A while ago, we looked at the computation of the GCD and found that the simplest implementation was, by far, the fastest. This week, we’ll have a look at another numerical problem: fast exponentiation (modulo machine-size integers).

The exponentiation looks like an expensive operation because, at first, we think it needs multiplications to be computed. Fortunately for us, and because of the arithmetic of exponentiation, we can compute in steps, which is much, much faster, even when is smallish. Let’s us see how.

Let’s start by reviewing how exponents work:

,

,

.

These properties let us write as either or as , or—as we will do—as . The decomposition that uses powers of two is especially convenient: each power can be computed by simple squaring *and* the series of powers map nicely with the binary representation of the original exponent. For example, , and we have : the exponents correspond to bits set to 1. We can exploit that to produce a general procedure.

We can write it either recursively,

unsigned expo_rec(unsigned x, unsigned p) { if (p) { if (p&1) return x*expo_rec(x,p-1); else { unsigned t=expo_rec(x,p/2); return t*t; } } else return 1; }

or iteratively,

unsigned expo_iter(unsigned x, unsigned p) { unsigned t=x, e=1; while (p) { if (p&1) e*=t; t*=t; p/=2; } return e; }

In both cases, the code is really simple, but we should suspect the second, iterative, version to be much faster. Is it?

*

* *

Again, the best is to take a quantitative approach and actually measure the thing. The following function measures the time taken to compute all combinations of power and base (yes, that’s the technical name for the number that is being raised to a power) from `0` to `range-1`. The code is as follows:

#include <stdint.h> #include <sys/time.h> typedef unsigned (*expo)(unsigned,unsigned); uint64_t now() { struct timeval tv; gettimeofday(&tv,0); return tv.tv_sec*1000000+tv.tv_usec; } uint64_t test( unsigned range, expo f) { uint64_t start=now(); for (unsigned a=0;a<range;a++) for (unsigned b=0;b<range;b++) f(a,b); uint64_t stop=now(); return stop-start; }

Plugging the two above version in the test code with a range of 10000 yields the following times:

expo_rec |
3.45s |

expo_iter |
1.11s |

…thus leaving no doubts on which is faster. Let’s have a look at the code for `expo_iter` generated by `gcc -O3` (4.8.1):

400ac0: 85 f6 test esi,esi 400ac2: b8 01 00 00 00 mov eax,0x1 400ac7: 74 1c je 400ae5 400ac9: 0f 1f 80 00 00 00 00 nop DWORD PTR [rax+0x0] 400ad0: 89 c2 mov edx,eax 400ad2: 0f af d7 imul edx,edi 400ad5: 40 f6 c6 01 test sil,0x1 400ad9: 0f 45 c2 cmovne eax,edx 400adc: 0f af ff imul edi,edi 400adf: d1 ee shr esi,1 400ae1: 75 ed jne 400ad0 400ae3: f3 c3 repz ret 400ae5: f3 c3 repz ret 400ae7: 66 0f 1f 84 00 00 00 nop WORD PTR [rax+rax*1+0x0] 400aee: 00 00

Except for the alignment-forcing `nop`s, and the strange looking `repz ret`, the code is very straightforward.

*

* *

The next step is to replace `unsigned` by an abstract type `T`, that is, to use templates and an arbitrary-precision number class that properly overloads the needed operators. Right now, the code computes exponentiation modulo the size of machine-specific `int` (or, more precisely, `unsigned`), and that’s what I needed for now. Since the complexity raising a base to a power is (more exactly, the number of multiplies is the number of bits in the size of the integers used, plus the number of bits set to 1 in the exponent), it’s really convenient to use in, say, a hash function for lookups.

[…] we evaluate this naïvely, we end up doing multiply and additions. Even using the fast exponentiation algorithm, we still use multiplies. But we can, very easily, get down to […]