## Fast exponentiation

A while ago, we looked at the computation of the GCD and found that the simplest implementation was, by far, the fastest. This week, we’ll have a look at another numerical problem: fast exponentiation (modulo machine-size integers). The exponentiation $a^b$ looks like an expensive operation because, at first, we think it needs $b$ multiplications to be computed. Fortunately for us, and because of the arithmetic of exponentiation, we can compute $a^b$ in $O(\lg b)$ steps, which is much, much faster, even when $b$ is smallish. Let’s us see how.

Let’s start by reviewing how exponents work: $\displaystyle x^n=\underbrace{x\times{x}\times{x}\cdots\times{x}}_{n\text{~times}}$, $x^a x^b = x^{a+b}$, $(x^a)^b = x^{ab}$.

These properties let us write $x^{20}$ as either $(x^5)^4$ or as $x^{10}x^8x^2$, or—as we will do—as $x^{16}x^4$. The decomposition that uses powers of two is especially convenient: each power can be computed by simple squaring and the series of powers map nicely with the binary representation of the original exponent. For example, $25_{10}=11001_2$, and we have $x^{25}=x^{16}x^{8}x$: the exponents correspond to bits set to 1. We can exploit that to produce a general procedure.

We can write it either recursively,

unsigned expo_rec(unsigned x, unsigned p)
{
if (p)
{
if (p&1)
return x*expo_rec(x,p-1);
else
{
unsigned t=expo_rec(x,p/2);
return t*t;
}
}
else
return 1;
}


or iteratively,

unsigned expo_iter(unsigned x, unsigned p)
{
unsigned t=x, e=1;

while (p)
{
if (p&1) e*=t;

t*=t;
p/=2;
}

return e;
}


In both cases, the code is really simple, but we should suspect the second, iterative, version to be much faster. Is it?

*
* *

Again, the best is to take a quantitative approach and actually measure the thing. The following function measures the time taken to compute all combinations of power and base (yes, that’s the technical name for the number that is being raised to a power) from 0 to range-1. The code is as follows:

#include <stdint.h>
#include <sys/time.h>

typedef unsigned (*expo)(unsigned,unsigned);

uint64_t now()
{
struct timeval tv;
gettimeofday(&tv,0);
return tv.tv_sec*1000000+tv.tv_usec;
}

uint64_t test( unsigned range,
expo f)
{
uint64_t start=now();
for (unsigned a=0;a<range;a++)
for (unsigned b=0;b<range;b++)
f(a,b);
uint64_t stop=now();
return stop-start;
}


Plugging the two above version in the test code with a range of 10000 yields the following times:

 expo_rec 3.45s expo_iter 1.11s

…thus leaving no doubts on which is faster. Let’s have a look at the code for expo_iter generated by gcc -O3 (4.8.1):

  400ac0:       85 f6                   test   esi,esi
400ac2:       b8 01 00 00 00          mov    eax,0x1
400ac7:       74 1c                   je     400ae5
400ac9:       0f 1f 80 00 00 00 00    nop    DWORD PTR [rax+0x0]
400ad0:       89 c2                   mov    edx,eax
400ad2:       0f af d7                imul   edx,edi
400ad5:       40 f6 c6 01             test   sil,0x1
400ad9:       0f 45 c2                cmovne eax,edx
400adc:       0f af ff                imul   edi,edi
400adf:       d1 ee                   shr    esi,1
400ae1:       75 ed                   jne    400ad0
400ae3:       f3 c3                   repz ret
400ae5:       f3 c3                   repz ret
400ae7:       66 0f 1f 84 00 00 00    nop    WORD PTR [rax+rax*1+0x0]
400aee:       00 00


Except for the alignment-forcing nops, and the strange looking repz ret, the code is very straightforward.

*
* *

The next step is to replace unsigned by an abstract type T, that is, to use templates and an arbitrary-precision number class that properly overloads the needed operators. Right now, the code computes exponentiation modulo the size of machine-specific int (or, more precisely, unsigned), and that’s what I needed for now. Since the complexity raising a base $a$ to a power $b$ is $O(\lg b)$ (more exactly, the number of multiplies is the number of bits in the size of the integers used, plus the number of bits set to 1 in the exponent), it’s really convenient to use in, say, a hash function for lookups.

### 2 Responses to Fast exponentiation

1. Horner’s Method | Harder, Better, Faster, Stronger says:

[…] we evaluate this naïvely, we end up doing multiply and additions. Even using the fast exponentiation algorithm, we still use multiplies. But we can, very easily, get down to […]

2. Fast Exponentiation, revisited | Harder, Better, Faster, Stronger says:

[…] a while ago, I presented a fast exponentiation algorithm that uses the binary decomposition of the exponent to […]