## The Speed of GCD

The subject of computing the GCD was brought up a couple of times lately, and we assumed that the straightforward divide-and-remained implementation was the most efficient. But is it?

Before writing this post, I knew of basically two versions, one due to Euclid, invented sometimes in Antiquity of course, and one that used the remainder (that is, modulo) to do basically the same thing (which can be implemented either recursively or iteratively). Turns out that there are other algorithms, for example the so-called binary GCD that tries to somewhat speed up the process by removing multiples of two. But which one is the most efficient?

The first version of the algorithm, attributed to Euclid himself, proceeds by successive subtraction. To make sure the algorithm works about properly, we must make sure that neither of the numbers is zero, but asides from that, it proceeds as follows:

unsigned gcd_iterative_sub(unsigned a, unsigned b)
{
if (a==0) return b;
if (b==0) return a;
while (a!=b)
if (a>b)
a-=b;
else
b-=a;
return a;
}


To be honest, I find this version close to incomprehensible: it is very hard to get the intuition on how or even why it works. The recursive version

unsigned gcd_recursive(unsigned a, unsigned b)
{
if (b)
return gcd_recursive(b, a % b);
else
return a;
}


…is a lot clearer. See, if $a$ is not a divisor for $b$ (and vice-versa) then there’s a remainder, but whatever the GCD is, it must divide $a$, $b$ and the remainder $a\:\mathrm{mod}\:b$, so we can find the GCD of $b$ and $a\:\mathrm{mod}\:b$, which is equivalent to solving the GCD of $a$ and $b$! If for some reason you’re allergic to, or opposed to, for various reasons, recursion, you can make it iterative:

unsigned gcd_iterative_mod(unsigned a, unsigned b)
{
while (b)
{
unsigned t=b;
b=a % b;
a=t;
}
return a;
}


*
* *

The binary GCD algorithms tries to speed things up by first finding the largest common power of two, then proceeds to a subtraction-based greatest common divisor search, then ends by rescaling back by the power of two that was first removed. My implementation of this algorithm (of course strongly “inspired” from the implementation found on Wikipedia):

unsigned gcd_binary(unsigned a, unsigned b)
{
if (a == 0) return b;
if (b == 0) return a;

int shift=0;
while (!((a | b) & 1))
{
a >>= 1;
b >>= 1;
shift++;
}

while (!(a & 1)) a >>= 1;

do
{
while (!(b & 1)) b >>= 1;

if (a>b) std::swap(a,b);
b-=a;
} while (b);

// rescale
return a << shift;
}


If we replace the subtraction-based GCD finding part with, say, the iterative, modulo-based, version, we get something like:

unsigned gcd_binary_mod(unsigned a, unsigned b)
{
if (a == 0) return b;
if (b == 0) return a;

int shift=0;
while (!((a | b) & 1))
{
a >>= 1;
b >>= 1;
shift++;
}

return gcd_iterative_mod(a,b) << shift;
}


Where we strip the largest common power of two, compute the GCD normally, then rescale the result before returning.

*
* *

Now, which one is faster? There’s only one way to know. So I have used g++ 4.8.1 (Ubuntu/Linaro), an i7 2600K, and Ubuntu 13.10. Since it’ll require quite some time (you don’t want just one data point, you want many to filter out random activities in your computer), I also decided to test the methods using threads. The “new” C++11 threads are not as easy to use as I hoped (there are a number of subtleties on how function-object are called, or copied, or ref’d), but it’s still rather straightforward.

The first thing to do, is to create a function-object that holds the results of a particular test. In this case, I just count the elapsed wall-time using gettimeofday (in microseconds) while the test runs for all pairs of integers from 0 to range:

class tester
{
private:

gcd f;
unsigned range;
uint64_t elapsed;

public:

uint64_t get_elapsed() const { return elapsed; }

void run()
{
uint64_t start=now();
for (unsigned a=0;a<range;a++)
for (unsigned b=0;b<range;b++)
f(a,b);
uint64_t stop=now();
elapsed=stop-start;
}

void set_range(unsigned _range) { range=_range; }
unsigned get_range() const { return range; }

tester(gcd _f, unsigned _range):
f(_f),
range(_range),
elapsed(0)
{};
};


So basically, we instantiate this class with a function pointer to the algorithm we want to test. Calling run() will perform the actual test. To launch many threads, we will use std::thread. The main thing is to call the run() without having the thread object making a copy, but rather use a pointer to the function with the this as argument.

int main()
{
const size_t nb_algorithms=5;

tester *testers[nb_algorithms]=
{
new tester(gcd_recursive,0),
new tester(gcd_iterative_sub,0),
new tester(gcd_iterative_mod,0),
new tester(gcd_binary,0),
new tester(gcd_binary_mod,0)
};

for (unsigned range=1000;range<=100000;range*=10)
for (int rep=0;rep<50;rep++)
{
std::cout << std::setw(12) << range;
for (int i=0;i<nb_algorithms;i++)
{
testers[i]->set_range(range);
// this only keeps a pointer to the function and its argument is
// this so it points to the already existing object, while
// creating a new thread with something like
//
// new std::thread( std::function<void()>(*testers[i]) );
//
// (assuming and operator()) would make a thread-private copy of
// the object
//
new std::thread( &tester::run, testers[i]);
}

for (int i=0;i<nb_algorithms;i++)

for (int i=0;i<nb_algorithms;i++)
std::cout << std::setw(12) << testers[i]->get_elapsed()/1000000.0;
std::cout << std::endl;

for (int i=0;i<nb_algorithms;i++)
}

return 0;
}


This code launches up to five threads simultaneously (I know, it’s not completely optimal since I could reschedule thread object to run as soon as they terminate, but anyway the time will be dominated by the slowest algorithms) and prints the collected data.

*
* *

Running the tests for all algorithms, with ranges from 0 to 1000, 0 to 10000, and 0 to 100000 yield the following graphs:

Unsurprisingly, the subtraction-based version does not fare very well. What is maybe more surprising is that the recursive version does best, and with less variance than the equivalent iterative version. Lastly, the binary versions are somewhat not very impressive, and “optimizing” the code by shifting and exploiting powers of two doesn’t yield a speed-up, quite the contrary. The added complexity yields slower code. The original binary GCD proceeds by subtraction and does better than the original, subtraction-based version, but still does significantly worse than the recursive version. The natural idea is to see if we do much better when we replace the subtraction-based portion of the binary GCD by an iterative, mod-based, version. It does make things better, but it’s still slower than the simple recursive version.

*
* *

So, once again, we find that an “optimization” turns out to be a significant slowdown for the code, despite the expected complexity of the algorithm being less that the naïve, recursive, version. It might not have been, it might have been faster, but then again we could only make sure by benchmarking—actually testing somewhat methodically—the different versions. So, keep it simple, and use the recursive GCD. Unless, of course, the benchmarks give some different result on your box.

*
* *

After the discussion here, I’ve included a new version using the built-in __builtin_ctz, and, lo! We indeed get an interesting speed-up.

The new code:

unsigned gcd_binary_2(unsigned a, unsigned b)
{
if (a == 0) return b;
if (b == 0) return a;

int shift=__builtin_ctz(a|b);
a>>=__builtin_ctz(a);

do
{
b>>=__builtin_ctz(b);

if (a>b) std::swap(a,b);
b-=a;
} while (b);

return a << shift;
}


And the new results:

The tests for the larger numbers give a 40% speed-up, which is much more interesing than the original binary, without intrinsics, what yielded an impressive slow down.

*
* *

Looking at the assembly code, we see that the built-in translated directly into a specific machine instruction, tzcnt (trailing zero count) aliasing bsf on older processors (instruction tables seem to indicate that they have the same opcode?). The assembly generated by the compile from the above C++ code is given here:

0000000000401610 <_Z12gcd_binary_2jj>:
401610: test   edi,edi
401612: mov    eax,esi
401614: je     40164a
401616: test   esi,esi
401618: mov    eax,edi
40161a: je     40164a
40161c: mov    edx,edi
40161e: tzcnt  ecx,edi
401622: or     edx,esi
401624: shr    edi,cl
401626: tzcnt  edx,edx
40162a: jmp    401632
40162c: nop    DWORD PTR [rax+0x0]
401630: mov    edi,eax
401632: tzcnt  ecx,esi
401636: mov    eax,edi
401638: shr    esi,cl
40163a: cmp    esi,edi
40163c: cmovbe eax,esi
40163f: cmovb  esi,edi
401642: sub    esi,eax
401644: jne    401630
401646: mov    ecx,edx
401648: shl    eax,cl
40164a: repz ret


### 6 Responses to The Speed of GCD

1. Nice post.

What happens if you use GCC’s __builtin_ctz (or call an instruction like bsr) instead of doing expensive bit-check-and-shift loops?

• First I doubt that __builtin_ctz works miracles (but I haven’t tried, so can’t confirm).

I guess is just that it’s mostly a pointless optimization because the probability of having a number with $2^n$ as a factor is proportional to $2^{-n}$, so the “optimization” gives maximal payoff with asymptotically zero probability.

(and sorry for the very late approve+reply… I’ve been rather busy lately.)

2. Non-recursive version in Euler Math Toolbox. Note, that there is a built-in command.

3. […] A while ago, we looked at the computation of the GCD and found that the simplest implementation was, by far, the fastest. This week, we’ll have a look at another numerical problem: fast exponentiation (modulo machine-size integers). […]

4. […] have honestly never written a program where computing the gcd was the bottleneck. However, Pigeon wrote a blog post where the binary GCD fared very poorly compared to a simple implementation of Euler’s […]

5. […] the discussion of The Speed of GCD, Daniel Lemire remarked that one could use the compiler-specific intrinsic function __builtin_ctz […]