The subject of computing the GCD was brought up a couple of times lately, and we assumed that the straightforward divide-and-remained implementation was the most efficient. But is it?

Before writing this post, I knew of basically two versions, one due to Euclid, invented sometimes in Antiquity of course, and one that used the remainder (that is, modulo) to do basically the same thing (which can be implemented either recursively or iteratively). Turns out that there are other algorithms, for example the so-called binary GCD that tries to somewhat speed up the process by removing multiples of two. But which one is the most efficient?

The first version of the algorithm, attributed to Euclid himself, proceeds by successive subtraction. To make sure the algorithm works about properly, we must make sure that neither of the numbers is zero, but asides from that, it proceeds as follows:

unsigned gcd_iterative_sub(unsigned a, unsigned b) { if (a==0) return b; if (b==0) return a; while (a!=b) if (a>b) a-=b; else b-=a; return a; }

To be honest, I find this version close to incomprehensible: it is very hard to get the intuition on how or even *why* it works. The recursive version

unsigned gcd_recursive(unsigned a, unsigned b) { if (b) return gcd_recursive(b, a % b); else return a; }

…is a lot clearer. See, if is not a divisor for (and vice-versa) then there’s a remainder, but whatever the GCD is, it must divide , and the remainder , so we can find the GCD of and , which is equivalent to solving the GCD of and ! If for some reason you’re allergic to, or opposed to, for various reasons, recursion, you can make it iterative:

unsigned gcd_iterative_mod(unsigned a, unsigned b) { while (b) { unsigned t=b; b=a % b; a=t; } return a; }

*

* *

The binary GCD algorithms tries to speed things up by first finding the largest common power of two, then proceeds to a subtraction-based greatest common divisor search, then ends by rescaling back by the power of two that was first removed. My implementation of this algorithm (of course strongly “inspired” from the implementation found on Wikipedia):

unsigned gcd_binary(unsigned a, unsigned b) { if (a == 0) return b; if (b == 0) return a; int shift=0; while (!((a | b) & 1)) { a >>= 1; b >>= 1; shift++; } while (!(a & 1)) a >>= 1; do { while (!(b & 1)) b >>= 1; if (a>b) std::swap(a,b); b-=a; } while (b); // rescale return a << shift; }

If we replace the subtraction-based GCD finding part with, say, the iterative, modulo-based, version, we get something like:

unsigned gcd_binary_mod(unsigned a, unsigned b) { if (a == 0) return b; if (b == 0) return a; int shift=0; while (!((a | b) & 1)) { a >>= 1; b >>= 1; shift++; } return gcd_iterative_mod(a,b) << shift; }

Where we strip the largest common power of two, compute the GCD normally, then rescale the result before returning.

*

* *

Now, which one is faster? There’s only one way to know. So I have used g++ 4.8.1 (Ubuntu/Linaro), an i7 2600K, and Ubuntu 13.10. Since it’ll require quite some time (you don’t want just one data point, you want many to filter out random activities in your computer), I also decided to test the methods using threads. The “new” C++11 threads are not as easy to use as I hoped (there are a number of subtleties on how function-object are called, or copied, or ref’d), but it’s still rather straightforward.

The first thing to do, is to create a function-object that holds the results of a particular test. In this case, I just count the elapsed wall-time using `gettimeofday` (in microseconds) while the test runs for all pairs of integers from 0 to `range`:

class tester { private: gcd f; unsigned range; uint64_t elapsed; public: uint64_t get_elapsed() const { return elapsed; } void run() { uint64_t start=now(); for (unsigned a=0;a<range;a++) for (unsigned b=0;b<range;b++) f(a,b); uint64_t stop=now(); elapsed=stop-start; } void set_range(unsigned _range) { range=_range; } unsigned get_range() const { return range; } tester(gcd _f, unsigned _range): f(_f), range(_range), elapsed(0) {}; };

So basically, we instantiate this class with a function pointer to the algorithm we want to test. Calling `run()` will perform the actual test. To launch many threads, we will use `std::thread`. The main thing is to call the `run()` without having the thread object making a copy, but rather use a pointer to the function with the `this` as argument.

int main() { const size_t nb_algorithms=5; std::thread ** threads=new std::thread *[nb_algorithms]; tester *testers[nb_algorithms]= { new tester(gcd_recursive,0), new tester(gcd_iterative_sub,0), new tester(gcd_iterative_mod,0), new tester(gcd_binary,0), new tester(gcd_binary_mod,0) }; for (unsigned range=1000;range<=100000;range*=10) for (int rep=0;rep<50;rep++) { std::cout << std::setw(12) << range; for (int i=0;i<nb_algorithms;i++) { testers[i]->set_range(range); threads[i]= // this only keeps a pointer to the function and its argument is // this so it points to the already existing object, while // creating a new thread with something like // // new std::thread( std::function<void()>(*testers[i]) ); // // (assuming and operator()) would make a thread-private copy of // the object // new std::thread( &tester::run, testers[i]); } for (int i=0;i<nb_algorithms;i++) threads[i]->join(); for (int i=0;i<nb_algorithms;i++) std::cout << std::setw(12) << testers[i]->get_elapsed()/1000000.0; std::cout << std::endl; for (int i=0;i<nb_algorithms;i++) delete threads[i]; } return 0; }

This code launches up to five threads simultaneously (I know, it’s not completely optimal since I could reschedule thread object to run as soon as they terminate, but anyway the time will be dominated by the slowest algorithms) and prints the collected data.

*

* *

Running the tests for all algorithms, with ranges from 0 to 1000, 0 to 10000, and 0 to 100000 yield the following graphs:

Unsurprisingly, the subtraction-based version does not fare very well. What is maybe more surprising is that the recursive version does best, and with less variance than the equivalent iterative version. Lastly, the binary versions are somewhat not very impressive, and “optimizing” the code by shifting and exploiting powers of two doesn’t yield a speed-up, quite the contrary. The added complexity yields slower code. The original binary GCD proceeds by subtraction and does better than the original, subtraction-based version, but still does significantly worse than the recursive version. The natural idea is to see if we do much better when we replace the subtraction-based portion of the binary GCD by an iterative, mod-based, version. It does make things better, but it’s still slower than the simple recursive version.

*

* *

So, once again, we find that an “optimization” turns out to be a significant slowdown for the code, despite the expected complexity of the algorithm being less that the naïve, recursive, version. It might not have been, it might have been faster, but then again we could only make sure by benchmarking—actually testing somewhat methodically—the different versions. So, keep it simple, and use the recursive GCD. Unless, of course, the benchmarks give some different result on your box.

*

* *

After the discussion here, I’ve included a new version using the built-in `__builtin_ctz`, and, lo! We indeed get an interesting speed-up.

The new code:

unsigned gcd_binary_2(unsigned a, unsigned b) { if (a == 0) return b; if (b == 0) return a; int shift=__builtin_ctz(a|b); a>>=__builtin_ctz(a); do { b>>=__builtin_ctz(b); if (a>b) std::swap(a,b); b-=a; } while (b); return a << shift; }

And the new results:

The tests for the larger numbers give a 40% speed-up, which is much more interesing than the original binary, without intrinsics, what yielded an impressive *slow down*.

*

* *

Looking at the assembly code, we see that the built-in translated directly into a specific machine instruction, `tzcnt` (trailing zero count) aliasing `bsf` on older processors (instruction tables seem to indicate that they have the same opcode?). The assembly generated by the compile from the above C++ code is given here:

0000000000401610 <_Z12gcd_binary_2jj>: 401610: test edi,edi 401612: mov eax,esi 401614: je 40164a 401616: test esi,esi 401618: mov eax,edi 40161a: je 40164a 40161c: mov edx,edi 40161e: tzcnt ecx,edi 401622: or edx,esi 401624: shr edi,cl 401626: tzcnt edx,edx 40162a: jmp 401632 40162c: nop DWORD PTR [rax+0x0] 401630: mov edi,eax 401632: tzcnt ecx,esi 401636: mov eax,edi 401638: shr esi,cl 40163a: cmp esi,edi 40163c: cmovbe eax,esi 40163f: cmovb esi,edi 401642: sub esi,eax 401644: jne 401630 401646: mov ecx,edx 401648: shl eax,cl 40164a: repz ret

Nice post.

What happens if you use GCC’s __builtin_ctz (or call an instruction like bsr) instead of doing expensive bit-check-and-shift loops?

First I doubt that __builtin_ctz works miracles (but I haven’t tried, so can’t confirm).

I guess is just that it’s mostly a pointless optimization because the probability of having a number with as a factor is proportional to , so the “optimization” gives maximal payoff with asymptotically

zeroprobability.(and sorry for the very late approve+reply… I’ve been rather busy lately.)

Non-recursive version in Euler Math Toolbox. Note, that there is a built-in command.

[…] A while ago, we looked at the computation of the GCD and found that the simplest implementation was, by far, the fastest. This week, we’ll have a look at another numerical problem: fast exponentiation (modulo machine-size integers). […]

[…] have honestly never written a program where computing the gcd was the bottleneck. However, Pigeon wrote a blog post where the binary GCD fared very poorly compared to a simple implementation of Euler’s […]

[…] the discussion of The Speed of GCD, Daniel Lemire remarked that one could use the compiler-specific intrinsic function __builtin_ctz […]