First I doubt that __builtin_ctz works miracles (but I haven’t tried, so can’t confirm).

I guess is just that it’s mostly a pointless optimization because the probability of having a number with as a factor is proportional to , so the “optimization” gives maximal payoff with asymptotically zero probability.

(and sorry for the very late approve+reply… I’ve been rather busy lately.)

[…] A while ago, we looked at the computation of the GCD and found that the simplest implementation was, by far, the fastest. This week, we’ll have a look at another numerical problem: fast exponentiation (modulo machine-size integers). […]

[…] have honestly never written a program where computing the gcd was the bottleneck. However, Pigeon wrote a blog post where the binary GCD fared very poorly compared to a simple implementation of Euler’s […]

Nice post.

What happens if you use GCC’s __builtin_ctz (or call an instruction like bsr) instead of doing expensive bit-check-and-shift loops?

First I doubt that __builtin_ctz works miracles (but I haven’t tried, so can’t confirm).

I guess is just that it’s mostly a pointless optimization because the probability of having a number with as a factor is proportional to , so the “optimization” gives maximal payoff with asymptotically

zeroprobability.(and sorry for the very late approve+reply… I’ve been rather busy lately.)

Non-recursive version in Euler Math Toolbox. Note, that there is a built-in command.

[…] A while ago, we looked at the computation of the GCD and found that the simplest implementation was, by far, the fastest. This week, we’ll have a look at another numerical problem: fast exponentiation (modulo machine-size integers). […]

[…] have honestly never written a program where computing the gcd was the bottleneck. However, Pigeon wrote a blog post where the binary GCD fared very poorly compared to a simple implementation of Euler’s […]

[…] the discussion of The Speed of GCD, Daniel Lemire remarked that one could use the compiler-specific intrinsic function __builtin_ctz […]

Fastest for me (C#):

static uint gcd(uint a, uint b)

{

while (a != 0)

{

b %= a;

if (b == 0) return a;

a %= b;

}

return b;

}