In a previous post, we noticed that

,

where , could be used to kick-start Newton's (or another) square root finding algorithm. But how fast can we find and in this decomposition?

The decomposition can be obtained efficiently, as we only have to notice that the exponent is also the position (starting at zero) of the most significant bit in . So we only need a fast way of locating this bit.

A naïve approach could be to shift the number left until the sign bit is set and count the number of iterations. We could do that by using conversions between signed/unsigned:

int naive_clzi(unsigned n) { int c=0,ni=n; while (ni>0) { c++; ni<<=1; } return c; }

Before you ask: `clz` stands for *count leading zeroes* and is chosen to echo the name of GCC’s `__builtin_clz`, an intrinsic that maps to the CPU’s efficient instructions to do just that, count leading zeroes. If we don’t like the idea of casting, then we can rewrite the above as:

int naive_clz(unsigned n) { int c=0; while ((n&0x80000000u)==0) { c++; n<<=1; } return c; }

Or, of course, we can use the builtin, `__builtin_clz`.

*

* *

Let’s now compare the speeds of the naive code vs the built-in function `__builtin_clz`. All we need is a template function that will allow us to dispense with the function pointers:

using decomposition=std::pair<int,int>; typedef int (*clz)(unsigned); //////////////////////////////////////// template <clz f> decomposition decompose(unsigned n) { if (n) { // __builtin_clz and naive_clz(i) // are undefined on 0 int k=31-f(n); return decomposition(k,n-(1<<k)); } else return decomposition(-1,0); }

And the speed-test code is

// requires <sys/time.h> uint64_t now() { struct timeval tv; gettimeofday(&tv,0); return tv.tv_sec*1000000+tv.tv_usec; } //////////////////////////////////////// template <clz f> void speed_test() { uint64_t start=now(); int s=0; for (int i=0;i<1000000000;i++) { decomposition a=decompose<f>(i); s+=a.first; } uint64_t stop=now(); std::cout << '\t' << s << '\t' << (stop-start)/1000000.0 ; }

*

* *

Let’s see how they behave. At first, you would think that the naive version with signed/unsigned casting is the fastest of the two naive version, but, lo!, it is not:

While it is easy to explain the performance of the built-in (as it maps to a single efficient CPU instruction), we still have the intuition that comparing against zero is faster than and-ing then comparing. Let’s have a look at the assembly code for these functions, as generated by GCC/G++ 5.4.0, with `-O3`:

;; naive_clzi 0000000000400bf0 <_Z10naive_clzij>: 400bf0: 31 c0 xor eax,eax 400bf2: 85 ff test edi,edi 400bf4: 7e 15 jle 400c0b <_Z10naive_clzij+0x1b> 400bf6: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0] 400bfd: 00 00 00 400c00: 01 ff add edi,edi 400c02: 83 c0 01 add eax,0x1 400c05: 85 ff test edi,edi 400c07: 7f f7 jg 400c00 <_Z10naive_clzij+0x10> 400c09: f3 c3 repz ret

;; naive_clz 0000000000400bd0 <_Z9naive_clzj>: 400bd0: 31 c0 xor eax,eax 400bd2: 85 ff test edi,edi 400bd4: 78 13 js 400be9 <_Z9naive_clzj+0x19> 400bd6: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0] 400bdd: 00 00 00 400be0: 83 c0 01 add eax,0x1 400be3: 01 ff add edi,edi 400be5: 79 f9 jns 400be0 <_Z9naive_clzj+0x10> 400be7: f3 c3 repz ret

In the completely unsigned version, the compiler did figure out that we’re testing the sign bit, and compares only against it. It is set “for free” by the `add edi,edi` instruction just before the `jns`. In the integer-cast version, we did not compare the sign bit, but just for “greater than zero”. This doesn’t translate to negative, but to “negative or zero”, therefore the sign-bit optimization isn’t possible. We’d been better off to write `>=`!

The built-in translate to a single `bsr` instruction (bit scan right), that’s more or less efficient depending on the particular CPU you have, but is in all cases faster than the equivalent loop.

*

* *

“But I thought it was about square roots?”

Oh, right. About that. Why are we interested in finding efficiently in ? Because the bounds

cant be used to kick-start a square root algorithm. But there’s another piece that we need to make it much better: but we already have it: shifting by half-bits! We’ll get into this next week.

*…to be continued.*

[…] discussed algorithms for computing square roots a couple of times already, and then some. While sorting notes, I’ve came across something interesting: Archytas’ […]