Square Roots (part IV)

In a previous post, we noticed that

$\sqrt{2^k} \leqslant \sqrt{n} = \sqrt{2^k+b} \leqslant \sqrt{2^{k+1}}$,

where $0 \leqslant b < 2^k$, could be used to kick-start Newton's (or another) square root finding algorithm. But how fast can we find $k$ and $b$ in this decomposition?

The decomposition $n=2^k+b$ can be obtained efficiently, as we only have to notice that the exponent $k$ is also the position (starting at zero) of the most significant bit in $n$. So we only need a fast way of locating this bit.

A naïve approach could be to shift the number left until the sign bit is set and count the number of iterations. We could do that by using conversions between signed/unsigned:

int naive_clzi(unsigned n)
{
int c=0,ni=n;
while (ni>0)
{
c++;
ni<<=1;
}
return c;
}


Before you ask: clz stands for count leading zeroes and is chosen to echo the name of GCC’s __builtin_clz, an intrinsic that maps to the CPU’s efficient instructions to do just that, count leading zeroes. If we don’t like the idea of casting, then we can rewrite the above as:

int naive_clz(unsigned n)
{
int c=0;
while ((n&0x80000000u)==0)
{
c++;
n<<=1;
}
return c;
}


Or, of course, we can use the builtin, __builtin_clz.

*
* *

Let’s now compare the speeds of the naive code vs the built-in function __builtin_clz. All we need is a template function that will allow us to dispense with the function pointers:

using decomposition=std::pair<int,int>;
typedef int (*clz)(unsigned);

////////////////////////////////////////
template <clz f>
decomposition decompose(unsigned n)
{
if (n)
{
// __builtin_clz and naive_clz(i)
// are undefined on 0
int k=31-f(n);

return decomposition(k,n-(1<<k));
}
else
return decomposition(-1,0);
}


And the speed-test code is

// requires <sys/time.h>
uint64_t now()
{
struct timeval tv;
gettimeofday(&tv,0);
return tv.tv_sec*1000000+tv.tv_usec;
}

////////////////////////////////////////
template <clz f>
void speed_test()
{
uint64_t start=now();

int s=0;
for (int i=0;i<1000000000;i++)
{
decomposition a=decompose<f>(i);
s+=a.first;
}

uint64_t stop=now();

std::cout
<< '\t' << s << '\t'
<< (stop-start)/1000000.0 ;
}


*
* *

Let’s see how they behave. At first, you would think that the naive version with signed/unsigned casting is the fastest of the two naive version, but, lo!, it is not:

While it is easy to explain the performance of the built-in (as it maps to a single efficient CPU instruction), we still have the intuition that comparing against zero is faster than and-ing then comparing. Let’s have a look at the assembly code for these functions, as generated by GCC/G++ 5.4.0, with -O3:

;; naive_clzi
0000000000400bf0 <_Z10naive_clzij>:
400bf0:       31 c0                   xor    eax,eax
400bf2:       85 ff                   test   edi,edi
400bf4:       7e 15                   jle    400c0b <_Z10naive_clzij+0x1b>
400bf6:       66 2e 0f 1f 84 00 00    nop    WORD PTR cs:[rax+rax*1+0x0]
400bfd:       00 00 00
400c02:       83 c0 01                add    eax,0x1
400c05:       85 ff                   test   edi,edi
400c07:       7f f7                   jg     400c00 <_Z10naive_clzij+0x10>
400c09:       f3 c3                   repz ret

;; naive_clz
0000000000400bd0 <_Z9naive_clzj>:
400bd0:       31 c0                   xor    eax,eax
400bd2:       85 ff                   test   edi,edi
400bd4:       78 13                   js     400be9 <_Z9naive_clzj+0x19>
400bd6:       66 2e 0f 1f 84 00 00    nop    WORD PTR cs:[rax+rax*1+0x0]
400bdd:       00 00 00
400be0:       83 c0 01                add    eax,0x1
400be5:       79 f9                   jns    400be0 <_Z9naive_clzj+0x10>
400be7:       f3 c3                   repz ret


In the completely unsigned version, the compiler did figure out that we’re testing the sign bit, and compares only against it. It is set “for free” by the add edi,edi instruction just before the jns. In the integer-cast version, we did not compare the sign bit, but just for “greater than zero”. This doesn’t translate to negative, but to “negative or zero”, therefore the sign-bit optimization isn’t possible. We’d been better off to write >=!

The built-in translate to a single bsr instruction (bit scan right), that’s more or less efficient depending on the particular CPU you have, but is in all cases faster than the equivalent loop.

*
* *

“But I thought it was about square roots?”

Oh, right. About that. Why are we interested in finding $k$ efficiently in $n=2^k+b$? Because the bounds

$\sqrt{2^k} \leqslant \sqrt{n} = \sqrt{2^k+b} \leqslant \sqrt{2^{k+1}}$

cant be used to kick-start a square root algorithm. But there’s another piece that we need to make it much better: but we already have it: shifting by half-bits! We’ll get into this next week.

…to be continued.