Square Roots (part IV)

In a previous post, we noticed that

\sqrt{2^k} \leqslant \sqrt{n} = \sqrt{2^k+b} \leqslant \sqrt{2^{k+1}},

where 0 \leqslant b < 2^k, could be used to kick-start Newton's (or another) square root finding algorithm. But how fast can we find k and b in this decomposition?

The decomposition n=2^k+b can be obtained efficiently, as we only have to notice that the exponent k is also the position (starting at zero) of the most significant bit in n. So we only need a fast way of locating this bit.

A naïve approach could be to shift the number left until the sign bit is set and count the number of iterations. We could do that by using conversions between signed/unsigned:

int naive_clzi(unsigned n)
 {
  int c=0,ni=n;
  while (ni>0)
   {
    c++;
    ni<<=1;
   }
  return c;
 }
 

Before you ask: clz stands for count leading zeroes and is chosen to echo the name of GCC’s __builtin_clz, an intrinsic that maps to the CPU’s efficient instructions to do just that, count leading zeroes. If we don’t like the idea of casting, then we can rewrite the above as:

int naive_clz(unsigned n)
 {
  int c=0;
  while ((n&0x80000000u)==0)
   {
    c++;
    n<<=1;
   }
  return c;
 }

Or, of course, we can use the builtin, __builtin_clz.

*
* *

Let’s now compare the speeds of the naive code vs the built-in function __builtin_clz. All we need is a template function that will allow us to dispense with the function pointers:

using decomposition=std::pair<int,int>;
typedef int (*clz)(unsigned);

////////////////////////////////////////
template <clz f>
 decomposition decompose(unsigned n)
 {
  if (n)
   {
    // __builtin_clz and naive_clz(i)
    // are undefined on 0
    int k=31-f(n);

    return decomposition(k,n-(1<<k));
   }
  else
   return decomposition(-1,0);
 }

And the speed-test code is

// requires <sys/time.h>
uint64_t now()
 {
  struct timeval tv;
  gettimeofday(&tv,0);
  return tv.tv_sec*1000000+tv.tv_usec;
 }
 
////////////////////////////////////////
template <clz f>
 void speed_test()
  {
   uint64_t start=now();

   int s=0;
   for (int i=0;i<1000000000;i++)
    {
     decomposition a=decompose<f>(i);
     s+=a.first;
    }
   
   uint64_t stop=now();

   std::cout
    << '\t' << s << '\t'
    << (stop-start)/1000000.0 ;
  }

*
* *

Let’s see how they behave. At first, you would think that the naive version with signed/unsigned casting is the fastest of the two naive version, but, lo!, it is not:

naive-vs-builtin

While it is easy to explain the performance of the built-in (as it maps to a single efficient CPU instruction), we still have the intuition that comparing against zero is faster than and-ing then comparing. Let’s have a look at the assembly code for these functions, as generated by GCC/G++ 5.4.0, with -O3:

;; naive_clzi
0000000000400bf0 <_Z10naive_clzij>:
  400bf0:       31 c0                   xor    eax,eax
  400bf2:       85 ff                   test   edi,edi
  400bf4:       7e 15                   jle    400c0b <_Z10naive_clzij+0x1b>
  400bf6:       66 2e 0f 1f 84 00 00    nop    WORD PTR cs:[rax+rax*1+0x0]
  400bfd:       00 00 00 
  400c00:       01 ff                   add    edi,edi
  400c02:       83 c0 01                add    eax,0x1
  400c05:       85 ff                   test   edi,edi
  400c07:       7f f7                   jg     400c00 <_Z10naive_clzij+0x10>
  400c09:       f3 c3                   repz ret
;; naive_clz
0000000000400bd0 <_Z9naive_clzj>:
  400bd0:       31 c0                   xor    eax,eax
  400bd2:       85 ff                   test   edi,edi
  400bd4:       78 13                   js     400be9 <_Z9naive_clzj+0x19>
  400bd6:       66 2e 0f 1f 84 00 00    nop    WORD PTR cs:[rax+rax*1+0x0]
  400bdd:       00 00 00 
  400be0:       83 c0 01                add    eax,0x1
  400be3:       01 ff                   add    edi,edi
  400be5:       79 f9                   jns    400be0 <_Z9naive_clzj+0x10>
  400be7:       f3 c3                   repz ret 

In the completely unsigned version, the compiler did figure out that we’re testing the sign bit, and compares only against it. It is set “for free” by the add edi,edi instruction just before the jns. In the integer-cast version, we did not compare the sign bit, but just for “greater than zero”. This doesn’t translate to negative, but to “negative or zero”, therefore the sign-bit optimization isn’t possible. We’d been better off to write >=!

The built-in translate to a single bsr instruction (bit scan right), that’s more or less efficient depending on the particular CPU you have, but is in all cases faster than the equivalent loop.

*
* *

“But I thought it was about square roots?”

Oh, right. About that. Why are we interested in finding k efficiently in n=2^k+b? Because the bounds

\sqrt{2^k} \leqslant \sqrt{n} = \sqrt{2^k+b} \leqslant \sqrt{2^{k+1}}

cant be used to kick-start a square root algorithm. But there’s another piece that we need to make it much better: but we already have it: shifting by half-bits! We’ll get into this next week.

…to be continued.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: