## Square Roots (Part V)

Last week we examined the complexity of obtaining $k$ in the decomposition $n=2^k+b$ for some integer $n$. This week, we’ll have a look at how we can use it to speed-up square root extraction. Recall, we’re interested in $k$ because

$2^k \leqslant 2^k+b < 2^{k+1}$,

with $0 \leqslant b < 2^k$, which allows us to get easy bounds on $\sqrt{n}$. Better, we also have that

$\sqrt{2^k} \leqslant \sqrt{2^k+b} \leqslant \sqrt{2^{k+1}}$,

and we know how to compute $\sqrt{2^k}=2\frac{k}{2}$ (somewhat efficiently! Let’s combine all this and see how it speeds up things.

Most of the code for square roots is found here, and for shifting, including by half a bit, it is found here.

Using $2^\frac{k}{2}$ and $2^\frac{k+1}{2}$ as lower and upper bound for binary search will provide better performance because we reduce the number of iterations from, say 16 for 16-bits integers to about $\frac{k}{2}$, where $k$ depends on $n$.

unsigned driven_binary_sqrt_bounds(unsigned n, int & i)
{
i=0;
if (n)
{
unsigned k=std::numeric_limits<unsigned>::digits-__builtin_clz(n);
unsigned l=shift_half(1<<k); // OK, truncates
unsigned h=shift_half(1<<(k+1))+1; // rounding error from half shift
while (l!=h)
{
i++;
unsigned m=(l+h)/2;
if (m*m > n)
h=m;
else
l=m+1;
}

return l-1;
}
else
return 0;
}


As for Newton’s method, we can use the knowledge of $k$, and $\frac{k}{2}$ to kick-start search at a better point than $\frac{n}{2}$. The modification to the algorithm aren’t extensive:

unsigned driven_newton_sqrt_bounds(unsigned n, int & i)
{
i=0;
if (n)
{
unsigned k=std::numeric_limits<unsigned>::digits-__builtin_clz(n);
unsigned l=shift_half(1<<k); // OK, truncates anyway
unsigned h=shift_half(1<<(k+1))+1; // rounding error from half shift
unsigned x=(l+h)/2; // should be pretty close

unsigned d;
do
{
i++;
d = n/x;
x = (x+d)/2;
}
while (d<x);

// precision fix: it sometimes rounds up (42.9767
// becomes 43, which is the best answer, but) we
// need truncation (as for the other methods)
if (x*x>n) x--;
return x;
}

else return 0;
}


The last correction can be left out if rounding is deemed OK.

*
* *

How does the knowledge of $k$ helps finding square roots? As we did before, we just instrumented the code to obtain the number of iterations performed in the principal loop. We will use this as a metric of performance rather than, say, raw CPU time. Why so? Because counting the performance of a specific implementation is one thing, and as we showed a couple of weeks ago, shifting by half a bit is a trade-off of speed and precision, and we cannot exclude that someone would go through the trouble of implementing his own CPU with a “shift by half a bit” instruction. Let us, again, refer to this previous post for the detail of all the methods. Now, compare (Y-axis, number of iterations, X-axis, numbers):

where the “Bounded” versions are the version just above.

We see that the use of $k$ reduces greatly the number of iterations for the binary search method. Again, the number of iteration will be about $\frac{k}{2}$. As for the Newton method, using an initial guess of

$\displaystyle x=\frac{1}{2}\left(2^{\frac{k}{2}}+2^{\frac{k+1}{2}}\right)$

somewhat reduces the number of iterations further from the “driven” Newton search and quite a lot from the naïve version where the initial guess is merely $\frac{n}{2}$.

*
* *

Unsurprisingly, then, having better initial guesses is a plus. I’m not sure, though, that the speed-up, as measured by the number of iterations, is worth the added complexity, even if it is a tiny amount. We’ve seen that using a built-in function that maps to an efficient instruction (__builtin_clz that maps to the CPU instruction bsr) is much faster than the equivalent naïve implementation so in the end the added complexity might be beneficial.

### One Response to Square Roots (Part V)

1. […] discussed algorithms for computing square roots a couple of times already, and then some. While sorting notes, I’ve came across something interesting: Archytas’ method for […]