Square Roots (Part V)

Last week we examined the complexity of obtaining $k$ in the decomposition $n=2^k+b$ for some integer $n$. This week, we’ll have a look at how we can use it to speed-up square root extraction. Recall, we’re interested in $k$ because $2^k \leqslant 2^k+b < 2^{k+1}$,

with $0 \leqslant b < 2^k$, which allows us to get easy bounds on $\sqrt{n}$. Better, we also have that $\sqrt{2^k} \leqslant \sqrt{2^k+b} \leqslant \sqrt{2^{k+1}}$,

and we know how to compute $\sqrt{2^k}=2\frac{k}{2}$ (somewhat efficiently! Let’s combine all this and see how it speeds up things.

Most of the code for square roots is found here, and for shifting, including by half a bit, it is found here.

Using $2^\frac{k}{2}$ and $2^\frac{k+1}{2}$ as lower and upper bound for binary search will provide better performance because we reduce the number of iterations from, say 16 for 16-bits integers to about $\frac{k}{2}$, where $k$ depends on $n$.

unsigned driven_binary_sqrt_bounds(unsigned n, int & i)
{
i=0;
if (n)
{
unsigned k=std::numeric_limits<unsigned>::digits-__builtin_clz(n);
unsigned l=shift_half(1<<k); // OK, truncates
unsigned h=shift_half(1<<(k+1))+1; // rounding error from half shift
while (l!=h)
{
i++;
unsigned m=(l+h)/2;
if (m*m > n)
h=m;
else
l=m+1;
}

return l-1;
}
else
return 0;
}

As for Newton’s method, we can use the knowledge of $k$, and $\frac{k}{2}$ to kick-start search at a better point than $\frac{n}{2}$. The modification to the algorithm aren’t extensive:

unsigned driven_newton_sqrt_bounds(unsigned n, int & i)
{
i=0;
if (n)
{
unsigned k=std::numeric_limits<unsigned>::digits-__builtin_clz(n);
unsigned l=shift_half(1<<k); // OK, truncates anyway
unsigned h=shift_half(1<<(k+1))+1; // rounding error from half shift
unsigned x=(l+h)/2; // should be pretty close

unsigned d;
do
{
i++;
d = n/x;
x = (x+d)/2;
}
while (d<x);

// precision fix: it sometimes rounds up (42.9767
// becomes 43, which is the best answer, but) we
// need truncation (as for the other methods)
if (x*x>n) x--;
return x;
}

else return 0;
}

The last correction can be left out if rounding is deemed OK.

*
* *

How does the knowledge of $k$ helps finding square roots? As we did before, we just instrumented the code to obtain the number of iterations performed in the principal loop. We will use this as a metric of performance rather than, say, raw CPU time. Why so? Because counting the performance of a specific implementation is one thing, and as we showed a couple of weeks ago, shifting by half a bit is a trade-off of speed and precision, and we cannot exclude that someone would go through the trouble of implementing his own CPU with a “shift by half a bit” instruction. Let us, again, refer to this previous post for the detail of all the methods. Now, compare (Y-axis, number of iterations, X-axis, numbers): where the “Bounded” versions are the version just above.

We see that the use of $k$ reduces greatly the number of iterations for the binary search method. Again, the number of iteration will be about $\frac{k}{2}$. As for the Newton method, using an initial guess of $\displaystyle x=\frac{1}{2}\left(2^{\frac{k}{2}}+2^{\frac{k+1}{2}}\right)$

somewhat reduces the number of iterations further from the “driven” Newton search and quite a lot from the naïve version where the initial guess is merely $\frac{n}{2}$.

*
* *

Unsurprisingly, then, having better initial guesses is a plus. I’m not sure, though, that the speed-up, as measured by the number of iterations, is worth the added complexity, even if it is a tiny amount. We’ve seen that using a built-in function that maps to an efficient instruction (__builtin_clz that maps to the CPU instruction bsr) is much faster than the equivalent naïve implementation so in the end the added complexity might be beneficial.

One Response to Square Roots (Part V)

1. Square roots (Part VI) | Harder, Better, Faster, Stronger says:

[…] discussed algorithms for computing square roots a couple of times already, and then some. While sorting notes, I’ve came across something interesting: Archytas’ method for […]