Yet another Square Root Algorithm

Computing (integer) square root is usually done by using a float square root routine and casting it to an integer, possibly with rounding. What if, for some reason, we can’t use floats? What are the possibilities?

One possibility is to use binary search to find the square root of a number. It may be a good choice because it will only perform a number of iterations that is half of the (maximum) numbers of bits of the number (we will explain why in a moment). Another possibility is to use Newton’s method to find the square root. It does a bit better than binary search, but not exceedingly so: on very small integers, it may iterate a third as much iterations as binary search, but on large integers, it will do only a bit fewer iterations. Can we do better? Yes!

Let’s start by describing the binary search algorithm. Let’s say we want to extract the square roots of unsigneds. On a typical box, unsigneds will be 32 bits. A 32 bits number has, at most, a 16 bits square root (since $\sqrt{2^{32}}=2^{16}$), and a binary search can search between 0 to 65535, inclusive.

A straightforward implementation would look something like:

unsigned binary_sqrt(unsigned n)
{
unsigned l=0;
unsigned h=(std::numeric_limits<unsigned>::max()
>> (std::numeric_limits<unsigned>::digits/2))+1;

while (l!=h)
{
unsigned m=(l+h)/2;

if (m*m > n)
h=m;
else
l=m+1;
}

return l-1;
}


You can instrument it to count the number of times it cycles through the while loop (which I did).

In another post, I discussed Newton's algorithm to extract square roots. Read the previous post for the details of how Newton's algorithm works. Again, a straightforward implementation is given:

unsigned newton_sqrt(unsigned n)
{
if (n)
{
unsigned x=std::max(2u,n/2);
unsigned d;
do
{
d = n/x;
x = (x+d)/2;
}
while (d<x);
return x;
}

else return 0;
}


I am sure you've been thinking that, surely, we can restrict the binary search, or use a better initial estimate for Newton's method. Well, we can have a good guess on the bounds in which the square root lies.

Let's see how we can derive some loose bounds. Suppose

$k=2^n+b$

the number of which we want to extract the root, and where $0\leqslant b < 2^n$ (for the decomposition to be unique). If $k$ is written in such a way, then we have the following inequalities:

$\displaystyle \sqrt{2^n} \leqslant \sqrt{k}=\sqrt{2^n+b}<\sqrt{2^{n+1}}$

But $\sqrt{2^n}=2^\frac{n}{2}$, and we might be able to approximate this efficiently, at least getting $2^{\lfloor\frac{n}{2}\rfloor}$. Well, in fact, we can. We can estimate $n+1$ using one of GCC's builtin functions. We can then use $2^{\lfloor\frac{n}{2}\rfloor}$ for the lower bound, and $2^{\lfloor\frac{n}{2}+1\rfloor}$ for the upper bound.

The implementation using the __builtin_clz intrinsic function is:

unsigned driven_binary_sqrt(unsigned n)
{
if (n)
{
unsigned bits=std::numeric_limits<unsigned>::digits-__builtin_clz(n);
unsigned shift=std::max(1u,bits/2);
unsigned h=std::min(65536u,(1u<<(shift+1))-1);
unsigned l=(h>>2);
while (l!=h)
{
unsigned m=(l+h)/2;
if (m*m > n)
h=m;
else
l=m+1;
}
return l-1;
}
else
return 0;
}


*
* *

Was it worth all that trouble? We have seen a couple of times that a more sophisticated algorithm didn't yield the expected speed ups. But this times, it does a bit:

We see that the fixed-range binary search yields, unsurprisingly, a constant number of iterations. Newton's method with an initial guess of $\frac{k}{2}$ (remember, $n$ is the exponent, but $k$ is the number of which we want to know the square root) does a lot better. But, lo! the binary search with a smarter interval does a lot better! yay!

But now, what if we use a better estimate for Newton's method? Say we replace

    unsigned x=std::max(1u,(n+1)/2);


by

    unsigned shift=(std::numeric_limits<unsigned>::digits-__builtin_clz(n))/2;
unsigned x=std::min(65536u,(1u<<(shift+1))-1);


?

Well, we get a killer speed-up for Newton (at least iteration-wise):

Only reaction possible:

*
* *

I had the intuition that extracting the square root of a $n$ bit long number would yield a number of approximately $\frac{n}{2}$ bits for a long time. The inequalities I presented above justify this view, but they are still loose because they are not exact, they ignore the $b$ term, or more exactly, assumes it to be zero on the left side and at most $2^n$ on the right side. A bit of refinement may be in order to get even better bounds.

6 Responses to Yet another Square Root Algorithm

1. Interesting way to get square roots!

2. zeuxcg says:

There are a few bugs in the code above; here’s a thorough test: https://gist.github.com/zeux/259846e93b0ffb2cde48 (no reason not to exhaustively test functions with 32-bit inputs ;-)

Otherwise for large numbers it seems like naive binary sqrt is best in terms of runtime. Driven binary is probably slower because it has extra setup that is more expensive than a few extra iterations it saves (for large numbers); newton uses horribly slow integer division, and while driven newton manages to recover because of a lower iteration count, on platforms without native integer division driven newton can easily be an order of magnitude slower…

• There are no parentheses because of operator precedence. See

http://en.cppreference.com/w/cpp/language/operator_precedence

But you’re right. Style is inconsistent.

There might be bugs, but they are verified to give $\lfloor\sqrt{n}\rfloor$ on 0 to $2^{31}$ or so.

• zeuxcg says:

I am well aware of C++ precedence rules. Some of them are not obvious; clang knows this and emits a warning, which is why I changed the code. This is not related to correctness – I just marked this location in the code because it was the only thing I changed.

I’m not sure what the discussed range is. You don’t mention a range explicitly so I’m assuming a full 32-bit range of inputs. Running code from the gist displays error cases for functions that are not 100% correct.

• All routines in their original version worked properly for $0$ to $2^{31}-1$. I fixed them so that the range is the complete (unsigned) 32 bits.

3. […] last week, we saw that we could use a (supposed) efficient machine-specific instruction to derive good bounds (but not very tight) to help binary search-based and Newton-based square root extraction algorithms go faster. Last week, we saw that the technique did lead to fewer iterations in both algorithms. […]