Last week we examined the complexity of obtaining in the decomposition for some integer . This week, we’ll have a look at how we can use it to speed-up square root extraction. Recall, we’re interested in because

,

with , which allows us to get easy bounds on . Better, we also have that

,

and we know how to compute (somewhat efficiently! Let’s combine all this and see how it speeds up things.

Most of the code for square roots is found here, and for shifting, including by half a bit, it is found here.

Using and as lower and upper bound for binary search will provide better performance because we reduce the number of iterations from, say 16 for 16-bits integers to about , where depends on .

unsigned driven_binary_sqrt_bounds(unsigned n, int & i) { i=0; if (n) { unsigned k=std::numeric_limits<unsigned>::digits-__builtin_clz(n); unsigned l=shift_half(1<<k); // OK, truncates unsigned h=shift_half(1<<(k+1))+1; // rounding error from half shift while (l!=h) { i++; unsigned m=(l+h)/2; if (m*m > n) h=m; else l=m+1; } return l-1; } else return 0; }

As for Newton’s method, we can use the knowledge of , and to kick-start search at a better point than . The modification to the algorithm aren’t extensive:

unsigned driven_newton_sqrt_bounds(unsigned n, int & i) { i=0; if (n) { unsigned k=std::numeric_limits<unsigned>::digits-__builtin_clz(n); unsigned l=shift_half(1<<k); // OK, truncates anyway unsigned h=shift_half(1<<(k+1))+1; // rounding error from half shift unsigned x=(l+h)/2; // should be pretty close unsigned d; do { i++; d = n/x; x = (x+d)/2; } while (d<x); // precision fix: it sometimes rounds up (42.9767 // becomes 43, which is the best answer, but) we // need truncation (as for the other methods) if (x*x>n) x--; return x; } else return 0; }

The last correction can be left out if rounding is deemed OK.

*

* *

How does the knowledge of helps finding square roots? As we did before, we just instrumented the code to obtain the number of iterations performed in the principal loop. We will use this as a metric of performance rather than, say, raw CPU time. Why so? Because counting the performance of a specific implementation is one thing, and as we showed a couple of weeks ago, shifting by half a bit is a trade-off of speed and precision, and we cannot exclude that someone would go through the trouble of implementing his own CPU with a “shift by half a bit” instruction. Let us, again, refer to this previous post for the detail of all the methods. Now, compare (Y-axis, number of iterations, X-axis, numbers):

where the “Bounded” versions are the version just above.

We see that the use of reduces greatly the number of iterations for the binary search method. Again, the number of iteration will be about . As for the Newton method, using an initial guess of

somewhat reduces the number of iterations further from the “driven” Newton search and quite a lot from the naïve version where the initial guess is merely .

*

* *

Unsurprisingly, then, having better initial guesses is a plus. I’m not sure, though, that the speed-up, as measured by the number of iterations, is worth the added complexity, even if it is a tiny amount. We’ve seen that using a built-in function that maps to an efficient instruction (`__builtin_clz` that maps to the CPU instruction `bsr`) is much faster than the equivalent naïve implementation so in the end the added complexity might be beneficial.