last week, we saw that we could use a (supposed) efficient machine-specific instruction to derive good bounds (but not very tight) to help binary search-based and Newton-based square root extraction algorithms go faster. Last week, we saw that the technique did lead to fewer iterations in both algorithms.
Now, does the reduced number of iterations translate into actual, machine speed-up?
Let’s see. A quick timing test extracting square roots from 0 to 100000000 with each of the algorithms reveals the following graph:
Where “sqrt” is the C Math library intrinsic, “binary” the binary search, “driven binary” the binary search with tighter bounds, “Newton” is Newton’s algorithm, and “Driven Newton” is Newton’s algorithm with a better initial guess.
Newton’s algorithm does fewer iterations than the binary searches, but each iteration is much more expensive. So expensive that even when halving (or more) the number of iterations, it’s still much slower than “dumb” binary search. We also see that the “driven binary” is slightly slower than the vanilla version. So, after all, __builtin_clz isn’t all that cheap an instruction.
Wild excitement is replaced with disappointment. Kind of. Despite a much better initial value, Newton’s algorithm fails to beat a simple binary search. Why? One can guess that multiplication is much faster on my CPU than division. The binary search algorithm uses only multiplication, division by two, and comparison; Newton’s algorithm uses a division division by 2 but also a division by an arbitrary number, and a comparison, so the big difference is the integer division (even though it is supported by a hardware instruction).