## Interpolation Search

We all know about binary search. It’s been with us such a long time. Knuth thinks it’s first appearance in print is in the Moore School Lectures, in 1946. The technique search for an item in a list by halving, iteratively, the searched portion. Its complexity is $O(\log n)$ for a list of $n$ values, making it a very efficient algorithm for searching.

One may even be tempted to think that it’s in fact optimal, that we can’t do significantly better. Is that so?

Well, actually no. The (classical?) binary searches splits the list in half, exactly at its center. But the center might not be the best place to split the list. What if the value found in the middle of the (current) list is much bigger or much smaller than the value we are searching for? The list is then likely biased somehow and we should take advantage of that. One idea, introduced by Perl, Itai and Avni in 1978, uses information on both ends of the (current) list to guesstimate where in that range is the searched value.

Their idea is to use linear interpolation to compute the location of the search value under the hypothesis that all values between the two extremes (the smallest value at the beginning of the list and the largest at the end) are increasing evenly. They make the hypothesis that values form a straight line between the two extremes of the list.

Let $v$ be the searched value, $l$ the index of the first item in the current list, $h$ the index of the last item, and $v_l$ and $v_h$, the two values that either end of the list. The location $p$ of $v$ is estimated as $\displaystyle p=\frac{v-v_l}{v_h-v_l}(h-l)$,

whereas in binary search, we have $p=\lfloor (l+h)/2 \rfloor$.

*
* *

A quick PoC:

#include <iostream>
#include <algorithm>
#include <vector>

//////////////////////////////
typedef size_t (*interpolator)(size_t low, int low_val, size_t high, int high_val, int val);

//////////////////////////////
size_t binary_interpolator(size_t low, int, size_t high, int, int)
{
return (low+high)/2;
}

//////////////////////////////
size_t linear_interpolator(size_t low, int low_val, size_t high, int high_val, int val)
{
if (low!=high)
{
if (val-low_val<0)
return low;
else
return low+((val-low_val)*(high-low))/(high_val-low_val);
}
else
return low;
}

//////////////////////////////
void init_vect(std::vector<int> & v, int amplitude, size_t nb_items)
{
v.resize(nb_items);
for (size_t i=0;i<nb_items;i++)
v[i]=rand() % amplitude;

sort(v.begin(),v.end());
}

//////////////////////////////
size_t binary_search( const std::vector<int> & v,
int value,
uint64_t & steps,
interpolator interp)
{
steps=0;
size_t l=0, h=v.size()-1;
while (l<h)
{
steps++;
size_t p=interp(l,v[l],h,v[h],value);
if (v[p]<value)
l=p+1;
else
if (v[p]==value)
return p;
else
h=p;
}
return l;
}

//////////////////////////////
int main()
{
const int amplitude=1000;
const size_t nb_items=1000; // or whatev.
std::vector<int> v;

init_vect(v,amplitude,nb_items);

size_t total_binary_steps=0;
size_t total_linear_steps=0;
for (int i=0;i<amplitude;i++)
{
size_t binary_steps;
size_t linear_steps;
size_t w_b=binary_search(v,i,binary_steps,binary_interpolator);
size_t w_l=binary_search(v,i,linear_steps,linear_interpolator);
total_binary_steps+=binary_steps;
total_linear_steps+=linear_steps;
}

std::cout
<< "log2(" << nb_items << ")=" << log(nb_items)/log(2) << std::endl
<< "avg bin steps=" << total_binary_steps/(float)nb_items << std::endl
<< "avg lin steps=" << total_linear_steps/(float)nb_items << std::endl
;
return 0;
}


*
* *

The complexity of interpolation search is much better that binary search’s. Binary search is $O(\lg n)$ but Perl et al. have shown that interpolation search (under favorable conditions) is $O(\lg \lg n)$. For a table with $n=10^6$, binary search does 20 or 21 iterations. Interpolation will do about 4. With $n=10^{10}$, interpolation will iterate about 5 times, while binary search will iterate 33 or 34 times. That’s how much faster interpolation search is.

*
* *

However, interpolation search doesn’t quite fit the general frame imposed by binary search. Binary search, especially with the variant where the “center” of the list is chosen as $p=\lfloor (h+l)/2 \rfloor$, is just what it advertises: binary. It makes a split where one list contains the items from $l$ to $p-1$ and another with the items from $p$ to $h$. It therefore guarantees that which ever side the algorithm chooses, the (current) list shrinks.

Interpolation search, on the other hand, doesn’t quite give you this warranty. The guesstimated position can basically be anywhere. But here’s the catch: if you do the binary search style partitions, $l$ to $p-1$ and $p$ to $h$, interpolation search may well find that the best next guess is still $p$, and your partition doesn’t shrink! To fix that, interpolation search becomes a “trichotomy”, where the partition becomes $l$ to $p-1$, $p$, and $p$ to $h$. The “partition” containing only the item at position $p$ is checked once with equality. If the position contains the item, you’re done. Otherwise the next predicted position will be different.

Reference

Yehoshua Perl, Alon Itai, Haim Avni — Interpolation Search – A Log Log N Search — Comm. ACM, vol. 21 no. 7 (july 1978) p. 550–552