Hash Table and Magic (Relatively Primes) Numbers

Today, let’s talk about hash tables. Or, more precisely, one type of secondary probing technique, one that uses some number theory—but not quadratic residues, just relatively prime numbers. I don’t know if you’re like me, but it often bugs me when I read something in a textbook that seems to make sense, but is claimed without proof nor demonstration (oh, I think I said that before). That seems to happen particularly often when I read stuff about data structures, and this time I decided to explore one of those claims.

You know how a hash table works. You have a large array T with $T_s$ slots. You want to insert data into it, and you compute its location as $H(key)~\mathrm{mod}~T_s$, with $H(\cdot)$ being a function that repeatably produces a random-looking number from the key, giving the same random-looking number whenever it is fed the same key. If you find that there’s already something in that slot, you use a secondary hash function that will probe other places in the table where the data associated with the $key$ might be.

To be useful, the secondary hash function must be capable of visiting all the entries of the tables exactly once. In other words, it has to generate a permutation of $0,1,\ldots,T_s-1$.

The simplest way of doing this is to use a linear (secondary) probing. This method proceeds as follows. You first get $a_0 = H(key)~\mathrm{mod}~T_s$,

the starting address. If you happen to find what you’re looking for (or an empty entry, if you’re inserting) at address $a_0$, then you’re done, you’ve found where the data should be. If the table does not contain the data corresponding to the $key$ at address $a_0$ (and/or is not an empty slot if you’re inserting), then you have to resort to a secondary hash function that will examine further candidates with $a_0$ as a staring point. So you proceed to search for a slot in the table containing the key you’re looking for (or an empty slot; whichever comes first). By varying $t=0,1,2,\ldots,T_s-1$, you examine the addresses $a_t = (a_0 + t)~\mathrm{mod}~T_s$.

So basically, once you get the initial address $a_0$, you try $a_0+1$, $a_0+2$, $a_0+3$, …, possibly wrapping around the table, until you either find the key you’re looking for, an empty slot, or probe the entire table, returning to $a_0$ (when $t=T_s$).

However, this method has one major problem: it creates clusters. If you have something at $a_0$ and something else at $a_0+2$, inserting something else at $a_0$ will cause $a_0+1$ to be chosen (since it’s empty) and now we have consecutive entries. If we insert something else again at $a_0$ (because sometimes hash functions produce the same hash for different keys) then it will get inserted at $a_0+3$, and the cluster grows: it makes the search (or insertion) slow for all the keys that land near $a_0$: it “clogs” your table.

Secondary clustering is mitigated by using a better secondary hash function, one that generates very different addresses, not just consecutive ones. Remember, a secondary hash function must generate a permutation of $0,1,2,\ldots,T_s-1$. One simple way of doing so is to pick an integer $\gamma$ that is relatively prime to $T_s$ (it will not suffice for it to be merely prime, it must be relatively prime to $T_s$: it must not share common divisors).

Once you pick $\gamma$, the secondary probing becomes $a_t = (a_0 + t\gamma)~\mathrm{mod}~T_s$,

and it visits all entries of the table exactly one, commencing the probe at $a_0$, and in random-looking locations as $t$ varies.

But since the goal is to minimize clustering, $\gamma$ should be chosen so that, on average, it creates as little clusters as possible, therefore, each $t\gamma~(\mathrm(mod)~T_s$) should land as far as possible as of all the previous probes. How can we chose $\gamma$?

Ha! there we are. Some authors claim that $\displaystyle \gamma\approx\phi{}T_s=\left(\frac{1+\sqrt{5}}{2})-1\right)T_s$

is the best $\gamma$! And now, you should have the exact same reaction that I had: where did that came from?!. Let’s try to find out. $\phi~(\mathrm{mod}~1)$, $2\phi~(\mathrm{mod}~1)$, $3\phi~(\mathrm{mod}~1)$,…

produces, at step $t$, points that are all approximately $1/t$ apart, thus “maximally spread”. Now we’re not interested in the unit line, but with the integers $0,1,2,\ldots,T_s-1$, and does the miracle occur when $\gamma\approx(\phi-1)T_s$ is an integer relatively prime to $T_s$? Let us see.

First, we generate the prime numbers from 2 to $T_s-1$. This could be done, for example, as we did before. Once the list is acquired, we will proceed to test each of those prime numbers as the $\gamma$, and see how each disperses in the table. To measure dispersion, I generate $t\gamma~(\mathrm{mod}~T_s)$, and check for its two nearest neighbors generated before (i.e., already visited), the largest one smaller than $t\gamma~(\mathrm{mod}~T_s)$ and the smallest one larger than $t\gamma~(\mathrm{mod}~T_s)$. The distance between those two is the size of the gap $t\gamma~(\mathrm{mod}~T_s)$ cuts; all sizes are added up to compute the “score” of a candidate $\gamma$.

I generated the results for a few plausible table sizes: 100, 200, 500, 1000, 10000, 100000. The results are summarized: $T_s$ $\gamma$ ratio 100 73 0.73 200 139 0.695 500 347 0.694 1000 821 0.821 10000 6199 0.6199 100000 70717 0.70717 Gaps for $T_s=100$ Gaps for $T_s=200$ Gaps for $T_s=500$ Gaps for $T_s=1000$ Gaps for $T_s=10000$ Gaps for $T_s=100000$

So it seems that $\frac{\gamma}{T_s}\approx{}0.7$ is better than $\approx 0.618\ldots$… Busted!

*
* *

If the claim that $\frac{\gamma}{T_s}\approx(\phi-1)$ is busted, we see that there are a lot of “good” values for $\gamma$. Some are contrabulously bad, for example, very small or close to $T_s$, or near $\frac{1}{2}T_s$. However, after that, it becomes difficult, just by the magnitude of the chosen number, to decide whether it will be a good choice. That might need more exploring.

*
* *

OK, let us now discuss the implementation. Finding the size of the gap were we put the next $t\gamma~(\mathrm{mod}~T_s)$ is what is the most costly thing in the simulation. One idea might be to use a bitmap and put a 1 bit whenever/where-ever we insert an item, and scan down the bitmap for the nearest set bit to the left, and to the right. This works, but it slow, even more so as $T_s$ grows. A better solution is to use a std::map to hold the addresses drawn so far. Like this, we can insert the current address and find the next item using upper_bound, and the previous by using find then decrementing the iterator.

Computing gap size around some address x would look something like:

int gap_size(std::map<int,bool> & mappe,
int x,
int min=0, int max=10000)
{
std::map<int,bool>::const_iterator n,p;
int lo,hi;

p=mappe.find(x);p--; // finds x, then previous (NOT lower_bounds)
n=mappe.upper_bound(x); // finds first > x

if (n!=mappe.end()) hi=n->first; else hi=max;
if (p!=mappe.end()) lo=p->first; else lo=0;

return hi-lo;
}

If std::map is some kind of leaf-linked self-balancing tree, the above procedure is essentially $O(\lg n)$ for $n$ addresses inserted. This means that the procedure (again assuming self-balancing trees) is $O(n \lg n)$ rather than scanning a bit map in $O(n^2)$.

6 Responses to Hash Table and Magic (Relatively Primes) Numbers

1. yann says:

I’m not sure to fully understand the benefit.

The main point seems to find a way to deal with hash collision.
The proposition is to use the same table and look for a free slot.

(Conversely, it also means that searching into the table will need to follow the same process.)

The “search for a free slot” algorithm is compared between a naive solution (+1) and a slightly different one (+P, find a P).

I just don’t understand how come +P is supposed to be better than +1. I understand that it can reduce “clustering”, but does it matter ?

• Steven Pigeon says:

The benefit is that if you do +1, then you will end up scanning the whole clog to find a free spot, even if your hash function is good. You can clog the table even if your hash function is good because the distribution of your keys might be bad for some application-specific reason, even if the occupancy is rather low. With some other random probe (but that is garanteed to visit all slots), you avoid primary clustering because you poke around the table randomly, and the spot you find might be isolated. I guess that if the distribution of the key is evil, you may succeed in creating secondary clustering.

• yann says:

OK.
Let’s imagine that the Hash function has a perfect distribution property (which obviously should be verified, but a few are existing).

That means that, given a first entry generating a value H, a second different entry has roughly the same probability to generate a value of H, or H+1, or H+P.

Typical Distribution issue happen when inserting Identical entries. In this case, by definition, they all generate the same value H. Depending on the application, this might be a non-issue (a single entry is enough and therefore is kept) or generate a lot of duplicates.

I’m not sure if your use case makes reference to the first situation (different entries generating the same hash), or the second one (multiple identical entries inserted into the hash table).

2. Khuram Ali says:

Reblogged this on Khuram Ali.

3. awesoham says:

I gid the ISO lg() notation. Nice article!

4. changsul says:

Reblogged this on ChangSu's tech blog.