Hash Table and Magic (Relatively Primes) Numbers

Today, let’s talk about hash tables. Or, more precisely, one type of secondary probing technique, one that uses some number theory—but not quadratic residues, just relatively prime numbers.

200px-Golden_spiral_in_rectangles.svg

I don’t know if you’re like me, but it often bugs me when I read something in a textbook that seems to make sense, but is claimed without proof nor demonstration (oh, I think I said that before). That seems to happen particularly often when I read stuff about data structures, and this time I decided to explore one of those claims.

You know how a hash table works. You have a large array T with T_s slots. You want to insert data into it, and you compute its location as H(key)~\mathrm{mod}~T_s, with H(\cdot) being a function that repeatably produces a random-looking number from the key, giving the same random-looking number whenever it is fed the same key. If you find that there’s already something in that slot, you use a secondary hash function that will probe other places in the table where the data associated with the key might be.

To be useful, the secondary hash function must be capable of visiting all the entries of the tables exactly once. In other words, it has to generate a permutation of 0,1,\ldots,T_s-1.

The simplest way of doing this is to use a linear (secondary) probing. This method proceeds as follows. You first get

a_0 = H(key)~\mathrm{mod}~T_s,

the starting address. If you happen to find what you’re looking for (or an empty entry, if you’re inserting) at address a_0, then you’re done, you’ve found where the data should be. If the table does not contain the data corresponding to the key at address a_0 (and/or is not an empty slot if you’re inserting), then you have to resort to a secondary hash function that will examine further candidates with a_0 as a staring point. So you proceed to search for a slot in the table containing the key you’re looking for (or an empty slot; whichever comes first). By varying t=0,1,2,\ldots,T_s-1, you examine the addresses

a_t = (a_0 + t)~\mathrm{mod}~T_s.

So basically, once you get the initial address a_0, you try a_0+1, a_0+2, a_0+3, …, possibly wrapping around the table, until you either find the key you’re looking for, an empty slot, or probe the entire table, returning to a_0 (when t=T_s).

However, this method has one major problem: it creates clusters. If you have something at a_0 and something else at a_0+2, inserting something else at a_0 will cause a_0+1 to be chosen (since it’s empty) and now we have consecutive entries. If we insert something else again at a_0 (because sometimes hash functions produce the same hash for different keys) then it will get inserted at a_0+3, and the cluster grows: it makes the search (or insertion) slow for all the keys that land near a_0: it “clogs” your table.

Secondary clustering is mitigated by using a better secondary hash function, one that generates very different addresses, not just consecutive ones. Remember, a secondary hash function must generate a permutation of 0,1,2,\ldots,T_s-1. One simple way of doing so is to pick an integer \gamma that is relatively prime to T_s (it will not suffice for it to be merely prime, it must be relatively prime to T_s: it must not share common divisors).

Once you pick \gamma, the secondary probing becomes

a_t = (a_0 + t\gamma)~\mathrm{mod}~T_s,

and it visits all entries of the table exactly one, commencing the probe at a_0, and in random-looking locations as t varies.

But since the goal is to minimize clustering, \gamma should be chosen so that, on average, it creates as little clusters as possible, therefore, each t\gamma~(\mathrm(mod)~T_s) should land as far as possible as of all the previous probes. How can we chose \gamma?

Ha! there we are. Some authors claim that

\displaystyle \gamma\approx\phi{}T_s=\left(\frac{1+\sqrt{5}}{2})-1\right)T_s

is the best \gamma! And now, you should have the exact same reaction that I had: where did that came from?!. Let’s try to find out.

I distinctly remember reading—but not where I read it—that the sequence

\phi~(\mathrm{mod}~1), 2\phi~(\mathrm{mod}~1), 3\phi~(\mathrm{mod}~1),…

produces, at step t, points that are all approximately 1/t apart, thus “maximally spread”. Now we’re not interested in the unit line, but with the integers 0,1,2,\ldots,T_s-1, and does the miracle occur when \gamma\approx(\phi-1)T_s is an integer relatively prime to T_s? Let us see.

First, we generate the prime numbers from 2 to T_s-1. This could be done, for example, as we did before. Once the list is acquired, we will proceed to test each of those prime numbers as the \gamma, and see how each disperses in the table. To measure dispersion, I generate t\gamma~(\mathrm{mod}~T_s), and check for its two nearest neighbors generated before (i.e., already visited), the largest one smaller than t\gamma~(\mathrm{mod}~T_s) and the smallest one larger than t\gamma~(\mathrm{mod}~T_s). The distance between those two is the size of the gap t\gamma~(\mathrm{mod}~T_s) cuts; all sizes are added up to compute the “score” of a candidate \gamma.

I generated the results for a few plausible table sizes: 100, 200, 500, 1000, 10000, 100000. The results are summarized:

T_s \gamma ratio
100 73 0.73
200 139 0.695
500 347 0.694
1000 821 0.821
10000 6199 0.6199
100000 70717 0.70717
Gaps for T_s=100

Gaps for T_s=100

200

Gaps for T_s=200

500

Gaps for T_s=500

1000

Gaps for T_s=1000

10000

Gaps for T_s=10000

100000

Gaps for T_s=100000

So it seems that \frac{\gamma}{T_s}\approx{}0.7 is better than \approx 0.618\ldots… Busted!

*
* *

If the claim that \frac{\gamma}{T_s}\approx(\phi-1) is busted, we see that there are a lot of “good” values for \gamma. Some are contrabulously bad, for example, very small or close to T_s, or near \frac{1}{2}T_s. However, after that, it becomes difficult, just by the magnitude of the chosen number, to decide whether it will be a good choice. That might need more exploring.

*
* *

OK, let us now discuss the implementation. Finding the size of the gap were we put the next t\gamma~(\mathrm{mod}~T_s) is what is the most costly thing in the simulation. One idea might be to use a bitmap and put a 1 bit whenever/where-ever we insert an item, and scan down the bitmap for the nearest set bit to the left, and to the right. This works, but it slow, even more so as T_s grows. A better solution is to use a std::map to hold the addresses drawn so far. Like this, we can insert the current address and find the next item using upper_bound, and the previous by using find then decrementing the iterator.

Computing gap size around some address x would look something like:

int gap_size(std::map<int,bool> & mappe,
             int x,
             int min=0, int max=10000)
 {
  std::map<int,bool>::const_iterator n,p;
  int lo,hi;

  p=mappe.find(x);p--; // finds x, then previous (NOT lower_bounds)
  n=mappe.upper_bound(x); // finds first > x

  if (n!=mappe.end()) hi=n->first; else hi=max;
  if (p!=mappe.end()) lo=p->first; else lo=0;

  return hi-lo;
 }

If std::map is some kind of leaf-linked self-balancing tree, the above procedure is essentially O(\lg n) for n addresses inserted. This means that the procedure (again assuming self-balancing trees) is O(n \lg n) rather than scanning a bit map in O(n^2).

6 Responses to Hash Table and Magic (Relatively Primes) Numbers

  1. yann says:

    I’m not sure to fully understand the benefit.

    The main point seems to find a way to deal with hash collision.
    The proposition is to use the same table and look for a free slot.

    (Conversely, it also means that searching into the table will need to follow the same process.)

    The “search for a free slot” algorithm is compared between a naive solution (+1) and a slightly different one (+P, find a P).

    I just don’t understand how come +P is supposed to be better than +1. I understand that it can reduce “clustering”, but does it matter ?

    • The benefit is that if you do +1, then you will end up scanning the whole clog to find a free spot, even if your hash function is good. You can clog the table even if your hash function is good because the distribution of your keys might be bad for some application-specific reason, even if the occupancy is rather low. With some other random probe (but that is garanteed to visit all slots), you avoid primary clustering because you poke around the table randomly, and the spot you find might be isolated. I guess that if the distribution of the key is evil, you may succeed in creating secondary clustering.

      • yann says:

        OK.
        Let’s imagine that the Hash function has a perfect distribution property (which obviously should be verified, but a few are existing).

        That means that, given a first entry generating a value H, a second different entry has roughly the same probability to generate a value of H, or H+1, or H+P.

        Typical Distribution issue happen when inserting Identical entries. In this case, by definition, they all generate the same value H. Depending on the application, this might be a non-issue (a single entry is enough and therefore is kept) or generate a lot of duplicates.

        I’m not sure if your use case makes reference to the first situation (different entries generating the same hash), or the second one (multiple identical entries inserted into the hash table).

  2. awesoham says:

    I gid the ISO lg() notation. Nice article!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: