Today, let’s talk about hash tables. Or, more precisely, one type of secondary probing technique, one that uses some number theory—but not quadratic residues, just relatively prime numbers.

I don’t know if you’re like me, but it often bugs me when I read something in a textbook that seems to make sense, but is claimed without proof nor demonstration (oh, I think I said that before). That seems to happen particularly often when I read stuff about data structures, and this time I decided to explore one of those claims.

You know how a hash table works. You have a large array T with slots. You want to insert data into it, and you compute its location as , with being a function that repeatably produces a random-looking number from the key, giving the same random-looking number whenever it is fed the same key. If you find that there’s already something in that slot, you use a *secondary hash function* that will probe other places in the table where the data associated with the might be.

To be useful, the secondary hash function must be capable of visiting all the entries of the tables *exactly once*. In other words, it has to generate a permutation of .

The simplest way of doing this is to use a linear (secondary) probing. This method proceeds as follows. You first get

,

the starting address. If you happen to find what you’re looking for (or an empty entry, if you’re inserting) at address , then you’re done, you’ve found where the data should be. If the table does not contain the data corresponding to the at address (and/or is not an empty slot if you’re inserting), then you have to resort to a secondary hash function that will examine further candidates with as a staring point. So you proceed to search for a slot in the table containing the key you’re looking for (or an empty slot; whichever comes first). By varying , you examine the addresses

.

So basically, once you get the initial address , you try , , , …, possibly wrapping around the table, until you either find the key you’re looking for, an empty slot, or probe the entire table, returning to (when ).

However, this method has one major problem: it creates clusters. If you have something at and something else at , inserting something else at will cause to be chosen (since it’s empty) and now we have consecutive entries. If we insert something else again at (because sometimes hash functions produce the same hash for different keys) then it will get inserted at , and the cluster grows: it makes the search (or insertion) slow for all the keys that land near : it “clogs” your table.

Secondary clustering is mitigated by using a better secondary hash function, one that generates very different addresses, not just consecutive ones. Remember, a secondary hash function must generate a permutation of . One simple way of doing so is to pick an integer that is relatively prime to (it will not suffice for it to be merely prime, it must be relatively prime to : it must not share common divisors).

Once you pick , the secondary probing becomes

,

and it visits all entries of the table exactly one, commencing the probe at , and in random-looking locations as varies.

But since the goal is to minimize clustering, should be chosen so that, on average, it creates as little clusters as possible, therefore, each ) should land as far as possible as of all the previous probes. How can we chose ?

Ha! there we are. Some authors claim that

is the best ! And now, you should have the exact same reaction that I had: *where did that came from?!*. Let’s try to find out.

I distinctly remember reading—but not where I read it—that the sequence

, , ,…

produces, at step , points that are all approximately apart, thus “maximally spread”. Now we’re not interested in the unit line, but with the integers , and does the miracle occur when is an integer relatively prime to ? Let us see.

First, we generate the prime numbers from 2 to . This could be done, for example, as we did before. Once the list is acquired, we will proceed to test each of those prime numbers as the , and see how each disperses in the table. To measure dispersion, I generate , and check for its two nearest neighbors generated before (i.e., already visited), the largest one smaller than and the smallest one larger than . The distance between those two is the size of the gap cuts; all sizes are added up to compute the “score” of a candidate .

I generated the results for a few plausible table sizes: 100, 200, 500, 1000, 10000, 100000. The results are summarized:

ratio | ||

100 | 73 | 0.73 |

200 | 139 | 0.695 |

500 | 347 | 0.694 |

1000 | 821 | 0.821 |

10000 | 6199 | 0.6199 |

100000 | 70717 | 0.70717 |

So it seems that is better than … Busted!

*

* *

If the claim that is busted, we see that there are a lot of “good” values for . Some are contrabulously bad, for example, very small or close to , or near . However, after that, it becomes difficult, just by the magnitude of the chosen number, to decide whether it will be a good choice. That might need more exploring.

*

* *

OK, let us now discuss the implementation. Finding the size of the gap were we put the next is what is the most costly thing in the simulation. One idea might be to use a bitmap and put a 1 bit whenever/where-ever we insert an item, and scan down the bitmap for the nearest set bit to the left, and to the right. This works, but it slow, even more so as grows. A better solution is to use a `std::map` to hold the addresses drawn so far. Like this, we can insert the current address and find the next item using `upper_bound`, and the previous by using `find` then decrementing the iterator.

Computing gap size around some address `x` would look something like:

int gap_size(std::map<int,bool> & mappe, int x, int min=0, int max=10000) { std::map<int,bool>::const_iterator n,p; int lo,hi; p=mappe.find(x);p--; // finds x, then previous (NOT lower_bounds) n=mappe.upper_bound(x); // finds first > x if (n!=mappe.end()) hi=n->first; else hi=max; if (p!=mappe.end()) lo=p->first; else lo=0; return hi-lo; }

If `std::map` is some kind of leaf-linked self-balancing tree, the above procedure is essentially for addresses inserted. This means that the procedure (again assuming self-balancing trees) is rather than scanning a bit map in .

I’m not sure to fully understand the benefit.

The main point seems to find a way to deal with hash collision.

The proposition is to use the same table and look for a free slot.

(Conversely, it also means that searching into the table will need to follow the same process.)

The “search for a free slot” algorithm is compared between a naive solution (+1) and a slightly different one (+P, find a P).

I just don’t understand how come +P is supposed to be better than +1. I understand that it can reduce “clustering”, but does it matter ?

The benefit is that if you do +1, then you will end up scanning the whole clog to find a free spot, even if your hash function is good. You can clog the table even if your hash function is good because the distribution of your keys might be bad for some application-specific reason, even if the occupancy is rather low. With some other random probe (but that is garanteed to visit all slots), you avoid primary clustering because you poke around the table randomly, and the spot you find might be isolated. I guess that if the distribution of the key is evil, you may succeed in creating secondary clustering.

OK.

Let’s imagine that the Hash function has a perfect distribution property (which obviously should be verified, but a few are existing).

That means that, given a first entry generating a value H, a second different entry has roughly the same probability to generate a value of H, or H+1, or H+P.

Typical Distribution issue happen when inserting Identical entries. In this case, by definition, they all generate the same value H. Depending on the application, this might be a non-issue (a single entry is enough and therefore is kept) or generate a lot of duplicates.

I’m not sure if your use case makes reference to the first situation (different entries generating the same hash), or the second one (multiple identical entries inserted into the hash table).

Reblogged this on Khuram Ali.

I gid the ISO lg() notation. Nice article!

Reblogged this on ChangSu's tech blog.