Perfect Hashing (part I)

A few days ago, a friend asked me how to write, if even possible, a hash function that had no collisions. Of course, it’s relatively easy to get a reasonably good hash function that will give the expected amount of collisions (especially when used as usual, modulo the size of the table). But what if we do not want collisions at all? What can we do?

7e4156dfac4d82e9a5cab4987ecc3a15

There are some hash functions, known as perfect hash function, that will yield a different hash for every key from a a priori known set of keys, say K. Are they hard to build? Fortunately, not really. Can we exploit the uniqueness of the hashed value to speed look-ups up? Yes, but it’s a bit more complicated.

The idea is to build a searchable data structure that assigns an unique number to every key. If we were to build that from scratch, that’d be cumbersome, but we can use predefined data structures such as std::map to build a stateful hash function. The basic idea is that if we see a key never seen before, it gets a new hash, and if we’ve seen it before, we return the same hash as last time. In C++, we could write:

template <typename T>
class perfect_hash
 {
  private:

   size_t count=0;
   std::map<T,size_t> keys;

  public:
  
  size_t operator()(const T & k)
   {
    auto i=keys.find(k);
    if (i!=keys.end())
     return keys[k];
    else
     return keys[k]=count++;
   }

  perfect_hash() = default;
  ~perfect_hash() = default;
 };

If the key is in the keys map, we return the stored hash, otherwise we assign it a new one and store it. This way, the class memorizes the keys and guaranties uniqueness.

*
* *

So now you’re thinking, hmmm, OK, but isn’t that hash expensive? Well, yes—but we’ll work around this in a moment. An ordinary hash function will cost at least an amount of computation linear in the length of the key. If we’re hashing strings, we’d expect the hash function to iterate over each character at least one. If we’re hashing integers, well, it may be a constant number of operations, if we count integer operations as constant-time. With the perfect_hash class, we’re sure to do O(\lg |K|) comparisons in the map (with \lg being base-2 logarithm), times the cost of the comparisons, which may be up to linear in the length of the keys. That is, the perfect hash is O(n \lg |K|), with n being the average key length.

To work around this problem we can create—once—key-hash pairs. The first time we create an item with a given key, we call the perfect hash function and form a pair with the key and its given hash, and that will become the key. This, later on, will allow O(1) indexing into a hash table—with no holes and no collisions.

The following program exemplifies the idea:

typedef std::pair<std::string,size_t> key_pair;

int main()
 {
  perfect_hash<std::string> pf;
  std::list<key_pair> words;

  std::string word;
  do
   {
    if (std::cin >> word)
     words.push_back(key_pair(word,pf(word)));
   }
  while (std::cin);
   
  words.sort([](const key_pair & a,
                const key_pair & b)
             { return a.first<b.first; });

  for (key_pair w : words)
   std::cout << w.first << " " << w.second << std::endl;
  return 0;
 }

*
* *

This algorithm uses a linear amount of memory (the std::map) and runs in log-linear time for the creation for each of the key_pairs. After that, we can use the hash value stored into the key_pair to get O(1) indexing, but to the price of artificially inflating the key.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: