A few days ago, a friend asked me how to write, if even possible, a hash function that had no collisions. Of course, it’s relatively easy to get a reasonably good hash function that will give the expected amount of collisions (especially when used as usual, modulo the size of the table). But what if we do not want collisions at all? What can we do?

There are some hash functions, known as perfect hash function, that will yield a different hash for every key from a *a priori* known set of keys, say . Are they hard to build? Fortunately, not really. Can we exploit the uniqueness of the hashed value to speed look-ups up? Yes, but it’s a bit more complicated.

The idea is to build a searchable data structure that assigns an unique number to every key. If we were to build that from scratch, that’d be cumbersome, but we can use predefined data structures such as `std::map` to build a stateful hash function. The basic idea is that if we see a key never seen before, it gets a new hash, and if we’ve seen it before, we return the same hash as last time. In C++, we could write:

template <typename T> class perfect_hash { private: size_t count=0; std::map<T,size_t> keys; public: size_t operator()(const T & k) { auto i=keys.find(k); if (i!=keys.end()) return keys[k]; else return keys[k]=count++; } perfect_hash() = default; ~perfect_hash() = default; };

If the key is in the `keys` map, we return the stored hash, otherwise we assign it a new one and store it. This way, the class memorizes the keys and guaranties uniqueness.

*

* *

So now you’re thinking, *hmmm, OK, but isn’t that hash expensive*? Well, yes—but we’ll work around this in a moment. An ordinary hash function will cost at least an amount of computation linear in the length of the key. If we’re hashing strings, we’d expect the hash function to iterate over each character at least one. If we’re hashing integers, well, it may be a constant number of operations, if we count integer operations as constant-time. With the `perfect_hash` class, we’re sure to do comparisons in the map (with being base-2 logarithm), times the cost of the comparisons, which may be up to linear in the length of the keys. That is, the perfect hash is , with being the average key length.

To work around this problem we can create—once—key-hash pairs. The first time we create an item with a given key, we call the perfect hash function and form a pair with the key and its given hash, and *that* will become the key. This, later on, will allow indexing into a hash table—with no holes and no collisions.

The following program exemplifies the idea:

typedef std::pair<std::string,size_t> key_pair; int main() { perfect_hash<std::string> pf; std::list<key_pair> words; std::string word; do { if (std::cin >> word) words.push_back(key_pair(word,pf(word))); } while (std::cin); words.sort([](const key_pair & a, const key_pair & b) { return a.first<b.first; }); for (key_pair w : words) std::cout << w.first << " " << w.second << std::endl; return 0; }

*

* *

This algorithm uses a linear amount of memory (the `std::map`) and runs in log-linear time for the creation for each of the `key_pair`s. After that, we can use the hash value stored into the `key_pair` to get indexing, but to the price of artificially inflating the key.