## Hash Functions (Part I)

Hash tables are a great way of ensuring $O(1)$ access to your data. Well, it does, but as long as the hash function is good.

But what exactly makes a hash function a good hash function?

Recall, a hash function is a function $h$ that takes a key $k$ and computes an address, which is subsequently reduced to a table of size $m$. That is,

$a=h(k) \mod{} m$.

The size of the table, $m$, is also called here the modulus.

A good hash function should

• Be independent of the data on which it is applied. A good hash function can be used on any data type without loss of performance. This requirement pretty much excludes the use of ad hoc hash functions tailored for your current data. Of course, you are free to design your own data-specific hash function, but it will break as soon as you change your data. You will then go back and tweak your hash function to accommodate the changes, but this will break any persistent storage based on the old hash function. It also implies that the performance of the hash function isn’t that much dependent on the length of the data—it shouldn’t get more random as the key get longer.
• Cover the range of values covered by the machine-specific size_t. This ensures that you can use the hash function to build tables of any sizes, or, at least, sizes that your system can handle. The C predefined type size_t varies in a machine and system-specific way, but represents the largest addressable slab of (internal) memory. On a 64-bits system, this value will be $2^{64}-1$. You may consider extend this requirement to the size of off_t, the size of the largest addressable file on your system.
• Be persistent across sessions. If the hash function is to be used for persistent storage, it can’t have run-time dependent values. In other words, you can’t use srand and rand to fill some of the values your hash function uses—or you have to generate them once and store them somewhere.
• Appear uniformly random regardless of modulus. This requirements makes the hash function independent of the table size and vice-versa. This property will ensure that, regardless of table size $m>0$ and the number of items $n$, the expected number of items per table entry will be very nearly the same, $\theta(\frac{n}{m})$. In other words, everything gets distributed very evenly across table entries (implying, without entries almost empty of comparatively overfull).
• Be fast. How fast should a hash function be? If we use a (balanced) tree for look-up, the cost of finding a data item is $O(n \lg m)$, where $n$ is the average key length and $m$ the number of items in the tree. We therefore make the hypothesis that comparing two keys is $O(n)$ (the length of the shortest of the two) and that, searching the tree, we will make $O(\lg m)$ comparisons. A hash table will have $O(1)$ access time if the hash function is good, but that $O(1)$ is the number of comparisons, not the cost of computing the hash function. To be attractive, the hash function must be computed in much less than $O(n \lg m)$, therefore putting a bound on the hidden constant. True, asymptotic notation brushes the hidden constant under the rug, but in practice, if we have $m\sim{}2^{30}$, the hash function must have a hidden constant much smaller than $30$ (because $\lg 2^{30}=30\lg 2=30$ to make the whole hash table thing much faster than a balanced binary search tree.

*
* *

Over the next few weeks, or so, we will examine several hash functions. We will assess their respective performance using a variety of goodness tests. Then, once we’ve visited a couple of hash functions, I will make a comparative summary.

Next week: the anatomy of a hash function.