## Hash Functions (checksums, part II?)

On a number of different occasions, I briefly discussed Hash Functions, saying that if a hash function needn’t be very strong if the intend is to use it for look-up, contrary to cryptographic applications. But, unsurprisingly, it’s rather difficult to get a good fast hash function.

Coming up with a very good hash function isn’t easy, but we can at least make an effort to build one by understanding what makes a better hash function, especially by explicitly testing them. Let us have a try at building a good hash function (for look-up) from scratch.

For the test, we have a list of 323578 unique words (coming from here, with all the accents removed, using the iconv command). These words will provide us with a highly structured set of keys—keys will not be random but share long prefixes, many suffixes, and other structures. Each of these words will be keys, drawn from a dictionary, that is, $D=\{k_1,k_2,\ldots,k_n\}$. For each $k_i \in D$, we compute its address $a_i$ in the table of size $m$ by:

$a_i=h(k_i)~\mathrm{mod}~m$

where $h(\cdot)$ is the hash function that maps the key to a random-looking (but deterministic) integer number in a range larger or equal to $[0,m-1]$. Typically, a hash function will generate a 32-bits or 64-bits number from the key in order to accommodate convenient values of $m$, which typically ranges from a few thousands to quadrillions (short scale).

So the objective for function $h(\cdot)$ is to take a key, maybe a number, but in general it will be a string-like object, and produce a deterministic, but random-looking number out of it. The intuition is that, somehow, we have to “mix” and conflapulate the bits of the key onto a smaller number of bits, typically 32 or 64. So we want the hash function to take every bit of the key into account and combine them in a way that looks as random as possible, with the implication that if we flip a single bit in the key, it should cause and avalanche propagating this single bit’s change and yield a completely different value.

The simplest possible function one can come up with is the check sum, where, say, byte elements forming the key, are added together. It does seem to have some kind of avalanche mechanism through the carry, so maybe it’s not that bad. A simple check-sum based hash function would look like:

uint32_t hash_sum(const std::string & key)
{
uint32_t sum=0;

for (int i=0;i<key.size();i++)
sum+=key[i];

return sum;
}


Of course, the avalanches caused by the carry only move one way, that is, the avalanche propagates only to the left, and not very far on average. One possibility is to add an explicit avalanche mechanism.

uint32_t hash_rotative(const std::string & key)
{
uint32_t h=0;

// Attributed to Knuth, but references
// do not match actual text
for (int i=0;i<key.size();i++)
h=(h<<5)^(h>>27)^key[i];

return h;
}


In this function, the “bilateral” avalanche is provoked by the (h<<5)^(h>>27) that rotates-right the temporary value by 5 bits (probably than rotating left would have been better). Note also that the above code doesn’t do the same thing at all if h is signed. Also, probably having +key[i] instead of the exclusive-or would push a little bit more avalanche in there.

OK, so, maybe rotating isn’t much of a randomization operation. What else could we use (while being computationally cheap)? Maybe map to a Gray code? And have + to combine the new element?

uint32_t hash_gray(const std::string & key)
{
uint32_t h=0;

for (int i=0;i<key.size();i++)
h= (h^(h>>1))+key[i];

return h;
}


We could make sure that the bits spread even more and one would think that using a weave would do the trick. Also known as perfect shuffle, the operation consists in cutting the card pack in two halves and you recombine the cards to a single pack of cards by interleaving them one by one. This video explains it better.

uint32_t hash_shuffle(const std::string & key)
{
uint32_t h=0;

for (int i=0;i<key.size();i++)
h=perfect_shuffle(h)+key[i];

return h;
}


So now, after each addition affecting the lower-portion of the temporary hash value, we make sure that we spread the modifications pretty much everywhere in the word using the shuffle. That should work better.

The addition operation affects only the lower-portion of the temporary hash value and its propagating effect, the carry, only travels a small number of bits on average; it is just addition after all. Maybe we can somehow embiggen the value of key[i] so that it affects more bits?

One way of doing so is to map every possible byte to a 32-bits random number, preferably generated using a “strong” generator. This table, say, expand, will be indexed by key[i] and add more non-locality to the hash functions.

uint32_t hash_shuffle_expanded(const std::string & key)
{
static const uint32_t expansion[256]=
#include "random-expansion"
;
uint32_t h=0;

for (int i=0;i<key.size();i++)
h=perfect_shuffle(h)+expansion[(unsigned char)key[i]];

return h;
}

uint32_t hash_rotative_expanded(const std::string & key)
{
static const uint32_t expansion[256]=
#include "random-expansion"
;
uint32_t h=0;

for (int i=0;i<key.size();i++)
h=(h<<5)^(h>>27)^expansion[(unsigned char)key[i]];

return h;
}


*
* *

OK, so now we have a bunch of candidate hash functions, each better than the others, or so we think, and it’s time to verify how they perform in reality. For each hash function discussed here, we process all the words of the word list (the dictionary), and see how keys are distributed to bins. The code is pretty much only:

void test_hash(const std::list<std::string> & dictionary,
hash_function_t hash,
uint32_t test_modulo=1000)
{
std::vector<int> hits(test_modulo,0);
for ( std::list<std::string>::const_iterator i=dictionary.begin();
i!=dictionary.end();
++i)
hits[hash(*i) % test_modulo]++;

for (int i=0;i<test_modulo;i++)
std::cout << hits[i] << std::endl;
}


If we set the test_modulo to the table size (remember $m$ from before?) to 200, we get the following box plots:

We see that the simple hash function using only sums does not do a good job at addressing each bin equally; the hits are distributed from 850 to 2400, while an ideal hash function would give exactly $\approx 1617$ hits per bin. OK, so it’s not doing so well.

The rotative hash function doesn’t do significantly better than sum, because, probably, the avalanches do not go very far. The Gray code variant does a bit better, but not that much.

The shuffle function that spreads the bits in all directions, does a lot better. So our hypotheses on what it’s doing seem to be verified. Adding the embiggening doesn’t do much on the shuffle hash, because it seems to be doing well already. But added to the rotative hash, embiggening gives us a function that compares to shuffling only. Interesting!

What if $m=1000$?

The overall conclusions remain the same, but is seems that the sum is even worse (spreading from almost zero to almost 1200 instead of being nicely grouped around $\approx 323$). Surprisingly, the rotative hash function does better than the Gray hash function when $m=1000$. Why? Good question. Maybe we need to investigate that a bit more?

What if we look at how bins are occupied? At $m=200$, we get:

Wow! That’s weird! The first (bad) hash functions behave like periodic oscillators! Well, that, I did not expect. To solve this mystery completely, we would need a bit more analysis, but let’s say that for the sum hash function, as words tend to be made of the same letters (e comes very often in a word) it is kind of expected that sums are similar for similar groups of words. This effect is known as primary clustering, and it is to be avoided as much as possible. The rotative function only disperse the sum in a way that it exhibits a higher frequency oscillation. The Gray hash function also produces this phenomenon, but with a lesser amplitude. It does a better job at spreading the hash values around. Needless to say, it does look weird!

The $m=1000$ data also show the same general behavior but it also gives us more information on what’s going on in the hash functions. The Gray hash function looks kind of periodic, but much more chaotic than the sum and rotative hash functions.

*
* *

So what have we learned? That sum is bad as a hash function, sure. But more importantly that avalanches are a key property of a good hash function, and that better the avalanches are, the better will be the hash functions. The Shuffle hash function does much better than its predecessors exactly because it is better at “avalanching.” If your function is good at avalanches, embiggening does not much more—unless, of course, your function isn’t good at avalanches at all. We have seen that embiggening brought the rotative hash function close to the Shuffle hash function just by replacing input bytes by 32-bits random numbers.

*
* *

The source can be found here, just g++ hash-test.cpp to yield the executable—that’s all there is to it.