Three bad functions (Hash functions, part IV)

Devising a good, fast, simple hash function isn’t easy. At all. It’s quite the challenge, in fact. Unsurprisingly, the first tries will be somewhat disappointing, but that shouldn’t keep you from trying. This week, let’s consider a couple of simple (and not very good) functions.

The Test Set. The test set for this series of post is a list of (French) words coded in ISO-8859, a list of words (taken from a site offering dictionaries for the game of Scrabble; and I had this list for a very long time, I can’t remember where I took it). You can download it from here.

Checksum.The simplest of all hash functions is the checksum. It consist in, as you might have guessed, the sum of the bytes of the hashed object. A direct implementation would look something like:

typedef uint64_t hash_uint_t;
const size_t hash_uint_t_bits=sizeof(hash_uint_t)*CHAR_BIT;

hash_uint_t checksum(const char buffer[], size_t n)
 {
  hash_uint_t s=0;

  for (size_t i=0;i<n;i++)
   s+=buffer[i];

  return s;
 }

In this implementation, the hashed object is represented as a const char buffer[] because it’s the easiest way to have a hash function applicable to pretty much everything.

We don’t need to examine the various graphs and tests to see it won’t be a good hash function (but we will, in a moment). First, we notice that the checksum isn’t resistant to permutations. Indeed, all the permutations of the same string will give the same checksum. An other important defect is that the hash will not cover all possible values. The largest value it can produce is proportional to the length of the object (size_t n, in the arguments). This limits the size of the hash table (or, conversely, will produce a large number of empty cells towards the end of the table).

Let’s have a look at the bit histogram:

checksum-histogram

The bit histogram confirms what we deduced: only a relatively small number of bits behave somewhat randomly, all the others remain zeroes. How does the checksum behave under different modulii (1000, 10000, 100000)? Very surprisingly, actually.

checksum-1000

checksum-10000

checksum-100000

Knuth‘s hash function. Actually, this function isn’t Knuth’s, but is attributed to him by Bob Jenkins. In an effort to propagate further the effect of addition, a confusion step is used. In this function, confusion is just a right rotation of 5 bits. This results in pushing some highly variable bits to the end of the word.

hash_uint_t knuth(const char buffer[], size_t n)
 {
  hash_uint_t h=0;
  for (size_t i=0;i<n;i++)
   h = bit_helpers::ror(h,5) + buffer[i];

  return h;
 }

Is it much better? Let’s look at the bit histogram:

knuth-histogram

And the bucket graphs for the same modulii (1000, 10000, 100000):

knuth-1000

knuth-10000

knuth-100000

There’s a rather strange “banding” effect. Some buckets have five to ten time as many items as others. The function is clearly unbalanced.

Gray. Since the rotation isn’t a very efficient confusion step, could we use something the transformation to Gray codes?

inline uint64_t to_gray(uint64_t x) { return x^(x>>1); }

hash_uint_t gray(const char buffer[], size_t n)
 {
  hash_uint_t h=0;
  for (size_t i=0;i<n;i++)
   h = to_gray(h+buffer[i]);

  return h;
 }

Is this function any better? Let’s see.

gray-histogram

The bit histogram reveals a behavior reminiscent of the simple checksum: only the first few bits vary; the most significant bits remain mostly zeroes.

The bucket histograms confirm that the Gray hash function doesn’t cover well the whole range, and, just like the simple checksums, concentrate items in the first few slots. However, the Gray conversion somewhat diffuses the sine-like pattern.

gray-1000

gray-10000

gray-100000

*
* *

So simple functions won’t do a good job at hashing. Recall, the hash function must have both good combination and confusion steps. Addition alone isn’t much of a combination step because the only “large scale” effect it can produce is through the carry, and, on average, the carry doesn’t ripple very far. Simple confusion steps like rotation and Gray conversion aren’t randomizing enough to produce good random-looking values. We need something much better.

To be continued…

One Response to Three bad functions (Hash functions, part IV)

  1. […] the previous entries, we learned that a good hash function for look-ups should disperse bits as much as possible […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: