Three (somewhat) Better Functions (Hash functions, part V)

Last week’s hash functions—the check-sum, Knuth’s, and gray—produced less than optimal results. A function based only on addition, such as the check-sum, cannot possibly produce very large numbers, and therefore fails to distribute item over a very large table. That’s why there’s a confusion step following the combination step, to spread the bits around the (machine-sized) word.

So the confusion step must explicitly shuffle the bits around (although, not necessarily a permutation) and make sure that the most- and least-significant bits gets thrown around. Let’s try a couple of things!

I’ve discussed moving bits around before. The first function (that I called “blender”) moves bits around following this diagram:

blender-1

The blender function can be implemented using shifting and other bit manipulations. A hash function using the blender function would be implemented by something like this:

uint64_t blender(uint64_t x)
 {
  x= ((x & 0xffff000000000000) >> 48) |
     ((x & 0x0000ffff00000000) >> 16) |
     ((x & 0x00000000ffff0000) << 16) |
     ((x & 0x000000000000ffff) << 48) ;

  x= ((x & 0xff000000ff000000) >> 24) |
     ((x & 0x00ff000000ff0000) >> 8 ) |
     ((x & 0x0000ff000000ff00) << 8 ) |
     ((x & 0x000000ff000000ff) << 24) ;

  x= ((x & 0xf000f000f000f000) >> 12) |
     ((x & 0x0f000f000f000f00) >> 4 ) |
     ((x & 0x00f000f000f000f0) << 4 ) |
     ((x & 0x000f000f000f000f) << 12) ;

  x= ((x & 0xc0c0c0c0c0c0c0c0) >> 6) |
     ((x & 0x3030303030303030) >> 2) |
     ((x & 0x0c0c0c0c0c0c0c0c) << 2) |
     ((x & 0x0303030303030303) << 6) ;

  x= ((x & 0x8888888888888888) >> 3) |
     ((x & 0x4444444444444444) >> 1) |
     ((x & 0x2222222222222222) << 1) |
     ((x & 0x1111111111111111) << 3) ;
  return x;
 }
 
hash_uint_t blender(const char buffer[], size_t n)
 {
  hash_uint_t h=0;
  for (size_t i=0;i<n;i++)
   h=blender(h+buffer[i]);

  return h;
 }

We can be confident that the compiler eliminates all superfluous operations. You can look at the generated code using objdump -D -Mintel (disassemble in Intel syntax). Using the same test data, we get the following bit-histogram:

blender-histogram

The function produces the following bucket histograms:

blender-1000
blender-10000
blender-100000

The buckets histograms show variance, but it seems that overall, there aren’t any outliers (empty or overfull buckets). That somewhat surprising since the bit histogram shows a good distribution only for a few bits; and most other bits being zero.

*
* *

Let’s try something else. Let’s try to spread bits around more evenly. Let’s try this function:

blender-2

A direct implementation would be something like:

uint64_t blender2(uint64_t x)
 {
  x= ((x & 0xffff000000000000) >> 48) | // rol(x,16) (optimized correctly by g++)
     ((x & 0x0000ffffffffffff) << 16) ;

  x= ((x & 0xff000000ff000000) >> 24) |
     ((x & 0x00ffffff00ffffff) << 8 ) ;

  x= ((x & 0xf000f000f000f000) >> 12) |
     ((x & 0x0fff0fff0fff0fff) << 4 ) ;

  x= ((x & 0xc0c0c0c0c0c0c0c0) >> 6 ) |
     ((x & 0x3f3f3f3f3f3f3f3f) << 2 ) ;

  x= ((x & 0x8888888888888888) >> 3 ) |
     ((x & 0x7777777777777777) << 1 ) ;

  return x;
 }

hash_uint_t blender2(const char buffer[], size_t n)
 {
  hash_uint_t h=0;
  for (size_t i=0;i<n;i++)
   h=blender2(h+buffer[i]);

  return h;
 }

It gives the following bit histogram:

blender2-histogram

It shows somewhat of an amelioration on the previous function. Some bits are almost always 1 or almost always zero, but there are now more bits with a better distribution. Part of the effect may be linked to the average length of the keys from the test set: there are a lot of short words, and those only make the hash function spin a few times. Maybe it’s insufficient to even out everything properly?

Let’s look at the bucket histograms:

blender2-1000
blender2-10000
blender2-100000

It seems that the hash function using blender2 spreads the items in two classes of buckets: about one half is under-full, the other half over-full. Yet, no disastrous outliers.

*
* *

OK, let’s try something else again. The first stage of the previous function is basically a rotation, but by 16 bits; and successive stages make bits stay together more or less, or disperse but not very far from each other. Let’s try to shuffle them more.

blender-6

The first stage is an arbitrary rotation. In the diagram, it shows a rotation of 2, but in the code, you may actually want something that’s not a multiple/power of two, maybe a prime number. Let’s see what happens with a rotation of 3:

uint64_t blender6(uint64_t x)
  {
   x=rol(x,3);

   x= ((x & 0xffff000000000000) >> 16) |
      ((x & 0x0000ffff00000000) >> 32) |
      ((x & 0x00000000ffff0000) << 32) |
      ((x & 0x000000000000ffff) << 16) ;

   x= ((x & 0xff000000ff000000) >> 8 ) |
      ((x & 0x00ff000000ff0000) >> 16) |
      ((x & 0x0000ff000000ff00) << 16) |
      ((x & 0x000000ff000000ff) << 8 ) ;

   x= ((x & 0xf000f000f000f000) >> 4) |
      ((x & 0x0f000f000f000f00) >> 8) |
      ((x & 0x00f000f000f000f0) << 8) |
      ((x & 0x000f000f000f000f) << 4) ;

   x= ((x & 0xc0c0c0c0c0c0c0c0) >> 2) |
      ((x & 0x3030303030303030) >> 4) |
      ((x & 0x0c0c0c0c0c0c0c0c) << 4) |
      ((x & 0x0303030303030303) << 2) ;

   x= ((x & 0x8888888888888888) >> 1) |
      ((x & 0x4444444444444444) >> 2) |
      ((x & 0x2222222222222222) << 2) |
      ((x & 0x1111111111111111) << 1) ;

   return x;
  }

hash_uint_t blender6(const char buffer[], size_t n)
 {
  hash_uint_t h=0;
  for (size_t i=0;i<n;i++)
   h=bit_helpers::blender6(h+buffer[i]);

  return h;
 }

This gives us the following bit histogram:

blender6-histogram

It shows a somewhat better variance. There are still highly biased bits, but they’re a bit closer to being centered. The bucket histograms, however:

blender6-1000
blender6-10000
blender6-100000

This last function is not very good, then. It shows clear banding.

*
* *

In these tries, surprisingly, the first blender function gives the best results on bucket histograms, but does a really bad job on the bit histogram. Conversely, the other two do a better job on the bit histogram, but both create banding in the distribution, which is clearly an unwanted feature.

We’ll have to explore this matter a bit further.

To be continued…

One Response to Three (somewhat) Better Functions (Hash functions, part V)

  1. […] the previous entries, we learned that a good hash function for look-ups should disperse bits as much as possible as well […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: