In the previous entries, we learned that a good hash function for look-ups should disperse bits as much as possible as well as being unpredictable, that is, behave more or less like a pseudo-random number generator. We had a few failed attempts, a few promising ones, and now, a good one.
One of the weak operations in the previous hash functions is the combination operation, which is the addition. We remarked that it isn’t very good because it is unlikely to provoke a global change in the hash value. Indeed, if you add an 8-bit quantity, then you’re reasonably certain that the value changes for the first 8 bits, but after that, changes are operated only through the carry ripple, which has only probability of being propagated to the
th bit. That is, it is very unlikely to ripple to the end of the word.
So we need an operation (as simple as possible) to make sure that the new bits are spread across, and affect, the hash value. Therefore, we must scatter input bits. But how? Well, we could design some bit-wise function that takes 8 bits and spread them, but that function would be fixed, and independent of the input bits (if we consider a permutation-type function). So we need a splatter that depends on the input, but covers more or less all bits. Well, we can do that by (integer) multiplying the next input byte by a large random-looking number. A random-looking prime number, in fact. Why prime? It will not introduce new common factors in the subsequent additions other than those in the input.
Let’s pick one:
.
This number is 58 bits long. If you multiply an 8-bit value by a 56-bits value, you’d get, most of the times, a 64-bits value. This time, it is a bit bigger to compensate the fact the the 8-bit input doesn’t necessarily use all of its 8 bits. Plus it’s prime. Why? How?
(Yes. For real. That how I typed it. Not sorry.) Then let’s mix the product. Let’s use the perfect_shuffle we’ve already used. Then combine this value with a simple addition. The combination step being strong enough now, we could use a simple confusion step. Let’s use cut_deck, a function that swaps the high- and low part of the word, without exchanging bits in each parts, for a bit more confusion.
The resulting code is given by
inline uint64_t cut_deck(uint64_t x) { return (x<<32) | (x>>32); } hash_uint_t bricolage(const char buffer[], size_t n) { // a prime number more than 56 bits // (it is 58 bits long) const uint64_t magic=173773926194192273u; hash_uint_t h=0; for (size_t i=0;i<n;i++) h = cut_deck(h+perfect_shuffle(buffer[i]*magic)); return h; }
*
* *
How does it compare to the other hash functions so far? Let’s see! Here’s the bit-histogram:
At the same scale as the other functions, it appears to be flat. However, if we change the scale a bit, we see that there are still variations.
So there are still variations, but very small. The buckets histogram reveal that it disperses data evenly over several table sizes:
*
* *
We finally found something that looks like a good hash function. In the meanwhile, we have seen that building a good hash function, even just for look-up, isn’t as easy as we might have thought. It doesn’t suffice to hack a couple of bit-twiddling thingies together with duct-tape to make it work: we must be a lot more deliberate: make sure the combination step propagate changes all over the hash value and that any insufficiency is compensated by a good confusion step. If the combination step is weak, confusion must take over; if the combination step is strong, confusion could even be dispensed with.