Testing Hash functions (Hash functions, part III)

So, this week, let’s have a look at how we will test hash functions. Testing is necessary since it’s not because your hash function looks random that it is sufficiently random to be used for look-up.

stethoscope-small

First thing you’ll need before starting test is to build a representative test set. It will be composed of the various keys you expect to encounter at run time. To make sure the hash function is good for your data but also independent of it, you can have more than one test set.

To assess the general fitness of a given hash function we can use many different techniques: bit histograms, buckets histograms, and statistical goodness tests. We could also use other tests, say, actual run-time statistics gathered from instrumented code.

Bit histograms. We’re not going to prove this here, but if the hash function is to behave as an uniform random variable for every possible modulus, then each bit must behave as a fair coin, being 0 or 1 with the same probability. We can test for this property using a statistical goodness test (see a bit below) or simply visually using a bit histogram. The further a bit strays from the probability p=0.5 of being a one, the less random it is. To compute the histogram, we run the hash function through the test data set; and we count, for every bit, how many times bit $i$ was set to one.

The following histogram is an example of a rather bad hash function.

bad-bit-histogram

While the next shows what we would expect for a good hash function.

good-bit-histogram

Buckets histograms. In much the same way as bit histograms, buckets histograms (or plots) show how many times a hash value lands in a particular bucket, modulo m. This is particularly useful to spot weird behavior. To compute the histogram for a given modulus (table size), run through the test set and count how many items land in a given bucket.

The following buckets histogram shows that there is something seriously wrong—and weird—with the considered hash function:

bad-buckets

While the next buckets histogram shows a much better hash function without many outliers.

good-buckets

Statistical goodness tests. While visual inspection might be sufficient to detect really bad hash functions, we may have to use something a bit more sophisticated to compare relative merits of different hash functions. If we are working under the hypothesis that the hash function must behave like an uniform random variable, one possible test is the \chi^2 (chi-square) test.

The \chi^2 test is used to determine if an observed probability distribution is the same as an expected (and known) distribution. In our case, we want to know if the number of items per bucket is distributed uniformly randomly. The expected number of items in bucket 0\leqslant{i}<m, denoted E_i, is simply E_i=\frac{M}{N} for M items hashed onto N buckets. The observed number of items for buckets i is denoted O_i, and depends entirely on the hash function being tested. Then the \chi^2 value for this experiment is

\displaystyle \chi^2=\sum_{i=0}^{N-1}\frac{(O_i-E_i)^2}{E_i}.

We then compare this value to the Chi-Squared Distribution with N degrees of freedom to obtain a p-value for this particular result, and through statistical hypothesis testing, decide how likely, or whether or not, the hash function behaves like an uniform random variable.

*
* *

So in the next entry, next week, we’ll examine a few simple hash functions and see whether or not you’ll want to used them in your code.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: