Before we go on exploring hash functions for look-up, let’s discuss their basic anatomy. This will give us some vocabulary as well as help us identify what are the important characteristics of good hash functions.
The general anatomy of hash function is composed of four principal parts, as shown in the block diagram here below.
The hash value, denoted in the diagram (and as the output), is the value computed so far by the hash function. It receives some initial value, , and is successively modified by each symbol of the input.
The input symbol, shown as in the diagram, is the -th part of the input, fed at time to the function. In turn, in causes the hash value to change, yielding .
The combination step combines the hash value with the next input symbol . This operation can be as simple as adding the symbol to the current hash value; but it’s preferably an operation whose result is much harder to predict.
The confusion step mixes up (not necessarily a permutation) the hash value yielded by the combination step. This step is necessary to mitigate any inefficiencies the combination step may exhibit. If you would have a perfect combination step, the confusion step would be omitted, but, as you might have guessed, the combination step is rarely that perfect and some additional help is needed to produce strong, random-looking, values.
A good hash function must exhibit confusion and diffusion. Diffusion (in Shannon’s sense) is observed when a single bit changed in the input affects many bits in the output. Confusion in Shannon’s sense (not the same as the figure above) is when an output bit depends on several input bits. If your hash function exhibits both properties, then it will be robust.
The choice of the initial hash value, , may not be as critical as one might think at first. It may seem that choosing a good, random-looking initial value will kick-start the function and help produce stronger outputs, but it really only depends on the hash function itself. If the hash function hash good enough combination and confusion steps, the effect of the initial value—often taken to be zero anyway—will be lessened because the two steps randomize the value sufficiently given input symbols. If, on the contrary, the hash function is really bad (say, a simple sum of the input symbols), then the initial value will only introduce a constant bias to the output values, and not make them a lot more random.