Some time ago, I presented a piece on compressing voxel worlds and I just realized that I discussed different types of variable length codes quite a few times, but that I never took the time to present you the basic method of Huffman coding!
The problem of finding an optimal variable length code is to find an uniquely decodable binary code (that is, a code using only 0 and 1 for which there exists only one way of decoding a message) that encodes each symbol in a number of bits that is proportional to its probability of occurrence. A first strategy was devised by Robert Fano, Huffman’s teacher at one point. The so-called Shannon-Fano code is a top-down approach to solving the problem but it was shown to be suboptimal. The code proposed by David Huffman solves the problem optimally by taking exactly the opposite approach, that is, building the code bottom-up.
In a previous post, I summarized what makes a tree-structured code uniquely decodable:
This code is uniquely decodable because it is a prefix code, that is, a code for which no prefix part of a code is a code by itself. A simple way of understanding this property is to consider that decoding these types of code is like walking down a binary tree: you read a zero, you go down left, you read a one, you go down right, and if you reach a leaf in the tree, you’ve completed the decoding of a symbol. You’re not allowed to stop at an internal node, only at a leaf.
Huffman’s goal was to find, from a list of symbols and their relative frequencies of occurrence, such a code tree. However, Huffman realized that the code, to be optimal, that is, with the shortest average code length, had to obey these conditions:
- No code is a prefix of another code;
- No extra information is needed to delimit codes;
- Symbols, sorted in non-increasing order of frequency of occurrence receive codes of non-decreasing length; and in particular, the two least frequent receive codes of the same length; in order word, frequent symbols get the shorter codes and the least frequent symbols the longer codes;
- The last two codes, the longest, differ only in the last bit
Huffman procedure is relatively simple but yields an optimal code complying to the above conditions. Here how it works.
First, form a node for each symbol containing the symbol itself and its frequency of occurrence. Then, form a list with these nodes and sort them from most to least frequent. At the tail of the list, merge the last two nodes, yielding an internal node with no symbol but whose frequency of occurrence is the sum of the frequencies of the two merged nodes, and put it back in the list so that the list remains sorted from most to least frequent. Repeat until there’s only one node left, the root of the code three.
The method is exceedingly simple but let us present an example. Let’s say we have messages to encode that use the five symbols a, b, c, d, and e, which we observe with relative frequencies 3, 3, 2, 1, and 1.
So we create our list of nodes:
We merge the last two nodes:
Since the list remained sorted, nothing is to be done except proceed to the next iteration. The last two nodes are merged again:
and the list order restored:
Let us merge the last two nodes again:
…and restore list order again:
Lets us merge the two last nodes:
and since now we have only one node in the list, we are done, we have a complete tree! We now assign codes:
Yielding the code:
Using the code, the sequence aabcea is encoded as
00 00 01 10 111 00
or, in memory, as a blob without the spaces:
which is much shorter than assigning a naïve code with 3 bits to each codes (since there are 5 possible codes, you need 3 bits as 2 bits would give you only 4 codes).
The optimality of Huffman’s procedure is not quite trivial to show but can be found in many Information Theory and Data Compression books. The average Huffman code length (provided sufficiently long messages and unchanging frequencies of occurrence) is never off by more than bits compared to the entropy of the source, which makes Huffman codes very efficient in general.
So you can use Huffman’s simple procedure to create your own variable length codes suited to the frequencies of occurrence of the symbols in the source you are considering. This will greatly help even ad hoc compression strategies because it will ensure that the codes you are using are exactly as long as they need to be for the code to be uniquely.