Some time ago, I presented a piece on compressing voxel worlds and I just realized that I discussed different types of variable length codes quite a few times, but that I never took the time to present you the basic method of Huffman coding!

The problem of finding an optimal variable length code is to find an uniquely decodable binary code (that is, a code using only 0 and 1 for which there exists only one way of decoding a message) that encodes each symbol in a number of bits that is proportional to its probability of occurrence. A first strategy was devised by Robert Fano, Huffman’s teacher at one point. The so-called Shannon-Fano code is a top-down approach to solving the problem but it was shown to be suboptimal. The code proposed by David Huffman solves the problem *optimally* by taking exactly the opposite approach, that is, building the code bottom-up.

In a previous post, I summarized what makes a tree-structured code uniquely decodable:

This code is uniquely decodable because it is a prefix code, that is, a code for which no prefix part of a code is a code by itself. A simple way of understanding this property is to consider that decoding these types of code is like walking down a binary tree: you read a zero, you go down left, you read a one, you go down right, and if you reach a leaf in the tree, you’ve completed the decoding of a symbol. You’re not allowed to stop at an internal node, only at a leaf.

Huffman’s goal was to find, from a list of symbols and their relative frequencies of occurrence, such a code tree. However, Huffman realized that the code, to be optimal, that is, with the shortest average code length, had to obey these conditions:

- No code is a prefix of another code;
- No extra information is needed to delimit codes;
- Symbols, sorted in non-increasing order of frequency of occurrence receive codes of non-decreasing length; and in particular, the two least frequent receive codes of the same length; in order word, frequent symbols get the shorter codes and the least frequent symbols the longer codes;
- The last two codes, the longest, differ only in the last bit

Huffman procedure is relatively simple but yields an optimal code complying to the above conditions. Here how it works.

First, form a node for each symbol containing the symbol itself and its frequency of occurrence. Then, form a list with these nodes and sort them from most to least frequent. At the tail of the list, merge the last two nodes, yielding an internal node with no symbol but whose frequency of occurrence is the sum of the frequencies of the two merged nodes, and put it back in the list so that the list remains sorted from most to least frequent. Repeat until there’s only one node left, the root of the code three.

The method is exceedingly simple but let us present an example. Let’s say we have messages to encode that use the five symbols *a*, *b*, *c*, *d*, and *e*, which we observe with relative frequencies 3, 3, 2, 1, and 1.

So we create our list of nodes:

We merge the last two nodes:

Since the list remained sorted, nothing is to be done except proceed to the next iteration. The last two nodes are merged again:

and the list order restored:

Let us merge the last two nodes again:

…and restore list order again:

Lets us merge the two last nodes:

and since now we have only one node in the list, we are done, we have a complete tree! We now assign codes:

Yielding the code:

a |
00 | |

b |
01 | |

c |
10 | |

d |
110 | |

e |
111 |

Using the code, the sequence *aabcea* is encoded as

00 00 01 10 111 00

or, in memory, as a blob without the spaces:

0000011011100

which is much shorter than assigning a naïve code with 3 bits to each codes (since there are 5 possible codes, you need 3 bits as 2 bits would give you only 4 codes).

*

* *

The optimality of Huffman’s procedure is not quite trivial to show but can be found in many Information Theory and Data Compression books. The average Huffman code length (provided sufficiently long messages and unchanging frequencies of occurrence) is never off by more than bits compared to the entropy of the source, which makes Huffman codes *very* efficient in general.

*

* *

So you can use Huffman’s simple procedure to create your own variable length codes suited to the frequencies of occurrence of the symbols in the source you are considering. This will greatly help even ad hoc compression strategies because it will ensure that the codes you are using are *exactly* as long as they need to be for the code to be uniquely.

I remember my college days when I struggled to get hold of Huffman Codes :) would have been nice if I get this tutorial. good stuff buddy.

[...] at Harder, Better, Faster, Stronger, Steven Pigeon has a nice post on Huffman codes. He explains how Huffman codes operate and works through a pencil and paper example. Because he [...]

It’s been a while since I’ve seen Scheme but this seems to be a nice implementation of Huffman codes! Good work!

Wow the best description of huffman I’ve read in a long time (the last time I read it, it was in a book). Also thanks for revealing the secret behind huffman…why things are done the way are done.

The algorithm itself is really simple. Proofs on performance bounds are suprisingly complex given the algorithm. A few years ago I wrote chapter 4 in the Lossless Compression Handbook (with Khalid Sayood as editor) on Huffman coding and all its variations. It also has a copious reference list.

Nevertheless, sometimes simple codes (such as those described in “Compressing Voxel Worlds”) will give you an extra bang with comparatively little effort; Huffman (or other optimal/near-optimal) codes will give you that little extra edge on compression.

Thanks for the book reference. Couldn’t agree more with your observations on optimal coding.

Btw could you please recommend some good algorithm/online resource/book for doing lossless compression in hardware? I need something which gives the best bang for the buck (w.r.t. silicon real estate and execution speed).

No, I do not have a good reference on hardware-accelerated (lossless) compression. I’m pretty sure searching the IEEE Xplore (or similar ressource), would give results, but I think you can get pretty far using software methods only. In chapter 4, I describe a state-machine based technique that greatly speed-up coding/decoding by avoiding manipulating individual bits.

Thanks. I will readup on it.

[...] only two types, I think: normal and with a pit), so 2 bits are needed. Using something simple as Huffman Coding would probably get us nowhere further because the distributions aren’t skewed enough to [...]

[...] time ago, I discussed Huffman codes, how they were generated, and how you could (de)code information with it. I also said that they [...]

[...] first strategy that comes to mind, using this assumption, is to use a method reminiscent of how Huffman Codes are constructed. That is, we scan the original list from left to right: we take two nodes (if there [...]

There’s a nice implementation of this up at http://www.learninglover.com/examples.php?id=19