Last time, we presented universal codes and two simple code. This week, let’s revisit the topic and present Elias codes, which are much better universal codes than Chaitin’s and modified Chaitin codes.
We will use the idea of recursion introduced before, but we will push the idea further, getting codes with lengths , which are pretty much as good as it gets.
Elias (in 1975) proposed to use recursion to code of the length of the number, that is, we code the length of the length of the length … of the code, rather than directly the length. It’s an unusual “bottom up” recursive scheme, where we start by the end. But since we have to know how large the end is, we will decide on a minimum code size, , say .
The Elias code for is
where : denotes simple bit-wise concatenation; and where
This generates, for , codes such as:
|4||10 100 0|
|5||10 101 0|
|6||10 110 0|
|7||10 111 0|
|8||11 1000 0|
|14||11 1110 0|
|15||11 1111 0|
|16||10 100 10000 0|
|17||10 100 10001 0|
The length of the code is , and it’s much better than the Chaitin codes with bits.
So why is universal coding interesting at all? If you know a little bit about compression, you known that an efficient code will be close to the entropy, an information measure introduced by Shannon in the 1940s. Roughly speaking, the entropy is the lower-bound on the average code length one can reach with an uniquely decodable encoding. Here, we pretty much give that up and limit ourselves to codes that eventually yield bits for .
But it happens often that we have to code numbers that are a priori unbounded, and that we do not know much on the distribution except that small numbers are more likely than large numbers (something that is trivially true, pretty much independently of what we call “small” (see here)). In this case, universal codes are ideal. They’re in general relatively simple, and we can create all kind of variations, including those that explicitly optimize for a given (approximated) distribution on small numbers while allowing arbitrary large numbers as part of the tail. And one can also sacrifice finesse for machine-efficiency (an euphemism for “simple programming”), such as in the case of MIDI VLQ, or include error-checking and resynchronization capabilities, like in UTF-8.