Universal Coding (part III)

In Universal Coding (part II), we explored the idea of variable length codes that encode integers of any magnitude in $O(\lg n+\lg\lg n)$ bits, but the encoding requires (somewhat) complex arithmetic, and we concluded that sometimes, it’s better to use a simple code that sacrifices some coding-efficiency in exchange for machine-efficiency. Let’s now look at one of those, MIDI VLQ and how we can make it a bit better.

The MIDI VLQ code allows a signed integer of any magnitude (either using 2s complement or an explicit sign bit, depending on the variant) to be encoded by using extensions bits that introduce more bytes. The encoding is as follows. The number to encode is broken down onto bytes, seven bits by bytes, with the 8th bit used to indicate whether or not it is the last byte of the number. If there is not enough bits to fill the last byte, one either uses zeroes (in the explicit sign-bit variant, with the sign bit in the 7th bit of the last byte) or sign-extends. VLQ is therefore a code where the length is coded in unary (0 for length 1, 10 for length 2, 110 for length 3, etc.) but is dispersed in the code rather than being encoded at the beginning.

A typical Encoding would look like this, with the original number at the top and the VLQ at the bottom:

While it is simple, MIDI VLQ encoding is not absolutely code-efficient because it does not make full use of the extension bits: there are multiple representations for numbers. For example, one can encode zero in an infinite number of ways, because 0x80 0x80 ... 0x80 0x00 is a valid code, whatever its length. We can transform the VLQ so that they can only represent numbers in one way, and see what’s the coding efficiency gain.

First, let’s get rid of the multiple representation problem. In MIDI VLQ, there is not a strong link between the continuity bit and the actual number being coded, but we can make sure that if a continuity bit is set, the range of number representable changes. Consider the following table:

 Code form Range xxxxxxx0 0-127 xxxxxxx1 xxxxxxx0 128-16511 xxxxxxx1 xxxxxxx1 xxxxxxx0 16512-2113663

Since we can represent values 0 to 127 with a single byte code (with the 8th bit being 0), it would be wasteful to start the two bytes codes at zero: we start at 128. In the two byte codes, have 14 payload bits, we can represent $2^{14}$ values, but starting at 128, and up to 128+16384-1=16511, and the two byte codes represent values 128-16511. The same applies with 3 bytes codes, starting at 16512, and representing $2^{21}$ more values, yielding range 16512-2113663. And so on.

In general, a $n$ byte code represents the range

$\displaystyle \sum_{i=1}^{n-1} 2^{7i}$ to $\displaystyle \left(\sum_{i=1}^n 2^{7i}\right)-1$

or

$\displaystyle \frac{128}{127}(128^{n-1}-1)$ to $\displaystyle\frac{128}{127}(128^n-1)-1$,

using the identity $\displaystyle \sum_{i=1}^n a^i=\frac{a(a^n-1)}{a-1}$.

With VLQ a $n$ byte code represent $2^{7n}$ values (of which $\sum_{i=1}^{n-1} 2^{7i}$ are redundant), therefore, it represent only

$\displaystyle 2^{7n}-\sum_{i=1}^{n-1} 2^{7i}$

new codes. The efficiency of coding is therefore

$\displaystyle \lim_{n\to\infty} \frac{127^n-\sum_{i=1}^{n-1}128^i}{128^n}=\frac{126}{127}\approx{}0.992$,

while the code with no redundant representation yields and efficiency of $1$.

*
* *

That was a lot of work to get asymptotically 1% more efficiency than the basic VLQ, and one can further verify that even with small $n$, the efficiency remains about the same (there’s a dominating term $\frac{126}{127}$ followed by a rapidly vanishing term in $128^{6-7n}$). Are there cases where a 1% gain is worth all that trouble? Well, you decide. More important than the multiple representations for values is the fact that there may be up to 6 pad bits. While asymptotatically this inefficiency vanishes as well, for small $n$, $\frac{6}{7n}$ is rather large.