## Ad Hoc Compression Methods: Move To Front

In a previous post, I’ve presented an ad hoc compression method known as digram (digraph, bigram, bigraph) coding that coded pairs of symbols on used codes from the original alphabet yielding a very simple encode/decode and also sufficiently good compression (in the vicinity of 1.5:1 or so). While clearly not on par with state-of-the-art codecs such as Bzip2 or PAQ8P, this simple codec can still buy you extra space, even when using a very slow/simple processor.

This week, I’ll introduce you to another simple algorithm, Move To Front coding.

Move to front coding comes in many variants, but they all do essentially the same: compute a ranking of symbols according to their relative frequency while allowing them to be temporarily move to the front of the list (thus the name of the method). On the long run, however, the symbols remain in approximately the ranking order.

Let us compare three algorithms:

• Savage MTF. When a symbol is observed, it is moved at the first rank, shifting all the symbols between the head and its former position by one.
• Bubble MTF. When a symbol is observed, it is bubbled exactly 1 position towards the first position, shifting only one symbol down.
• Ranking MTF. The ranks are incrementally computed as symbols are encoded. The ordering converges towards the ranking on the whole sequence.

For the first test, I used a simple encoding that gives 4 bits code to the most frequent symbols and much longer—12 bits—codes to the symbols that are not in the first 15 rank positions. In clear: the codes 0000 to 1110 are assigned to the first to fifteenth symbols, and the code 1111, used as an “escape” introduces 8 literal bits that encode the symbol. This encoding is somewhat naive, but is sufficient to test which of the three variants is the most efficient.

So I will use the same data set as with digram coding. Running all three algorithms, I get the following results:

Comparing Algorithms.

Unsurprisingly, the ranking algorithm gives better results except for file 11800-8 which is harder to compress, and somewhat uncharacteristic. So we’ll focus on the ranking algorithm.

Since the ranking algorithm is the most efficient, it’s worthwhile to implement it using an efficient algorithm. We have three operations that must be performed as fast as possible: finding the rank of a symbol, finding a symbol given a rank, and finally update the rank of a symbol. The first two can be performed in constant time, and the last one in expected constant time. Consider the following data structure:

The Ranking Data Structures

The Counts table holds the number of times a symbol have been observed so far. It is drawn much wider than the other two because the counts are integers large enough to represent a file size (off_t in C/C++, which is likely a 64 bits integer). The Ranks table is indexed by the symbol itself and contains its current rank. The Reverse table gives the symbol corresponding to a given rank, allowing fast look-up during decode. Finally, the crucial part of the algorithm, the update of ranking, proceeds pretty much as a bubble sort:

```void update(uint8_t c)
{
counts[c]++;

// bubbling update
int i=ranks[c]; // where we start from
int j=i; // the next most popular symbol (smaller rank)

// should bubble at most once

// when distribution is stabilizing
//
while ( (j>0) && (counts[c]>counts[reverse[j-1]]) )
j--;

std::swap(ranks[c],ranks[reverse[j]]);
std::swap(reverse[i],reverse[j]);
}
```

The costly step (the while) isn’t that costly because after a while, symbols do not change rank very often.

So we proceed with the ranking algorithm. But the basic encoding isn’t all that great, because it gives very long codes to the symbols that aren’t in the first 15 positions. Examining the ranking, we see that the distribution falls very rapidly. For example, from file 12606.txt, we see that the symbol occupying the first position is observed 481038 times while the symbol in the 15th position is seen only 58189 times. Such a distribution is surely amenable to an efficient code. Let us try a Huffman Code. Knowing that the escape is used 2950504 times, we arrive at the following code:

```symbol  L    Code

esc   1    1 + xxxxxxxx (+ 8 bits)

0   3    000
1   4    0010
2   5    00110
3   5    00111
4   5    01000
5   5    01001
6   5    01010
7   5    01011
8   5    01100
9   5    01101
10   6    011100
11   6    011101
12   6    011110
13   7    0111110
14   7    0111111
```

So, according to the code, the symbol occupying the 9th position, for example, would receive the code 01101 (notice that we start at zero, in a very C-ish fashion). Let Ranks+ be the new algorithm with the first encoding.

Thinking a bit, you notice that the 9 bit codes given to symbols not in the top 15 may be overly long for codes almost in the top 15, say the 16th or 17th. We can still refine the code a bit by assigning 5 bits codes to the next 32 symbols, and full, 8 bits code to the other ones, yielding:

```symbol  L    Code

esc   1    10 + xxxxx (+ 5 bits) (codes 15 to 47) (7 bits total)
11 + xxxxxxxx (+8 bits) else (10 bits)
```

let us call this second variant Ranks++. Running the compressors, we get the following:

Comparing Encodings.

So the third encoding is much better than the first two. So we have improved the compression quite a bit by using smarter codes. Maybe we could refine coding a bit more by using a full (static) Huffman code to encore all symbols, but I would guess the improvement would not be significant.

Let us now proceed to a “reality check”. Is the algorithm very good? Let us compare to digram coding. Comparing results, we get:

Comparing with Digram Coding

So digram coding is still doing better despite the use of a variable length code and ranking! Clearly, the ranking algorithm isn’t very good, even more so when compared to Bzip2:

Comparing with Bzip2

Again, Bzip2 beats either method flat. Real flat.

*
* *

Still, what I shown here this week is still a valuable lesson, I would think. First, I show how you get an idea and test it in many steps. First we tested the three ranking methods to see which one is best for the intended data. Then we devised a fast/efficient algorithm to implement the method. We then compared variable length codes to determine which is better. Finally, we proceeded to a reality check that allowed us to assess the actual usefulness of the method.

*
* *

As a commentor, Andrew, pointed out, MTF by itself isn’t very good, but can be used as a step in a more complex compression algorithm. For example, Bzip2 does use MTF coding but only after having transformed the input using the Burrows-Wheeler Transform that massages the input so that it presents a greater number of repetitions. It then applies RLE (as we described in the Pokémon experiments), then Huffman coding (and other types of coding as well, depending on which special case the algorithm uses), yielding extremely interesting compression ratios.

### 4 Responses to Ad Hoc Compression Methods: Move To Front

1. Andrew Polar says:

If method does not beat bzip2 it does not worth time reading. You forgot to say that MTF effective when applied after BWT, this why bzip2 shows strong compression. The typical data fragment after BWT may look like follows 26,25,26,25,103,103,101,101,25,25,25… You can see that it is best data for further MTF. You can also say that MTF can be applied as context modeling. I mean different MTF objects after different bytes. I use some technique that it is close to MTF in RunCoder1. It is sorting based on context dependent statistics. I approached bzip2 much closer on large files, but bzip2 is ideal for short files.

2. Steven Pigeon says:

You may or mayn’t have noticed, it’s about MTF itself, not bzip2.

But you are right that bzip2 performs the Burrows-Wheeler Transform as a step before using MTF, which is also followed by huffman coding, and that it is this step that gives it its edge, but wrong about the fact that a method has no interest if it doesn’t beat <insert your favorite algorithm here>.

Here, I merely presented the effect of MTF + modified Huffman, which removes nothing of its use in general, and doesn’t precludes you from combining these techniques in conjuction with <your favorite algorithm>.

3. Tom says:

I have a variant I call Enhanced MTF. Instead of having one list for move-to-front, I use a unique list for each of the prior characters. In fact,it has now evolved into a separate list for each of the prior 3 bytes!! Works great for text. tcoffin1@cfl.rr.com

• Steven Pigeon says:

Maybe a better name for this would be Conditional MTF? I think we can see MTF as a (not always very good) approximation to $P(X=x)$. Your variant approximates $P(X_t=x_t|x_{t-1})$, $P(X_t=x_t|x_{t-1},x_{t-2})$, etc. which should give better approximations to $P(X_t=x_t|\{x_i\}_{i=1}^{t-1})$ than naïve MTF.