Yestermorning I presented an ad hoc compression method known as RLE, or *run-length encoding* and I got quite more reactions than I thought I would. Many suggested doing plenty of *other* stuff despite the fact that the post was about RLE, not about <insert your favorite compression method>. But something very nice happened: people got interested and many suggestions came forth.

I decided to implement the median-based prediction method as suggested by **Dark Shikari** here. Let us see the results!

The median prediction, as used in the FFV1 lossless video codec that uses only intra-frame coding, uses the following context:

The prediction is given by:

Which means that if color images are considered, the predicted color of the pixel ‘?’ is given by the median of the colors. In RGB (or more like some other YUV-like color space), that means that the prediction is applied on each color plane separately. The difference with the actual color and the predicted color is then encoded.

Using “continuous” colors, the residual (the difference) is likely to be smaller in magnitude than the original range of the colors; using an efficient code yields compression.

What happens when you have a palette image? As **Dark Shikari** suggested, let us use the color in the palette that is the closest to the prediction as the predicted color. But that leaves the problem of encoding the error. Palette images have a problem in this regard: colors are not necessarily sorted in any smart way! Encoding the error pretty much means encoding the correct palette index.

To circumvent this problem, I use the same technique as I used with the simpler context prediction of yesterday; I compute the ranking for each color in the palette at each pixel. The closest to the prediction gets rank 0, the furthest, rank 3.

To measure the difference with the palette colors and the predicted color, I tried two approaches. One is based on the approximate perceived brightness difference, where the difference is given by the formula . The second is to count the number of components that differ, something like . I labeled the first “median-luma” and the second “median-diff”.

The “media-luma” method yields an average compression ratio of 1.77:1, and the “median-diff” method 1.82:1, just wee short of the context-prediction that yields 1.9:1. The median prediction is followed by the same steps as the context predictor; so results compare directly the quality of prediction. Why median-diff does a bit better than the median-luma method, I am not sure; intuitively I would say that the bitmaps exhibit some color structure that makes some color components (not palette entries, but separate components such as R, G, B) more likely to be the same as they are pretty much primary-colored. What’s your take on this?

*

* *

So the important point here is not that RLE is state-of-the-art (because it certainly is *not*) but that compression algorithms perform better when there are multiple stages of analysis, prediction, and encoding. A good prediction or transform helps the following stages exploit redundancy. In our case, we have median or context prediction that transforms the original pixel stream into a ranking stream, which, it turns out, has more repetitions than the original stream. The repetitions are then exploited by an auto-trigger RLE encoder, whose output is further encoded by a code reminescent of Huffman codes.

And most codecs are structured like that, at the global scale—although a lot more complex.

Here’s a possibly even-better method:

1. Use no prediction at all.

2. Use a very simple arithmetic coder with context choice decided by the value of the top neighbor and the value of the left neighbor.

3. Either write the probabilities into the frame header (similar to optimal huffman coding) or make it context-adaptive to avoid such a header at the cost of adaptation time.

So, for example, let’s say we have 4 colors in our palette, the Left pixel is “1”, the top pixel is “3”, and the current pixel is “0”.

Thus, context = Left + Top * 4 =1 + 3 * 4 = 13

Now, how do we write the symbol itself? You could use a multi-symbol rangecoder, in which case making things adaptive would be way too complex (you’d just want to hardcode the range values in the header). You could also use a binary rangecoder, and do the following:

encode_decision( context*3 + 0, pixel == 0 )

encode_decision( context*3 + 1, pixel == 1 )

encode_decision( context*3 + 2, pixel == 2 )

So we write 3 decisions to the bitstream, three of which are zero (for the values which the pixel isn’t) and one of which is one (for the value the pixel is). If the pixel is 3, nothing gets written, because the last bit would be redundant.

Another scheme would be:

encode_decision( context*3, (pixel&1) )

encode_decision( context*3 + (pixel&1) + 1, pixel>>1 )

I suspect these are probably practically equivalent.

Note this is using x264 encode_decision syntax, where encode_decision is defined as encode_decision( context, bit ).

Arithmetic coding is probably very feasible if you also have to manage really limited precision, like it would be the case here. Probability estimates given contexts are necessarily low precision because you won’t be using zillions of examplars to formulate your estimate of the pdf. If a given context shows up 10 times, you’re pretty sure that one decimal point is quite enough. I don’t know how we can exploit that to build a “lazy” arithmetic coder…

Well, with only 16 contexts and a thousand pixels or so per sprite, you’re sure to each one a decent number of times.

What you’re probably looking for is an arithmetic coder that adapts quickly–this can be done pretty trivially. You could also make one that adapts faster for contexts which are used more rarely; that would be interesting.

The easiest way, but perhaps less interesting, would probably be to calculate optimal arithcodes and store them in the frame header.

The typical ways to deal with variable adaptation rates are:

1) Count the number of times a given context decides each way. Current probability estimate is a/(a+b). Simple and good, but involves a division, and the initial estimate is heavily quantized.

2) Use a state machine. Instead of h264’s flat statespace, it can be a tree, where higher levels in the tree increment probability by larger amounts, and only the leaves form a closed cycle. Easy to add to an h264-style coder (and it was in one of the drafts iirc), but initial estimate is still heavily quantized.

3) As 2, except leave statespace flat and switch transition tables instead.

Optimal bayesian updating of pdfs should always compress better than explicitly coding an initial pdf (updating gives you exactly the right amount of less precision in less frequently used contexts). But explicit coding lets you leave the probability estimation out of the standard (and of course makes decoding faster).

I’ve been thinking. Another possiblity would be to encode larger tiles progressively. Start with an encoding of the bitmap using 2×2 or larger tiles. Nothing special, just a code for uniform/non-uniform and if uniform, in a separate stream, the uniform color. If non-uniform, read four pixels, possibly predicted from surroundings.

A similar method to what you’re describing is used in the (currently prototype-stage) fast lossless video codec FFV2.

Note that this is *not* generally better than a single-pixel arithmetic coding solution; it was chosen here because it reduces waste as a result of VLCs being of limited bit-length-precision and because it improves performance on decoding due to fewer VLCs having to be read.

The way FFV2 works is as follows:

1. Median-filter the image.

2. Divide the frame up into 8×4 blocks.

3. Divide each 8×4 block up into 2×2 blocks. Create a bitmask based on which 2×2 blocks have all-zero residual and which don’t. Write this bitmask to the bitstream; this is called the coded block pattern. This bitmask isn’t written directly, instead, the CBP has its own VLC table for this.

4. For each 2×2 block, write the VLC for the block. Each of the four residuals in the block can have one of 6 states:

ESC, -2, -1, 0, 1, 2

This leads to a total of 6^4 VLCs.

5. If the block has escape codes, after writing the block, write, in order, the escape codes for the escaped coefficients. These have their own VLC table as well.

The format has, of course, massive Huffman tables written at each keyframe.

Personally I’d think the arithmetic coding solution would be better for what you’re doing, though this solution might work if you hardcode the VLC tables into the device instead of each individual image (there’d be far fewer VLCs due there being only 4 possible pixel values, of course).

By the way, if you’re interested in more in-depth realtime discussion, feel free to drop by #x264dev or #x264 on Freenode; you’ll probably find more concentrated knowledge about entropy coding there than anywhere else I can think of.

That actually might be a good idea. Although I am very well informed regarding probability estimation and integer (tree-structured) coding, my knowledge of arithmetic and range coders is rather sketchy.

great domain name for blog like this)))

I found your blog on google and read a few of your other posts. I just added you to my Google News Reader. Keep up the good work. Look forward to reading more from you in the future.