In a previous post, we started discussing the subject of encoding values using fractional bits. This is quite counter-intuitive as we usually think that encoding asks for *whole bits*, furthermore that we think of bits as being essentially atomic—unsplittable.

However, bits can be fractioned, helping us devising efficient encodings, something not unlike a primitive form of arithmetic coding, for a variety of data. Let’s consider a simple case that illustrates the principle: encoding the date.

In a naïve encoding scheme, if we were to encode a date (without resorting to a system-provided class that manages all calendar exceptions), we could consider using three different integers to code year, month, and day, resulting in `3*sizeof(int)` bytes of storage. On a system with `int` being 32 bits, that’s 12 bytes of storage—clearly wasteful. Efficient, because you need no further processing to access the value, but wasteful of storage. We would then consider bit fields, and a tighter bit allocation:

typedef struct { short year: 10; // 1024 years short month: 4; // 16 months? short day: 5; // 32 days? } __attribute__((packed)) date_t;

This structure compiles to 19 bits, aligned to 24, that is, 3 bytes. That’s a lot better than our first naïve scheme using 12 bytes. This scheme is still somewhat inefficient because it allocates 4 bits for the month, giving us 16 months instead of 12, and 32 days per month instead of 28, 29, 30, or 31. It uses 9 bits for the day of the year, giving us years of 512 days rather than 365 or 366. The scheme is somewhat machine-efficient (as one might expect of bit fields) but is not encoding-efficient.

How inefficient is it?

To encode 366 different values, one needs bits. How can we achieve an encoding very close to bits? One possibility is to use a phase-in code that will yield codes of roughly bits long on average. Another is to use fractional shifts and ors; as we did before, and this will get us at bits, which is about 0.2% off.

The code is as follows:

int pair(int m, int d) { // shifts the month by LESS // than 5 bits (but more than 4) return m*31+d; } void depair(int z, int * m, int * d) { // shifts the z by LESS // than 5 bits (but more than 4) *m=z / 31; // "ands" the z by LESS // than 5 bits (but more than 4) *d=z % 31; }

This method encodes the day of the year as one of 372 (11*31+31=372) instead of 512, just 6 off the target of 366. (And , while .)

*

* *

Using this encoding in a standalone bit field struct still yields 9 allocated bits, which is exactly zero improvement from the original naïve struct. So when will this encoding pay off? When we bundle a large number of dates, of course. If we bundle *two* dates, we almost shave off one complete bit, if we bundle 5, we save 2.3 bits, and if we bundle 18, we save more than a byte: we use 153.7 bits instead of 162.

*

* *

Of course, the correct way of encoding a date is to count the number of days since some point in time, and have a good piece of software that translates that number of day into the corresponding date, managing all calendar quirks like leap days and uneven numbers of days by months or even translating it into some other calendar.

But the point is, we can encode values by “shifting” and “anding” bits by *any* number of bits, even fractional—well, by the logarithm of any natural number other than 1. This will allow, in some cases, to devise *ad hoc* encodings that come very close to the ideal encoding. I must add that *ad hoc* solutions are sometimes welcomed, even though “serious” computer scientists frown upon them as lacking generality; but their value depends on the problem at hand, on various considerations such as the desired encoding speed and efficiency.

A bit of self promotion, but just in case, since it seems to fit the title of your blog post :

Maybe you are already aware, a new method to compress fractional bits, called ANS, as been recently proven to be competitive (from a computing efficiency and accuracy perspective) thanks to a first implementation called FSE ( a few more implementations have emerged since then).

https://github.com/Cyan4973/FiniteStateEntropy

It’s definitely more complex than the scheme presented here, but since it’s also about fractional bits, and since it’s too recent to be widely reported and published, I figure it may be a good read.

Some blog posts explain the mechanics behind the code, starting there :

http://fastcompression.blogspot.fr/2013/12/finite-state-entropy-new-breed-of.html

[…] long time ago, we discussed the problem of using fractions of bits to encode, jointly, values without wasting much bits. But we considered then only special cases, […]