## Phase-in Codes

Once in a while, we have to encode a series of values to memory or file, of which we know either very little about, distribution-wise, or are very close to being drawn from an uniform random variable: basically, they’re all equally likely to be drawn. This precludes the use of Huffman (or like) codes because the distribution is flat and Huffman code will afford no real compression despite the added complexity.

We do not want to use a fixed number of bits either because the number of possible values might be very far from the next power of two. For example say I have 11 different values. I could encode them on 4 bits each, but I would waste more than half a bit by doing so. Half a bit wasted for every code. Fortunately, there’s an easy, fast encoding that can help us achieve high efficiency: phase-in codes.

The first written reference to phase-in codes I could find is US Patent 4464650 (1984) where they are referred to as economy codes [1]. In [4], they are named as phase-in codes, and are briefly discussed but without reference to their inventor. Then they show up in the work of Acharya and Já Já [2,3], though I am sure they appear elsewhere with a different name.

OK, so our objective is to devise a simple code that allows us to use, on average, very close to $\lg N$ bits (where $\lg$ is a shorthand for $\log_2$) for $N$ distinct values. We try to give $\lceil \lg N \rceil$ bits long codes to as few values as possible, while encoding the remaining values on $\lfloor\lg N\rfloor$ bits.

The usual binary code on $k$ bits will be optimal whenever $N=2^k$, that is, $N$ is a power of two. If $N$ isn’t, it will be of the form $N=2^k+b$, where $k=\lfloor \lg N \rfloor$ (the greatest power of two smaller or equal to $N$) determines $b$ uniquely.

Using the decomposition $N = 2^k+b$, we compute $B=2^k-b$, the threshold. If a value is smaller than the threshold $B$, it will receive a code $k$ bits long, if it’s greater or equal to the threshold, it will receive a code $k+1$ bits long. For example, with $N=11$, we have $k=3$, $b=3$, $B=2^3-3=5$, and we obtain the following code table:

 n Length Code 0 3 000 1 3 001 2 3 010 3 3 011 4 3 100 5 4 1010 6 4 1011 7 4 1100 8 4 1101 9 4 1110 10 4 1111

Examining the table, we see that the first $k$ bits change every two codes passed $B$ and that the trailing bit is just $(n-B) \mod 2$. We can understand the phase-in code as using $N-B=2b$ codes of length $k+1$ and only $B=2^k-b$ codes of length $k$.

The encoding procedure is given by:

void encode( int n,
int k, int B,
unsigned char buffer[],
int & offset)
{
if (nput_bit and put_bitstring are functions that put the bits into the buffer, while updating the write offset. Similarly, the decoding procedure is given by:

unsigned decode( int k,
int B,
unsigned char buffer[],
int & offset)
{
unsigned n=get_bitstring(k,buffer,offset);

if (nget_bit and get_bitstring are the function that read bits from the buffer, updating the read offset.

Phase-in codes’ efficiency is very good. If we take again our example with $N=11$, we have an average code length of $Bk+(N-B)(k+1)=5\times3{}+6\times{}4=15+24=39$, and $39/11\approx{}3.545$, while $\lg 11\approx{}3.459$, for a excess of $\approx{}0.086$ bits! That means that the average phase-in code length is actually pretty close to $\lg 11$, and is therefore very efficient.

The following figure shows the efficiency of phase-in codes for the first few $N$. The Y axis is the average code length, in bits, and the X axis is the $N$s. In red, the phase-in code length, in black, $\lg N$. In fact, we can show that the average excess is never more than $\approx{}0.0860713\ldots$ bits for $N$.

Performance of Phase-In Codes

Phase-in codes may be used by themselves, but I think they serve their purpose better when combined with other codes, especially variable length codes. A variable length code could introduce the number of distinct values $N$, and then you can read a $k$ or $k+1$ code depending on the previously decoded $N$. You may need a code that looks like:

 Code 1 + Phase-in code with $N=3$ 01 + Phase-in code with $N=7$ 001 + Phase-in code with $N=17$ 000 + Phase-in code with $N=35$

Using Phase-in codes for the “tail” code will speed up decoding quite a bit compared to the naïve VLC decoder that walks a code tree as it reads bits one by one.

[1] Willard L. Eastman, Abraham Lempel, Jacob Ziv, Martin Cohn — Apparatus and Method for Compressing Data Signals and Restoring the Compressed Data Signals —US Patent 4464650, 1984 (submitted 1981)

[2] Tinku Acharya, Joseph Já Já — Enhancing LZW Coding Using a Variable-Length Binary Encoding — Tech Rep TR-95-70, Institute for System Research (1995)

[3] Tinku Acharya, Joseph Já Já — An Online Variable Length Binary Encoding of Text — Informatics and Computer Science, v. 94 (1996) p.1–22

[4]Timothy C Bell, John G. Cleary, Ian H. Witten — Text Compression — Prentice-Hall (1990)

### 3 Responses to Phase-in Codes

1. […] what if ? One possibility is to use phase-in codes, that provides an average code length rather close to , even when , that is, even if it’s not […]

2. […] the algorithm quite easily, and it’s another code that will come to our aid: Phase-In Codes, which have average code lengths close to (as I show […]

3. […] one needs bits. How can we achieve an encoding very close to bits? One possibility is to use a phase-in code that will yield codes of roughly bits long on average. Another is to use fractional shifts and […]