## Length of Phase-in Codes

In a previous post, I presented the phase-in codes. I stated that they were very efficient given the distribution is uniform. I also stated that, given the distribution is uniform, they never do worse than $\approx 0.086$ bits worse than the theoretical lower bound of $\lg N$, where $N$ is the number of distinct symbols to code and $\lg N$ is a convenient shorthand for $\log_2$. In this post, I derive that result.

The Phase-in Codes follow closely Lg N. In blue, the exact Lg N. In red, the average length of the corresponding Phase-in Code.

Bell et al. introduce a notation for codes that I will reuse here, and from now on. Don’t worry, I won’t expect you to remember all of those; I will reintroduce them as need be. So, let $C_{\phi(N)}(n)$ be the phase-in code for $n$ using $N$ symbols. Again, we use the decomposition $N=2^k+b$, where $k=\lfloor \lg N\rfloor$, and $b$ is uniquely determined.

The average length $\bar{L}_{\phi(N)}(X)$ of code $C_{\phi(N)}(X)$ for random variable $X$ is given by:

$\bar{L}_{\phi(N)}(X)=P(X

and $B=2^k-b$. If $X$ is uniform on $0 \ldots N-1$, we can rewrite the above equation as:

$\bar{L}_{\phi(N)}(X)=\displaystyle\frac{(2^k-b)k+2b(k+1)}{2^k+b}$

as, indeed, there are $2b$ codes of length $k+1$. After simplification, we get:

$\bar{L}_{\phi(N)}(X)=k+\displaystyle\frac{2b}{2^k+b}$

again provided that $X$ is uniformly, identically, independently, distributed. So, now that we have an expression for the average length of the codes, we are interested to know what $b$ must be to create the maximal difference between $\bar{L}_{\phi(N)}$ and $\lg N$. So we want to find $0 \leqslant b < 2^k$ that maximizes:

$\bar{L}_{\phi(N)}(X)-\lg N$

since $\bar{L}_{\phi(N)}(X) \geqslant \lg N$, we do not need to use least mean squares, or something similar. Also, luckily for us, this function is concave in $b$, so it suffice to solve:

$\displaystyle\frac{\partial}{\partial b}(\bar{L}_{\phi(N)}(X)-\lg N)=0$

for $b$. After much agony, we find $b=(2 \ln 2-1)2^k\approx 2^k 0.3862\ldots$

Substituting $b$ back into $\bar{L}_{\phi(N)}(X)-\lg N$, we find, finally,

$\bar{L}_{\phi(N)}(X)-\lg N=\frac{1}{\ln 2}(\ln 2-\ln \ln 2-1)\approx0.0860713$

Which concludes the demonstration.

The constant $0.0860713\ldots$ shows up quite often in proofs relating to code lengths, and, in particular, in the proof that Huffman codes are optimal, within a small constant.

Upperbounded by 0.0860713... the excess average length of Phase-in Codes reaches a minimum when b=0

[1]Timothy C Bell, John G. Cleary, Ian H. Witten — Text Compression — Prentice-Hall (1990)

### One Response to Length of Phase-in Codes

1. […] Fortunately, we can “repair” the algorithm quite easily, and it’s another code that will come to our aid: Phase-In Codes, which have average code lengths close to (as I show here). […]