In a previous post, I presented the phase-in codes. I stated that they were very efficient given the distribution is uniform. I also stated that, given the distribution is uniform, they never do worse than bits worse than the theoretical lower bound of , where is the number of distinct symbols to code and is a convenient shorthand for . In this post, I derive that result.
Bell et al. introduce a notation for codes that I will reuse here, and from now on. Don’t worry, I won’t expect you to remember all of those; I will reintroduce them as need be. So, let be the phase-in code for using symbols. Again, we use the decomposition , where , and is uniquely determined.
The average length of code for random variable is given by:
and . If is uniform on , we can rewrite the above equation as:
as, indeed, there are codes of length . After simplification, we get:
again provided that is uniformly, identically, independently, distributed. So, now that we have an expression for the average length of the codes, we are interested to know what must be to create the maximal difference between and . So we want to find that maximizes:
since , we do not need to use least mean squares, or something similar. Also, luckily for us, this function is concave in , so it suffice to solve:
for . After much agony, we find
Substituting back into , we find, finally,
Which concludes the demonstration.
The constant shows up quite often in proofs relating to code lengths, and, in particular, in the proof that Huffman codes are optimal, within a small constant.
Timothy C Bell, John G. Cleary, Ian H. Witten — Text Compression — Prentice-Hall (1990)