In many occasions, we have to estimate a probability distribution from observations because we do not know the probability and we do not have enough *a priori* knowledge to infer it. An example is text analysis. We may want to know the probability distribution of letters conditional to the few preceding letters. Say, what is the probability of having ‘a’ after ‘prob’? To know the probability, we start by estimating frequencies using a large set of observations—say, the whole Project Gutenberg ASCII-coded English corpus. So we scan the corpus, count frequencies, and observe that, despite the size of the corpus, some letter combinations weren’t observed. Should the corresponding probability be zero?

While we may argue that they could be zero because they were never seen, one could argue that they are merely unlikely and should be non-zero, but small.

The problem of zero probabilities is particularly acute in data compression. In data compression, to create an efficient code (such as a Huffman code), we must first estimate the probabilities of occurrence. If we can look at the text before coding it, that’s not a problem because we will know what it contains. But if we can’t observe the whole text before starting to code, or we have to estimate the probability from another text, then we potentially have the problem of zero-frequency symbols. First, because we want to ensure that the code-book allows us to encode any symbol, and second because we do not want to estimate probabilities too coarsely because it will lead to an inefficient code and bad compression ratios.

So, how do we usually solve this problem?

The first piece of the puzzle is to figure out how entropy is affected by adding symbols. Say that we have seen the subset of symbols out of the possible alphabet . If we add one symbol to , its entropy increases, by an amount log-proportional to the probability of the new symbol. Therefore, any new symbol added must have the smallest possible probability to minimize the increase in entropy (and the drop in compression efficiency).

One strategy could be to start all frequency counts at , rather than zero, then simply increment symbol frequencies as symbols are observed. How much does this bias frequency estimation?

Let’s see. Suppose that are the frequencies of symbols in , some of which are possibly zero. Let be the random variable of interest. We have, ignoring the zero probabilities that

.

If then and that’s it. If we started counts at for all frequencies, the (estimation of the) probabilities becomes

.

If we compute the difference with the naïve probabilities, we find

.

Another strategy is to count frequencies as they are, then replaces the zero frequencies by a count of 1. Suppose that there are $k$ zero probabilities. The probability of a(n initially non zero) symbol is

.

If we calculate the difference between this new estimate and the naïve probabilities, we find

,

which is a smaller (or equal) difference than with the other method, whenever (and is larger than both)

*

* *

We could decrease the difference again, and preferably in a way that suits the code we will use. If we are to use arithmetic coding, we can assign an arbitrarily small probability to an observe symbol and it would not affect the overall coding performance. If we were to use Huffman coding instead, which has *grain*, for lack of a better term, we must make sure that the frequencies or probabilities are collectively small enough that all the symbols that are thought to have zero frequencies are assigned the last code, and that they only push the least frequency observed symbol down one bit. (In Huffman coding, the two least frequency symbols have codes of the same length, and those two codes are longer or equal to all other codes. To avoid having more frequency symbols’ codes increase in length, we must make sure that all the zero frequency symbols are siblings to the least non-zero frequency symbol. Have a look back at previous post to convince yourself it must be so.)