In the last installment of this series, we had a look at Caesar’s cipher, an absurdly simple encryption technique where the symmetric encryption only consists in shifting symbols places.

While it’s ridiculously easy to break the cipher, even with pen-and-paper techniques, we ended up, last time, surmising that we should be able to crack the cipher automatically, without human intervention, if only we had a reasonable language model. This week, let us have a look at how we could build a very simple language model that does just that.

There are a lot of ways of building language models. In our case, we’re only interested in having something that gives a higher score to text that looks like reasonable text and very low scores to text that looks like gibberish. One way to do so, is to use a somewhat classical framework where we estimate the probability that we observe a given text fragment given it’s supposed to be, say, English.

That is, we want an estimation of

.

While we cannot really estimate the above directly, we can use, as a proxy, the following:

with having some default value, maybe “space.” That is, the model we will use is a *first order Markov chain*, where the probability of the entire sentence is estimated as the product of the conditional probabilities , the probability that is observed after ; the conditional probability that a given letter is observed after another. This leaves us with the much simpler task of estimating only .

*

* *

To estimate , we will simply construct a matrix , where is the number of distinct symbols in the language—let us say , and treat text as composed of bytes. Initially, all entries of are set to 1 (not zero, in order to deal with rare, but not impossible, observations, as we will make clearer in a moment). Then we scan a *large* amount of text, reading characters one by one. The pseudo-code looks something like

” ”

while read s:

We then normalize row-wise the matrix . Why row-wise? Because we transform each row of to a probability distribution . That is, we want:

,

where we account for all possible successors, and that all probabilities sum to exactly one. Which brings us back to the initial value of for the entries of the un-normalized . We could have set to zero initially, and after we fill it (by scanning the text), the entries that are still zero correspond not to impossible combinations, but only to *combinations that have not been observed* while filling the matrix. So it may be quite possible later on, when estimating the score of a new piece of text to encounter combinations not seen while “learning” the matrix. This would yield a very, very bad score for such a text. Well, a score of zero. We just want it to be unlikely, so we must settle for a small, but non-zero probability to prevent the product from being zero whenever we observe a combination not seen during the learning phase.

*

* *

Let us now have a look at the implementation of the estimation of .

// compile with --std=c++11 #include <iostream> #include <vector> #include <algorithm> #include <configuration> int main() { std::vector<std::vector<int>> matrix(charset_size, std::vector<int>(charset_size, 1)); size_t nb_read; do { unsigned char page[page_size]; std::cin.read((char*)page,page_size); nb_read=std::cin.gcount(); int last=0; for (int i=0;i<nb_read;i++) { unsigned char next=page[i]; if ( (next>=charset_min) && (next<=charset_max) ) { next-=charset_min; matrix[last][next]++; last=next; } else ; // skip, out of charset we're // interested in } } while (nb_read==page_size); for (int i=0;i<charset_size;i++) { size_t norm=0; for (int j=0;j<charset_size;j++) norm+=matrix[i][j]; for (int j=0;j<charset_size-1;j++) std::cout << matrix[i][j]/(double)norm << ' '; std::cout << matrix[i][charset_size-1]/(double)norm << std::endl; } return 0; }

where the `configuration` file contains

#ifndef __MODULE_CONFIGURATION__ #define __MODULE_CONFIGURATION__ const size_t page_size=1000000; // sure, why not? const size_t charset_min=0; const size_t charset_max=255; const size_t charset_size=charset_max-charset_min+1; #endif // __MODULE_CONFIGURATION__

…various details on how you want to deal with files and charsets. You can then invoke

bzcat blob.txt.bz2 | make-matrix > matrix.txt

where `blob.txt.bz2` is a large text file assembled from files from the Project Gutenberg.

It’s interesting to look at the matrix graphically. If we look at it as a histogram, we get something like

If we look at it as a (log) density plot, we get something like

*

* *

Evaluating scores, once the matrix is computed (and normalized), is rather straightforward. The implementation of

translates into

double product_score( std::string text, const std::vector<std::vector<double>> & matrix) { double product=1.0; int last=0; for (int i=0;i<text.size();i++) { unsigned char next=text[i]; if ( (next>=charset_min) && (next<=charset_max) ) { next-=charset_min; product*=matrix[last][next]; last=next; } } return product; }

However, if the string is rather long, the above implementation will just give *zero*. To avoid the problem, we observe that

,

an expression that should be more stable, numerically. We get:

double log_score( std::string text, const std::vector<std::vector<double>> & matrix) { double sum_log=0; int last=0; for (int i=0;i<text.size();i++) { unsigned char next=text[i]; if ( (next>=charset_min) && (next<=charset_max) ) { next-=charset_min; sum_log+=std::log(matrix[last][next]); last=next; } } return sum_log; }

And we’re ready to assemble everything to form an estimator of how likely is the current piece of text to be actual, readable, English. Let us now apply this to the automatic breaking of Caesar’s cipher… next week.

[…] To be continued… […]

[…] the last installment of this series, we looked at Markov chains as a mean of estimating the likelihood of a given piece […]

[…] time ago, we presented the Caesar Cipher, developed a simple language model that allowed us to break the cipher relatively easily. This week, we will look at (simple) […]

[…] if we had a language model that helps us decide if a tentative deciphering makes sense? Oh, right, we do have one. And since we can use it as a “fitness” function, we can apply genetic programming to […]