In the last installment of this series, we looked at Markov chains as a mean of estimating the likelihood of a given piece of text of actually being a message, written in English, rather than mere gibberish.

This week, we finally piece everything together to obtain a program to crack Caesar’s cipher without (much) human intervention.

So the Markov chain model gives us a way of assigning a score to a piece of text as a probability:

,

according to a model that estimates the likeliness of a symbol following an other. We saw how we estimate this probability last week. It basically reduces to scan a large body of text (the larger the better) and estimating frequencies, then converting frequencies into probabilities. Once the score engine is ready, we can finally assemble everything together and solve for the Caesar code automagically.

Cycling through keys and getting a score:

#!/usr/bin/env bash key=9 message="attack the gauls at dawn" for ((k=0;k<25;k++)) do cesar $k $(cesar $key "$message") done | compute-score matrix.txt

Producing:

product=7.7064e-69 log=-156.836 jccjltcqnpjdubjcmjfw product=1.10597e-54 log=-124.239 kddkmudroqkevckdnkgx product=1.08628e-38 log=-87.4155 leelnvesprlfwdleolhy product=2.74988e-54 log=-123.328 mffmowftqsmgxemfpmiz product=5.62482e-54 log=-122.612 nggnpxgurtnhyfngqnja product=1.46081e-50 log=-114.75 ohhoqyhvsuoizgohrokb product=4.27383e-51 log=-115.979 piiprziwtvpjahpisplc product=6.22185e-75 log=-170.866 qjjqsajxuwqkbiqjtqmd product=2.27773e-56 log=-128.122 rkkrtbkyvxrlcjrkurne product=8.98141e-43 log=-96.816 sllsuclzwysmdkslvsof product=7.42025e-57 log=-129.243 tmmtvdmaxztneltmwtpg product=6.0041e-45 log=-101.824 unnuwenbyauofmunxuqh product=7.90349e-53 log=-119.97 voovxfoczbvpgnvoyvri product=5.10408e-62 log=-141.13 wppwygpdacwqhowpzwsj product=3.1834e-69 log=-157.72 xqqxzhqebdxripxqaxtk product=5.26779e-49 log=-111.165 yrryairfceysjqyrbyul product=3.82049e-61 log=-139.117 zsszbjsgdfztkrzsczvm product=4.59541e-32 log=-72.1577 attackthegaulsatdawn product=3.90715e-53 log=-120.674 buubdluifhbvmtbuebxo product=6.96003e-65 log=-147.728 cvvcemvjgicwnucvfcyp product=2.09647e-66 log=-151.23 dwwdfnwkhjdxovdwgdzq product=1.66271e-34 log=-77.7794 exxegoxlikeypwexhear product=7.40958e-65 log=-147.665 fyyfhpymjlfzqxfyifbs product=2.26381e-56 log=-128.128 gzzgiqznkmgarygzjgct product=1.10932e-50 log=-115.026 haahjraolnhbszhakhdu

On the first column, we have scores yielded by the product formula, the second by the log-sum, and finally, the corresponding tentative decryption. In this form, we must scan for the maximum probability (or maximum log-sum) to discover the best candidate. It shows the correctly decrypted text at position (and indeed (mod 26), it’s the correct inverse—and it’s not like the key in itself has any value). It seems to be working correctly.

It’s easier if we sort it out:

#!/usr/bin/env bash key=9 message="attack the gauls at dawn" for ((k=0;k<25;k++)) do cesar $k $(cesar $key "$message") done | compute-score matrix.txt \ | sort --numeric-sort \ --reverse \ --field-separator '=' \ --key 3

and now:

product=4.59541e-32 log=-72.1577 attackthegaulsatdawn product=1.66271e-34 log=-77.7794 exxegoxlikeypwexhear product=1.08628e-38 log=-87.4155 leelnvesprlfwdleolhy product=8.98141e-43 log=-96.816 sllsuclzwysmdkslvsof product=6.0041e-45 log=-101.824 unnuwenbyauofmunxuqh product=5.26779e-49 log=-111.165 yrryairfceysjqyrbyul product=1.46081e-50 log=-114.75 ohhoqyhvsuoizgohrokb product=1.10932e-50 log=-115.026 haahjraolnhbszhakhdu product=4.27383e-51 log=-115.979 piiprziwtvpjahpisplc product=7.90349e-53 log=-119.97 voovxfoczbvpgnvoyvri product=3.90715e-53 log=-120.674 buubdluifhbvmtbuebxo product=5.62482e-54 log=-122.612 nggnpxgurtnhyfngqnja product=2.74988e-54 log=-123.328 mffmowftqsmgxemfpmiz product=1.10597e-54 log=-124.239 kddkmudroqkevckdnkgx product=2.27773e-56 log=-128.122 rkkrtbkyvxrlcjrkurne product=2.26381e-56 log=-128.128 gzzgiqznkmgarygzjgct product=7.42025e-57 log=-129.243 tmmtvdmaxztneltmwtpg product=3.82049e-61 log=-139.117 zsszbjsgdfztkrzsczvm product=5.10408e-62 log=-141.13 wppwygpdacwqhowpzwsj product=7.40958e-65 log=-147.665 fyyfhpymjlfzqxfyifbs product=6.96003e-65 log=-147.728 cvvcemvjgicwnucvfcyp product=2.09647e-66 log=-151.23 dwwdfnwkhjdxovdwgdzq product=7.7064e-69 log=-156.836 jccjltcqnpjdubjcmjfw product=3.1834e-69 log=-157.72 xqqxzhqebdxripxqaxtk product=6.22185e-75 log=-170.866 qjjqsajxuwqkbiqjtqmd

As we go down the sorted list, we see that solutions become progressively exponentially less likely; at least as estimated by the language model. This means that we only have to have a look at the very first few to ascertain that the decoding has succeeded.

And this means we can use the language model (despite its needing refinement, no doubt) for other ciphers/algorithms. And maybe we should.

*To Be continued…*

[…] To be continued… […]

The base of the logarithm is not specified here, because the result only changes by a constant factor when another base is used. A constant factor, is usually disregarded in the analysis of algorithms under the standard uniform cost model .

[…] time ago, we presented the Caesar Cipher, developed a simple language model that allowed us to break the cipher relatively easily. This week, we will look at (simple) substitution […]