Breaking Caesar’s Cipher (part III)

In the last installment of this series, we looked at Markov chains as a mean of estimating the likelihood of a given piece of text of actually being a message, written in English, rather than mere gibberish.

This week, we finally piece everything together to obtain a program to crack Caesar’s cipher without (much) human intervention.

So the Markov chain model gives us a way of assigning a score to a piece of text as a probability:

\displaystyle P(sentence~s~is~English~text) \propto \prod_{i=1}^n P(s_i|s_{i-1}),

according to a model that estimates the likeliness of a symbol following an other. We saw how we estimate this probability last week. It basically reduces to scan a large body of text (the larger the better) and estimating frequencies, then converting frequencies into probabilities. Once the score engine is ready, we can finally assemble everything together and solve for the Caesar code automagically.

Cycling through keys and getting a score:

#!/usr/bin/env bash

key=9
message="attack the gauls at dawn"

for ((k=0;k<25;k++))
do
    cesar $k $(cesar $key "$message") 
done | compute-score matrix.txt

Producing:

product=7.7064e-69  log=-156.836 jccjltcqnpjdubjcmjfw
product=1.10597e-54 log=-124.239 kddkmudroqkevckdnkgx
product=1.08628e-38 log=-87.4155 leelnvesprlfwdleolhy
product=2.74988e-54 log=-123.328 mffmowftqsmgxemfpmiz
product=5.62482e-54 log=-122.612 nggnpxgurtnhyfngqnja
product=1.46081e-50 log=-114.75  ohhoqyhvsuoizgohrokb
product=4.27383e-51 log=-115.979 piiprziwtvpjahpisplc
product=6.22185e-75 log=-170.866 qjjqsajxuwqkbiqjtqmd
product=2.27773e-56 log=-128.122 rkkrtbkyvxrlcjrkurne
product=8.98141e-43 log=-96.816  sllsuclzwysmdkslvsof
product=7.42025e-57 log=-129.243 tmmtvdmaxztneltmwtpg
product=6.0041e-45  log=-101.824 unnuwenbyauofmunxuqh
product=7.90349e-53 log=-119.97  voovxfoczbvpgnvoyvri
product=5.10408e-62 log=-141.13  wppwygpdacwqhowpzwsj
product=3.1834e-69  log=-157.72  xqqxzhqebdxripxqaxtk
product=5.26779e-49 log=-111.165 yrryairfceysjqyrbyul
product=3.82049e-61 log=-139.117 zsszbjsgdfztkrzsczvm
product=4.59541e-32 log=-72.1577 attackthegaulsatdawn
product=3.90715e-53 log=-120.674 buubdluifhbvmtbuebxo
product=6.96003e-65 log=-147.728 cvvcemvjgicwnucvfcyp
product=2.09647e-66 log=-151.23  dwwdfnwkhjdxovdwgdzq
product=1.66271e-34 log=-77.7794 exxegoxlikeypwexhear
product=7.40958e-65 log=-147.665 fyyfhpymjlfzqxfyifbs
product=2.26381e-56 log=-128.128 gzzgiqznkmgarygzjgct
product=1.10932e-50 log=-115.026 haahjraolnhbszhakhdu

On the first column, we have scores yielded by the product formula, the second by the log-sum, and finally, the corresponding tentative decryption. In this form, we must scan for the maximum probability (or maximum log-sum) to discover the best candidate. It shows the correctly decrypted text at position k=17 (and indeed 17+9 \equiv 0 (mod 26), it’s the correct inverse—and it’s not like the key in itself has any value). It seems to be working correctly.

It’s easier if we sort it out:

#!/usr/bin/env bash

key=9
message="attack the gauls at dawn"

for ((k=0;k<25;k++))
do
    cesar $k $(cesar $key "$message") 
done | compute-score matrix.txt \
     | sort --numeric-sort \
            --reverse \
            --field-separator '=' \
            --key 3

and now:

product=4.59541e-32 log=-72.1577 attackthegaulsatdawn
product=1.66271e-34 log=-77.7794 exxegoxlikeypwexhear
product=1.08628e-38 log=-87.4155 leelnvesprlfwdleolhy
product=8.98141e-43 log=-96.816  sllsuclzwysmdkslvsof
product=6.0041e-45  log=-101.824 unnuwenbyauofmunxuqh
product=5.26779e-49 log=-111.165 yrryairfceysjqyrbyul
product=1.46081e-50 log=-114.75  ohhoqyhvsuoizgohrokb
product=1.10932e-50 log=-115.026 haahjraolnhbszhakhdu
product=4.27383e-51 log=-115.979 piiprziwtvpjahpisplc
product=7.90349e-53 log=-119.97  voovxfoczbvpgnvoyvri
product=3.90715e-53 log=-120.674 buubdluifhbvmtbuebxo
product=5.62482e-54 log=-122.612 nggnpxgurtnhyfngqnja
product=2.74988e-54 log=-123.328 mffmowftqsmgxemfpmiz
product=1.10597e-54 log=-124.239 kddkmudroqkevckdnkgx
product=2.27773e-56 log=-128.122 rkkrtbkyvxrlcjrkurne
product=2.26381e-56 log=-128.128 gzzgiqznkmgarygzjgct
product=7.42025e-57 log=-129.243 tmmtvdmaxztneltmwtpg
product=3.82049e-61 log=-139.117 zsszbjsgdfztkrzsczvm
product=5.10408e-62 log=-141.13  wppwygpdacwqhowpzwsj
product=7.40958e-65 log=-147.665 fyyfhpymjlfzqxfyifbs
product=6.96003e-65 log=-147.728 cvvcemvjgicwnucvfcyp
product=2.09647e-66 log=-151.23  dwwdfnwkhjdxovdwgdzq
product=7.7064e-69  log=-156.836 jccjltcqnpjdubjcmjfw
product=3.1834e-69  log=-157.72  xqqxzhqebdxripxqaxtk
product=6.22185e-75 log=-170.866 qjjqsajxuwqkbiqjtqmd

As we go down the sorted list, we see that solutions become progressively exponentially less likely; at least as estimated by the language model. This means that we only have to have a look at the very first few to ascertain that the decoding has succeeded.

And this means we can use the language model (despite its needing refinement, no doubt) for other ciphers/algorithms. And maybe we should.

To Be continued…

3 Responses to Breaking Caesar’s Cipher (part III)

  1. Desmond Holt says:

    The base of the logarithm is not specified here, because the result only changes by a constant factor when another base is used. A constant factor, is usually disregarded in the analysis of algorithms under the standard uniform cost model .

  2. […] time ago, we presented the Caesar Cipher, developed a simple language model that allowed us to break the cipher relatively easily. This week, we will look at (simple) substitution […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: