Some time ago, we presented the Caesar Cipher, developed a simple language model that allowed us to break the cipher relatively easily. This week, we will look at (simple) substitution ciphers.

Superficially, substitution ciphers seem much stronger than Caesar’s

cipher because, rather than just using shifting of the alphabet, it uses an

arbitrary substitution, for example,

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

↓

H W P G S K Z B Q L C O U Y D I X E J F N M A V R T

Which encodes A as H, B as W, etc.

This produces reasonably garbled messages, at least at first sight. The clear text

The Executive Magistrate of the American Union, unmindful of his obligation to execute the laws for the equal benefit of his fellow citizens, has sanctioned a censorship of the press, by which papers incompatible with the compact are excluded from the southern mails, and he has officially advised Congress to do by law, although in violation of the Constitution, what he had himself virtually done already in despite of both.

could be encoded as

hun ndnxqhmcn obrmihvbhn pk hun bonvmxbl qlmpl, qlomlfkqt pk umi pjtmrbhmpl hp ndnxqhn hun tbgi kpv hun nsqbt jnlnkmh pk umi knttpg xmhmynli, ubi iblxhmplnf b xnlipviumz pk hun zvnii, ja gumxu zbznvi mlxpozbhmjtn gmhu hun xpozbxh bvn ndxtqfnf kvpo hun ipqhunvl obmti, blf un ubi pkkmxmbtta bfcminf xplrvnii hp fp ja tbg, bthupqru ml cmptbhmpl pk hun xplihmhqhmpl, gubh un ubf umointk cmvhqbtta fpln btvnbfa ml fnizmhn pk jphu.

for some random substitution. While this looks rather garbled, it takes relatively little work to decipher. First, we have statistical analysis and knowledge about the English language (if we do suspect it’s written in English) that let us guess that “hun” is most likely “the” (since it repeats a number of times), “ubi” is probably “has” or “and”, and that gives us groups of three letters against which we can start. We can, using frequency analysis, tentatively guess other letters, then using pattern matching, and soon recover the message.

*

* *

Implementing a substitution cipher cleanly is a matter of a few lines of code. A C++ implementation looks something like:

class substitution { private: char forward[26]; char inverse[26]; public: char encode(char x) const { if (!std::islower(x)) return x; return forward[x-'a']; } char decode(char x) const { if (!std::islower(x)) return x; return inverse[x-'a']; } substitution(my::random::random_base<unsigned> & r) { // "If there is no null byte among the first n // bytes of src, the string placed in dest will // not be null-terminated." std::strncpy(forward,"abcdefghijklmnopqrstuvwxyz",26); // Fisher-Yates for (int i=25;i;i--) std::swap(forward[i],forward[r % i]); // and the inverse... for (int i=0;i<26;i++) inverse[forward[i]-'a']='a'+i; } };

where `my::random::random_base & r` is a base class for (uniform?) random number generators, and `r` controls the generation of the substitution (the seed of the generator is the “key”).

As most often, everything happens in the constructor. The first part uses the Fisher-Yates shuffling to generate a random (forward) substitution. The second part, somewhat tricky, computes the inverse, very elegantly so, for ‘a’ to ‘z’.

*

* *

Now, how can we automate the the deciphering of (simple) substitution ciphers? Well, we can use frequency analysis, as hinted earlier, or we can use a language model to help us look for a suitable (inverse) substitution. Maybe we could use genetic programming?

*To be continued…*

Anyone with mad substitution-cryptography skills and a love of sci-fi is welcome to attempt my moderately difficult interstellar cryptography puzzle at

https://thismoonlesssky.wordpress.com/2013/11/05/a-little-game-to-start-with-easy-interstellar-cryptography-puzzle/

[…] while ago, we had a look at a (simple) substitution cipher. Recall, this cipher proceeds by, as its name suggests, substituting one letter for another. For […]