# Substitution cipher breaker

This online calculator tries to decode substitution cipher without knowing the key. It uses genetic algorithm over text fitness function to break the encoded text

This content is licensed under Creative Commons Attribution/Share-Alike License 3.0 (Unported). That means you may freely redistribute or modify this content under the same license conditions and must attribute the original author by placing a hyperlink from your site to this work https://planetcalc.com/8047/. Also, please do not modify any references to the original work (if any) contained in this content.

The calculator below tries to automatically decode the text enciphered with the simple substitution cipher without knowing the key. The calculator logic is explained below the calculator.

In cryptography, a substitution cipher is a method of encrypting by which units of plaintext are replaced with ciphertext, according to a fixed system; the "units" may be single letters (the most common), pairs of letters, triplets of letters, mixtures of the above, and so forth. The receiver deciphers the text by performing the inverse substitution. Substitution of single letters separately —

simple substitution— can be demonstrated by writing out the alphabet in some order to represent the substitution. It is a cipher key, and it is also called asubstitution alphabet.^{1}

For simple substitution cipher, the set of all possible keys is the set of all possible permutations. Thus, for English alphabet, the number of keys is 26! (factorial of 26), which is about . Because of this, if you want to decipher the text without knowing the key, brute force approach is out of the question.

However, the simple substitution cipher is considered as a weak cipher, because it is vulnerable to cryptoanalysis. First of all, substitution does not change frequencies of the letters, so, if you have a decent amount of enciphered text and you know the language it was written in, you can try frequency analysis. For example, most common letter in English language is E, so, most common letter in the encrypted text is probable the E substitution. Analyst also looks for frequencies of bigrams and trigrams, because some unigram frequencies are too close to each other to rely on them. Using frequencies analyst can create trial keys and test them to see if they reveal some words and phases in the encrypted text.

But this manual approach is time-consuming, so the goal of automated solution is to exclude human from the process of breaking the cipher. And it is possible due to another simple substitution cipher vulnerability, known as **Utility of Partial Solution**.

In other words, if there are many pairs of keys in the keyspace where the decryption of the ciphertext by the key more similar to the correct key more closely resembles the plaintext than the decryption of the ciphertext by the other key, the cipher has Utility of Partial Solutions... If there is a correlation between the degree to which a key resembles the correct key and the degree to which that key's decryption of the ciphertext resembles the plaintext, it should be possible to search the keyspace efficiently by quickly discarding keys that are "worse" than whatever key is the closest match at any moment, climbing ever closer to the optimal key without knowing it initially. More specially, these keyspaces can be searched via Stochastic Optimization Algorithms.

^{2}

The tricky part here is how you can measure if one key is "worse" than another. To address this, we need **text fitness** which gives us some sort of score on how given text looks like typical English text. There are different approaches, and I've tried this and that, but one which worked for me is outlined here: Text fitness (version 3). In short, it uses the sum of log probabilities of quadgrams, and compares the sum with the sum for the "normal" English text (created as the sum of log probabilities of the most often English quadgrams). Here I'd like to thank Jens Guballa (site), author of another substitution solver, who kindly gives me the hint that text fitness function should be "normalized".

The implementation below uses genetic algorithm to search for correct key. If it fails, you can try to repeat couple of times (each time it starts from set of random keys as initial generation) or tweak the settings, for example, increase the number of generations. Just click the **Details** to reveal additional settings. In this mode, calculator also displays best key in each generation, which is quite curious to watch.

If you see that the found key is close to the correct one, but want to tweak couple of letters, you may want to use Substitution Cipher Tool to manually test the keys.

## Comments