Text fitness (version 3)
This online calculator calculates measure known as fitness score, of how given text is similar to a typical English text. It uses logarithms of probabilities with normalization to calculate fitness score. The reference frequencies are calculated from Leipzig Corpora Collection and can be downloaded from http://practicalcryptography.com/cryptanalysis/text-characterisation/quadgrams/. The less value you get, the better.
The calculator below is another example of fitness function which can be applied to the text. I've already created this: Text fitness and this: Text fitness (version 2) calculators, but actually failed to achieve my goals. And my goal was to create a fitness function, which is good enough to be used for automatic substitution cipher breaking. I've tried those with different implementations, namely with the hill-climbing algorithm and with the genetic algorithm, but achieved only moderate results - pretty good on long texts but almost useless on short texts.
Finally, I've asked the author of one of the best online substitution cipher breakers, Jens Guballa (here is the link), about the fitness function he used. And he responds, saying that it is still the sum of log probabilities of quadgrams, normalized. "Normalized" was the key. First version of my function indeed used the sum of log probabilities, but technically, the lesser value does not guarantee that the text is closer to English. It does guarantee that text consists of the most often used quadgrams, but this is not the same thing.
So I decided to normalize my fitness function. To do this, I calculate the normal value of the fitness function (i.e., the value of "normal" English text) by summing log probabilities of top N most often used quadgrams, divided by N.
Then my fitness function looks like this
where f is the log probabilities sum of all K quadgrams in given text.
This function was quite successful in breaking even short texts enciphered with a substitution cipher. Note that it uses frequencies available at Practical Cryptography. There they claim that frequencies are generated from Leipzig Corpora Collection.
As it was done in the previous articles, let's compare fitness score for some texts:
- The random article from NYT (Source)
The score is 1.18
- The "To be or not to be" speech from Hamlet.
The score is 1.26
- The JABBERWOCKY by Lewis Carroll
The score is 1.59
- The random letters sequence produced by Random Letters Generator
The score is 5.27.
As you can see, now we have some meaning in the results. The computer can tell you that the NYT article is certainly "more English" than a random number of letters. This can be used in many applications, for example, in automatic cracking of classical simple substitution ciphers (actually, this is why I need this function). Of course, like all statistical measures, this heavily relies on text to be "normal" English text. It fails miserably if text statistics differs from normal, as in the widely known example from Simon Singh’s book "The Code Book".
From Zanzibar to Zambia to Zaire, ozone zones make zebras run zany zigzags
You can play with this quote right below if you are interested.Click here to calculate fitness score