homechevron_rightProfessionalchevron_rightStatistics

# Text fitness (version 3)

This online calculator calculates measure known as fitness score, of how given text is similar to a typical English text. It uses logarithms of probabilites with normalization to calculate fitness score. The reference frequencies are calculated from Leipzig Corpora Collection and can be downloaded from http://practicalcryptography.com/cryptanalysis/text-characterisation/quadgrams/. The less value you get, the better.

The calculator below is yet another example of fitness function which can be applied to text. I've already created this: Text fitness and this: Text fitness (version 2) calculators, but actually failed to achieve my goals. And my goal was to create fitness function, which is good enough to be used for automatic substitution cipher breaking. I've tried those with different implementations, namely with the hill climbing algorithm and with the genetic algorithm, but achieved only moderate results - pretty good on long texts, but almost useless on short texts.

Finally I've asked the author of one of the best online substitution cipher breakers, Jens Guballa (here is the link), about fitness function he used. And he respond, saying that it is still sum of log probabilities of quadgrams, normalized. "Normalized" was the key. First version of my function indeed used the sum of log probabilities, but technically, the lesser value of it does not guarantee that the text is closer to English. It does guarantee that text consists from most often used quadgrams, but this is not the same thing.

So I decided to normalize my fitness function. To do this, I calculate normal value of the fitness function (i.e. value of "normal" English text) by summing log probabilites of top N most often used quadgrams, divided by N.
$f_{normal}=\frac{\sum_1^N log(p(quadgram)) }{N}$

Then my fitness function looks like this
$f'=\frac{|f - f_{normal}|}{f_{normal}}$,
where f is the log probabilites sum of all K quadgrams in given text.
$f=\frac{\sum_1^K log(p(quadgram)) }{K}$

This function was quite successful in breaking even short texts enciphered with substitution cipher. Note that it uses frequencies available at Practical Cryptography. There they claim that frequencies are generated from Leipzig Corpora Collection.

As it was done in the previous articles, let's compare fitness score for some texts:

1. The random article from NYT (Source)

The score is 1.18

1. The "To be or not to be" speech from Hamlet.

The score is 1.26

1. The JABBERWOCKY by Lewis Carroll

The score is 1.59

1. The random letters sequence produced by Random Letters Generator

The score is 5.27.

As you can see, now we have some meaning in the results. Computer can tell you that NYT article is certainly "more English" than random number of letters. This can be used in number of applications, for example, in automatic cracking of classical simple substitution ciphers (actually this is the reason I need this function). Of course, as all statistics measures, this heavy relies on text to be "normal" English text. It fails miserably if text statistics differs from normal, as in widely known example from Simon Singh’s book "The Code Book"

From Zanzibar to Zambia to Zaire, ozone zones make zebras run zany zigzags

You can play with this quote right below, if you are interested.