Text fitness (version 2)
This online calculator calculates measure known as fitness score, of how given text is similar to a typical English text. It uses frequencies of unigrams, bigrams and trigrams to calculate fitness score. The reference frequencies are calculated from Leipzig Corpora Collection and can be downloaded from http://practicalcryptography.com/cryptanalysis/text-characterisation/quadgrams/. The less value you get, the better.
The calculator below is an alternative example of fitness function which can be applied to the text. I've already created Text fitness calculator, which uses logarithms of quadgrams probabilities. This one uses a different approach, so maybe somebody will find it interesting.
The idea of the fitness function is the following:
- We take large text corpus and calculate the occurrences of unigrams , bigrams and trigrams .
- Then we sum all counters to get total sum of unigrams , bigrams and trigrams
- Then we calculate reference frequencies of each unigram , bigram and trigram (you can also call them probabilities).
- Then, using the same way, we calculate frequencies within the target text to get partial frequencies for unigrams , bigrams and trigrams .
- Then we calculate the fitness score using the following formula:
where alpha, betha and gamma are the weights we assign to the importance of unigrams, bigrams and trigrams respectively. This implementation uses 1/6, 1/3 and 1/2, assigning most of the weight to trigrams.
A couple of words about reference frequencies. Note that the first version of Text fitness uses data from Google Books Ngram statistics. This one, however, uses frequencies available at Practical Cryptography. There they claim that frequencies are generated from Leipzig Corpora Collection.
As it was done in the first article, let's compare fitness score for some texts:
- The random article from NYT (Source)
The score is 0.50
- The "To be or not to be" speech from Hamlet.
The score is 0.49
- The JABBERWOCKY by Lewis Carroll
The score is 0.62
- The random letters sequence produced by Random Letters Generator
The score is 0.92.
As you can see, now we have some meaning in the results. A computer can tell you that the NYT article is certainly "more English" than the random number of letters. This can be used in many applications, for example, in automatic cracking of classical simple substitution ciphers (actually, this is why I need this function). Of course, like all statistical measures, this heavily relies on text to be "normal" English text. It fails miserably if text statistics differs from normal, as in the widely known example from Simon Singh’s book "The Code Book"
From Zanzibar to Zambia to Zaire, ozone zones make zebras run zany zigzags
You can play with this quote right below if you are interested.
Click here to calculate fitness score
Comments