homechevron_rightProfessionalchevron_rightStatistics

# Text fitness (version 2)

This online calculator calculates measure known as fitness score, of how given text is similar to a typical English text. It uses frequencies of unigrams, bigrams and trigrams to calculate fitness score. The reference frequencies are calculated from Leipzig Corpora Collection and can be downloaded from http://practicalcryptography.com/cryptanalysis/text-characterisation/quadgrams/. The less value you get, the better.

The calculator below is an alternative example of fitness function which can be applied to text. I've already created Text fitness calculator, which uses logarithms of quadgrams probabilities. This one uses a different approach, so may be somebody will find it interesting.

The idea of the fitness function is the following:

1. We take large text corpus and calculate the occurrences of unigrams $C^u_i$, bigrams $C^b_{ij}$ and trigrams $C^t_{ijk}$.
2. Then we sum all counters to get total sum of unigrams $S_u$, bigrams $S_b$ and trigrams $S_t$
3. Then we calculate reference frequencies of each unigram $R^u_i=\frac{C^u_i}{S_u}$, bigram $R^b_{ij}=\frac{C^b_{ij}}{S_b}$ and trigram $R^t_{ijk}=\frac{C^t_{ijk}}{S_u}$ (you can also call them probabilities).
4. Then, using the same way, we calculate frequencies within the target text to get partial frequencies for unigrams $P^u_i$, bigrams $P^b_{ij}$ and trigrams $P^t_{ijk}$.
5. Then we calculate the fitness score using the following formula:
$f=\alpha \sum (R^u_i - P^u_i) + \betha \sum (R^b_{ij} - P^b_{ij}) + \gamma \sum (R^t_{ijk} - P^t_{ijk})$
where alpha, betha and gamma are the weights we assign to the importance of unigrams, bigrams and trigrams respectively. This implementation uses 1/6, 1/3 and 1/2, assigning most of the weight to trigrams.

Couple of words about reference frequencies. Note that first version of Text fitness uses data from Google Books Ngram statistics. This one, however, uses frequencies available at Practical Cryptography. There they claim that frequencies are generated from Leipzig Corpora Collection.

As it was done in the first article, let's compare fitness score for some texts:

1. The random article from NYT (Source)

The score is 0.50

1. The "To be or not to be" speech from Hamlet.

The score is 0.49

1. The JABBERWOCKY by Lewis Carroll

The score is 0.62

1. The random letters sequence produced by Random Letters Generator

The score is 0.92.

As you can see, now we have some meaning in the results. Computer can tell you that NYT article is certainly "more English" than random number of letters. This can be used in number of applications, for example, in automatic cracking of classical simple substitution ciphers (actually this is the reason I need this function). Of course, as all statistics measures, this heavy relies on text to be "normal" English text. It fails miserably if text statistics differs from normal, as in widely known example from Simon Singh’s book "The Code Book"

From Zanzibar to Zambia to Zaire, ozone zones make zebras run zany zigzags

You can play with this quote right below, if you are interested.