homechevron_rightProfessionalchevron_rightStatistics

# Text fitness

This online calculator computes text "fitness". That is, how similar the given text to other texts written in English language.

The calculator below is an example of fitness function which can be applied to text.

According to Wikipedia,

A fitness function is a particular type of objective function that is used to summarise, as a single figure of merit, how close a given design solution is to achieving the set aims. Fitness functions are used in genetic programming and genetic algorithms to guide simulations towards optimal design solutions.1

Why we may need it for texts? Unlike human, computer can't look at the text and say if it is a normal text or some sort of gibberish. So it needs something measurable. This particular implementation calculates such measure (or fitness score) based on quadgrams (aka 4-grams, aka tetragraphs) statistics. Thanks to Google Books, those team released their Ngram statistics under a Creative Commons Attribution 3.0 Unported License, we can actually calculate occurences of any n-grams in whole Google Corpus Data (here is the link to Ngram Viewer). And thanks to Peter Norvig, who actually calculated these occurences, so, I do not need to download 23Gb of text and calculate it by myself (here is the link to Peter Norvig's article English Letter Frequency Counts: Mayzner Revisited or ETAOIN SRHLDCU).

For my fitness function, I used 20000 most often occured quadgrams. The total number of quadrams analyzed is 1 467 684 913 428. To get the idea, here is top ten, along with their frequencies (which is calculated by diving number of occurences to total number of quadgrams):

TION 16 665 142 795 0,0113547142
ATIO 8 806 923 204 0,0060005544
THAT 8 006 441 484 0,0054551501
THER 6 709 891 631 0,0045717521
WITH 6 099 136 075 0,0041556168
MENT 5 424 670 138 0,0036960727
IONS 4 103 605 496 0,0027959717
THIS 3 830 166 510 0,0026096654
HERE 3 590 397 215 0,0024462997
FROM 3 473 404 890 0,0023665876

Having these frequencies, technically, we can estimate the probability to find given text in whole text corpus (which is good candidate for fitness measure). For example, let our text be the word "MENTION". It consists of the following quadrams: MENT - ENTI - NTIO - TION. So,

Well, of course, approximately. Language rules do impose additional limitations, but we do not care much about them as long as our fitness function works as expected. The real problem here, however, is that the probabilities are quite small, so multiplication of those quickly goes to even smaller values, introduces rounding errors, and is not quite usable. The solution is known - apply the logarithm function. In this case,

As you can see, multiplication is replaced with addition. Since the probabilities are less than one but greater than zero, logarithm of base 10 gives us negative values. And the more rare quadgrams we have the bigger negative value we got. And, by the way, for quadgrams outside of the first 1000, I used very small constant probability of 1/1 467 684 913 428, those logarithm is -12.1666328301.

So, this is exactly how the fitness metric is calculated in calculator below. I break text to quadgrams, sum all logarithms of probabilities, normalize by diving to length of the text and take the absolute value of the result (just for convenience). The more rare quadgrams appear in the text, the bigger the value we got, the less rare quadgrams appear in the text, the less value we got.

Of course, this is the one of the possible text metrics, and, taken alone, it actually means nothing. The power comes from the comparison of texts. Let's compare several cases:

1. The random article from NYT (Source)

The score is 5.61

1. The "To be or not to be" speech from Hamlet.

The score is 6.08

1. The JABBERWOCKY by Lewis Carroll

The score is 6.53

1. The random letters sequence produced by Random Letters Generator

The score is 11.46.

As you can see, now we have some meaning in the results. Computer can tell you that NYT article is certainly "more English" than random number of letters. This can be used in number of applications, for example, in automatic cracking of classical simple substitution ciphers (actually this is the reason I need this function). Of course, as all statistics measures, this heavy relies on text to be "normal" English text. It fails miserably if text statistics differs from normal, as in widely known example from Simon Singh’s book "The Code Book"

From Zanzibar to Zambia to Zaire, ozone zones make zebras run zany zigzags

You can play with this quote right below, if you are interested. 