Online calculator: Text fitness

The calculator below is an example of fitness function which can be applied to the text.

According to Wikipedia,

A fitness function is a particular type of objective function used to summarise, as a single figure of merit, how close a given design solution is to achieving the set aims. Fitness functions are used in genetic programming and genetic algorithms to guide simulations towards optimal design solutions.¹

Why may we need it for texts? Unlike humans, computers can't look at the text and say if it is normal text or gibberish. So it needs something measurable. This particular implementation calculates such measure (or fitness score) based on quadgrams (aka 4-grams, aka tetragraphs) statistics. Thanks to Google Books, those team released their Ngram statistics under a Creative Commons Attribution 3.0 Unported License; we can actually calculate occurrences of any n-grams in whole Google Corpus Data (here is the link to Ngram Viewer). And thanks to Peter Norvig, who actually calculated these occurrences, so I do not need to download 23Gb of text and calculate it by myself (here is the link to Peter Norvig's article English Letter Frequency Counts: Mayzner Revisited or ETAOIN SRHLDCU).

For my fitness function, I used 20000 most often occurred quadgrams. The total number of quadrams analysed is 1 467 684 913 428. To get the idea, here is top ten, along with their frequencies (which is calculated by diving number of occurrences to total number of quadgrams):

Quadgram	Occurrences	Frequency
TION	16 665 142 795	0,0113547142
ATIO	8 806 923 204	0,0060005544
THAT	8 006 441 484	0,0054551501
THER	6 709 891 631	0,0045717521
WITH	6 099 136 075	0,0041556168
MENT	5 424 670 138	0,0036960727
IONS	4 103 605 496	0,0027959717
THIS	3 830 166 510	0,0026096654
HERE	3 590 397 215	0,0024462997
FROM	3 473 404 890	0,0023665876

Having these frequencies, technically, we can estimate the probability to find given text in whole text corpus (which is a good candidate for fitness measure). For example, let our text be the word "MENTION". It consists of the following quadgrams: MENT - ENTI - NTIO - TION. So,

$p(MENTION)=p(MENT)*p(ENTI)*p(NTIO)*p(TION)$

Well, of course, approximately. Language rules impose additional limitations, but we do not care much about them as long as our fitness function works as expected. The real problem here, however, is that the probabilities are quite small, so multiplication of those quickly goes to even smaller values, introduces rounding errors, and is not quite usable. The solution is known - apply the logarithm function. In this case,

$log(p(MENTION))=log(p(MENT))+log(p(ENTI))+log(p(NTIO))+log(p(TION))$

As you can see, multiplication is replaced with addition. Since the probabilities are less than one but greater than zero, the logarithm of base 10 gives us negative values. And the more rare quadgrams we have, the bigger the negative value we got. By the way, for quadgrams outside of the first 1000, I used a very small constant probability of 1/1 467 684 913 428; that logarithm is -12.1666328301.

So, this is exactly how the fitness metric is calculated in the calculator below. I break text into quadgrams, sum all logarithms of probabilities, normalize by diving to the text's length, and take the absolute value of the result (just for convenience). The more rare quadgrams appear in the text, the bigger the value we got, the less rare quadgrams appear in the text, the less value we got.

Of course, this is one of the possible text metrics, and, taken alone, it actually means nothing. The power comes from the comparison of texts. Let's compare several cases:

The random article from NYT (Source)

Click here to calculate fitness score

The score is 5.61

The "To be or not to be" speech from Hamlet.

Click here to calculate fitness score

The score is 6.08

The JABBERWOCKY by Lewis Carroll

Click here to calculate fitness score

The score is 6.53

The random letters sequence produced by Random Letters Generator

Click here to calculate fitness score

The score is 11.46.

As you can see, now we have some meaning in the results. The computer can tell you that the NYT article is certainly "more English" than a random number of letters. This can be used in many applications, for example, in automatic cracking of classical simple substitution ciphers (actually, this is why I need this function). Of course, like all statistical measures, this heavily relies on text to be "normal" English text. It fails miserably if text statistics differs from normal, as in the widely known example from Simon Singh’s book "The Code Book"

From Zanzibar to Zambia to Zaire, ozone zones make zebras run zany zigzags

You can play with this quote right below if you are interested.

Click here to calculate fitness score

Text Fitness

To be, or not to be, that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles
And by opposing end them. To die—to sleep,
No more; and by a sleep to say we end
The heart-ache and the thousand natural shocks
That flesh is heir to: 'tis a consummation
Devoutly to be wish'd. To die, to sleep;
To sleep, perchance to dream—ay, there's the rub:
For in that sleep of death what dreams may come,
When we have shuffled off this mortal coil,
Must give us pause—there's the respect
That makes calamity of so long life.
For who would bear the whips and scorns of time,
Th'oppressor's wrong, the proud man's contumely,
The pangs of dispriz'd love, the law's delay,
The insolence of office, and the spurns
That patient merit of th'unworthy takes,
When he himself might his quietus make
With a bare bodkin? Who would fardels bear,
To grunt and sweat under a weary life,
But that the dread of something after death,
The undiscovere'd country, from whose bourn
No traveller returns, puzzles the will,
And makes us rather bear those ills we have
Than fly to others that we know not of?
Thus conscience does make cowards of us all,
And thus the native hue of resolution
Is sicklied o'er with the pale cast of thought,
And enterprises of great pitch and moment
With this regard their currents turn awry
And lose the name of action.

Text

Calculation precision

Digits after the decimal point: 2

Fitness Score

Fitness function ↩

PLANETCALC Online calculators

Text fitness

This page exists due to the efforts of the following people:

Timur

Text Fitness

Similar calculators

Comments

PLANETCALC Online calculators

Text Fitness

Similar calculators

Comments

Share this page