Text fitness

This online calculator computes text "fitness". That is, how similar the given text to other texts written in English language.

This page exists due to the efforts of the following people:

Timur

Timur

Created: 2018-10-23 09:46:26, Last updated: 2021-03-20 16:13:18
Creative Commons Attribution/Share-Alike License 3.0 (Unported)

This content is licensed under Creative Commons Attribution/Share-Alike License 3.0 (Unported). That means you may freely redistribute or modify this content under the same license conditions and must attribute the original author by placing a hyperlink from your site to this work https://planetcalc.com/7959/. Also, please do not modify any references to the original work (if any) contained in this content.

The calculator below is an example of fitness function which can be applied to the text.

According to Wikipedia,

A fitness function is a particular type of objective function used to summarise, as a single figure of merit, how close a given design solution is to achieving the set aims. Fitness functions are used in genetic programming and genetic algorithms to guide simulations towards optimal design solutions.1

Why may we need it for texts? Unlike humans, computers can't look at the text and say if it is normal text or gibberish. So it needs something measurable. This particular implementation calculates such measure (or fitness score) based on quadgrams (aka 4-grams, aka tetragraphs) statistics. Thanks to Google Books, those team released their Ngram statistics under a Creative Commons Attribution 3.0 Unported License; we can actually calculate occurrences of any n-grams in whole Google Corpus Data (here is the link to Ngram Viewer). And thanks to Peter Norvig, who actually calculated these occurrences, so I do not need to download 23Gb of text and calculate it by myself (here is the link to Peter Norvig's article English Letter Frequency Counts: Mayzner Revisited or ETAOIN SRHLDCU).

For my fitness function, I used 20000 most often occurred quadgrams. The total number of quadrams analysed is 1 467 684 913 428. To get the idea, here is top ten, along with their frequencies (which is calculated by diving number of occurrences to total number of quadgrams):

Quadgram Occurrences Frequency
TION 16 665 142 795 0,0113547142
ATIO 8 806 923 204 0,0060005544
THAT 8 006 441 484 0,0054551501
THER 6 709 891 631 0,0045717521
WITH 6 099 136 075 0,0041556168
MENT 5 424 670 138 0,0036960727
IONS 4 103 605 496 0,0027959717
THIS 3 830 166 510 0,0026096654
HERE 3 590 397 215 0,0024462997
FROM 3 473 404 890 0,0023665876

Having these frequencies, technically, we can estimate the probability to find given text in whole text corpus (which is a good candidate for fitness measure). For example, let our text be the word "MENTION". It consists of the following quadgrams: MENT - ENTI - NTIO - TION. So,

p(MENTION)=p(MENT)*p(ENTI)*p(NTIO)*p(TION)

Well, of course, approximately. Language rules impose additional limitations, but we do not care much about them as long as our fitness function works as expected. The real problem here, however, is that the probabilities are quite small, so multiplication of those quickly goes to even smaller values, introduces rounding errors, and is not quite usable. The solution is known - apply the logarithm function. In this case,

log(p(MENTION))=log(p(MENT))+log(p(ENTI))+log(p(NTIO))+log(p(TION))

As you can see, multiplication is replaced with addition. Since the probabilities are less than one but greater than zero, the logarithm of base 10 gives us negative values. And the more rare quadgrams we have, the bigger the negative value we got. By the way, for quadgrams outside of the first 1000, I used a very small constant probability of 1/1 467 684 913 428; that logarithm is -12.1666328301.

So, this is exactly how the fitness metric is calculated in the calculator below. I break text into quadgrams, sum all logarithms of probabilities, normalize by diving to the text's length, and take the absolute value of the result (just for convenience). The more rare quadgrams appear in the text, the bigger the value we got, the less rare quadgrams appear in the text, the less value we got.

Of course, this is one of the possible text metrics, and, taken alone, it actually means nothing. The power comes from the comparison of texts. Let's compare several cases:

  1. The random article from NYT (Source)
Click here to calculate fitness score

The score is 5.61

  1. The "To be or not to be" speech from Hamlet.
Click here to calculate fitness score

The score is 6.08

  1. The JABBERWOCKY by Lewis Carroll
Click here to calculate fitness score

The score is 6.53

  1. The random letters sequence produced by Random Letters Generator
Click here to calculate fitness score

The score is 11.46.

As you can see, now we have some meaning in the results. The computer can tell you that the NYT article is certainly "more English" than a random number of letters. This can be used in many applications, for example, in automatic cracking of classical simple substitution ciphers (actually, this is why I need this function). Of course, like all statistical measures, this heavily relies on text to be "normal" English text. It fails miserably if text statistics differs from normal, as in the widely known example from Simon Singh’s book "The Code Book"

From Zanzibar to Zambia to Zaire, ozone zones make zebras run zany zigzags

You can play with this quote right below if you are interested.

Click here to calculate fitness score

PLANETCALC, Text Fitness

Text Fitness

Digits after the decimal point: 2
Fitness Score
 

URL copied to clipboard
PLANETCALC, Text fitness

Comments