Online calculator: Text fitness (version 2)

The calculator below is an alternative example of fitness function which can be applied to the text. I've already created Text fitness calculator, which uses logarithms of quadgrams probabilities. This one uses a different approach, so maybe somebody will find it interesting.

The idea of the fitness function is the following:

We take large text corpus and calculate the occurrences of unigrams $C^u_i$ , bigrams $C^b_{ij}$ and trigrams $C^t_{ijk}$ .
Then we sum all counters to get total sum of unigrams $S_u$ , bigrams $S_b$ and trigrams $S_t$
Then we calculate reference frequencies of each unigram $R^u_i=\frac{C^u_i}{S_u}$ , bigram $R^b_{ij}=\frac{C^b_{ij}}{S_b}$ and trigram $R^t_{ijk}=\frac{C^t_{ijk}}{S_u}$ (you can also call them probabilities).
Then, using the same way, we calculate frequencies within the target text to get partial frequencies for unigrams $P^u_i$ , bigrams $P^b_{ij}$ and trigrams $P^t_{ijk}$ .
Then we calculate the fitness score using the following formula:
$f=\alpha \sum (R^u_i - P^u_i) + \betha \sum (R^b_{ij} - P^b_{ij}) + \gamma \sum (R^t_{ijk} - P^t_{ijk})$
where alpha, betha and gamma are the weights we assign to the importance of unigrams, bigrams and trigrams respectively. This implementation uses 1/6, 1/3 and 1/2, assigning most of the weight to trigrams.

A couple of words about reference frequencies. Note that the first version of Text fitness uses data from Google Books Ngram statistics. This one, however, uses frequencies available at Practical Cryptography. There they claim that frequencies are generated from Leipzig Corpora Collection.

As it was done in the first article, let's compare fitness score for some texts:

The random article from NYT (Source)

Click here to calculate fitness score

The score is 0.50

The "To be or not to be" speech from Hamlet.

Click here to calculate fitness score

The score is 0.49

The JABBERWOCKY by Lewis Carroll

Click here to calculate fitness score

The score is 0.62

The random letters sequence produced by Random Letters Generator

Click here to calculate fitness score

The score is 0.92.

As you can see, now we have some meaning in the results. A computer can tell you that the NYT article is certainly "more English" than the random number of letters. This can be used in many applications, for example, in automatic cracking of classical simple substitution ciphers (actually, this is why I need this function). Of course, like all statistical measures, this heavily relies on text to be "normal" English text. It fails miserably if text statistics differs from normal, as in the widely known example from Simon Singh’s book "The Code Book"

From Zanzibar to Zambia to Zaire, ozone zones make zebras run zany zigzags

You can play with this quote right below if you are interested.

Click here to calculate fitness score

Text Fitness Version 2

To be, or not to be, that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles
And by opposing end them. To die—to sleep,
No more; and by a sleep to say we end
The heart-ache and the thousand natural shocks
That flesh is heir to: 'tis a consummation
Devoutly to be wish'd. To die, to sleep;
To sleep, perchance to dream—ay, there's the rub:
For in that sleep of death what dreams may come,
When we have shuffled off this mortal coil,
Must give us pause—there's the respect
That makes calamity of so long life.
For who would bear the whips and scorns of time,
Th'oppressor's wrong, the proud man's contumely,
The pangs of dispriz'd love, the law's delay,
The insolence of office, and the spurns
That patient merit of th'unworthy takes,
When he himself might his quietus make
With a bare bodkin? Who would fardels bear,
To grunt and sweat under a weary life,
But that the dread of something after death,
The undiscovere'd country, from whose bourn
No traveller returns, puzzles the will,
And makes us rather bear those ills we have
Than fly to others that we know not of?
Thus conscience does make cowards of us all,
And thus the native hue of resolution
Is sicklied o'er with the pale cast of thought,
And enterprises of great pitch and moment
With this regard their currents turn awry
And lose the name of action.

Text

Calculation precision

Digits after the decimal point: 2

Fitness Score

PLANETCALC Online calculators

Text fitness (version 2)

This page exists due to the efforts of the following people:

Timur

Text Fitness Version 2

Similar calculators

Comments

PLANETCALC Online calculators

Text Fitness Version 2

Similar calculators

Comments

Share this page