Online calculator: Text fitness (version 3)

The calculator below is another example of fitness function which can be applied to the text. I've already created this: Text fitness and this: Text fitness (version 2) calculators, but actually failed to achieve my goals. And my goal was to create a fitness function, which is good enough to be used for automatic substitution cipher breaking. I've tried those with different implementations, namely with the hill-climbing algorithm and with the genetic algorithm, but achieved only moderate results - pretty good on long texts but almost useless on short texts.

Finally, I've asked the author of one of the best online substitution cipher breakers, Jens Guballa (here is the link), about the fitness function he used. And he responds, saying that it is still the sum of log probabilities of quadgrams, normalized. "Normalized" was the key. First version of my function indeed used the sum of log probabilities, but technically, the lesser value does not guarantee that the text is closer to English. It does guarantee that text consists of the most often used quadgrams, but this is not the same thing.

So I decided to normalize my fitness function. To do this, I calculate the normal value of the fitness function (i.e., the value of "normal" English text) by summing log probabilities of top N most often used quadgrams, divided by N.
$f_{normal}=\frac{\sum_1^N log(p(quadgram)) }{N}$

Then my fitness function looks like this
$f'=\frac{|f - f_{normal}|}{f_{normal}}$ ,
where f is the log probabilities sum of all K quadgrams in given text.
$f=\frac{\sum_1^K log(p(quadgram)) }{K}$

This function was quite successful in breaking even short texts enciphered with a substitution cipher. Note that it uses frequencies available at Practical Cryptography. There they claim that frequencies are generated from Leipzig Corpora Collection.

As it was done in the previous articles, let's compare fitness score for some texts:

The random article from NYT (Source)

Click here to calculate fitness score

The score is 1.18

The "To be or not to be" speech from Hamlet.

Click here to calculate fitness score

The score is 1.26

The JABBERWOCKY by Lewis Carroll

Click here to calculate fitness score

The score is 1.59

The random letters sequence produced by Random Letters Generator

Click here to calculate fitness score

The score is 5.27.

As you can see, now we have some meaning in the results. The computer can tell you that the NYT article is certainly "more English" than a random number of letters. This can be used in many applications, for example, in automatic cracking of classical simple substitution ciphers (actually, this is why I need this function). Of course, like all statistical measures, this heavily relies on text to be "normal" English text. It fails miserably if text statistics differs from normal, as in the widely known example from Simon Singh’s book "The Code Book".

From Zanzibar to Zambia to Zaire, ozone zones make zebras run zany zigzags

You can play with this quote right below if you are interested.

Click here to calculate fitness score

Text Fitness Version 3

To be, or not to be, that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles
And by opposing end them. To die—to sleep,
No more; and by a sleep to say we end
The heart-ache and the thousand natural shocks
That flesh is heir to: 'tis a consummation
Devoutly to be wish'd. To die, to sleep;
To sleep, perchance to dream—ay, there's the rub:
For in that sleep of death what dreams may come,
When we have shuffled off this mortal coil,
Must give us pause—there's the respect
That makes calamity of so long life.
For who would bear the whips and scorns of time,
Th'oppressor's wrong, the proud man's contumely,
The pangs of dispriz'd love, the law's delay,
The insolence of office, and the spurns
That patient merit of th'unworthy takes,
When he himself might his quietus make
With a bare bodkin? Who would fardels bear,
To grunt and sweat under a weary life,
But that the dread of something after death,
The undiscovere'd country, from whose bourn
No traveller returns, puzzles the will,
And makes us rather bear those ills we have
Than fly to others that we know not of?
Thus conscience does make cowards of us all,
And thus the native hue of resolution
Is sicklied o'er with the pale cast of thought,
And enterprises of great pitch and moment
With this regard their currents turn awry
And lose the name of action.

Text

Calculation precision

Digits after the decimal point: 2

Fitness Score

PLANETCALC Online calculators

Text fitness (version 3)

This page exists due to the efforts of the following people:

Timur

Text Fitness Version 3

Similar calculators

Comments

PLANETCALC Online calculators

Text Fitness Version 3

Similar calculators

Comments

Share this page