Here is the calculator, which calculates the index of coincidence, or IOC (IC) for the given text. The index of coincidence is the probability of two randomly selected letters being equal. This metric was first proposed by William F. Friedman in 1922 in Revierbank Publication No. 22 titled "The Index of Coincidence and Its Applications in Cryptography". In 1967, the historian David Kahn wrote
Revierbank Publication No. 22, written in 1920, when Friedman was 28, must be regarded as the most important single publication in cryptography. It took the science into a new world. 1
Having the definition above, one can devise the formula for IOC.
Let be the length of the text.
Let be the size of the alphabet.
Let be the i-th letter of the alphabet.
Let be the number of occurences of i-th letter in the text.
Then the probability of having two selected is
The total probability (which is the IOC) is the sum of probabilities for each letter:
Note that sometimes IOC is "normalized". This is usually done by multiplying the result by - size of the alphabet.
The calculator below parses the text and calculates the IOC using the formulas above. You can also read why it is so important below the calculator.
Why Index of Coincidence is so important?
It is important, because we can calculate expected index of coincidence for given language using language's frequency of letters. With the letter frequency as we can approximate the as . Which gives us the following:
If is large enough, we can approximate the fraction as , which gives us
We can also calculate expected index of coincidence for completely random text - there all the letters have equal frequency . It is indeed .
Having expected index of coincidence, you can quickly estimate ciphered text, if you suspect that it was produced by one of the "classical" ciphers. If the index of coincidence is high and close to the expected IC for the language, then the text probably was encrypted using transposition cipher or simple (monoalphabetic) substitution cipher. Otherwise, if the index of coincidence is low and close to the expected IC for random text, then the text probably was encrypted using polyalphabetic cipher.
According to Wikipedia,
The index of coincidence is useful both in the analysis of natural-language plaintext and in the analysis of ciphertext (cryptanalysis). Even when only ciphertext is available for testing and plaintext letter identities are disguised, coincidences in ciphertext can be caused by coincidences in the underlying plaintext. This technique is used to cryptanalyze the Vigenère cipher, for example. For a repeating-key polyalphabetic cipher arranged into a matrix, the coincidence rate within each column will usually be highest when the width of the matrix is a multiple of the key length, and this fact can be used to determine the key length, which is the first step in cracking the system. Coincidence counting can help determine when two texts are written in the same language using the same alphabet. (This technique has been used to examine the purported Bible code). The causal coincidence count for such texts will be distinctly higher than the accidental coincidence count for texts in different languages, or texts using different alphabets, or gibberish texts.2