Index of Coincidence

This online calculator calculates index of coincidence (IC, IOC) for the given text

This page exists due to the efforts of the following people:

Timur

Timur

Created: 2018-10-19 11:12:12, Last updated: 2021-02-18 12:16:53
Creative Commons Attribution/Share-Alike License 3.0 (Unported)

This content is licensed under Creative Commons Attribution/Share-Alike License 3.0 (Unported). That means you may freely redistribute or modify this content under the same license conditions and must attribute the original author by placing a hyperlink from your site to this work https://planetcalc.com/7944/. Also, please do not modify any references to the original work (if any) contained in this content.

Here is the calculator, which calculates the index of coincidence, or IOC (IC) for the given text. You can read what is the index of coincidence and how it is calculated below the calculator.

PLANETCALC, Index of Coincidence

Index of Coincidence

Digits after the decimal point: 4
Index of Coincidence
 
Normalized Index of Coincidence
 

The index of coincidence

The index of coincidence is the probability of two randomly selected letters being equal. William F. Friedman first proposed this metric in 1922 in Revierbank Publication No. 22 titled "The Index of Coincidence and Its Applications in Cryptography". In 1967, the historian David Kahn wrote.

Revierbank Publication No. 22, written in 1920, when Friedman was 28, must be regarded as the most important single publication in cryptography. It took science into a new world. 1

Having the definition above, one can devise the formula for IOC.
Let N be the length of the text.
Let n be the size of the alphabet.
Let a_i be the i-th letter of the alphabet.
Let F_i be the number of occurrences of i-th letter in the text.

Then the probability of having two a_i selected is p_i=\frac{F_i*(F_i-1)}{N*(N-1)}
The total probability (which is the IOC) is the sum of probabilities for each letter:
IOC=\frac{1}{N*(N-1)}*\sum^{n}_{i=1}F_i*(F_i-1)

Note that sometimes IOC is "normalized". This is usually done by multiplying the result by n - the alphabet's size.
IOC_{normalised}=\frac{n}{N*(N-1)}*\sum^{n}_{i=1}F_i*(F_i-1)

The calculator below parses the text and calculates the IOC using the formulas above. You can also read why it is so important below the calculator.

Why Index of Coincidence is so important?

It is important, because we can calculate expected index of coincidence for given language using language's frequency of letters. With the letter frequency as p_i we can approximate the F_i as p_i*N. Which gives us the following:
IOC_{expected}=\frac{1}{N*(N-1)}*\sum^{n}_{i=1}F_i*(F_i-1)\\=\frac{1}{N*(N-1)}*\sum^{n}_{i=1}(p_i*N)*(p_i * N - 1)\\=\sum^{n}_{i=1}p_i*\frac{p_i*N-1}{N-1}
If N is large enough, we can approximate the fraction \frac{p_i*N-1}{N-1} as p_i, which gives us
IOC_{expected}=\sum^{n}_{i=1}p_i^2

We can also calculate expected index of coincidence for completely random text - there all the letters have equal frequency 1/n. It is indeed 1/n.

Having expected index of coincidence, you can quickly estimate ciphered text if you suspect that it was produced by one of the "classical" ciphers. If the index of coincidence is high and close to the expected IC for the language, then the text probably was encrypted using transposition cipher or simple (monoalphabetic) substitution cipher. Otherwise, if the index of coincidence is low and close to the expected IC for random text, then the text probably was encrypted using a polyalphabetic cipher.

According to Wikipedia,

The index of coincidence is useful in the analysis of natural-language plaintext and ciphertext analysis (cryptanalysis). Even when the only ciphertext is available for testing and plaintext letter identities are disguised, coincidences in the underlying plaintext can cause coincidences in the ciphertext. This technique is used to cryptanalysis the Vigenère cipher, for example. For a repeating-key polyalphabetic cipher arranged into a matrix, the coincidence rate within each column will usually be highest when the width of the matrix is a multiple of the key length, and this fact can be used to determine the key length, which is the first step in cracking the system. Coincidence counting can help determine when two texts are written in the same language using the same alphabet. (This technique has been used to examine the purported Bible code). The causal coincidence count for such texts will be distinctly higher than the accidental coincidence count for texts in different languages or texts using different alphabets or gibberish texts.2


  1. David Kahn, The Code Breakers, Macmillan, 1967. 

  2. Index of Coincidence 

URL copied to clipboard
PLANETCALC, Index of Coincidence

Comments