Analyzing digram frequencies

Concept summary and lesson

Examples/demo

Last time, we counted letters in a piece of text. That by itself is a pretty useful tool for breaking classical crypto, but we can do even better. It turns out that English has a much stronger pattern to it when you look at pairs of letters rather than single letters. If you look at this link: digraph frequencies in English text, you'll see an interesting thing: they only gave frequencies for the 22 most common digrams, even though there are 262=676 possible pairs. That's because they are very unevenly distributed!

That's good news for us as codebreakers, because it lets us rule out a lot of potential solutions based on how unlikely the digrams are. Say you've found 10 possible single-letter simple substitution cipher keys that give believable letter frequencies. That means that you can use any of them to "decipher" the message, but only one of them will actually be correct. You could then take a look at what the digram frequencies of each would give you, and you'll probably be able to rule out almost all of the bad ones immediately! That's because digrams are a lot more sparse than single letters, and the weird ones are often actually impossible to find in correct english text.

So today we're going to write a function that will give us the count of all digrams in a piece of text. It's going to work

Media resources

Exercises