Cryptogram Cracking

Concept summary and lesson

So far, we have made a bunch of functions that can give us histograms of the frequencies of the letters in pieces of text. However, there's a problem: It's hard to compare one histogram to another! Our corpus of text that we use for a model probably has thousands of characters in it, while the cryptogram probably only has dozens. That means the actual numbers in the two things are going to be nothing alike! One is going to be orders of magnitude larger than the other, and that makes it hard to line things up.

In order to make our lives easier, we're going to convert the histogram into a probability distribution. A probability distribution is just a collection of all possibilities, each of which has a certain chance of happening. Notice that I said all possibilities and chance of happening. That means that every number in a probability distribution is less than or equal to 1, and they all add up to 1.

Turning a histogram into a PDF (probability distribution function) is easy - just add up all of the terms in the histogram, then divide each one by the total. If you do that, you'll have a whole collection of fractions with a common denominator, all of which add up to exactly one!

This process is called normalizing the data - we scale it up or down so that we can make better comparisons.

Updating our crypto cracker

Now, instead of just aligning our character arrays by frequency, we're going to try to find the best match for each character mapping. We'll still want to start with sorted arrays, but instead of just picking the next element from the corpus each time, we'll keep our current ciphertext letter active until we find the corpus letter with the closest probability, and use that one as our substitution.

This is still not going to solve the cryptogram! However, it will get us a lot closer. This also

Media resources

Exercises