To understand why serious encryption algorithms are so complex, and why it's
so important to be careful with the critical secrets that make an encryption
system work, it's useful to understand something about how people break
encryption systems. The study of this is called cryptanalysis, and it's
an amazingly fascinating field of applied mathematics. I'm going to be
interspersing information about cryptanalysis with my cryptography posts. One
thing to remember here is that we'll be talking about it mainly in the context
of how you can break an encryption system - but cryptanalysis is also used for
designing cryposystems, because you can only design a successful cryptosystem by
thinking about how it can defeat the ways that it could be broken.
One caveat: I'm going to be describing cryptanalysis in terms of how I understand it, which is sometimes different from classical descriptions by cryptanalysists. My
understanding is strongly rooted in computation and information theory, rather than pure math. So sometimes my presentation will be a bit different, but hopefully by staying in the ground where I'm most comfortable, I can do a better job of making it comprehensible.
To be a bit formal for a moment, a simple cryptosystem consists of
- C = En(P, K)
- P = De(C, K)
Where C is encrypted text (called the cipher-text), P is the unencrypted text (called the clear-text), K is the secret key at the heart of the system. We typically think of K as being something like a password, but it doesn't have to be - in a simple substitution cipher, K is the mapping from clear-text characters to cipher characters.
A good cryptosystem is one where it's easy to generate the cipher-text given
the clear-text and the secret key; easy to generate the clear-text given the
cipher-text and the secret key; and very hard to figure out the secret key, even
given a large number of clear-text/cipher-text pairs.
When we analyze cryptosystems to try to divine their secrets, we
can look at them in terms of a number of different attack scenarios. What I mean by attack scenario is that we're looking at a cryptosystem
that we want to break. An attack scenario is a way of describing
what kind of information the agent trying to crack the cryptosystem
has access to. For example, some of the common scenarios discussed
- Cipher-only: we can watch encrypted messages sent from one user
to another, giving us a large library of encrypted text. But we don't know
any of the clear-text.
- Known clear-text: we have a library of encrypted text, like we did in
cipher-only, and in addition, we have some collection of clear-text/cipher-text
- Chosen clear-text: we have a library of encrypted text, like we did in
cipher-only, and in addition, we can select a finite set of clear-texts,
and have them encrypted.
- Iterated chosen clear-text (also sometimes called "black-box"):
like chosen clear-text, but instead of
only being able to select a set of texts exactly once, we can submit
a set of texts, get back the cipher-text, and then based on what we learn
from earlier sets of plain/cipher-text pairs, we can select more sets of
clear-text to encrypt.
These scenarios all assume that you know the basic cryptosystem. In
general, that's a reasonable assumption. If you don't have a clue
about what kind of cryptosystem was used to encrypt a cipher-text,
things get much harder. There are some tricks that you can use
to make a guess at what kind of cryptosystem was used, but for non-trivial
cryptography, you're often pretty much honked if you don't have any idea of
what sort of crypto was used.
To give you a sense of what it's like to attack a cryptosystem,
I'll walk through a simple example. Below is a simple cipher-text,
encrypted using a simple substititution, with spaces and punctuation
b czfbczc bh bu gqlvh hbxz hl uhgih vd xt lkp qrly. lpz hjbpy hjgh b jgaz plhbfzc bp xt hbxz qilkubpy hjz qrlyludjziz bu hjgh hjziz giz g kjlrz rlh lm hziibmbf ufbzpfz qrlyu lvh hjziz: ligf, djgitpyvrg, gzhblrlyt, evuh hl pgxz g mzk. qvh hjziz bu plh pzgirt ul xvfj lvh hjziz czcbfghzc hl xghj - gpc bp dgihbfvrgi hl hjz xbuvuz lm xghj. b hjbpw hjgh hjgh bu g cgxp ujgxz, qzfgvuz bp xt zodzibzpfz, lpz lm hjz xluh frzgi kgtu lm bczphbmtbpy g figfwdlh bu hjilvyj xghj. pl xghhzi hjz udzfbmbf uvqezfh, hjz figfwdlhu grkgtu zbhjzi galbc li ufizk vd hjz xghj. kjzhjzi bh'u hjz "xzifvit fgvuzu gvhbux" mlrwu, hjz azrbwlauwbgpu, fizghblpbuhu, grh-xzcbfbpz svgfwu, izdvqrbfgp dlrruhziu, li ufbzphlrlybuhu - tlv fgp grkgtu izflypbnz hjz figfwdlhu qt hjzbi xghj. ul b gx ylbpy hl cl xt qzuh hl dilabcz g albfz lm xghjzxghbfgr ugpbht - qlhj qt ujlkbpy kjgh'u kilpy kbhj hjz qgc xghj urld dvxdzc lvh qt hjz rllpbzu, gpc qt ujlkbpy jlk yllc xghj kliwu.
The first question - and this is always the first question in
an attempt to break an encryption - is: "What do we know?"
In this case, we know quite a lot:
- The clear-text is english text.
- The encryption is a simple substitution, which preserves
spaces and punctuation, but not case.
That might not look like a lot, but it's actually huge. Knowing
that the clear-text is english gives us a huge amount of information
that we can use to break it. Knowing that work-breaks are preserved
makes it nearly trivial to solve.
To break it, we'll use a simple form of frequency analysis. Frequency analysis is a great technique for attacking simple ciphers. The
idea of it is that we know a huge amount of information about patterns
in human languages. Different letters don't appear equally frequently in
English; for example, there are a lot more "E"s than "U"s.
A lot of people mistakenly think that frequency analysis means only using the specific frequency of individual letters. That's not true. A good
frequency analysis based attack on a cryptosystem uses not just the frequency of
particular letters, but of groups of letters, relationships between
letters, and frequencies of words.
So let's start deciphering that message. To try to make it easier
to see how we're decrypting, as we figure out letters, I'll write
the message with the clear-text and the cipher-text tangled - clear-text
letters will be uppercase, and cipher-text letters will be lowercase. When
we figure out that a cipher-text letter "a" encodes to a particular
clear-text letter "B", we'll write that as "a/B".
We'll start off with a nice word-based frequency trick. There are only two
common one-letter words in english: "A" and "I", and "I" occurs more frequently
at the beginning of a sentence than "A". So there's a good chance that a single
letter word that occurs frequently, including at the beginning of sentences is
"I". So we'll try that here: there are two one-letter words in the cipher-text:
"b" and "g", and "b" occurs at the beginning of sentences. So we'll try "B/I"
I czfIczc Ih Iu Aqlvh hIxz hl uhAih vd xt lkp qrly. lpz hjIpy hjAh I jAaz plhIfzc Ip xt hIxz qilkuIpy hjz qrlyludjziz Iu hjAh hjziz Aiz A kjlrz rlh lm hziiImIf ufIzpfz qrlyu lvh hjziz: liAf, djAitpyvrA, AzhIlrlyt, evuh hl pAxz A mzk. qvh hjziz Iu plh pzAirt ul xvfj lvh hjziz czcIfAhzc hl xAhj - Apc Ip dAihIfvrAi hl hjz xIuvuz lm xAhj. I hjIpw hjAh hjAh Iu A cAxp ujAxz, qzfAvuz Ip xt zodziIzpfz, lpz lm hjz xluh frzAi kAtu lm IczphImtIpy A fiAfwdlh Iu hjilvyj xAhj. pl xAhhzi hjz udzfImIf uvqezfh, hjz fiAfwdlhu ArkAtu zIhjzi AalIc li ufizk vd hjz xAhj. kjzhjzi Ih'u hjz "xzifvit fAvuzu AvhIux" mlrwu, hjz azrIwlauwIApu, fizAhIlpIuhu, Arh-xzcIfIpz svAfwu, izdvqrIfAp dlrruhziu, li ufIzphlrlyIuhu - tlv fAp ArkAtu izflypInz hjz fiAfwdlhu qt hjzIi xAhj. ul I Ax ylIpy hl cl xt qzuh hl dilaIcz A alIfz lm xAhjzxAhIfAr uApIht - qlhj qt ujlkIpy kjAh'u kilpy kIhj hjz qAc xAhj urld dvxdzc lvh qt hjz rllpIzu, Apc qt ujlkIpy jlk yllc xAhj kliwu.
Now, in the first sentence, we see two two-letter words side by side starting with "I". There are only three common two-letter words starting with "I": "if", "is", and "it". Since it's a pair of words, we can consider the
different permutations: "it is", "it if", "if it", "if is", "is if", and "is it". "it if" and "if is" are both very unlikely. We can't discard them out
of hand, but we can consider them very unlikely. "is it" would usually imply
a question, but there's no question mark. So it's probably either "if it", or "it is". Looking into the next line, where we see "hjAh", we can say that "h/F" is very unlikely, because there are very few words in english that start and end with an "f", but a lot that start and end with a T. So we'll guess that the phrase is "it is", and "h/T" and "u/S".
I czfIczc IT IS AqlvT TIxz Tl STAiT vd xt lkp qrly. lpz TjIpy TjAT I jAaz plTIfzc Ip xt TIxz qilkSIpy Tjz qrlylSdjziz IS TjAT Tjziz Aiz A kjlrz rlT lm TziiImIf SfIzpfz qrlyS lvT Tjziz: liAf, djAitpyvrA, AzTIlrlyt, evST Tl pAxz A mzk. qvT Tjziz IS plT pzAirt Sl xvfj lvT Tjziz czcIfATzc Tl xATj - Apc Ip dAiTIfvrAi Tl Tjz xISvSz lm xATj. I TjIpw TjAT TjAT IS A cAxp SjAxz, qzfAvSz Ip xt zodziIzpfz, lpz lm Tjz xlST frzAi kAtS lm IczpTImtIpy A fiAfwdlT IS Tjilvyj xATj. pl xATTzi Tjz SdzfImIf SvqezfT, Tjz fiAfwdlTS ArkAtS zITjzi AalIc li Sfizk vd Tjz xATj. kjzTjzi IT'S Tjz "xzifvit fAvSzS AvTISx" mlrwS, Tjz azrIwlaSwIApS, fizATIlpISTS, ArT-xzcIfIpz svAfwS, izdvqrIfAp dlrrSTziS, li SfIzpTlrlyISTS - tlv fAp ArkAtS izflypInz Tjz fiAfwdlTS qt TjzIi xATj. Sl I Ax ylIpy Tl cl xt qzST Tl dilaIcz A alIfz lm xATjzxATIfAr SApITt - qlTj qt SjlkIpy kjAT'S kilpy kITj Tjz qAc xATj Srld dvxdzc lvT qt Tjz rllpIzS, Apc qt SjlkIpy jlk yllc xATj kliwS.
Now, we see "Tl" in the first line. There's only one common two-letter
word starting with "T": "to". So we'll do "l/O". We also see "STAiT",
which looks like it's probably "start"; and we see that "i" occurs very frequently in the cipher-text; "R" is one of the most common consonants in english. So we'll also add "i/R".
I czfIczc IT IS AqOvT TIxz TO START vd xt Okp qrOy. Opz TjIpy TjAT I jAaz pOTIfzc Ip xt TIxz qROkSIpy Tjz qrOyOSdjzRz IS TjAT TjzRz ARz A kjOrz rOT Om TzRRImIf SfIzpfz qrOyS OvT TjzRz: ORAf, djARtpyvrA, AzTIOrOyt, evST TO pAxz A mzk. qvT TjzRz IS pOT pzARrt SO xvfj OvT TjzRz czcIfATzc TO xATj - Apc Ip dARTIfvrAR TO Tjz xISvSz Om xATj. I TjIpw TjAT TjAT IS A cAxp SjAxz, qzfAvSz Ip xt zodzRIzpfz, Opz Om Tjz xOST frzAR kAtS Om IczpTImtIpy A fRAfwdOT IS TjROvyj xATj. pO xATTzR Tjz SdzfImIf SvqezfT, Tjz fRAfwdOTS ArkAtS zITjzR AaOIc OR SfRzk vd Tjz xATj. kjzTjzR IT'S Tjz "xzRfvRt fAvSzS AvTISx" mOrwS, Tjz azrIwOaSwIApS, fRzATIOpISTS, ArT-xzcIfIpz svAfwS, RzdvqrIfAp dOrrSTzRS, OR SfIzpTOrOyISTS - tOv fAp ArkAtS RzfOypInz Tjz fRAfwdOTS qt TjzIR xATj. SO I Ax yOIpy TO cO xt qzST TO dROaIcz A aOIfz Om xATjzxATIfAr SApITt - qOTj qt SjOkIpy kjAT'S kROpy kITj Tjz qAc xATj SrOd dvxdzc OvT qt Tjz rOOpIzS, Apc qt SjOkIpy jOk yOOc xATj kORwS.
We're starting to see some really good progress. Some short words are
starting to pop up, and lots of places where the correct word is pretty obvious.
For example, "TjAT" is clearly "THAT", so we know "j/H". And we see lots of
three letter words, starting with "T". The most common three-letter word
starting with "T" is "the", and we see lots of three-letter words with "T"
followed by "j"; if our guess that "j" encodes "H" is correct, that means that
"z" almost certainly encodes "E".
I cEfIcEc IT IS AqOvT TIxE TO START vd xt Okp qrOy. OpE THIpy THAT I HAaE pOTIfEc Ip xt TIxE qROkSIpy THE qrOyOSdHERE IS THAT THERE ARE A kHOrE rOT Om TERRImIf SfIEpfE qrOyS OvT THERE: ORAf, dHARtpyvrA, AETIOrOyt, evST TO pAxE A mEk. qvT THERE IS pOT pEARrt SO xvfH OvT THERE cEcIfATEc TO xATH - Apc Ip dARTIfvrAR TO THE xISvSE Om xATH. I THIpw THAT THAT IS A cAxp SHAxE, qEfAvSE Ip xt EodERIEpfE, OpE Om THE xOST frEAR kAtS Om IcEpTImtIpy A fRAfwdOT IS THROvyH xATH. pO xATTER THE SdEfImIf SvqeEfT, THE fRAfwdOTS ArkAtS EITHER AaOIc OR SfREk vd THE xATH. kHETHER IT'S THE "xERfvRt fAvSES AvTISx" mOrwS, THE aErIwOaSwIApS, fREATIOpISTS, ArT-xEcIfIpE svAfwS, REdvqrIfAp dOrrSTERS, OR SfIEpTOrOyISTS - tOv fAp ArkAtS REfOypInE THE fRAfwdOTS qt THEIR xATH. SO I Ax yOIpy TO cO xt qEST TO dROaIcE A aOIfE Om xATHExATIfAr SApITt - qOTH qt SHOkIpy kHAT'S kROpy kITH THE qAc xATH SrOd dvxdEc OvT qt THE rOOpIES, Apc qt SHOkIpy HOk yOOc xATH kORwS.
Looking very good. We can see from things like "OvT" that "v" is cipher for "U"; from "HAaE", we can guess that "a" is cipher for "V"; from "OpE", we can guess that "p" is cipher for "N".
I cEfIcEc IT IS AqOUT TIxE TO START Ud xt OkN qrOy. ONE THINy THAT I HAVE NOTIfEc IN xt TIxE qROkSINy THE qrOyOSdHERE IS THAT THERE ARE A kHOrE rOT Om TERRImIf SfIENfE qrOyS OUT THERE: ORAf, dHARtNyUrA, AETIOrOyt, eUST TO NAxE A mEk. qUT THERE IS NOT NEARrt SO xUfH OUT THERE cEcIfATEc TO xATH - ANc IN dARTIfUrAR TO THE xISUSE Om xATH. I THINw THAT THAT IS A cAxN SHAxE, qEfAUSE IN xt EodERIENfE, ONE Om THE xOST frEAR kAtS Om IcENTImtINy A fRAfwdOT IS THROUyH xATH. NO xATTER THE SdEfImIf SUqeEfT, THE fRAfwdOTS ArkAtS EITHER AVOIc OR SfREk Ud THE xATH. kHETHER IT'S THE "xERfURt fAUSES AUTISx" mOrwS, THE VErIwOVSwIANS, fREATIONISTS, ArT-xEcIfINE sUAfwS, REdUqrIfAN dOrrSTERS, OR SfIENTOrOyISTS - tOU fAN ArkAtS REfOyNInE THE fRAfwdOTS qt THEIR xATH. SO I Ax yOINy TO cO xt qEST TO dROVIcE A VOIfE Om xATHExATIfAr SANITt - qOTH qt SHOkINy kHAT'S kRONy kITH THE qAc xATH SrOd dUxdEc OUT qt THE rOONIES, ANc qt SHOkINy HOk yOOc xATH kORwS.
From "SfIENfE", we can guess that "f" is cipher for "C";
from "xATH", "TIxE", and "NAxE" we can guess that "x" is cipher for M.
We can see "NEARrt", which should be either "NEARBY" or "NEARLY"; L occurs
much more frequently than B, and some of the other instances of "r" appear likely to be "L"s.
I cECIcEc IT IS AqOUT TIME TO START Ud MY OkN qLOy. ONE THINy THAT I HAVE NOTICEc IN MY TIME qROkSINy THE qLOyOSdHERE IS THAT THERE ARE A kHOLE LOT Om TERRImIC SCIENCE qLOyS OUT THERE: ORAC, dHARYNyULA, AETIOLOyY, eUST TO NAME A mEk. qUT THERE IS NOT NEARLY SO MUCH OUT THERE cEcICATEc TO MATH - ANc IN dARTICULAR TO THE MISUSE Om MATH. I THINw THAT THAT IS A cAMN SHAME, qECAUSE IN MY EodERIENCE, ONE Om THE MOST CLEAR kAYS Om IcENTImYINy A CRACwdOT IS THROUyH MATH. NO MATTER THE SdECImIC SUqeECT, THE CRACwdOTS ALkAYS EITHER AVOIc OR SCREk Ud THE MATH. kHETHER IT'S THE "MERCURY CAUSES AUTISM" mOLwS, THE VELIwOVSwIANS, CREATIONISTS, ALT-MEcICINE sUACwS, REdUqLICAN dOLLSTERS, OR SCIENTOLOyISTS - YOU CAN ALkAYS RECOyNInE THE CRACwdOTS qY THEIR MATH. SO I AM yOINy TO cO MY qEST TO dROVIcE A VOICE Om MATHEMATICAL SANITY - qOTH qY SHOkINy kHAT'S kRONy kITH THE qAc MATH SLOd dUMdEc OUT qY THE LOONIES, ANc qY SHOkINy HOk yOOc MATH kORwS.
You should be able to finish it from here: there are a ton of obvious
substittions now: "c/D", "q/B", "y/G", "d/P", "m/F", etc. You should also
now really understand why simple substition is such a poor encryption technique: look how easily we just cracked that! In fact, there's a
cryptanalysis proof that on average, you need less that 50 characters of
cipher-text to decode a simple substitution for english.
(The clear-text is the text of the first-ever post on the original version
of this blog, with contractions and paragraph breaks removed.)