# Simple Encryption: Introduction and Substitution Ciphers

Aug 08 2008 Published by under Encryption

The starting point talking about encryption is to understand
what the point of it is; what it's supposed to do, what problems it's supposed to avoid.

Encryption is fundamentally about communication: you've got two parties who want to communicate, but don't want anyone else to be able to listen in.

They way that you do that is by sharing a secret. You use that secret to somehow modify the information that you're going to send, so that it can't be read by someone who doesn't have the secret. People often think of encryption as a way of using a password to hide information, but a password is just one of many kinds of secrets that you can use. The secret that you share with your counterpart can be a password, a number, a textbook, or just about anything else you can imagine.

For example, one very hard to break encryption is based on you and your counterpart having identical copies of a textbook. Then for each word in your message,
you find that word in your textbook. You encode the word as page#, paragraph#, word#. The encrypted message is then a sequence of triplets of numbers. Without knowing what book you used, if you do a good job with this kind of system,
you can make it pretty much impossible to decode your message without knowing
what book you used. But this illustrates an important point: in encryption, you need to have a clear understanding of exactly what the shared secret is, and
take care to protect it. (During World War 2, the allies only managed to break
the Enigma encryption system after they got an actual Enigma encryption machine: the structure of the machine is a critical part of the secret of Enigma encryption. With
the machine, and a few other tricks to figure out the rest of the secret, Enigma
was cracked. Without the machine, it wouldn't have been.)

One of the simplest forms of encryption is something called a simple
substitution cipher. The idea of that is that you have a mapping from one set of symbols to another. If you know the mapping, you can encode and decode messages. The secret that you're sharing with the person you're communicating with is the character mapping. If someone can either get, or figure out that mapping, then
you no longer have a secret. As I'll explain later, the problem with this kind
of encryption is that it's pretty easy to figure out the secret, and thus break
the encryption.

To see how substitution ciphers work, we'll look at a simple example. To keep things simpule, we'll only look at letters; we'll leave spaces and punctuation unchanged, and upper/lower case unchanged. The substitution map is as follows:

 Character Coding Character Coding A B C D E F G H I J K L M G Q F C Z L Y J B E W R X N O P Q R S T U V W X Y Z P L D S I U H V A K O T N

Using this, we can encrypt things. We take each letter of the message
we want to send, and look it up in the character row of the table, and replace it
with the corresponding letter in the coding row. To decrypt, we do the opposite: look up the letter in the encrypted message in the coding row, and replace it with the corresponding letter in the character row.

For example, "Mark has a big nose" would be encrypted as "Xgiw jgu g qby pluz".

Implementing a cipher like this borders on trivial. Here's a way of doing it in python, with a sample cipher and invocation demo:

```class Cipher(object):
def __init__(self, mapping):
self._mapping = mapping
self._reverse = {}
# set up the reverse mapping for decryption.
for k in mapping:
self._reverse[mapping[k]] = k
def encrypt(self, message):
"""Take a message string, and encrypt it according to the
cipher mapping""""
result = ""
for c in message:
if c.islower():
cu = c.upper()
else:
cu = c
if self._mapping.has_key(cu):
m = self._mapping[cu]
else:
m = cu
if c.isupper():
result += m
else:
result += m.lower()
return result
def decrypt(self, message):
result = ""
for c in message:
if c.islower():
cu = c.upper()
else:
cu = c
if self._reverse.has_key(cu):
m = self._reverse[cu]
else:
m = cu
if c.isupper():
result += m
else:
result += m.lower()
return result
cipher = Cipher({ 'A': 'G', 'B': 'Q', 'C': 'F', 'D': 'C', 'E': 'Z',
'F': 'L', 'G': 'Y', 'H': 'J', 'I': 'B', 'J': 'E',
'K': 'W', 'L': 'R', 'M': 'X', 'N': 'P', 'O': 'L',
'P': 'D', 'Q': 'S', 'R': 'I', 'S': 'U', 'T': 'H',
'U': 'V', 'V': 'A', 'W': 'K', 'X': 'O', 'Y': 'T',
'Z': 'N', })
en = cipher.encrypt("Mark has a Big nose")
de = cipher.decrypt(en)
print "Encrypted = %s" % en
print "Decrypted = %s" % de
```

That's it, and the largest part of that is really dealing with maintaining upper and lower case through the encryption.

Obviously, this isn't a great way of encrypting things. Why?

By maintaining case, punctuation, and word boundaries, you make it really easy to identify some likely mappings. For example, if you see a single uppercase letter anywhere other than the beginning of a sentence, you can be almost certain that it's "I". If you see a singe character that appears on its own in lowercase, it's
quite probably "a". Frequent three letter words will be either "and" or "the", and "the" will frequently appear at the beginning of sentences. Even if you include mappings for all of the punctuation and space, and you don't distinguish case, it's still easy to decrypt.
With a fixed mapping like this, by using frequency tables for the characters and character groups, you'll be able to pretty quickly
figure it out. This is easy enough to crack that there's actually a popular daily newspaper puzzle called "CryptoQuote" that presents an encrypted quote every day for readers to decode.

Just to give you a sense of how easy this is: just look at the encryption of "Mark has a big nose": ""Xgiw jgu g qby pluz". The "g" is obviously an A. So we immediately
get "xAiw jAu A qby pluz". A three letter word with "a" as the middle is probably
"has", so we can guess that "h" is translated to "j", and "s" is translated to "u". So
we get "xAiw HAS A qbz plSe". In 10 seconds, you've got a third of the message decoded - and that's for a message that's only 5 words/20 letters long!

Improving on this, while keeping the same basic concept is pretty easy. For example, one common technique is to have each letter that you encode also alter the translation in some way. Ultimately, that's the basis for how the infamous "Enigma" encryption machine used by the Nazi's during World War 2 worked. Next post, I'll show you a basic example of this technique, using something called a rotating cipher.

• Tim Howland says:

They way that you do that is by sharing a secret. You use that secret to somehow modify the information that you're going to send, so that it can't be read by someone who doesn't have the secret.

Actually, this isn't true- Public Key Cryptography (eg PGP) doesn't rely on a shared secret at all. Instead it relies on public and private information, and one-way functions to convert between the two. I'd argue that with the growth of SSL traffic on the internet, public key cryptography is becoming the dominant form of encryption in use today.

• Dunc says:

That's true Tim, but a discussion of asymmetric cryptography isn't exactly suited to an introduction to simple cryptography.
Anyway, SSL only uses asymmetric encryption to negotiate a shared key - all the actual traffic is encrypted using (much faster) symmetric algorithms which do rely on a shared secret.

• Charles says:

During World War 2, the allies only managed to break the Enigma encryption system after they got an actual Enigma encryption machine: the structure of the machine is a critical part of the secret of Enigma encryption. With the machine, and a few other tricks to figure out the rest of the secret, Enigma was cracked. Without the machine, it wouldn't have been.

I thought the Poles had made a preliminary break in the system without the machine by analyzing weaker messages.

• KeithB says:

I will share with the world my idea for a better substitution cipher:
use a 2 byte key and give multiple symbols to common characters so that the probability comes out about even.
For example, you may have 20 (or more!) different symbols for 'e', 50 or more different symbols for 'space' and so on.
Assuming you can get the frequencies of the symbols to be about equal, does this make it much harder to crack?

• Stephen Waits says:

MarkCC,
Umm, are you sure you have your Enigma history right? It seems glaringly wrong to me.
--Steve

• Who Cares says:

Isn't a simple way of preventing people from finding out that that 'g' is an 'a' by making used of fixed grouping?
that is group the message like this (note padding at the end to keep it at 4 letters per block):
xgiw jgug qbyp luzr
The Polish got a bit lucky. They managed to get their hands on a how does this work training paper (plain text, key at each step, encrypted text)
And then they had a brilliant guy, Buro something, who deduced that there were limited possibilities to get from one plain text letter to an encrypted letter, so that the Polish started writing up all chains of encryption. From this they developed devices that could do this faster (or reduce the work load).
The only codes they couldn't break were the German navy because those used extra settings. For that they did need that Enigma.

• Mark C. Chu-Carroll says:

Stephen:
I'm not much of a historian - my description of Enigma is based on memory of an article I read a while ago. Checking up on it, I do have some details wrong - but the basic point, that without knowing the basics of how the machine worked, it wouldn't have been solvable, is (I think) still correct; and the Bletchley park system for decrypting made heavy use of knowledge gleaned from examining a physical copy of the Enigma.
There were, of course, other things that led to the great crack of the Enigma. Just having a copy of the machine wasn't *sufficient* to crack it. There were also a number of fundamental mistakes that the Germans made in how they used the machine that greatly helped decoding it.
The original Polish crack of the Enigma was based on a combination of understanding the physical mechanism of the Enigma (it exploited the fact that the last rotor shifted in a fixed way, regardless of machine setting), and the fact that there were predictable message contents (the first six characters of each message were two copies of the three-character setting for the dials), plus certain things like various basis sending out the same message every day ("Nothing to report") provided the data that they needed to
break it.

• Pocket Nerd says:

This post and the last are perfect fodder for my growing interest in cryptography and steganography. They're also very timely, considering the western governments' increasing sense of entitlement to snoop on our data. Thanks for posting these!

• Charles says:

KeithB, thanks for sharing. Unfortunately, the world has known about this technique for a long time. It predates babbage's break of le chiffre indechiffrable. The weaknesses of the simple substitution cipher where known about a month after it was first used. Your suggested method was a common one, but it has two failings.
With a large enough corpus you do begin to see bigram trends which indicate what symbols are related. The english language is riddled with patterns, not just in terms of letter frequency.
Humans are not random enough, and to offer a real advantage the choice of symbol would have to be truly random.
Mark, even though BP did have reference materials for the Enigma, even to the point of being set back for months due to the addition of some new cipher wheels, there was a major break in the FISH cipher that protected German teletype transmissions. A Swede named Arne Beurling worked in the Swedish cipher clerks office, and due to several critical mistakes by the Nazi teletype operators, he was able to break the code and reverse engineer the machine.

• Who Cares says:

@charles:
I think this blog post was meant to be a simple start.
A 'simple' improvement is to expand the space on which you map the original text. This reduces picking out common letters by allowing to map to more then one substitute and increases the amount of encrypted data needed to detect patterns. Downside to this approach is that it starts to resemble a code and not a cipher or you have some formula embedded which allows for even easier cracking if noticed.
Another option is to start with grouping the encrypted message and then taking group n-1 to add to group n (or n and then add to n-1). That also increases the amount of data needed to detect patterns.

• D'Archangel says:

Actually, this isn't true- Public Key Cryptography (eg PGP) doesn't rely on a shared secret at all.

ObPedantry: I object to your use of "at all". All asymmetric implementations of which I am aware actually wrap a symmetric cipher. The actual information is encrypted using a shared key; the asymmetric cipher is used to encrypt that key. In a sense, the asymmetric cipher exists to provide a channel on which you can share your secret; it's the algorithmic equivalent of a hollow tree stump.
I'll stipulate that you could encrypt longer passages asymmetrically and therefore never have to share a key. But you wouldn't want to. Especially in an introduction, sharing a secret is the right way to think about encryption; when Mark gets to asymmetry, it can be framed as a particularly good way to share the secret.
D'A
[ObBitchyUser: Forms that require JavaScript are the suck.]

• Perhaps you could do a discussion on stenography and cryptolinguistics?
It doesn't affect computers much, but I'd love to learn more about that. (I was sad that my crypto course at A&M didn't discuss such things.)
That said, I've always thought that something written in a cryptolanguage, encrypted with 3AES with the key in El Gamal, and stenographicaly hidden in a JPG, could be put up on any website and go undetected and unbroken by all. The idea of the DoD using 4chan to route it's most secure messages is, I must say, quite funny.

• Pseudonym says:

Checking up on it, I do have some details wrong - but the basic point, that without knowing the basics of how the machine worked, it wouldn't have been solvable, is (I think) still correct; and the Bletchley park system for decrypting made heavy use of knowledge gleaned from examining a physical copy of the Enigma.

The thing you're referring to is the Steckerboard. This was a modification added to Enigma in the middle of the war, and it wasn't until one was captured that Bletchley Park worked out how to break it.
Before the war, the Polish mathematicians had exactly one piece of information beyond the actual traffic, and that's a copy of the patent for the machine. So they knew roughly how it worked, but they knew almost none of the details, and none of the modifications that the German millitary made.
What's interesting is that they reverse engineered it anyway, and the way they did it is fascinating. I'll give a brief description when Mark does his thing on rotating ciphers, because it requires a bit of understanding about how those work.
In the case of the Lorenz cipher (which Bletchley codenamed "FISH"), they managed to break it without knowing anything about the machine.

• Nick Johnson says:

Wow, that's some verbose Python code. Let me try:
class Cipher(object):
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
def __init__(self, mapping):
self._mapping = dict(zip(Cipher.alphabet, mapping))
self._reverse = dict(zip(mapping, Cipher.alphabet))
def encrypt(self, message):
"""Take a message string, and encrypt it according to the
cipher mapping""""
return "".join(self.mapping.get(x.upper(), x.upper()) for x in message)
def decrypt(self, message):
return "".join(self._reverse.get(x.upper(), x.upper()) for x in message)
cipher = Cipher("GQFCZLYJBEWRXPLDSIUHVAKOTN")
en = cipher.encrypt("Mark has a Big nose")
de = cipher.decrypt(en)
print "Encrypted = %s" % en
print "Decrypted = %s" % de

(It looks oddly spaced because Scienceblogs seems to do something odd with pre blocks.)

• Mark C. Chu-Carroll says:

Nick:
The code is deliberately verbose. In code on the blog, I'm not trying to write code that I'd use in the real world; I'm trying to write code that's easy to read. For someone who doesn't know python, "self._mapping = dict(zip(Cipher.alphabet, mapping))" isn't going to make a lot of sense; but an explicitly written dictionary is pretty clear, even if it's verbose.

• Nick Johnson says:

Yes, sorry, I should have made clear: I wasn't criticising the verbosity of your code; merely showing how much more compact Python can be. ðŸ™‚
Though that said, I think parts of your code _are_ needlessly overwrought (such as explicitly checking for upper/lower-case-ness and assigning accordingly instead of just calling .lower()), but that wasn't the reason I posted. ðŸ™‚

• Uncle Al says:

Render a two-color image of the document (e.g., screen capure). Draw lines of "random" orientation and width. Everything under a line reverses color. Very quickly all is noise. Then encrypt that graphics file and have NSA venture forth upon a jolly romp. Decryption uses the "random" seed to decrypt the image in reverse process. The weakness is obvious.

• Jeremy says:

For anyone as fascinated as I am with WWII-era cryptography, I highly recommend checking out the book "Between Silk and Cyanide: A Codemaker's War" -- great, great (nonfiction) story -- nothing to do with Enigma, though.
As far as I remember, Enigma's single biggest weakness was that it could not map any character in the plaintext to the same character in the ciphertext-- in fact it was specifically designed that way. Just having the blueprints and/or a stolen copy of the machine *should* not have been sufficient to break the code. In fact, if not for the poor codemaking habits of the operators (mentioned earlier in the comments) and this major flaw in the design, Enigma might well have remained secure for the entire war despite the Allies being in possession of a copy of the machine. I must say, though, I've seen a couple of these in museums and they're absolutely fascinating machines to look at....

• Eileen says:

I thought this text book idea was really interesting!
Eileen
Dedicated Elementary Teacher Overseas (in the Middle East)
elementaryteacher.wordpress.com

• Eileen says:

I forgot to say, the problem is, what kind of text book would have enough words to communicate about? I guess you'd have to get one with some subject-specific vocabulary that you wanted to talk about.
Eileen
Dedicated Elementary Teacher Overseas (in the Middle East)
elementaryteacher.wordpress.com

• Anonymous says:

Also cracking the enigma they would look at the "fists" of the morse code operators. "Fists" is the cadence or signature of the actual key presses. Some of the "Fists" were very unique, like a signature. It was many many things that helped. Often the Germans would send the same type of traffic at the same type of day int he same format. For example, morning weather reports were sent at the same time and followed a strict format. From those they could figure out the key for the day. Sometimes the Brits would do something just so the Germans would report it. The Brits would know the context and possible contents of the message and that made determining the key easier. Since the Germans thought the code was unbreakable they didn't feel the need to change things or inspect their assumptions. Good for the Allied Powers.

• MarkW says:

The Sherlock Holmes story "The Adventure of the Dancing Men" is based around Holmes's solution (by frequency analysis) of a simple substitution cypher.

• Scientopia Blogs