At this point, we've seen a fair bit about regular languages, and we've gone through the introduction to context free languages. We know one way of showing that a language is regular or context free: if you can write a (regular/context free) grammar for a language, then that language is necessarily (regular/context free). But... what if we have a language that we suspect is *not* regular?

Suppose, for example, that we have a context free grammar for a language. The fact that we specified it using a context free grammar doesn't mean that the language *requires* a context-free grammar. Sometimes it's easier to write things with a CFG, even if it's possible to write them with a regular grammar. We suspect that it really can't be done using a regular grammar - but we're not sure. Just because we *think* that we can't write a regular grammar for the language doesn't mean anything. It's possible that we're wrong - that there's some way of writing an alternative grammar that's regular. So.. how can we actually show that a language isn't regular? (Or stepping up a level, that a language isn't context free?)

There's a clever tool called the *pumping lemma*, which we can use to show that a language isn't regular. (And there's a variation of it that can be used to show that a language isn't context free.) What it does is exploit a simple property of a regular language in a way that allows us to use it to show that a language isn't regular.

One of the fundamental limits of a regular language is that it can't have deep patterns. It's built one terminal symbol at a time, from left to right, without being able to look back at context. That's essentially the process that we're going to exploit. What the pumping lemma says is: given a language \(L\), if we take any string in \(L\) above a grammar-specific minimum length, then that string will consist of three parts - a prefix (a), a center (b), and a suffix (c). If the string \(abc\) is in \(L\), then any string that *repeats* the center part any number of times will *also* be part of \(L\). So we can *pump* the language: \(abc\) will be in \(L\); \(abbc\) will be in the language; \(abbbc\) will be in the language, and so on. If we can find any string in \(L\) where there's no pumpable center, then \(L\) isn't regular.

Let's make it more formal, so that we can be precise. We need one quick notational thing - we'll write the length of string \(s\) as \(|s|\). With that, the pumping lemma says:

If \(L\) is a regular language, then there exists an integer \(p ge 1\) that depends solely on the structure of \(L\), such that for every string \(s in L\), if \(|s| ge p\), \(s\) can be described in terms of substrings \(a\), \(b\), and \(c\) such that:

- \(s = abc\)
- \(|ab| le p\)
- \(|b| ge 1\)
- \(forall i ge 0, a(b^i)c in L\)

Now, how do we use this? Let's look at how it works on the canonical example of a non-regular language: \(x^nb^y\) - a string which has a string of \(x\)s, followed by the same number of \(y\)s.

Let's assume that this language is regular. We know that there must be some integer \(n\), and that for anything longer than \(n\), some part of any string longer than \(n\) must be pumpable.

So, let's take the string \(x^{n+1}y^{n+1}\). By the pumping lemma, we know that sum of the lengths of the prefix and the middle of the string have to be less than or equal to \(n\). So the prefix, \(a\), is any sequence of \(x\)s; we'll say it's length is \(m\) where \(m ge 0\); and the middle, \(b\) has to be a sequence of at least one \(x\) - it's \(n - m\) \(x\)s, and \(n - m ge 1\). That is, we've broken the string into \(x^mx^{n-m}xy\).

So if the language is regular, then we must be able to say that \(ab^ic\) is in the language. So \(x^mx^{n-m}^2xy^{n+1}\) must be in the language. If we simplify that, \(x{2n-m+1}y^{n+1}\) must be in the language. But \(2n-m+1 > n+1\). So the pumped string isn't in the language - it has more \(x\)s than \(y\)s. That means that the language can't be regular, because pumping it produced a string that's not in the language.

One thing to be aware of - a mistake I made far too often in my first class that covered this - is that the fact that a language is pumpable does *not* mean that the language *is* regular. All regular languages are pumpable, but not all pumpable languages are regular. If you can't pump a language, you've proved it's not regular. But the fact that you can pump it doesn't tell you anything. If you want to prove that a language *is* regular, you have to either show a regular grammar, a regular expression, or a FSA that describes the language.

The regular language pumping lemma that I showed above is really easy to use. It takes a bit of practice to get comfortable choosing the basic structure of a string to pump - but the proofs are pretty much always as short as the one above. For context free languages, the basic concept is similar: you exploit the structure of the class of languages to find a structure that can be repeated - and then you show that your language of interest doesn't contain that sort of repetitive structure. But since CFLs are a lot more expressive, and can describe more complex structures, the repetitive structure that you exploit is, similarly, more complicated.

The context free pumping lemma says:

If \(L\) is a context-free language, then there exists some integer \(p ge 1\) suth that for any string \(s in L\), \(|s| ge p\), \(s\) can be divided into substrings \(u, v, x, y, z\) such that:

- \(s = uvxyz\)
- \(|vxy| le p\)
- \(vy ge 1\)
- \(uv^nxy^nz in L\)

So... Instead of just having a prefix, a middle, and a suffix, we've now got five subtrings. There are *two* repeated parts, instead of just one, and there can be something in between them. And we're pumping repetitions of both the front and the back. So if \(uvxyz in L\), then \(uvvxyyz in L\), and \(uvvvxyyyz in L\), and so on.

It is more complicated than the regular language pumping lemma, but it's still not awful. And proofs in it end up looking very much like the proofs with the regular language pumping lemma. For example, we could look at the language \(1^n2^n3^n\), which is a canonical example of a non-context-free language. The proof of that ends up being nearly exactly the same as the proof for \(x^ny^n\) for the context free languages. We set up the string so that both \(v\) and \(y\) must be 2s. Then we pump it, and show that we end up with more 2s than 3s.

"But the fact that you can pump it doesn't tell you anything. If you want to prove that a language is regular, you have to either show a regular grammar, a regular expression, or a FSA that describes the language."

Which is why, when teaching this, I tend to prefer the suffix characterization of regularity to the pumping lemma (since it's almost always equally easy to apply to show non-regularity, and gives an 'if and only if' characterization).

[Briefly, for each w in Sigma^* let L_w = {v | wv in L}. L is regular if and only if {L_w | w in Sigma^*} is finite.]

I'd never heard the Myhill-Nerode theorem called "the suffix characterization of regularity". It fits, but why don't you use the name?

Because I'm lazy? No, more seriously because when I say 'the suffix characterization of regularity' I remember what it means, whereas 'the Myhill-Nerode theorem' requires a fetch from memory that frequently fails.

[No slight or failure of attribution intended.]

One cool proof one can get from this is to show that there are infinitely many primes in Z. Proof sketch:

1) For any d>=1, and b>=1, there's a DFA that represents numbers divisible by b.

2) In base 3, powers of 2 are not pumpable.

Combine 1 and 2 by observing that DFAs are closed under finite unions, complements and intersections. So if there are only finitely many primes, one can use 1 to represent powers of 2 by taking the complement of (Divisible by 3 or 5 or 7... p) where p is the largest prime. Contradiction.

It looks like there are a bunch of typos in the mathematical notation, or maybe the software you are using to render it into gifs is messing it up.