CMPS 260: CFL Pumping Lemma

Just as there is a pumping lemma for regular languages, there is one for context-free languages. The lemma describes a necessary condition for a language to be a CFL, which is to say that any language that fails to satisfy that conidition is not context-free.

First we state a lemma, and a corollary to it, that makes an observation about derivation trees and the lengths of their yields.

Lemma 1: Let G be a context-free grammar, and let k be the length of the longest right-hand side of any of its productions. Let T be a derivation tree of G corresponding to a derivation X ⟹^* α, where X is a symbol (nonterminal or nonterminal) in G. If the height of T is ≤h, then |α| ≤ k^h.

Proof: The proof is by induction on h. As a base case, take h=0. In that case, α = X, which has length one, which is equal to k^h.

For the inductive case, take h≥1 and assume as an induction hypothesis that the lemma holds for all h' < h. Each of the children of the root node of T is the root node of a derivation tree of height at most h-1. By the induction hypothesis, the yield of each of those subtrees has length at most k^h-1. There being at most k such subtrees, the yield of T as a whole has length bounded above by k·k^h-1 = k^h. ◼

Corollary: For every CFG G there is an ascending function MaxLen_G: ℕ ⟶ ℕ such that, for all h≥0,

MaxLen_G(h) = max{ |α| | A ⇒^* α by a derivation tree of height ≤h}

That is, MaxLen_G(h) is the length of the longest sentential form that is the yield of any derivation tree having height h or less.

Let L be a CFL, and suppose that G is a CFG such that L = L(G). Let n be the number of nonterminal symbols in G and let m satisfy m ≥ MaxLen_G(n+1), which is to say that any derivation tree whose yield has length m has height n+1 or more.

Now suppose that w ∈ L(G), with |w| ≥ m, and consider some derivation tree T corresponding to a derivation S ⟹^* w. Then the height of T is at least n+1, which means that there is at least one root-to-leaf path in T on which there are at least n+2 nodes. Choose one such path, and consider the last n+2 nodes on that path, all but the last of which is labeled by a nonterminal symbol. Choose two among those nodes labeled by the same nonterminal symbol. (They must exist, because there are only n different nonterminals in G.) Let A be that nonterminal symbol, and call the two nodes Ancestor-A and Descendant-A.

T

S /|\ / | \ / | \ / | \ / | \ / A \ / /|\ \ / / | \ \ / / | \ \ / / A \ \ / / / \ \ \ / / / \ \ \ +-----+---+-----+---+-----+ u v x y z

T
S /\|\ / \| \ / \| \ / \| \ / \| \ / A \ / /\|\ \ / / \| \ \ / / \| \ \ / / A \ \ / / / \ \ \ / / / \ \ \ +-----+---+-----+---+-----+ u v x y z

Then T must look like the tree depicted to the right. (Of course, the higher occurrence of 'A' marks the position of the Ancestor-A node, and the lower occurrence marks the position of the Descendant-A node.) What that primitive illustration is intended to convey is that w = uvxyz, where x is the yield of the subtree rooted at Descendant-A, vxy is the yield of the subtree rooted at Ancestor-A, and w = uvxyz is the yield of T as a whole.

Now consider that if we replaced within T the subtree rooted at Ancestor-A by the subtree rooted at Descendant-A, we would get the tree T₀ shown below left, the yield of which is uxz (or, equivalently, uv⁰xy⁰z). (Notice that the v and y substrings of w are absent.)

But if we replaced within T the subtree rooted at Descendant-A by the subtree rooted at Ancestor-A, we would get the tree T₂ shown below right, the yield of which is uv²xy²z.

S /|\ / | \ / | \ / | \ / | \ / A \ / / \ \ / / \ \ +-----+-----+-----+ u x z

S /|\ / | \ / | \ / | \ / | \ / A \ / /|\ \ / / | \ \ / / | \ \ / / A \ \ / / /|\ \ \ / / / | \ \ \ / / / | \ \ \ / / / A \ \ \ / / / / \ \ \ \ / / / / \ \ \ \ +-----+---+---+-----+---+---+-----+ u v v x y y z

T₀ T₂

S /\|\ / \| \ / \| \ / \| \ / \| \ / A \ / / \ \ / / \ \ +-----+-----+-----+ u x z	S /\|\ / \| \ / \| \ / \| \ / \| \ / A \ / /\|\ \ / / \| \ \ / / \| \ \ / / A \ \ / / /\|\ \ \ / / / \| \ \ \ / / / \| \ \ \ / / / A \ \ \ / / / / \ \ \ \ / / / / \ \ \ \ +-----+---+---+-----+---+---+-----+ u v v x y y z
T₀	T₂

If, in T₂, we replace the subtree whose yield is x by the subtree whose yield is vxy, we get a derivation tree T₃ whose yield is uv³xy³z. Generalizing, we can perform the same operation on any tree T_i (i≥1) (the yield of which is uvⁱxyⁱz) to obtain tree T_i+1 having yield uvⁱ⁺¹xyⁱ⁺¹z.

We are almost ready to state the pumping lemma, but there are two small points to take care of.

One is that the length of vxy is at most m. That follows from these facts:

vxy is the yield of the derivation tree rooted at Ancestor-A, the height of which is at most n+1.
MaxLen_G(n+1) ≤ m, which is to say that no derivation tree of height n+1 or less has a yield longer than m.

The second point is that we can assert that at least one among v and y is a nonempty string, which can be stated as |vy| > 0. Why? Because if both v and y were the empty string, it could only be because G admitted the derivation A ⟹⁺ A. Such a derivation can occur only if G includes either λ-productions or unit productions (of the form H → K, the right-hand side of which is a single nonterminal symbol). But we can assume, without loss of generality, that G has neither of those kinds of productions, because there is an algorithm by which any CFG can be transformed into a CFG without such productions and generating the same language, except for the loss of the empty string.

All the reasoning appearing above serves, in effect, as a proof of the following theorem:

Pumping Lemma for Context-free Languages: If G is a CFG, where L = L(G), then there is a positive integer m (that depends upon G) such that if w∈L and |w|≥m, then there exist terminal strings u, v, x, y, and z satisfying these conditions:

w = uvxyz
|vxy| ≤ m
|vy| > 0
for all i ≥ 0, uvⁱxyⁱz ∈ L

The manner in which to this lemma can be used to show that a language L is not context-free is as follows:

Let m>0 be arbitrary.
Choose a string w ∈ L having length at least m. In order to describe w, one or more references to m are needed in order to ensure that |w| ≥ m.
Identify every possible way of splitting w into five substrings u, v, x, y, and z satisfying the first three conditions of the lemma (namely (1) w = uvxyz, (2) |vxy| ≤ m, and (3) |vy| > 0), and partition them into cases so that all the strings covered by any one case can be treated in a uniform manner.
For each case arising from (3), find a value of i for which uvⁱxyⁱz ∉ L.

Example application of the Pumping Lemma:
The "canonical" non-context free language is {aⁿbⁿcⁿ | n ≥ 0}. We prove it not to be a CFL using the Pumping Lemma, as follows:

Let m be the constant mentioned in the lemma, and choose w = a^mb^mc^m. Now we consider every possible way of factoring w into u, v, x, y, and z that satisfies the first three conditions listed in the lemma. For each of those ways, we show that the fourth condition does not hold.
Case 1: Either v or y contains two different "kinds" of symbols (e.g., a and b, or b and c). Then pumping (i.e., taking i to be 2 or greater) results in a string in which one of the substrings ba or cb occurs. (For example, suppose that v = a⁴b⁹. Then v² = a⁴b⁹a⁴b⁹, which has ba as a substring.) No such string is a member of L.
Case 2: Each of v and y contains at most one kind of symbol. (Remember that one of them, but not both, can be the empty string.) But then pumping (i.e., taking i to be 2) increases the number of occurrences of at most two kinds of symbols (e.g., a and b) while leaving the number of occurrences of the third kind of symbol (e.g., c) the same. The resulting string is thus not in L. ◼