Just as there is a pumping lemma for regular languages, there is one for context-free languages. The lemma describes a necessary condition for a language to be a CFL, which is to say that any language that fails to satisfy that conidition is not context-free.
First we state a lemma, and a corollary to it, that makes an observation about derivation trees and the lengths of their yields.
Lemma 1: Let G be a context-free grammar, and let k be the length of the longest right-hand side of any of its productions. Let T be a derivation tree of G corresponding to a derivation X ⟹* α, where X is a symbol (nonterminal or nonterminal) in G. If the height of T is ≤h, then |α| ≤ kh.
Proof: The proof is by induction on h. As a base case, take h=0. In that case, α = X, which has length one, which is equal to kh.
For the inductive case, take h≥1 and assume as an induction hypothesis that the lemma holds for all h' < h. Each of the children of the root node of T is the root node of a derivation tree of height at most h-1. By the induction hypothesis, the yield of each of those subtrees has length at most kh-1. There being at most k such subtrees, the yield of T as a whole has length bounded above by k·kh-1 = kh. ◼
Corollary: For every CFG G there is an ascending function MaxLenG: ℕ ⟶ ℕ such that, for all h≥0,
That is, MaxLenG(h) is the length of the longest sentential form that is the yield of any derivation tree having height h or less.
Let L be a CFL, and suppose that G is a CFG such that L = L(G). Let n be the number of nonterminal symbols in G and let m satisfy m ≥ MaxLenG(n+1), which is to say that any derivation tree whose yield has length m has height n+1 or more.
Now suppose that w ∈ L(G), with |w| ≥ m, and consider some derivation tree T corresponding to a derivation S ⟹* w. Then the height of T is at least n+1, which means that there is at least one root-to-leaf path in T on which there are at least n+2 nodes. Choose one such path, and consider the last n+2 nodes on that path, all but the last of which is labeled by a nonterminal symbol. Choose two among those nodes labeled by the same nonterminal symbol. (They must exist, because there are only n different nonterminals in G.) Let A be that nonterminal symbol, and call the two nodes Ancestor-A and Descendant-A.
S /|\ / | \ / | \ / | \ / | \ / A \ / /|\ \ / / | \ \ / / | \ \ / / A \ \ / / / \ \ \ / / / \ \ \ +-----+---+-----+---+-----+ u v x y z |
Then T must look like the tree depicted to the right. (Of course, the higher occurrence of 'A' marks the position of the Ancestor-A node, and the lower occurrence marks the position of the Descendant-A node.) What that primitive illustration is intended to convey is that w = uvxyz, where x is the yield of the subtree rooted at Descendant-A, vxy is the yield of the subtree rooted at Ancestor-A, and w = uvxyz is the yield of T as a whole.
Now consider that if we replaced within T the subtree rooted at Ancestor-A by the subtree rooted at Descendant-A, we would get the tree T0 shown below left, the yield of which is uxz (or, equivalently, uv0xy0z). (Notice that the v and y substrings of w are absent.)
But if we replaced within T the subtree rooted at Descendant-A by the subtree rooted at Ancestor-A, we would get the tree T2 shown below right, the yield of which is uv2xy2z.
S /|\ / | \ / | \ / | \ / | \ / A \ / / \ \ / / \ \ +-----+-----+-----+ u x z |
S /|\ / | \ / | \ / | \ / | \ / A \ / /|\ \ / / | \ \ / / | \ \ / / A \ \ / / /|\ \ \ / / / | \ \ \ / / / | \ \ \ / / / A \ \ \ / / / / \ \ \ \ / / / / \ \ \ \ +-----+---+---+-----+---+---+-----+ u v v x y y z |
T0 | T2 |
---|
If, in T2, we replace the subtree whose yield is x by the subtree whose yield is vxy, we get a derivation tree T3 whose yield is uv3xy3z. Generalizing, we can perform the same operation on any tree Ti (i≥1) (the yield of which is uvixyiz) to obtain tree Ti+1 having yield uvi+1xyi+1z.
We are almost ready to state the pumping lemma, but there are two small points to take care of.
One is that the length of vxy is at most m. That follows from these facts:
The second point is that we can assert that at least one among v and y is a nonempty string, which can be stated as |vy| > 0. Why? Because if both v and y were the empty string, it could only be because G admitted the derivation A ⟹+ A. Such a derivation can occur only if G includes either λ-productions or unit productions (of the form H → K, the right-hand side of which is a single nonterminal symbol). But we can assume, without loss of generality, that G has neither of those kinds of productions, because there is an algorithm by which any CFG can be transformed into a CFG without such productions and generating the same language, except for the loss of the empty string.
All the reasoning appearing above serves, in effect, as a proof of the following theorem:
Pumping Lemma for Context-free Languages: If G is a CFG, where L = L(G), then there is a positive integer m (that depends upon G) such that if w∈L and |w|≥m, then there exist terminal strings u, v, x, y, and z satisfying these conditions:
The manner in which to this lemma can be used to show that a language L is not context-free is as follows:
Example application of the Pumping Lemma:
The "canonical" non-context free language is
{anbncn | n ≥ 0}.
We prove it not to be a CFL using the Pumping Lemma, as follows:
Let m be the constant mentioned in the lemma, and choose
w = ambmcm.
Now we consider every possible way of factoring w into
u, v, x, y, and z that satisfies the first three conditions
listed in the lemma. For each of those ways, we show that
the fourth condition does not hold.
Case 1: Either v or y contains two different "kinds" of symbols (e.g., a and b, or b and c). Then pumping (i.e., taking i to be 2 or greater) results in a string in which one of the substrings ba or cb occurs. (For example, suppose that v = a4b9. Then v2 = a4b9a4b9, which has ba as a substring.) No such string is a member of L. Case 2: Each of v and y contains at most one kind of symbol. (Remember that one of them, but not both, can be the empty string.) But then pumping (i.e., taking i to be 2) increases the number of occurrences of at most two kinds of symbols (e.g., a and b) while leaving the number of occurrences of the third kind of symbol (e.g., c) the same. The resulting string is thus not in L. ◼ |