Parse Trees
Parse trees are trees labeled by symbols of a particular CFG. The leaves are labeled by terminal symbols, and the internal nodes are labeled by nonterminal symbols. The root is labeled by the start symbol of the grammar. Children are labeled by the right-hand side of a production. The concatenation of the labels of the leaves of a parse tree from left to right is the string generated by the tree, often called the yield of the tree.
Thoroem 1 Each leftmost (or rightmost) derivation of a string
We sometimes talk about tree that are not exactly parse trees, but trees with the root not labeled by the start symbol. Apparently, these trees are sub-trees of the parse trees, which is often called parse tree with the root
Ambiguity
A grammar is ambiguous if there exists a string that has more than one parse tree. (i.e. more than one leftmost (or rightmost) derivation). Ambiguity is a property of the grammar, not the language. This means a language may have two grammar where one of them is ambiguous and the other one is not. For example, the grammar:
is the grammar for the language of balanced parentheses. Nonetheless, the string
Though in each round we choose the leftmost nonterminal to replace, the two leftmost derivations are still different. The main reason is that we have multiple choices when replacing the nonterminal.
Grammar
We can rewrite the grammar for balanced parentheses to make it unambiguous as follows:
where
This grammar is called
Inherent Ambiguity
Some languages are inherently ambiguous, which means no unambiguous grammar can describe the language. For example, the language:
is inherently ambiguous. The proof is complicated, and we will not discuss it here. You can refer to this site for more details.
Normoral Form of CFG
As we examplified in the previous section, a language can have multiple grammars. In real-world applications, we often want to have a unique grammar for a language, which is easy to implement and analyze. We can convert a CFG to a normal form to achieve this goal. The Chomsky Normal Form (CNF) is a common normal form for CFGs.
Eliminate variables derive nothing
Some grammars have variables that derive nothing. We can remove these variables by iteratively removing variables that derive nothing. A variable
We can find that
On the other hand, those variables that cannot be reached from the start symbol are also useless. Thus, we can remove them as well.
Remove Order
We must remove variables in the right order:
- Remove variables that derive nothing. (All variables derive something)
- Remove variables that cannot be reached from the start symbol. (All variables are reachable)
If we remove variables in the wrong order, there may still be redundant variables.
Eliminate -productions
A nullable variable is a variable that can derive
We can easily find that
The key idea of eliminating
Eliminate unit productions
A unit production is a production where the right-hand side is a single variable. We can remove unit productions by iteratively replacing unit productions with the productions of the variable. This is a intuitive process, and we will not discuss too much here.
Cleaning up the CFG
We need to perform the following steps to clean up the CFG:
- Eliminate
-productions. - Eliminate unit productions.
- Eliminate variables that derive nothing (no terminal string).
- Eliminate variables that cannot be reached from the start symbol.
The main idea is that each step should not influence the result of preceding step.
CNF
A CFG is said to be in Chomsky Normal Form (CNF) if and only if each production of the CFG satisfies:
- The body contains only one terminal.
- The body contains only two nonterminals.
It can be proved that all CFLs have corresponding CFG in CNF.