Parse Trees

Parse trees are trees labeled by symbols of a particular CFG. The leaves are labeled by terminal symbols, and the internal nodes are labeled by nonterminal symbols. The root is labeled by the start symbol of the grammar. Children are labeled by the right-hand side of a production. The concatenation of the labels of the leaves of a parse tree from left to right is the string generated by the tree, often called the yield of the tree.

Thoroem 1 Each leftmost (or rightmost) derivation of a string $w$ corresponds to a unique parse tree for $w$ , the converse is also true.

We sometimes talk about tree that are not exactly parse trees, but trees with the root not labeled by the start symbol. Apparently, these trees are sub-trees of the parse trees, which is often called parse tree with the root $A$ .

Ambiguity

A grammar is ambiguous if there exists a string that has more than one parse tree. (i.e. more than one leftmost (or rightmost) derivation). Ambiguity is a property of the grammar, not the language. This means a language may have two grammar where one of them is ambiguous and the other one is not. For example, the grammar:

\begin{aligned} E & \to E E ∣ (E) ∣ () \end{aligned}

is the grammar for the language of balanced parentheses. Nonetheless, the string $() () ()$ has two leftmost derivations:

\begin{aligned} E & \Rightarrow_{l m} E E \Rightarrow_{l m} () E \Rightarrow_{l m} () E E \Rightarrow_{l m} () () E \Rightarrow_{l m} () () () \\ E & \Rightarrow_{l m} E E \Rightarrow_{l m} E E E \Rightarrow_{l m} () E E \Rightarrow_{l m} () () E \Rightarrow_{l m} () () () \end{aligned}

Though in each round we choose the leftmost nonterminal to replace, the two leftmost derivations are still different. The main reason is that we have multiple choices when replacing the nonterminal.

$L L (1)$ Grammar

We can rewrite the grammar for balanced parentheses to make it unambiguous as follows:

\begin{aligned} B & \to (R B ∣ ε \\ R & \to) ∣ (R R \end{aligned}

where $B$ is the start symbol. We only have one leftmost derivation for the string $() () ()$ :

\begin{aligned} B & \Rightarrow_{l m} (R B \Rightarrow_{l m} () B \Rightarrow_{l m} () (R B \Rightarrow_{l m} () () B \Rightarrow_{l m} () () (R B \Rightarrow_{l m} () () () B \Rightarrow_{l m} () () () \end{aligned}

This grammar is called $L L (1)$ grammar, which means we can determine which production to use in leftmost derivation by peeking merely one symbol of lookahead.

Inherent Ambiguity

Some languages are inherently ambiguous, which means no unambiguous grammar can describe the language. For example, the language:

{a^{i} b^{j} c^{k} ∣ i = j \lor j = k is true}

is inherently ambiguous. The proof is complicated, and we will not discuss it here. You can refer to this site for more details.

Normoral Form of CFG

As we examplified in the previous section, a language can have multiple grammars. In real-world applications, we often want to have a unique grammar for a language, which is easy to implement and analyze. We can convert a CFG to a normal form to achieve this goal. The Chomsky Normal Form (CNF) is a common normal form for CFGs.

Eliminate variables derive nothing

Some grammars have variables that derive nothing. We can remove these variables by iteratively removing variables that derive nothing. A variable $A$ derives nothing if there is no derivation $A \Rightarrow^{*} w$ where $w$ is a string of terminal symbols. For example, the grammar:

\begin{aligned} S & \to A B ∣ A \\ A & \to a A ∣ a \\ B & \to b B \end{aligned}

We can find that $B$ and $S$ derive nothing. After removing these variables, we get the grammar:

\begin{aligned} S & \to A \\ A & \to a A ∣ a \end{aligned}

On the other hand, those variables that cannot be reached from the start symbol are also useless. Thus, we can remove them as well.

Remove Order

We must remove variables in the right order:

Remove variables that derive nothing. (All variables derive something)
Remove variables that cannot be reached from the start symbol. (All variables are reachable)

If we remove variables in the wrong order, there may still be redundant variables.

Eliminate $ε$ -productions

A nullable variable is a variable that can derive $ε$ . We can find all nullable variables by iteratively finding variables that derive only nullable variables. For example, the grammar:

\begin{aligned} S & \to A B \\ A & \to a A ∣ ε \\ B & \to b B ∣ A \end{aligned}

We can easily find that $A$ is nullable. Because $B \to A$ , $B$ is also nullable. Then, $S$ is nullable for $S \to A B$ .

The key idea of eliminating $ε$ -productions is to replace each production deriving multiple nullable variables with a family of productions. For example, we can modify the grammar above to:

\begin{aligned} S & \to A B ∣ B \\ A & \to a A ∣ a \\ B & \to b B ∣ A \end{aligned}

Eliminate unit productions

A unit production is a production where the right-hand side is a single variable. We can remove unit productions by iteratively replacing unit productions with the productions of the variable. This is a intuitive process, and we will not discuss too much here.

Cleaning up the CFG

We need to perform the following steps to clean up the CFG:

Eliminate $ε$ -productions.
Eliminate unit productions.
Eliminate variables that derive nothing (no terminal string).
Eliminate variables that cannot be reached from the start symbol.

The main idea is that each step should not influence the result of preceding step.

CNF

A CFG is said to be in Chomsky Normal Form (CNF) if and only if each production of the CFG satisfies:

The body contains only one terminal.
The body contains only two nonterminals.

It can be proved that all CFLs have corresponding CFG in CNF.

需求分析

体系结构

详细设计

构造测试

交付演化

补充内容

交互设计的评估

基于总线的计算机系统

HarmonyOS 开发

并发

数理逻辑

Parse Trees

Ambiguity

$L L (1)$ Grammar

Inherent Ambiguity