ICS 142 Winter 2004, Midterm Study Guide

The study guide

The big picture: What does a compiler do?

A compiler is a program that takes a program in some source language as its input, and emits an equivalent program in some target language as output.
- The notion of equivalent program needs to be defined. An equivalent program is one that may or may not perform the same specific set of operations in the same order, but will generate the same observable result as the original one.
- In general, we'd like our compilers to make improvements to our programs, so long as the observable result is the same. Such improvements are generally called optimizations.
- Of course, we need to think very carefully as compiler writers about what constitutes an "observable result."
Compilers do a pretty big job. Like any large piece of software, it makes sense to separate concerns, splitting the program into smaller pieces.
- For many years, a typical design strategy was to split the compiler into two pieces: a front end and a back end.
  - The front end focused on validating the source program and understanding its meaning.
  - The back end concerned itself with generating an equivalent program in the target language.
- The output of the front end was a simplified version of the source program, often called an intermediate representation (IR).
  - The name intermediate representation arises because it's neither the source nor the target program, and is used only within the compiler.
- The back end assumes that the IR is well-formed, since it was built by the front end, rather than a fallible human programmer.
- This approach has a couple of clear advantages:
  - Separation of concerns. The jobs performed by the front and back ends are largely distinct, so it makes good software engineering sense to separate them.
  - Reuse. By building a library of front ends that build the same kind of IR, and a library of back ends that emit target programs for various architectures using that same kind of IR, one can easily build a large collection of compilers by simply connection different front ends to different back ends.
- A more recent overall design strategy includes a third component, an optimizer. Optimizers take as input a program in some kind of IR and generate as output a "better" version of the program in some kind of IR (either the same kind or a different kind).
  - What constitutes "better" is a matter of some debate. There are many attributes of the program we could optimize: execution speed, code size, and power consumption are three examples.
- Not surprisingly, passing IR's between the phases allows you to plug together a front end, an optimizer (or a sequence of optimizers), and a back end to yield a compiler.
Along the way, a compiler may use some global data structures, such as symbol tables, to record information about the program that is determined at different phases.

Desirable properties of a compiler

Ideally, we'd like compilers to exhibit certain desirable properties.
- Execution speed of compiled code. If necessary, we'd like our compiler to be able to generate the most speed-efficient code possible.
- Size of compiled code. We'd also like our compiler to be able to generate the most space-efficient code possible. Oftentimes, this represents a tradeoff, whereby you give up execution speed for size and vice versa. Ideally, programmers would be able to specify which is more important and instruct the compiler to trade one for the other.
- Error reporting. We'd like our compiler to emit understandable error messages when an error is encountered in the source program. (This is easier said than done.)
- Support for debugging. Debugging is partially supported by a compiler. To use a symbolic or graphical debugger on, say, a compiled C++ program, a compiler will need to be instructed to include information like variable names and function signatures into the generated code, information that's generally left out of a compiled C++ program.
- Execution speed of compiler. All other things being equal, a fast compiler is better than a slow one. Of course, we're often willing to use a slower compiler if it gives us other, more desirable benefits.

More details about the front end of a compiler

The front end's performs the following tasks:
- Reads the source program from some input source such as a file.
- Validates the source program, ensuring that it's a legal program in the programming language expected.
- Comes to an understanding of the meaning of the source program.
- Conveys that meaning (as concisely as possible) by building and returning an intermediate representation and passing it to later stages of compilation.
Extracting and conveying the meaning of the source program is a big job. It makes sense to divide it into smaller ones.
- The first stage is called scanning, in which the compiler takes the program, which is a stream of characters, and turns it into a stream meaningful patterns called tokens. The rest of the compiler, then, does not need to concern itself with every character of the input program, but instead deals with larger abstractions such as identifiers and string literals.
- The next stage is called parsing, in which the compiler takes that stream of tokens and discovers whether these tokens form a syntactically legal program. In this stage, only the syntax (the form of the program) is checked. The semantics (or meaning) of the program is checked later.
- The third stage is called semantic analysis or context-sensitive analysis. In this stage, the meaning of the program is determined, and semantic errors, such as using undeclared variables, are detected.

Scanning

A scanner's job is to recognize patterns in the input program and use those patterns to group characters togethers into tokens or words.
A scanner is unconcerned with whether the sequence of tokens makes sense. It is only concerned with recognizing legal tokens.
One way to solve this problem is to build a hand-coded scanner, in which you write a sequence of conditional statements and loops to detect these patterns.
- While plausible for a very simple language, this can become very tedious in a big hurry.
- Some theory helps us to solve the problem in a better way.
A deterministic finite automaton (DFA) is a five-tuple (S, Σ, δ, s₀, S_F):
- S is a finite set of states
- Σ is the alphabet (the character set) used by the DFA
- δ is a function. The value δ(s, c) indicates the state that the DFA should move to if it is in state s and sees the character c on the input.
- s₀ ∈ S is the start state
- S_F ⊆ S is the set of final states
A DFA begins in its start state, consuming each input symbol in order, and using the δ function to determine which state to move to. If, after consuming all the input, the DFA is in a final state, the input is accepted. If the DFA is not in a final state, the input is rejected.
We say that the set of strings accepted by a DFA D is the language of D, denoted L(D).
Implementing a DFA in a computer program turns out to be relatively straightforward:
- Declare a variable, say an integer, to keep track of the current state.
- Initialize that variable to the start state.
- Process the input characters in a loop, setting the state variable to the new state after each character.
- To easily look up the value of the δ function, store it in a two-dimensional array, with states as the rows and input characters (or groups of input characters) as the columns.
In trying to use DFA's directly to build a scanner, we realize two things:
- Implementing a DFA in this way is tedious. Specifying the δ function in a two-dimensional array will be painful. A tool would help.
- Even with a tool that would take the definition of a DFA as input and spit out code for it as output, there's a problem. There's quite a semantic distance between the problem we're trying to solve -- specifying the patterns we're interested in recognizing -- and the formalism we're using to solve it.
- Ideally, we'd rather specify the patterns in some more intuitive notation, and let a tool convert it to a DFA and spit out the code for it.

Regular expressions

A regular expression defines a language over some alphabet Σ.
ε is a regular expression denoting the set containing only the empty string.
A character c ∈ Σ is, itself, a regular expression that denotes a language consisting only of a string containing only c.
The concatenation of two regular expressions R and S, denoted RS, is a regular expression. The resulting regular expression describes a language of all strings that consist of a string in L(R) followed by a string in L(S).
The alternation of two regular expressions R and S, denoted R | S, is a regular expression. The resulting regular expression describes a language of all strings that are in L(R) ∪ L(S).
The Kleene closure of a regular expression R, denoted R*, is a regular expression. The resulting regular expression describes a language of all strings that consist of zero or more occurrences of strings in L(R) concatenated together.
To simplify the notation, we specify operator precedences as follows:
- Kleene closure has the highest precedence
- Concatenation has the next-highest precedence
- Alternation has the lowest precedence
- Parentheses are used to override precedence, in much the same way as we use them in mathematics and programming languages.
While the above is a complete definition of regular expressions in a mathematical sense, we use a few shorthands sometimes, too:
- We use the notation R+ to denote a language of all strings that consist of one or more occurrences of strings in L(R) concatenated together. (In other words, RR*.)
- We use the notation R? to denote a language of all strings that consist of zero or one occurrences of strings in L(R). (In other words, ε | R.)
- Character classes are shorthands for a set of characters. For example, the character class [0-9] denotes one occurrence of a character that is either 0, 1, 2, ..., 9. Naturally, you can combine character classes with other operators, to yield regular expressions like [0-9]+ to denote patterns.
Examples:
- Two regular expressions that recognize either the word "public", "protected", or "private":
- A regular expression that recognizes integers without leading zeroes:
  - 0 | [1-9][0-9]*
- A regular exrpession that recognizes real-number constants for a hypothetical programming language (including scientific notation):
  - [0-9]* . [0-9]+ ((e | E) [0-9]+)?
Regular expressions are a much more intuitive notation for humans to use to specify patterns. Unfortunately, they make a lousy implementation strategy.
Fortunately, some theory can help. Regular expressions can be converted to DFA's mechanically, which implies that a tool could do the following:
- Take a regular expression as input.
- Convert the regular expression to a DFA.
- Emit code that implements the DFA.

Converting regular expressions to DFA's

A useful theoretical result is that regular expressions and DFA's describe the same set of languages.
- In other words, any language that can be described by a regular expression can be recognized by a DFA and vice versa.
- The algorithm given in the book for converting a DFA to a regular expression will not be covered on the exam.
This implies that it must be possible to convert a regular expression to a DFA. Unfortunately, it needs to be done in multiple steps.
The first step is to convert the regular expression to an intermediate form called a non-deterministic finite automata or NFA. An NFA is just like a DFA, with two differences:
- It can specify more than one transition from some state on a given input character.
- It can specify ε-transitions, which can be followed without consuming an input character.
- Perhaps surprisingly, even with these differences, NFA's and not theoretically more powerful than DFA's are.
  - In other words, any language that can be recognized by an NFA can also be recognized by a DFA.
  - This implies that any NFA can be converted to a DFA.
An NFA processes an input string differently from a DFA.
- A DFA is in exactly one state at any given time. It begins in the start state. Upon consuming each input character, it moves to exactly one state.
- An NFA is in a set of states at any given time.
  - It begins in the ε-closure of the start state, which is the start state along with any states that can be reached from the start state by taking only ε-transitions.
  - Upon consuming each input character, it moves to the ε-closure of all the states it reaches by following transitions on that character from all the states it was in.
An NFA can be constructed from a regular expression mechanically, using a construction called Thompson's construction.
- It should be noted that there are other constructions that can be used to construct an NFA from a regular expression.
- Thompson's construction is used in automated tools because it can be implemented more efficiently than others.
- The principle it is built upon is that the NFA it constructs (and all of the intermediate ones it constructs) have a single start state and a single final state, with no transitions entering the start state and no transitions leaving the final state.
Thompson's construction proceeds by taking a regular expression and applying operators to it in precedence order.
- Applying each operator requires taking the NFA for its operands and pasting it together into one NFA using a template.
- The templates are shown in the textbook and were shown in lecture.
- After applying the templates for all of the operators in the regular expression, the resulting NFA recognizes the same language described by the regular expression.
An NFA, unfortunately, is not a very good implementation tool. No problem! We'll convert it to a DFA, using the subset construction.
- The subset construction builds a DFA that simulates the behavior of the NFA.
- Each state in the DFA corresponds to a subset of the states in the NFA. (The total number of states in the DFA could be the size of the power set of the states in the NFA, but in practice they turn out to be much smaller, since most of these states cannot be reached.)
- For every state s in the DFA, there is a transition on an input character to the state s' if the NFA would go from the subset of its states corresponding to s to the subset of its states corresponding to s' on that input character.
After constructing a DFA using the subset construction, we can run a DFA minimization algorithm to minimize the number of states in the DFA.
- This algorithm was not covered in lecture and will not be covered on the exam.

Constructing an automated scanner

A DFA takes an input string and, ultimately, says "accept" or "reject" as its output.
We need a tool that takes an input program, a stream of characters, and returns a stream of tokens. In other words, we need a DFA to recognize many words, not just one.
Using an automated tool such as JFlex, you specify a list of regular expressions that specify all the patterns you'd like your scanner to recognize.
The tool will do a few things:
- Use alternation ('|') to build one regular expression out of all of your patterns.
- Convert the regular expression to an NFA using Thompson's construction.
- Convert the NFA to a DFA using the subset construction.
- Minimize the number of DFA states using a minimization algorithm.
- Emit code that recognizes patterns using the DFA.
One approach would involve using the DFA as a recognizer. Whenever a final state was reached in the DFA, we could consider the consumed input to be a token, return that token, then go back to the start state. There's a problem with this approach:
- Due to overlapping of patterns, this would result in some very odd behavior. For example, the word fork in a Java program should be considered an identifier. But our pattern matcher would recognize it as the keyword for followed by the identifier k.
- To solve problems like this, automated scanners search for the longest possible pattern instead, using the following approach:
  - Continue running the DFA and consuming input until it reaches a situation in which there is no output transition on the current character.
  - If, at that point, the DFA is in a final state, you've recognized the pattern represented by that final state.
  - If not, backtrack through the input you consumed until you encounter a final state, and that's your pattern instead.
  - If you were never in a final state, the input does not begin with a lexeme, and an error will be reported.
- Since more than one pattern will often be matched by the same sequence of input, an automated tool will generally disambiguate by choosing the first pattern listed in the input script.

Will you be covering JFlex on the exam?

Intimate details of JFlex will not be covered, but I do expect you to understand the underlying theory discussed above.

Parsing

The job of a parser is to determine if the sequence of tokens recognized by the scanner can be combined to form a syntactically legal input program.
- Its output, theoretically speaking, is a parse tree for the input program.
- In practice, a boiled-down version of the parse tree, or a flat intermediate representation, is emitted instead.
A context-free grammar is a formalism that is suitable for specifying the syntax of a programming language. It consists of:
- A set of terminal symbols, which correspond to the tokens that can be recognized by the scanner.
- A set of nonterminal symbols, which are abstractions for groupings of tokens that are considered legal in an input program.
- A start symbol, which is a nonterminal symbol that is considered to be an abstraction for an entire input program.
- A set of rules (sometimes called productions) that explain how nonterminal symbols can be replaced legally with a sequence of nonterminal and/or terminal symbols.
A grammar describes a (potentially infinite) set of syntactically legal input programs. This set is called the language of the grammar.
An example grammar for identifiers surrounded by a matched set of parentheses:
- S → '(' S ')' | identifier
There are two ways to demonstrate that an input program is in the language described by a grammar.
- One is to draw a tree, with the start symbol as the root node, terminal symbols as leaves, and each node containing a nonterminal symbol having zero or more child nodes corresponding to the symbols on the right-hand side of one of its rules.
- Another is to write a derivation. A derivation corresponds to some parse tree. Examples:
  - S ⇒ ( S ) ⇒ ( identifier )
  - S ⇒ ( S ) ⇒ (( S )) (( identifier ))
- Compiler writers are primarily interested in two kinds of derivations:
  - leftmost derivations. Derivations in which the leftmost nonterminal is replaced at each step.
  - rightmost derivations. Derivations in which the rightmost nonterminal is replaced at each step.
A parser's job, then, is to discover either a parse tree or a derivation that indicates that the input program is in the language of the grammar.
If there is some input program for which there is more than one parse tree (or more than one leftmost or rightmost derivation), the grammar is said to be ambiguous.
- Since a compiler typically infers the program's meaning from the structure of the parse tree, ambiguity is generally considered to be a very bad thing in the grammar of a programming language.
There isn't one way to resolve ambiguity in a context-free grammar, though there are techniques that help. One such technique is the stratification technique, which is used to establish precedence and associativity of operators.
- Starting with this grammar for arbitrarily nested expressions:
  - E → E + E | E - E | E * E | E / E | ( E ) | identifier
- We can transform it to this:
  - E → E + E2 | E - E2 | E2
  - E2 → E2 * E3 | E2 / E3 | E3
  - E3 → ( E ) | identifier
- We often call this grammar the classic expression grammar.
- The resulting grammar establishes that:
  - + and - have the lowest precedence and are left-associative
  - * and / have the next-highest precedence and are left-associative
  - parenthesized expressions have a higher precedence than those that are not parenthesized

Top-down parsing

A top-down parser begins with a parse tree whose root contains the start symbol.
- At every step, it attempts to expand one of the nonterminals at the lower fringe of the tree (i.e. a nonterminal in a leaf node) by one of its rules.
- To expand the nonterminal, the symbols on the right-hand side of the selected rule
A top-down parser, theoretically, could expand the nonterminals in any order. In practice, however, top-down parsing algorithms are limited by a couple of desirable properties:
- We'd like to be able to parse the input in the natural order that we'd read and scan the input file: left-to-right.
- We'd like to parse the input as efficiently as possible, ideally making the "right decision" at every step and only looking at each input token once.
In order to solve the first problem, parsing left-to-right, we could start with this algorithm:
- ```
make new parse tree, with start symbol in root
node = root of parse tree

loop
{
    if node contains a terminal t
        if next input token is t
            advance node to next node on lower fringe of tree
        else
            backtrack
    else if node contains a nonterminal nt
        pick a rule nt → β
        extend tree using that rule
        node = leftmost symbol in β
}
```
- This algorithm works, theoretically speaking. It is capable of finding a parse tree for the input program (provided that the rules are picked nondeterministically).
- However, as a practical matter, this algorithm is terribly inefficient. It essentially amounts to an exhaustive search of all possible parse trees (or derivations), looking for the right one!
To solve the second problem, we need to find an algorithm that allows us, when looking at a nonterminal nt, always to choose the right rule to expand by.
- In order to do this, we need some support from the grammar. The grammar needs to be constructed in such a way that we can always make the right choice by looking only at the next token of input.

Recursive descent parsing

This topic is discussed in a great deal of detail in Assignment #2 and also in the textbook.

Table-driven LL(1) parsing

Recursive descent parsing is one form of top-down LL(1) parsing.
- While it is straightforward to code by hand, it has a problem: the size of the code is directly proportional to the size of the grammar.
- Yet the code all follows a very regular, predictable pattern, directly encoding information from the grammar and its FIRST+/FOLLOW sets.
- If we were to restructure the program somewhat, we might be able to write code for the recurring pattern once, then encode the rules for expanding nonterminals into a table.
A table-driven LL(1) parser does just that!
- It is built around a parsing table, with nonterminal symbols labeling the rows and terminal symbols (and eof) labeling the columns.
- Intuitively, the cell Table[X, y] contains an indication of the rule that should be used to expand X if the next input symbol is y.
  - The rules are generally numbered consecutively, and these numbers are stored in the table.
- More precisely:
  - Table[X, y] = n if n is the number of a rule X → β and y ∈ FIRST+(β).
  - If the rule X → ε is also in the grammar, Table[X, y] should be the number of this ε-rule for all symbols in FOLLOW(X).
- It should be reiterated that this is exactly the same information that is encoded into the structure of the code of a recursive descent parser. The information is encoded into the table, so that the code size is not affected by the size of the grammar. (Obviously, the size of the table is, but the table is much more compact than the code was.)
- Once you have the table, the algorithm works like this:
  - ```
  push eof
  push start symbol
  loop
  {
      if (top of stack is eof and next token is eof)
          ACCEPT!
      else if (top of stack is terminal or eof)
          if (top of stack is same as next token)
              pop terminal from stack
              consume next token
          else
              ERROR!
      else if (Table[top of stack, next token] is a rule A → β₁ β₂ ... β_k)
          pop A from stack
          push β_k ... β₁
      else
          ERROR!
  }
```
- Naturally, you wouldn't want to hand-code one of these table-driven LL(1) parsers, but you can certainly imagine a tool that would take an LL(1) grammar as input and emit a table-driven LL(1) parser as output.

Bottom-up parsing

Top-down parsers, specifically recursive descent or table-driven LL(1) parsers, are simple and efficient.
On the other hand, implementing them may require a number of transformations (e.g. left factoring, left recursion elimination) on the grammar of the programming language...
- ...making a formerly intuitive, clear grammar into a much clunkier one.
Particularly if we're using tools to generate our parser, in which the grammar is actually part of the source code of our compiler, keeping the grammar readable is important.
Also, although most programming language constructs can be expressed using LL(1) grammars, some can't.
For these reasons, we need bottom-up parsers.
A bottom-up parser begins with the input program as leaves in a parse tree, and the start symbol in the root.
- It endeavors, at every step, to combine a sequence of orphaned nodes (i.e. those with no parents) that represent the right-hand side of some grammar rule together, putting the nonterminal on the left-hand side of the rule above them.
- The process terminates when either there are no more sequences of orphaned nodes that can be combined in this fashion, or when the entire input program has been connected into one tree with the start symbol at the root.

Shift-reduce parsing

A typical bottom-up parser uses a shift-reduce technique. A shift-reduce parser is one that is built around a parser stack.
- The parser stack consists of (at least) nonterminal and terminal symbols.
- At every step, the parser makes a decision about whether to do one of two things:
  - Shift. Push the next token of input on to the stack.
  - Reduce. Pop k symbols off the top of the stack and replace them with a nonterminal symbol whose righthand side is those k symbols. (In other words, apply one of the grammar rules in reverse.)
The tricky part is knowing whether to shift or reduce. If a sequence of symbols β on the top of the stack is supposed to be reduced by the rule A → β, we say that β is a handle.
- By "supposed to," I mean that if the correct move to make is to reduce by the rule A → β, we call the top symbols on the stack β a handle.

With this in mind, here's a simple shift-reduce parsing algorithm:

loop
{
    if the top of the stack is β and A → β is a handle
        pop |β| symbols off the stack
        push A on the stack in their place
        // this action is called a reduction
    else if there are more tokens in the input
        push the next token
        // this action is called shifting
    else if the stack has only the start symbol on it
        ACCEPT!
    else
        ERROR!
}

The problem with this algorithm, of course, is that it's left the most important part -- determining if there is a handle on top of the stack -- as voodoo magic.
- We need an algorithm for finding handles.
- If we can develop such an algorithm, particularly one that can make the decision in constant time at each step, the entire parse will be linear with respect to s + r, where s is the total number of shifts and r is the total number of reductions.

LR(1) parsing

An LR(1) parser is a bottom-up, shift-reduce parser that uses one such mechanism for finding handles.
At every step, the LR(1) parser is considered to be in one out of a finite set of states (somewhat like a DFA). It encodes the knowledge of how to find handles into two tables:
- An Action table, with the rows labeled with parser states and the columns labeled with terminal symbols (and eof). Each entry in the table indicates one of four actions:
  - shift s', meaning that the parser should shift the next token of input and move to parser state s'.
  - reduce A → β, meaning that the parser should reduce by the rule A → β. (In a typical LR(1) parser, all rules are numbered consecutively and uniquely, so a reduce action might be "reduce 3", meaning that the reduction should be done using rule #3 in the grammar.)
  - accept, meaning that the parser should accept the input program as being legal.
  - error, meaning that the parser has determined that there is an error in the input program.
- A Goto table, with the rows labeled with parser states and the columns labeled with nonterminal symbols. Each entry in the table is a parser state, and means that if the parser was in state s and then recognized an A, it should proceed to state Goto[s, A].

Given these tables, here is the LR parsing algorithm:

push start state s₀
loop
{
    if Action[top of stack, next token] is shift s₁
        push next token
        push s_i
    else if Action[top of stack, next token] is reduce A → β
        pop 2|β| symbols
        s = top of stack
        push A
        push Goto[s, A]
    else if Action[top of stack, next token] is accept
        ACCEPT!
    else
        ERROR!
}

An example of this algorithm in action appears in the textbook (and was covered in lecture).

Of course, this algorithm only works if the Action and Goto tables have been constructed.
- To construct these tables, use the algorithm discussed in Section 3.5 of the textbook.
Fortunately, there are tools (e.g. CUP) that, given a grammar an a set of associated actions, will generate an LR parser for you, crunching through the details of the LR(1) construction (or a related construction, such as LALR) for you. After each reduction, the associated action will be taken.

A subset of the following topics will be covered in lecture on Wednesday, February 11. These topics will be covered on the Midterm in whatever detail we cover them in that lecture.

Semantic analysis (a.k.a. context-sensitive analysis). What is it? Why is it different from syntax analysis (parsing)?
Abstract syntax trees as an intermediate representation of an input program.
Symbol tables. What are they? What do they store? How can they be used to implement static scoping?
Type checking.