Engineering a compiler Notes
Chapter 1 Overview of Compilation
Compilers are computer programs that translate a program written in one language into a program written in another language.
1.1 INTRODUCTION
The compiler has a front end to deal with the source language, which focus on understanding the source-language program, and it has a back end to deal with the target language, which focus on mapping programs to the target machine.
Many research compilers produce C program as their output. Because compilers for C are available on most computers.
Two fundamental principles:
- The compiler must preserve the meaning of the program being compiled.
- The compiler must improve the input program in some discernible way.
1.2 COMPILER STRUCTURE
A compiler uses some sets of data structures to represent the code it processes. That form is called an intermediate representation.
Two-phase compiler, 只有 front end 和 back end.
Three-phase compiler, 有 front end, optimizer 和 back end.
1.3 OVERVIEW OF TRANSLATION
Notation
A compiler translates a program written in one notation into an equivalent program written in another notation.
1.3.1 The Front End
Cheching Syntax
The source language is a set, usually infinite, of strings defined by some finite set of rules, called a grammar.
The scanner takes a stream of characters and converts it to a stream of classified words.
The actual spelling of the words might be stored in a hash table and represented in the pairs with an integer index.
The process of automatically finding derivations is called parsing.
Intermediate Representations
Compilers use a variety of different kinds of IR.
1.3.2 The Optimizer
The optimizer analyzes the IR form of the code to discover facts about that context and uses that contextual knowledge to rewrite the code so that it computes the same answer in a more efficient way.
Analysis
Transformation
1.3.3 The Back End
The compiler’s back end traverses the IR form of the code and emits code for the target machine.
ILOC
–”intermediate language for an optimizing compiler”, is a notation fo low-level examples.
Instruction Selection
Instruction selection maps each IR operation, in its context, into one or more target machine operations.
Virtual register
A symbolic register name that the compiler uses to indicate that a value can be stored in a register.
意思就是选择最合适的 instruction, 所以叫 instruction selector, 比如一个 source code, mapping to 一个 machine code, 有几种选择, 这时 instruction selector 会选择最好的一个.
Register Allocation
The register allocator must map those virtual registers onto actual target-machine registers.
The optimization would increase demand for registers but eliminate a later instruction. 意思就是把前面计算的值用于后面的计算.
Instruction Scheduling
To reorder operation.
It attempts to minimize the number of cycles wasted waiting for operands.
Interactions Among Code-Generation Components
1.4 SUMMARY AND PERSPECTIVE
Chapter 2 Scanners
The scanner is the only pass in the compiler to touch every character in the input program.
The regular expression, a notation used to describe the valid words in a programming laguage.
2.1 INTRODUCTION
The scanner, 也就是 lexical analyzer, reads a stream of characters and produces a stream of words.
Conceptual Roadmap
Overview
Syntactic category, a classification of words according to their grammatical usage.
Microsyntax, the lexical structure of a language, which specifies how to group characters into words and how to separate words that run together.
In most languages, blanks and punctuation marks terminate a word.
Keyword, a word that is reserved for a particular syntactic purpose and, thus, cannot be used as an identifier.
Both generated and hand-crafted scanners can be implemented to require just O(1) time per character.
2.2 RECONGNIZING WORDS
The recognizer takes one transition per input character.
2.2.1 A Formalism for Recognizers
Finite automation, a formalism for recogmizers that has a finite set of states, an alphabet, a transition function, a start state, and one or more accepting states. A finite automaton(FA) is a five-tuple($S, \epsilon, \delta, s_0, S_A$).
有一个 error state $s_e$.
2.2.2 Recognizing More Complex Words
Syntactic category 和 lexeme.
2.3 REGULAR EXPRESSION
The language described by an RE is called a regular languages.
Simple recognizers have simple RE specifications.
2.3.1 Formalizing the Notation
An RE is built up from three basic operations:
- Alternation, R|S
- Concatenation, RS
- Closure, $R^*$
Regular expression are used in many applications to specify patterns in character strings.
Parentheses have highest precedence, followed by closure, concatenation, and alternation, in that order.
2.3.2 Examples
The point is critical: the cost of operating an FA is proportional to the length of the input, not to the length or complexity of the RE that generates the FA.
2.3.3 Closure Properties of REs
Regular expressions are closed under many operations–that is, if we apply the operation to an RE or a collection of REs, the result is an RE.
2.4 FROM REGULAR EXPRESSION TO SCANNER
The goal of out work with finite automata is to automate the derivation of executable scanners from a collection of REs.
Deterministic FAs, or DFAs. Kleene’s construction derives an RE from a DFA.
Nondeterministic FAs, or NFAs. Thompson’s construction derives an NFA from an RE.
2.4.1 Nondeterministic Finite Automata
An FA that includes states such as $s_0$ that have multiple transitions on a single character is called a nondeterministic finite automaton. Its an FA thats allows transitions on the empty string, $\epsilon$, and states that have multiple transitions on the same character.
An FA with unique character transitions in each state is called a deterministic finite automaton. Its an FA where the transition function is single-valued. DFAs do not allow $\epsilon$-transitions.
NFA 的行为:
- Each time the NFA must make a nondeterministic choice, it follows the transition that leads to an accepting state for the input string, if such a transition exists. 也就是说NFA会把一条路线走完.
- NFA 会 clones itself to pursue each possible transition. At any point, we call the specific set of states in which the NFA is active its configuration. 也就是其中某一条.
Equivalence of NFAs and DFAs
NFAs and DFAs are equivalent in their expressive power. Any DFA is a special case of an NFA.
2.4.2 Regular Expression to NFA: Thompson’s Construction
Deriving an NFA from the RE.
因为 NFA 要尝试所有情况, 因此 NFA 需要 $\epsilon$ transformation. 在每一个分支处都需要 $\epsilon$ transformation.
Each NFA has one start state and one accepting state of NFAs for some component REs.
An $\epsilon$-transition always connects two states.
The NFA also contains many $\epsilon$-moves that are obviously unneeded.
2.4.3 NFA to DFA: The Subset Construction
DFA executions is much easier to simulate than NFA.
The algorithm that constructs a DFA from an NFA is called subset construction.
Valid configuration, configuration of an NFA that can be reached by some input string.
2.4.5 Using a DFA as a Recognizer
2.5 IMPLEMENTING SCANNERS
最后一部似乎是 convert the DFA into executable code.
Three implementation strategies for converting a DFA into executable code:
- a table-driven scanner
- a direct-coded scanner
- a hand-coded sanner
All of these scanners operate in the same manner, by simulating the DFA.
The three implementation strategies differ in the details of their runtime cost.