Engineering a compiler Notes

Chapter 1 Overview of Compilation

Compilers are computer programs that translate a program written in one language into a program written in another language.

1.1 INTRODUCTION

The compiler has a front end to deal with the source language, which focus on understanding the source-language program, and it has a back end to deal with the target language, which focus on mapping programs to the target machine.

Many research compilers produce C program as their output. Because compilers for C are available on most computers.

Two fundamental principles:
- The compiler must preserve the meaning of the program being compiled.
- The compiler must improve the input program in some discernible way.

1.2 COMPILER STRUCTURE

A compiler uses some sets of data structures to represent the code it processes. That form is called an intermediate representation.

Two-phase compiler, 只有 front end 和 back end.
Three-phase compiler, 有 front end, optimizer 和 back end.

1.3 OVERVIEW OF TRANSLATION

Notation
A compiler translates a program written in one notation into an equivalent program written in another notation.

1.3.1 The Front End

Cheching Syntax

The source language is a set, usually infinite, of strings defined by some finite set of rules, called a grammar.

The scanner takes a stream of characters and converts it to a stream of classified words.

The actual spelling of the words might be stored in a hash table and represented in the pairs with an integer index.

The process of automatically finding derivations is called parsing.

Intermediate Representations

Compilers use a variety of different kinds of IR.

1.3.2 The Optimizer

The optimizer analyzes the IR form of the code to discover facts about that context and uses that contextual knowledge to rewrite the code so that it computes the same answer in a more efficient way.

Analysis

Transformation

1.3.3 The Back End

The compiler’s back end traverses the IR form of the code and emits code for the target machine.

ILOC–”intermediate language for an optimizing compiler”, is a notation fo low-level examples.

Instruction Selection

Instruction selection maps each IR operation, in its context, into one or more target machine operations.
Virtual register
A symbolic register name that the compiler uses to indicate that a value can be stored in a register.

意思就是选择最合适的 instruction, 所以叫 instruction selector, 比如一个 source code, mapping to 一个 machine code, 有几种选择, 这时 instruction selector 会选择最好的一个.

Register Allocation

The register allocator must map those virtual registers onto actual target-machine registers.

The optimization would increase demand for registers but eliminate a later instruction. 意思就是把前面计算的值用于后面的计算.

Instruction Scheduling

To reorder operation.

It attempts to minimize the number of cycles wasted waiting for operands.

Interactions Among Code-Generation Components

1.4 SUMMARY AND PERSPECTIVE

Chapter 2 Scanners

The scanner is the only pass in the compiler to touch every character in the input program.

The regular expression, a notation used to describe the valid words in a programming laguage.

2.1 INTRODUCTION

The scanner, 也就是 lexical analyzer, reads a stream of characters and produces a stream of words.

Conceptual Roadmap

Overview

Syntactic category, a classification of words according to their grammatical usage.

Microsyntax, the lexical structure of a language, which specifies how to group characters into words and how to separate words that run together.

In most languages, blanks and punctuation marks terminate a word.

Keyword, a word that is reserved for a particular syntactic purpose and, thus, cannot be used as an identifier.

Both generated and hand-crafted scanners can be implemented to require just O(1) time per character.

2.2 RECONGNIZING WORDS

The recognizer takes one transition per input character.

2.2.1 A Formalism for Recognizers

Finite automation, a formalism for recogmizers that has a finite set of states, an alphabet, a transition function, a start state, and one or more accepting states. A finite automaton(FA) is a five-tuple($S, \epsilon, \delta, s_0, S_A$).

有一个 error state $s_e$.

2.2.2 Recognizing More Complex Words

Syntactic category 和 lexeme.

2.3 REGULAR EXPRESSION

The language described by an RE is called a regular languages.

Simple recognizers have simple RE specifications.

2.3.1 Formalizing the Notation

An RE is built up from three basic operations:

  1. Alternation, R|S
  2. Concatenation, RS
  3. Closure, $R^*$

Regular expression are used in many applications to specify patterns in character strings.

Parentheses have highest precedence, followed by closure, concatenation, and alternation, in that order.

2.3.2 Examples

The point is critical: the cost of operating an FA is proportional to the length of the input, not to the length or complexity of the RE that generates the FA.

2.3.3 Closure Properties of REs

Regular expressions are closed under many operations–that is, if we apply the operation to an RE or a collection of REs, the result is an RE.

2.4 FROM REGULAR EXPRESSION TO SCANNER

The goal of out work with finite automata is to automate the derivation of executable scanners from a collection of REs.

Deterministic FAs, or DFAs. Kleene’s construction derives an RE from a DFA.

Nondeterministic FAs, or NFAs. Thompson’s construction derives an NFA from an RE.

2.4.1 Nondeterministic Finite Automata

An FA that includes states such as $s_0$ that have multiple transitions on a single character is called a nondeterministic finite automaton. Its an FA thats allows transitions on the empty string, $\epsilon$, and states that have multiple transitions on the same character.

An FA with unique character transitions in each state is called a deterministic finite automaton. Its an FA where the transition function is single-valued. DFAs do not allow $\epsilon$-transitions.

NFA 的行为:

  • Each time the NFA must make a nondeterministic choice, it follows the transition that leads to an accepting state for the input string, if such a transition exists. 也就是说NFA会把一条路线走完.
  • NFA 会 clones itself to pursue each possible transition. At any point, we call the specific set of states in which the NFA is active its configuration. 也就是其中某一条.

Equivalence of NFAs and DFAs

NFAs and DFAs are equivalent in their expressive power. Any DFA is a special case of an NFA.

2.4.2 Regular Expression to NFA: Thompson’s Construction

Deriving an NFA from the RE.

因为 NFA 要尝试所有情况, 因此 NFA 需要 $\epsilon$ transformation. 在每一个分支处都需要 $\epsilon$ transformation.

Each NFA has one start state and one accepting state of NFAs for some component REs.

An $\epsilon$-transition always connects two states.

The NFA also contains many $\epsilon$-moves that are obviously unneeded.

2.4.3 NFA to DFA: The Subset Construction

DFA executions is much easier to simulate than NFA.

The algorithm that constructs a DFA from an NFA is called subset construction.

Valid configuration, configuration of an NFA that can be reached by some input string.

2.4.5 Using a DFA as a Recognizer

2.5 IMPLEMENTING SCANNERS

最后一部似乎是 convert the DFA into executable code.

Three implementation strategies for converting a DFA into executable code:

  • a table-driven scanner
  • a direct-coded scanner
  • a hand-coded sanner
    All of these scanners operate in the same manner, by simulating the DFA.
    The three implementation strategies differ in the details of their runtime cost.

2.5.1 Table-Driven Scanners


Engineering a compiler Notes
http://example.com/2022/08/01/Engineering-a-compiler-Notes/
作者
Jie
发布于
2022年8月1日
许可协议