CS143 Notes

p7 Lexical Analysis

Token classes correspond to sets of strings.

Identifier: strings of letters of digits, starting with a letter.

Integer: a non-empty string of digits.

Keyword:

Whitespace: a non-empty sequence of blanks, newlines, and tabs.

Classify program substrings according to role.

<class,string> 这个东西称为 token.

如果有一段代码:

1
2
3
4
if(i==j)
z=0;
else
z=1;

在 Lexical Analyser 看来就是:

1
\tif(i==j)\n\t\tz=0;\n\telse\n\t\tz=1;

Often the punctuation marks of the language are in token classes all by themselves.

连在一起的 whitespace 会被 group 为一个 token.

An implementation must do two things:

  1. Recognize substrings corresponding to tokens.
    • The lexemes(the substrings are called lexemes)
  2. Identify the token class of each lexeme.

<token class, lexeme> – a token,

p8 Lexical Analysis Example

FORTRAN rule: Whitespace is insignificant.

VAR1 is the same as VA R1.

Lookahead, in order to understand the role of something as we are going left to right.

One of the goal in the design of lexical system is to minimize the amount of lookahead or bound the amount of lookahead that is required.

重点:

  1. The goal is to partition the string. This is implemented by reading left-to-right, recongnizing one token at a time.
  2. Lookahead may be required to decide where one token ends and the next token begins.

p9 Regular Languages

Summarize:
- Regular expressions specify regular languages
- Five constructs:
- Two base cases: empty and 1-character strings
- Three compound expressions: union, concatenation, iteration.

p10 Formal Languages

A formal language is just any set of strings over some alphabet.

如:

1
2
3
4
5
6
7
8
Alphabet = English
characters

Language = English
sentences

Alphabet = ASCII
Language = C programs

Meaning function L maps syntax to semantics:

1
L(e) = M

Meaning is many to one, syntax 对 semantic 是多对一。

p11 Lexical Specification

Whitespace: a non-empty sequences of blanks, newlines, and tabs.

p12 Lexical Specification

“Maximal Munch”. 取长的 string.

When there is more than one token class to which a string might belong, 通过优先级,choose the one listed first.

If no rule matches:
Making a Error regex for all strings not in the lexical specification. And put it last in priority.

Summarise
Regular expressions are a concise notation for string patterns.

Use in lexical analysis requires small extensions:
- To resolve ambiguities: match as long as possible; highest priority match
- To handle errors

p13 Finite Automata


CS143 Notes
http://example.com/2022/07/30/CS143-Notes/
作者
Jie
发布于
2022年7月30日
许可协议