CS143 Notes
p7 Lexical Analysis
Token classes correspond to sets of strings.
Identifier: strings of letters of digits, starting with a letter.
Integer: a non-empty string of digits.
Keyword:
Whitespace: a non-empty sequence of blanks, newlines, and tabs.
Classify program substrings according to role.
<class,string> 这个东西称为 token.
如果有一段代码:
1 |
|
在 Lexical Analyser 看来就是:
1 |
|
Often the punctuation marks of the language are in token classes all by themselves.
连在一起的 whitespace 会被 group 为一个 token.
An implementation must do two things:
- Recognize substrings corresponding to tokens.
- The lexemes(the substrings are called lexemes)
- Identify the token class of each lexeme.
<token class, lexeme> – a token,
p8 Lexical Analysis Example
FORTRAN rule: Whitespace is insignificant.
VAR1
is the same as VA
R1
.
Lookahead, in order to understand the role of something as we are going left to right.
One of the goal in the design of lexical system is to minimize the amount of lookahead or bound the amount of lookahead that is required.
重点:
- The goal is to partition the string. This is implemented by reading left-to-right, recongnizing one token at a time.
- Lookahead may be required to decide where one token ends and the next token begins.
p9 Regular Languages
Summarize:
- Regular expressions specify regular languages
- Five constructs:
- Two base cases: empty and 1-character strings
- Three compound expressions: union, concatenation, iteration.
p10 Formal Languages
A formal language is just any set of strings over some alphabet.
如:
1 |
|
Meaning function L maps syntax to semantics:
1 |
|
Meaning is many to one, syntax 对 semantic 是多对一。
p11 Lexical Specification
Whitespace: a non-empty sequences of blanks, newlines, and tabs.
p12 Lexical Specification
“Maximal Munch”. 取长的 string.
When there is more than one token class to which a string might belong, 通过优先级,choose the one listed first.
If no rule matches:
Making a Error
regex for all strings not in the lexical specification. And put it last in priority.
Summarise
Regular expressions are a concise notation for string patterns.
Use in lexical analysis requires small extensions:
- To resolve ambiguities: match as long as possible; highest priority match
- To handle errors