Parsing & Tokenization
1. Lexing
The first stage of a compiler is lexing. Here, we convert the input string into tokens, which are easier to work with. We do this with regex (regular expressions). Formally:
- A regular expression matches literals, alternation, sequencing and iteration.
- A context-free grammer matches the above and also recursion.
- A context-sensitive grammar matches the above and also context by recording arbitrary state.
One token can represent multiple meanings (e.g. * can be multiplication or pointer dereference). So, the tokenizer can simply name it Star and let the parser decide. Another problem that can arise is when a token is a prefix of another token. For example, in Java, >> can be a right shift or two generic tokens. The problem is tokenizers don't have enough context.
Instead, we can use scanerless parsing which operates on character streams.
2. Context Free Grammars
There are two different subsets of context-free-grammars (CFGs), each having its own associated parser. These algorithms have to operate on token streams.
| Approach | Top-down | Bottom-up |
| Does Left Recursion | No | Prefers It |
| Good Errors | Can be good | Reasonably weak |
| Allows for ambiguity | Up to | Up to |
| Complexity |
In general, any CFG can be parsed in at least
3. Parsing Expression Grammars
A PEG is unambiguous by construction. This means that although there may be local ambiguities, the parser will eventually backtrack to resolve these ambiguities and return a single parse-tree. PEG is capable of parsing any
A PEG consists of:
- Literals:
'x' - Variables:
v - Empty:
. - Sequencing:
e1 e2. - Alternation (left biased):
e1 / e2. - Negative Lookahead:
!e. - Grouping:
(e). - End of File:
eof.
As well as redundant (but useful):
- Any Character:
.. - Character Classes:
[0-9]. - Optional:
e?. - Zero or More:
e*. - One or More:
e+. - Lookahead:
&e.
3.1 Writing PEGs
The most important part of PEG is alternation. e1 / e2 means if e1 succeeds, e2 is not tried. This is left-biased, so e1 is tried first. This resolves ambiguity by always picking the parse-tree resulting from left-most choices.
PEGs also exibit greediness: they will always consume as much as they can. This is important to make sure that longer ambiguous branches are kept more to the left:
1stmt <- "if" expr "then" stmt "else" stmt 2 / "if" expr "then" stmt
Left recursive or nullable rules (those which can succeed without consuming input) can cause infinite loops when parsing PEGs. We cannot easily detect left recursion in PEGs, so we must avoid it.
As we have already seen, local ambiguity in PEGs can be expensive. This can be avoided with left-factoring - a transformation of the form e e1 / e e2 ==> e (e1 / e2). This reduces backtracking without changing the semantics of the expression.
> A Word of Warning!
Be careful!
e1 e / e2 e ==> (e1 / e2) eis not a valid transformation, and does not preserve meaning!
3.2 Resolving Left Recursion
We can solve left recursion using the following transformation:
1E <- E OP / T 2// ===> 3E <- T OP*
This transformation right-associates the expressions. It is up to the parser writer to ensure that a generated AST from this is left-associated.