Parsing & Tokenization

1. Lexing

The first stage of a compiler is lexing. Here, we convert the input string into tokens, which are easier to work with. We do this with regex (regular expressions). Formally:

A regular expression matches literals, alternation, sequencing and iteration.
A context-free grammer matches the above and also recursion.
A context-sensitive grammar matches the above and also context by recording arbitrary state.

One token can represent multiple meanings (e.g. * can be multiplication or pointer dereference). So, the tokenizer can simply name it Star and let the parser decide. Another problem that can arise is when a token is a prefix of another token. For example, in Java, >> can be a right shift or two generic tokens. The problem is tokenizers don't have enough context.

Instead, we can use scanerless parsing which operates on character streams.

2. Context Free Grammars

There are two different subsets of context-free-grammars (CFGs), each having its own associated parser. These algorithms have to operate on token streams.


Approach	Top-down	Bottom-up
Does Left Recursion	No	Prefers It
Good Errors	Can be good	Reasonably weak
Allows for ambiguity	Up to tokens	Up to tokens
Complexity	with perfect choices

In general, any CFG can be parsed in at least and can be fully ambiguous with multiple parses using the CYK algorithm. and demand any ambiguities can be resolved by looking ahead tokens. Good programming languages should asprire to have unambiguous grammars with , for better parsing performance.

3. Parsing Expression Grammars

A PEG is unambiguous by construction. This means that although there may be local ambiguities, the parser will eventually backtrack to resolve these ambiguities and return a single parse-tree. PEG is capable of parsing any grammar, but it is still unknown whether it can parse any CFG. A PEG can be parsed in time with no backtracking, or time with backtracking on every decision.

A PEG consists of:

Literals: 'x'
Variables: v
Empty: .
Sequencing: e1 e2.
Alternation (left biased): e1 / e2.
Negative Lookahead: !e.
Grouping: (e).
End of File: eof.

As well as redundant (but useful):

Any Character: ..
Character Classes: [0-9].
Optional: e?.
Zero or More: e*.
One or More: e+.
Lookahead: &e.

3.1 Writing PEGs

The most important part of PEG is alternation. e1 / e2 means if e1 succeeds, e2 is not tried. This is left-biased, so e1 is tried first. This resolves ambiguity by always picking the parse-tree resulting from left-most choices.

PEGs also exibit greediness: they will always consume as much as they can. This is important to make sure that longer ambiguous branches are kept more to the left:

1stmt <- "if" expr "then" stmt "else" stmt
2      / "if" expr "then" stmt

Left recursive or nullable rules (those which can succeed without consuming input) can cause infinite loops when parsing PEGs. We cannot easily detect left recursion in PEGs, so we must avoid it.

As we have already seen, local ambiguity in PEGs can be expensive. This can be avoided with left-factoring - a transformation of the form e e1 / e e2 ==> e (e1 / e2). This reduces backtracking without changing the semantics of the expression.

> A Word of Warning!

Be careful! e1 e / e2 e ==> (e1 / e2) e is not a valid transformation, and does not preserve meaning!

3.2 Resolving Left Recursion

We can solve left recursion using the following transformation:

1E <- E OP / T
2// ===>
3E <- T OP*

This transformation right-associates the expressions. It is up to the parser writer to ensure that a generated AST from this is left-associated.

Back to Home