Compilers Introduction

A compiler is a program which processes programs written in some language, writes programs in another language. It translates code between two different languages. It is a tool to enable you to program at a higher level, mapping high level concepts to low level implementation. They may also produce warning / error messages.

1. Phases of Compilation

Input in some language (e.g. C, Java, C#).
Analysis - construct an internal representation of the source language structure, and hence its meaning. Usually start by builing a tree representation, but may also build a graph to represent control flow.
Synthesis - use this internal representation to construct the target language. Analyse and transform the internal representation, and then traverse it to produce output.
Output in the target language (e.g. Intel x86 Assembly).

1.1 Lexical Analysis

Lexical Analysis takes the input sequence of characters and converts it into a sequence of tokens. User identifiers (e.g. variable names) are represented by the same lexical token. Additionally, a symbol table is constructed to keep track of the name and type of each identifier.

1.2 Syntax Analysis

Syntax Analysis (parsing) extracts the grammatical structure of the program - to work out how BNF (Backus-Naur Form) rules must have been applied to yield the input program. It outputs a data structure representing the program structure, an Abstract Syntax Tree (AST).

2. A Complete Compiler

We will now build a simple compiler in Haskell. The language will translate a simple arithmetic expression language into stack machine code, using a lexical analyser (scan), syntax analyser (parse) and translator (translate):

1data Instruction = PushConst Int | PushVar [Char] | Add | Sub
2
3compile :: [Char] -> [Instruction]
4compile = translate . parse . scan

2.1 Syntactic Analysis (Parsing)

We must seperate the ideas of syntax and semnatics:

Syntax consists of rules for acceptable utterances. To determine a program is syntactically correct, you must determine how the grammatical rules were used to construct it.
Semantics is harder to specify - it is the meaning of the program. It is the job of the compiler to determine the meaning of the program, and to ensure that the meaning is preserved in the translation.

We can specify syntactic rules using a context-free grammar, often Backus-Naur Form or Backus Normal Form. Each production shows one valid way by which a non-terminal (LHS) an be expanded into a string of terminals and non-terminals (RHS). For example: . For formally, a context free grammar (CFG) consists of components:

is a non-terminal start symbol.
is a set of productions.
is a set of tokens (terminals).
is a set of non-terminals. Strings of terminals can be derived using grammar by replacing each non-terminal with the RHS of a production. A string which consists of only terminals is called a sentence. The language of a grammar is the set of all sentences which can be derived from the start symbol.

The parse tree shows pictorially how the string is derived from the start symbol - it is a graphical proof. Parsing is the process of finding this proof.

A grammar is ambiguous if it generates contains strings which can be derived in more than one way. For example, the grammar is ambiguous because can be derived in two ways. Ambiguity is bad because it makes it hard to determine the meaning of a program. It is often possible to remove ambiguity by rewriting the grammar. The abstract syntax tree can be simpler than the parse tree, because it removes the ambiguity in the way the tree is constructed.

In top down parsing, we start with the start symbol, and look at each terminal just once, and use it to decide which rules to use. In bottom up parsing, we start with the terminals and work up to the start symbol (using the rules from right-to-left). Although they are are complicated, bottom-up parsers can be constructured automatically by parser generators.