Compilers Introduction

A compiler is a program which processes programs written in some language, writes programs in another language. It translates code between two different languages. It is a tool to enable you to program at a higher level, mapping high level concepts to low level implementation. They may also produce warning / error messages.

1. Phases of Compilation

  1. Input in some language (e.g. C, Java, C#).
  2. Analysis - construct an internal representation of the source language structure, and hence its meaning. Usually start by builing a tree representation, but may also build a graph to represent control flow.
  3. Synthesis - use this internal representation to construct the target language. Analyse and transform the internal representation, and then traverse it to produce output.
  4. Output in the target language (e.g. Intel x86 Assembly).

1.1 Lexical Analysis

Lexical Analysis takes the input sequence of characters and converts it into a sequence of tokens. User identifiers (e.g. variable names) are represented by the same lexical token. Additionally, a symbol table is constructed to keep track of the name and type of each identifier.

1.2 Syntax Analysis

Syntax Analysis (parsing) extracts the grammatical structure of the program - to work out how BNF (Backus-Naur Form) rules must have been applied to yield the input program. It outputs a data structure representing the program structure, an Abstract Syntax Tree (AST).

2. A Complete Compiler

We will now build a simple compiler in Haskell. The language will translate a simple arithmetic expression language into stack machine code, using a lexical analyser (scan), syntax analyser (parse) and translator (translate):

1data Instruction = PushConst Int | PushVar [Char] | Add | Sub 2 3compile :: [Char] -> [Instruction] 4compile = translate . parse . scan

2.1 Syntactic Analysis (Parsing)

We must seperate the ideas of syntax and semnatics:

We can specify syntactic rules using a context-free grammar, often Backus-Naur Form or Backus Normal Form. Each production shows one valid way by which a non-terminal (LHS) an be expanded into a string of terminals and non-terminals (RHS). For example: . For formally, a context free grammar (CFG) consists of components:

The parse tree shows pictorially how the string is derived from the start symbol - it is a graphical proof. Parsing is the process of finding this proof.

A grammar is ambiguous if it generates contains strings which can be derived in more than one way. For example, the grammar is ambiguous because can be derived in two ways. Ambiguity is bad because it makes it hard to determine the meaning of a program. It is often possible to remove ambiguity by rewriting the grammar. The abstract syntax tree can be simpler than the parse tree, because it removes the ambiguity in the way the tree is constructed.

In top down parsing, we start with the start symbol, and look at each terminal just once, and use it to decide which rules to use. In bottom up parsing, we start with the terminals and work up to the start symbol (using the rules from right-to-left). Although they are are complicated, bottom-up parsers can be constructured automatically by parser generators.

Back to Home