System Software Notes : Principles of Compiler

Principles of Compilers

Introduction

The word compilation is used to denote the task of translating high level language (HLL) programs into machine language programs. Though the objective of this task of translation is similar to that of an assembler, the problem of compilation is much more complex than that of an assembler. A compiler is a program that does the compilation task. A compiler recognises programs in a particular HLL and produces equivalent output programs appropriate for some particular computer configuration (hardware and OS). Thus, an HLL program is to a great extent independent of the configuration of the machine it will eventually run on, as long as it is ensured that the program is compiled by a compiler that recognises that HLL and produces output for the required machine configuration. It is common for a machine to have compilers that would translate programs to produce executables for that machine (hosts). But there also are compilers that runs on one type of machine but the output of which are programs that shall run on some other machine configuration, such as generating an MS-DOS executable program by compiling an HLL program in UNIX. Such a compiler is called a cross compiler. Another kind of translator that accepts programs in HLL are known as interpreters. An interpreter translates an input HLL program and also runs the program on the same machine. Hence the output of running an interpreter is actually the output of the program that it translates.

Important phases in Compilation

The following is a typical breakdown of the overall task of a compiler in an approximate sequence -

Lexical analysis, Syntax analysis, Intermediate code generation, Code optimisation, Code generation.

Like an assembler, a compiler usually performs the above tasks by making multiple passes over the input or some intermediate representation of the same. The compilation task calls for intensive processing of information extracted from the input programs, and hence data structures for representing such information needs to be carefully selected. During the process of translation a compiler also detects certain kinds of errors in the input, and may try to take some recovery steps for these.

Lexical Analysis

Lexical analysis in a compiler can be performed in the same way as in an assembler. Generally in an HLL there are more number of tokens to be recognised - various keywords (such as, for, while, if, else, etc.), punctuation symbols (such as, comma, semi-colon, braces, etc.), operators (such as arithmatic operators, logical operators, etc.), identifiers, etc. Tools like lex or flex are used to create lexcical analysers.

Syntax Analysis

Syntax analysis deals with recognising the structure of input programs according to known set of syntax rules defined for the HLL. This is the most important aspect in which HLLs are significantly different from lower level languages such as assembly language. In assembly languages the syntax rules are simple which roughly requires that a program should be a sequence of statements, and each statement should esentially contain a mnemonic followed by zero or more operands depending on the mnemonic. Optionally, there can be also be an identifier preceding the mnemonic. In case of HLLs, the syntax rules are much more complicated. In most HLLs the notion of a statement itself is very flexible, and often allows recursion, making nested constructs valid. These languages usually support multiple data types and often allow programmers to define abstruct data types to be used in the programs. These and many other such features make the process of creating software easier and less error prone compared to assembly language programming. But, on the other hand, these features make the process of compilation complicated.

The non-trivial syntax rules of HLLs need to be cleverly specified using some suitable notation, so that these can be encoded in the compiler program. One commonly used formalism for this purpose is the Context Free Grammar (CFG). CFG is a formalism that is more powerful than regular grammars (used to write regular expressions to describe tokens in a lexical analyser). Recursion, which is a common feature in most constructs of HLLs, can be defined using a CFG in a concise way, whereas a regular grammar is incapable of doing so. It needs to be noted that there are certain constructs that cannot be adequately described using CFG, and may require other more powerful formalisms, such as Context Sensitive Grammars (CSG). A common notation used to write the rules of CFG or CSG is the BNF (Backus Naur Form).

During syntax analysis, the compiler tries to apply the rules of the grammar of the input HLL given using BNF, to recognise the structure of the input program. This is called parsing and the module that performs this task is called a parser. From a somewhat abstract point of view, the output of this phase is a parse tree that depicts how various rules of the grammar can be repetitively applied to recognise the input program. If the parser cannot create a parse tree for some given input program, then the input program is not valid according to the syntax of the HLL.

The soundness of the CFG formalism and the BNF notation makes it possible to create different types of efficient parsers to recognise input according to a given language. These parsers can be broadly classified as top-down parsers and bottom-up parsers. Recursive descent parsers and Predictive parsers are two examples of top-down parsers. SLR parsers and LALR parser are two examples of bottom-up parsers. For certain simple context free languages (languages that can be defined using CFG) simpler bottom-up parsers can be written. For example, for recognising mathematical expressions, an operator precedence parser can be created.

In creating a compiler, a parser is often built using tools such as yacc and bison. To do so the CFG of the input language is written in BNF notation, and given as input to the tool (along with other details).

Intermediate Code Generation

Having recognised a given input program as valid, a compiler tries to create the equivalent program in the language of the target environment. In case of an assembler this translation was somewhat simpler since the operation implied by the mnemonic opcode in each statement in the input program, there is some equivalent machine opcode. The number of operands applicable for each operation in the machine language is the same as allowed for the corresponding assembly language mnemonic opcodes. Thus for the assembly language the translation for each statement can be done for each statement almost independently of the rest of the program. But, in case of an HLL, it is futile to try to associate a single machine opcode for each statement of the input language. One of the reasons for this is, as stated above, the extent of a statement is not always fixed and may contain recursion. Moreover, data references in HLL programs can assume significant levels of abstractions in comparision to what the target execution environment may directly support. The task of associating meanings (in terms of primitive operations that can be supported by a machine) to programs or segments of a program is called semantic processing.

Syntax Directed Translation

Though it is not entirely straightforward to associate target language operations to statements in the HLL programs, the CFG for the HLL allows one to associate semantic actions (or implications) for the various syntactic rules. Hence in the broad task of translation, when the input program is parsed, a compiler also tries to perform certain semantic actions corresponding to the various syntactic rules that are eventually applied. However, most HLLs contain certain syntactic features for which the semantic actions are to be determined using some additional information, such as the contents of the symbol table. Hence, building and usage of data-structures such as the symbol table are an important part of the semantic action that are performed by the compiler.

Upon carrying out the semantic processing a more manageable equivalent form of the input program is obtained. This is stored (represented) using some Intermediate code representation that makes further processing easy. In this representation, the compiler often has to introduce several temporary variables to store intermediate results of various operations. The language used for the intermediate code is generally not any particular machine language, but is such which can be efficiently converted to a required machine language (some form of assembly language can be considered for such use).

Code Optimisation

The programs represented in the intermediate code form usually contains much scope for optimisation both in terms of storage space as well as run time efficiency of the intended output program. Sometimes the input program itself contains such scope. Besides that, the process of generating the intermediate code representation usually leaves much room for such optimisation. Hence, compilers usually implement explicit steps to optimise the intermediate code.

Code Generation

Finally, the compiler converts the (optimised) program in the intermediate code representation to the required machine language. It needs to be noted that if the program being translated by the compiler actually has dependencies on some external modules, then linking has to be performed to the output of the compiler. These activities are independent of whether the input program was in HLL or assembly language.

Run-time Storage Administration

One of the important aspects of the semantic actions of a compiler is to ensure an efficient and error free run-time storage model to the output program. Most modern programming languages allows some extent of block-structuring, nesting of constructs, and recursion of subroutines. All these calls for an efficient modelling of data storage that is dynamic, and it turns out that a stack meets much of the criteria. Thus allocation and access of storage for program variables, subroutine parameters, and compiler generated internal variables on the stack is an important part of the task of a compiler.