Lexical Analyser

Introduction

In the process of translation (of programs) lexical analyser scans an input text and recognises sequences of bytes in it as certain valid "items" or invalid sequences. For example if an assembly language allows a set of mnemonics, alphanumeric identifiers, numbers (sequence of digits), and some symbols such as comma, colon, etc., then a lexical analyser for that assembly language should detect such items in an input program written in that assembly language. Usually a larger assembler program invokes a lexical analyser module to recognise such items one-by-one, starting from the beginning of the text. Each such distinct item that the lexical analyser may detect and pass on (return) to the main assembler is called a token. So for an assembly language there can be tokens such as IDENTIFIER, NUMBER, LITERAL, COMMA, KEY_LDA, KEY_START, KEY_LOOP, (the prefix KEY_ may indicate that the token represents a keyword) and so on. The designer of a translator can define the set of tokens for his purpose. In an assembler, distinct numbers are used to denote each token, so that the lexical analyser can pass the number representing the token that it recognises in the input. For example, for the above set of tokens we may use 1 for IDENTIFIER, 2 for NUMBER, 3 for LITERAL, 4 for COMMA, 5 for KEY_LDA, and so on. From implementation point of view, in C language one can use the C-preprocessor's define statement to establish these equivalence, such as, #define IDENTIFIER 1

Implementation

The sequences of bytes that a lexical analyser has to recognise are either fixed or variable (generic). Fixed sequences are for keywords and some symbols such as comma, colon, etc. where a predefined sequence of one or more bytes (characters) are recognised as some item. For example, for each keyword the sequences of letters are predefined. Variable or generic sequences are for identifiers, numbers etc. In these some rule is given instead of any predefined sequence. An example of a rule is, "a sequence of alphanumeric letters of an underscore, but first letter is not numeric is an identifier". Usually, if a sequence of input bytes matches a fixed sequence as well as some rule, that sequence is recognised as the item corresponding to the fixed sequence. For example the sequence "MULT" may match the fixed sequence for a keyword as well as the rule for identifier given above. This will be then recognised as a keyword. (Relate this to the concepts of Keyword and Reserved word.) Some items for recognition may have combination of fixed part and variable part. For example, suppose in a language an address value is prefixed by an "@" character to distinguish it from a numeric constant. Then the rule for an address value is "@ followed by a sequence of digits".

First Approach - String comparision

One approach that can be easily imagined is to read some number of characters form the input file and match the string against some known strings (spellings) of items. If the input string does not match any of these known fixed strings, then it may be further tested to match variable sequences. In practice, most programming languages allow identifiers (names of variables, constants, procedures, etc.) to consist of sequences of alphanumeric and possibly "underscore" characters. The keywords are usually strings of alphabetic characters. Numbers are usually strings of digits (decimal or hexadecimal) and possibly a period (decimal point). Other valid items are formed with other symbols than these mentioned characters. Fixed strings corresponding to keywords can be stored in a table, preferably arranged in such a way that facilitates easy search (eg. sorted to facilitate binary search). Thus, wherever in the input there is a continuous string of alphanumeric characters or underscore, a lexical analyser can read it into a buffer. While doing so set a boolean flag to indicate whether any underscore was read, another to indicate whether any alphabetic character was read, and another to indicate whether any decimal digit was read. If in the string there was no underscore or digits, then the string might be a keyword. The string is matched against the keywords stored in the table. If a match occurs, the lexical analyser declares the string to be that keyword. Otherwise it is an identifier. If the string in the buffer contained one or more underscores, it is not necessary to search the keyword table, and strightaway it can be recognised as an identifier. If however, the string had only digits it can be recognised as a number. In doing so some more characters from the input may also be included if there is a period followed by some more digits. Other items of the language may be recognised by matching the input character-by-character. The above is a description of a possible lexcial analyser for typical programming language. Variations of it can be implemented for any real requirement.

Second Approach - Finite State Machine

The first approach described above is easy to conceptualise and can be implemented as a program without much diffuculty. But the need to search words in the table of keywords, and other matching operations in the remaininig part, essentially requires each input character to be compared with constants multiple number of times. (Exercise : Why is each character compared with constants multiple times ?) This situation may well be improved in terms of computational complexity. The second approach to building a lexical analyser is using a Finite State Machine (FSM). It reduces the computational complexity such that each input character is to be handled only once (in fact, there is no comparision at all). In an FSM a set of states are defined for the system. Initially the system assumes a pre-defined state called the starting state and as the input is processed, the system makes transition to another state (transition to the same state is also possible) for each input symbol. In a way, a state actually indicates what sequence of input symbols have been seen so far and what action to take as a result. The action may be simply make another state transition using the next input character, or declare that a token has been recognised in the sequence of input symbols. In the latter case the system again assumes the starting state so that it can proceed to process the remaining input characters to recognise more tokens. Usually, the action associated with each state also includes some other tasks, such as to update some program data structures, etc.

For different languages the FSM for recognizing the tokens will vary in the set of states, the state transitions that are to be made, and the tokens to be recognized. It needs to be understood that the implementation of an FSM as a program has two parts - some static data that describes the states and the state transitions, and a set of instructions for the program to proceed through a series of state transitions. The data is structured as a state transition table where each row stands for a state of the FSM and each column stands for an input symbol. Each entry in the table indicates the state that the FSM should enter from the current state (corresponding to the row) upon encountering the input symbol corresponding to the column. With this information in the table, the code of FSM should simply facilitate the selection of the next state corresponding to the current state and the next input symbol by a look up of the table. However, other actions associated with each state also needs to be incorporated.

The tedious part in implementing the FSM model as a lexical analyser program for a particular language is to determine the states and the state transitions. Fortunately there exists tools that makes this very simple. One such tool is lex in UNIX (flex in Linux). This tool takes the lexical specifications in a file, in the form of regular expressions corresponding to each token to be recognised. The actions to be taken upon recognition of each token should be specified along with the regular expression for that token. The tool then creates a C program file that essentially contains a lexical analyser function named yylex(), which can be called to perform lexical analysis according to the specifications encoded. Thus lex (or flex) is a tool that creates a lexical analyser using lexical specifications and actions corresponding to tokens provided as input. [See man flex for more details]