Lexical Analyser
Introduction
In the process of translation (of programs) lexical analyser scans an input
text and recognises sequences of bytes in it as certain valid "items" or
invalid sequences. For example if an assembly language allows a set of
mnemonics, alphanumeric identifiers, numbers (sequence of digits), and some
symbols such as comma, colon, etc., then a lexical analyser for that assembly
language should detect such items in an input program written in that
assembly language. Usually a larger assembler program invokes a lexical
analyser module to recognise such items one-by-one, starting from the
beginning of the text. Each such distinct item that the lexical analyser may
detect and pass on (return) to the main assembler is called a
token. So for an assembly language there can be tokens such as
IDENTIFIER, NUMBER, LITERAL, COMMA, KEY_LDA, KEY_START, KEY_LOOP, (the prefix
KEY_ may indicate that the token represents a keyword) and so on. The
designer of a translator can define the set of tokens for his purpose. In an
assembler, distinct numbers are used to denote each token, so that the
lexical analyser can pass the number representing the token that it
recognises in the input. For example, for the above set of tokens we may use
1 for IDENTIFIER, 2 for NUMBER, 3 for LITERAL, 4 for COMMA, 5 for KEY_LDA,
and so on. From implementation point of view, in C language one can use the
C-preprocessor's define statement to establish these equivalence, such as,
#define IDENTIFIER 1
Implementation
The sequences of bytes that a lexical analyser has to recognise are either
fixed or variable (generic). Fixed sequences are for keywords and some
symbols such as comma, colon, etc. where a predefined sequence of one or more
bytes (characters) are recognised as some item. For example, for each keyword
the sequences of letters are predefined. Variable or generic sequences are
for identifiers, numbers etc. In these some rule is given instead of any
predefined sequence. An example of a rule is, "a sequence of alphanumeric
letters of an underscore, but first letter is not numeric is an identifier".
Usually, if a sequence of input bytes matches a fixed sequence as well as
some rule, that sequence is recognised as the item corresponding to the fixed
sequence. For example the sequence "MULT" may match the fixed sequence for a
keyword as well as the rule for identifier given above. This will be then
recognised as a keyword. (Relate this to the concepts of Keyword and
Reserved word.) Some items for recognition may have combination of
fixed part and variable part. For example, suppose in a language an address
value is prefixed by an "@" character to distinguish it from a numeric
constant. Then the rule for an address value is "@ followed by a sequence of
digits".
First Approach - String comparision
One approach that can be easily imagined is to read some number of characters
form the input file and match the string against some known strings
(spellings) of items. If the input string does not match any of these known
fixed strings, then it may be further tested to match variable sequences. In
practice, most programming languages allow identifiers (names of variables,
constants, procedures, etc.) to consist of sequences of alphanumeric and
possibly "underscore" characters. The keywords are usually strings of
alphabetic characters. Numbers are usually strings of digits (decimal or
hexadecimal) and possibly a period (decimal point). Other valid items
are formed with other symbols than these mentioned characters. Fixed strings
corresponding to keywords can be stored in a table, preferably arranged in
such a way that facilitates easy search (eg. sorted to facilitate binary
search). Thus, wherever in the input there is a continuous string of
alphanumeric characters or underscore, a lexical analyser can read it into a
buffer. While doing so set a boolean flag to indicate whether any underscore
was read, another to indicate whether any alphabetic character was read, and
another to indicate whether any decimal digit was read. If in the string
there was no underscore or digits, then the string might be a keyword. The
string is matched against the keywords stored in the table. If a match
occurs, the lexical analyser declares the string to be that keyword.
Otherwise it is an identifier. If the string in the buffer contained one or
more underscores, it is not necessary to search the keyword table, and
strightaway it can be recognised as an identifier. If however, the string had
only digits it can be recognised as a number. In doing so some more
characters from the input may also be included if there is a period followed
by some more digits. Other items of the language may be recognised by
matching the input character-by-character. The above is a description of a
possible lexcial analyser for typical programming language.
Variations of it can be implemented for any real requirement.
Second Approach - Finite State Machine
The first approach described above is easy to conceptualise and can be
implemented as a program without much diffuculty. But the need to search
words in the table of keywords, and other matching operations in the
remaininig part, essentially requires each input character to be
compared with constants multiple number of times. (Exercise : Why is each
character compared with constants multiple times ?) This situation
may well be improved in terms of computational complexity.
The second approach to building a lexical analyser is using a Finite State
Machine (FSM). It reduces the computational complexity such that each
input character is to be handled only once (in fact, there is no comparision
at all). In an FSM a set of states are defined for the system.
Initially the system assumes a pre-defined state called the starting state
and as the input is processed, the system makes transition to another state
(transition to the same state is also possible) for each input symbol. In a
way, a state actually indicates what sequence of input symbols have been seen
so far and what action to take as a result. The action may be simply
make another state transition using the next input character, or declare that
a token has been recognised in the sequence of input symbols. In the latter
case the system again assumes the starting state so that it can proceed to
process the remaining input characters to recognise more tokens. Usually, the
action associated with each state also includes some other tasks, such as to
update some program data structures, etc.
For different languages the FSM for recognizing the tokens will vary in the
set of states, the state transitions that are to be made, and the tokens to
be recognized. It needs to be understood that the implementation of an FSM as
a program has two parts - some static data that describes the states and
the state transitions, and a set of instructions for the program to proceed
through a series of state transitions. The data is structured as a
state transition table where each row stands for a state of the FSM
and each column stands for an input symbol. Each entry in the table indicates
the state that the FSM should enter from the current state (corresponding
to the row) upon encountering the input symbol corresponding to the
column. With this information in the table, the code of FSM should
simply facilitate the selection of the next state corresponding to the
current state and the next input symbol by a look up of the table. However,
other actions associated with each state also needs to be incorporated.
The tedious part in implementing the FSM model as a lexical analyser program
for a particular language is to determine the states and the state
transitions. Fortunately there exists tools that makes this very simple. One
such tool is lex in UNIX (flex in Linux). This tool takes the
lexical specifications in a file, in the form of regular expressions
corresponding to each token to be recognised. The actions to be taken upon
recognition of each token should be specified along with the regular
expression for that token. The tool then creates a C program file that
essentially contains a lexical analyser function named yylex(), which
can be called to perform lexical analysis according to the specifications
encoded. Thus lex (or flex) is a tool that creates a lexical analyser
using lexical specifications and actions corresponding to tokens provided as
input. [See man flex for more details]