The main task of lexical Analyzer is to read a stream of characters as an input and produce a sequence of tokens such as names, keywords, punctuation marks etc.. for syntax analyzer.
It discards the white spaces and comments between the tokens and also keep track of line numbers.
<fig: 3.1 pp. 84>
A lexical token is a sequence of characters that can be treated as a unit in the grammar of the programming languages.
Example of tokens:
Example of non-tokens:
There is a set of strings in the input for which the same token is produced as output. This set of strings is described by a rule called a pattern associated with the token.
Regular expressions are an important notation for specifying patterns.
For example, the pattern for the Pascal identifier token, id, is: id → letter (letter | digit)*.
A lexeme is a sequence of characters in the source program that is matched by the pattern for a token.
For example, the pattern for the RELOP token contains six lexemes ( =, < >, <, < =, >, >=) so the lexical analyzer should return a RELOP token to parser whenever it sees any one of the six.
An alphabet or a character class is a finite set of symbols. Typical examples of symbols are letters and characters.
The set {0, 1} is the binary alphabet. ASCII and EBCDIC are two examples of computer alphabets.
Strings
A string over some alphabet is a finite sequence of symbol taken from that
alphabet.
For example, banana is a sequence of six symbols (i.e., string of length six) taken from ASCII computer alphabet. The empty string denoted by , is a special string with zero symbols (i.e., string length is 0).
If x and y are two strings, then the concatenation of x and y, written xy, is
the string formed by appending y to x.
For example, If x = dog and y = house, then xy = doghouse. For empty string,
, we
have S
= S
= S.
String exponentiation concatenates a string with itself a given number of times:
S2 = SS or S.S
S3 = SSS or S.S.S
S4 = SSSS or S.S.S.S and so on
By definition S0 is an empty string, , and S` = S. For example, if x =ba and na then xy2 = banana.
Languages
A language is a set of strings over some fixed alphabet. The language may
contain a finite or an infinite number of strings.
Let L and M be two languages where L = {dog, ba, na} and M = {house, ba} then
The kleene closure of language L, denoted by L*, is "zero or more Concatenation of" L.
L* = L0 U L` U L2 U L3 . . . U Ln . . .
For example, If L = {a, b}, then
L* = {, a, b, aa, ab, ab, ba, bb, aaa, aba, baa, . . . }
The positive closure of Language L, denoted by L+, is "one or more Concatenation of" L.
L+ = L` U L2 U L3 . . . U Ln . . .
For example, If L = {a, b}, then
L+ = {a, b, aa, ba, bb, aaa, aba, . . . }