Phases:
Finite automata correspond in a 1:1 relationship to transition diagrams; from any transition diagram one can write down the formal automaton in terms of items #1-#5 above, and vice versa. To draw the transition diagram for a finite automaton:
state := S0 for(;;) switch (state) { case 0: switch (input) { 'a': state = 1; input = getchar(); break; 'b': input = getchar(); break; default: printf("dfa error\n"); exit(1); } case 1: switch (input) { EOF: printf("accept\n"); exit(0); default: printf("dfa error\n"); exit(1); } }
Fortunately, one can prove that for any NFA, there is an equivalent DFA. They are just a notational convenience. So, finite automata help us get from a set of regular expressions to a computer program that recognizes them efficiently.
Operations to keep track of sets of NFA states:
Algorithm:
Dstates := {e_closure(start_state)} while T := unmarked_member(Dstates) do { mark(T) for each input symbol a do { U := e_closure(move(T,a)) if not member(Dstates, U) then insert(Dstates, U) Dtran[T,a] := U } }
The C code generated by lex has the following public interface. Note the use of global variables instead of parameters, and the use of the prefix yy to distinguish scanner names from your program names. This prefix is also used in the YACC parser generator.
FILE *yyin; /* set this variable prior to calling yylex() */ int yylex(); /* call this function once for each token */ char yytext[]; /* yylex() writes the token's lexeme to this array */
The .l file format consists of a mixture of lex syntax and C code fragments. The percent sign (%) is used to signify lex elements. The whole file is divided into three sections separated by %%:
header %% body %% helper functions
The header consists of C code fragments enclosed in %{ and %} as well as macro definitions consisting of a name and a regular expression denoted by that name. lex macros are invoked explicitly by enclosing the macro name in curly braces. Following are some example lex macros.
letter [a-zA-Z] digit [0-9] ident {letter}({letter}|{digit})*
The body consists of of a sequence of regular expressions for different token categories and other lexical entities. Each regular expression can have a C code fragment enclosed in curly braces that executes when that regular expression is matched. For most of the regular expressions this code fragment (also called a semantic action consists of returning an integer that identifies the token category to the rest of the compiler, particularly for use by the parser to check syntax. Some typical regular expressions and semantic actions might include:
" " { /* no-op, discard whitespace */ } {ident} { return IDENTIFIER; } "*" { return ASTERISK; }You also need regular expressions for lexical errors such as unterminated character constants, or illegal characters.
The helper functions in a lex file typically compute lexical attributes, such as the actual integer or string values denoted by literals.
struct token { int category; char *text; int linenumber; int column; char *filename; union literal value; }The union literal will hold computed values of integers, real numbers, and strings.
A hash table or other efficient data structure can avoid this duplication. The software engineering design pattern to use is called the "flyweight". Example: Figure 3.18, p. 109. Use "install_id()" instead of "strdup()" to avoid duplication in the lexical data.
let X = the start symbol s while there is some nonterminal Y in X do apply any one production rule using Y, e.g. Y -> wWhen X consists only of terminal symbols, it is a string of the language denoted by the grammar. Each iteration of the loop is a derivation step. If an iteration has several nonterminals to choose from at some point, the rules of derviation would allow any of these to be applied. In practice, parsing algorithms tend to always choose the leftmost nonterminal, or the rightmost nonterminal, resulting in strings that are leftmost derivations or rightmost derivations.
E -> E + E E -> E * E E -> ( E ) E -> identallows two different derivations for strings such as "x + y * z". The grammar is ambiguous, but the semantics of the language dictate a particular operator precedence that should be used. One way to eliminate such ambiguity is to rewrite the grammar. For example, we can force the precedence we want by adding some nonterminals and production rules.
E -> E + T E -> T T -> T * F T -> F F -> ( F ) F -> ident
S -> A B C A -> a A A -> epsilon B -> b C -> cmaps to code like
procedure S() if A() & B() & C() then succeed end procedure A() if currtoken == a then # use production 2 currtoken = scan() return A() else succeed # production rule 3, match epsilon end procedure B() if currtoken == b then currtoken = scan() succeed else fail end procedure C() if currtoken == c then currtoken = scan() succeed else fail end
E -> E + T | T T -> T * F | F F -> ( E ) | identWe can remove the left recursion by introducing new nonterminals and new production rules.
E -> T E' E' -> + T E' | epsilon T -> F T' T' -> * F T' | epsilon F -> ( E ) | identGetting rid of such immediate left recursion is not enough, one must get rid of indirect left recursion, where two or more nonterminals are mutually left-recursive. One can rewrite any CFG to remove left recursion (Algorithm 4.1).
for i := 1 to n do for j := 1 to i-1 do begin replace each Ai -> Aj gamma with productions Ai -> delta1gamma | delta2gamma end eliminate immediate left recursion
S -> cAd A -> ab A -> aLeft factoring can often solve such problems
S -> cAd A -> a A' A'-> b A'-> (epsilon)One can also perform left factoring (Algorithm 4.2) to reduce or eliminate the lookahead or backtracking needed to tell which production rule to use. If the end result has no lookahead or backtracking needed, the resulting CFG can be solved by a "predictive parser" and coded easily in a conventional language. If backtracking is needed, a recursive descent parser takes more work to implement, but is still feasible. As a more concrete example:
S -> if E then S S -> if E then S1 else S2can be factored to:
S -> if E then S S' S'-> else S2 | epsilon
for (i = 1; if Yi can derive epsilon; i++) add First(Yi+1) to First(X)
Example. For the grammar
(1) S->aABe (2) A->Abc (3) A->b (4) B->dthe string "abbcde" can be parsed bottom-up by the following reduction steps:
abbcde aAbcde aAde aABe SDefinition: a handle is a substring that
Stack Input $ w$At each step, the parser performs one of the following actions.
The LR parsing algorithm is given below. See Figure 4.29 for a schematic.
ip = first symbol of input repeat { s = state on top of parse stack a = *ip case action[s,a] of { SHIFT s': { push(a); push(s') } REDUCE A->beta: { pop 2*|beta| symbols; s' = new state on top push A push goto(s', A) } ACCEPT: return 0 /* success */ ERROR: { error("syntax error", s, a); halt } } }
S->if E then S S->if E then S else S
In many languages two nested "if" statements produce a situation where an "else" clause could legally belong to either "if". The usual rule (to shift) attaches the else to the nearest (i.e. inner) if statement. Example reduce reduce conflict:
(1) S -> id LP plist RP (2) S -> E GETS E (3) plist -> plist, p (4) plist -> p (5) p -> id (6) E -> id LP elist RP (7) E -> id (8) elist -> elist, E (9) elist -> EBy the point the stack holds ...id LP id
Definition: An LR(0) item of a grammar G is a production of G with a dot at some position of the RHS.
Example: The production A->aAb gives the items:
A->.aAb A->a.Ab A->aA.b A->aAb.
Note: A production A-> e generates only one item:
A->.
Intuition: an item A->a. b denotes:
Closure: if I is a set of items for a grammar G, then closure(I) is the set of items constructed as follows:
These two rules are applied repeatedly until no new items can be added.
Intuition: If A->a.Bb e
closure(I) then we home to see a string derivable from B in the
input. So if B-> g is a production,
we should hope to see a string derivable from g.
Hence, B->.g e
closure(I).
Goto: if I is a set of items and X is a grammar symbol, then goto(I,X) is defined to be:
goto(I,X) = closure({[A->aX. b] | [A->a.Xb] e I})
Intuition:
E -> E+T | T T -> T*F | F F -> (E) | idLet I = {[E -> E+.T]} then:
goto(I,+) = closure({[E -> E+.T]}) = closure({[E -> E+.T], [E -> .T*F], [T -> .F]}) = closure({[E -> E+.T], [E -> .T*F], [T -> .F], [F-> .(E)], [F -> .id]}) = { [E -> E + .T],[T -> .T * F],[T -> .F],[F -> .(E)],[F -> .id]}
begin C := { closure({[S' -> .S]}) }; repeat for each set of items I e C: for each grammar symbol X: if goto(I,X) != 0 and goto(I,X) !e C then add goto(I,X) to C; until no new sets of items can be added to C; return C; end
Valid Items: an item A -> b 1. b 2 is valid for a viable prefix a b 1 if there is a derivation:
S' =>*rm aA w =>*rma b1 b 2w
Suppose A -> b1. b 2 is valid for ab1, and aB1 is on the parsing stack
Note: two valid items may tell us to do different things for the same viable prefix. Some of these conflicts can be resolved using lookahead on the input string.
Example:
S -> aABe FIRST(S) = {a} FOLLOW(S) = {$} A -> Abc FIRST{A} = {b} FOLLOW(A) = {b,d} A -> b FIRST{B} = {d} FOLLOW{B} = {e} B -> d FIRST{S'}= {a} FOLLOW{S'}= {$} I0 = closure([S'->.S] = closure([S'->.S],[S->.aABe]) goto(I0,S) = closure([S'->S.]) = I1 goto(I0,a) = closure([S->a.Abe]) = closure([S->a.Abe],[A->.Abc],[A->.b]) = I2 goto(I2,A) = closure([S->aA.Be],[A->A.bc]) = closure([S->aA.Be],[A->A.bc],[B->.d]) = I3 goto(I2,B) = closure([A->b.]) = I4 goto(I3,B) = closure([S->aAB.e]) = I5 goto(I3,b) = closure([A->Ab.c]) = I6 goto(I3,d) = closure([B->d.]) = I7 goto(I5,e) = closure([S->aABe.]) = I8 goto(I6,c) = closure([A->Abc.]) = I9
declarations %% grammar %% subroutinesThe declarations section defines the terminal symbols (tokens) and nonterminal symbols. The most useful declarations are:
The grammar gives the production rules, interspersed with program code fragments called semantic actions that let the programmer do what's desired when the grammar productions are reduced. They follow the syntax
A : body ;Where body is a sequence of 0 or more terminals, nonterminals, or semantic actions (code, in curly braces) separated by spaces. As a notational convenience, multiple production rules may be grouped together using the vertical bar (|).
%right ASSIGN %left PLUS MINUS %left TIMES DIVIDE %right POWER %% expr: expr ASSIGN expr | expr PLUS expr | expr MINUS expr | expr TIMES expr | expr DIVIDE expr | expr POWER expr ;
error
where errors expected
What we have at the start of semantic analysis is a tree built definitions; they can have all the synthesized attributes they want. In practice, attributes get stored in parse tree nodes and the semantic rules are evaluated either (a) during parsing (for easy rules) or (b) during one or more (sub)tree traversals.
struct c_type { int base_type; /* 1 = int, 2=float, ... */ union { struct array { int size; struct c_type *elemtype; } a; struct ctype *p; struct struc { char *label; struct field **f; } s; } u; } struct field { char *name; struct ctype *elemtype; }Given this representation, how would you initialize a variable to represent each of the following types:
int [10][20] struct foo { int x; char *s; }
Scope rules for each language determine how to go from names to declarations.
Each use of a variable name must be associated with a declaration. This is generally done via a symbol table. In most compiled languages it happens at compile time (in contrast, for example ,with LISP).
code |
---|
static data |
stack (grows down) |
heap (may grow up, from bottom of address space) |
return value | |
parameter | |
... | |
parameter | |
previous frame pointer (FP) | |
saved registers | |
... | |
FP--> | saved PC |
local | |
... | |
local | |
temporaries | |
SP--> | ... |
Object-oriented programs are the same, only every activation record has an associated object instance; they need one extra "register" in the activation record.
Goal-directed programs have an activation tree each instant, due to suspended activations that may be resumed for additional results. The lifetime view is a sort of multidimensional tree, with three types of nodes.
Basic problem in garbage collection: given a piece of memory, are there any pointers to it? (And if so, where exactly are all of them please). Approaches:
Can be formulated as syntax-directed translation
newtemp()
newlabel()
Production | Semantic Rules |
---|---|
S -> id ASN E | S.code = E.code || gen(ASN, id.place, E.place) |
E -> E1 PLUS E2 | E.place = newtemp(); E.code = E1.code || E2.code || gen(PLUS,E.place,E1.place,E2.place); |
E -> E1 MUL E2 | E.place = newtemp(); E.code = E1.code || E2.code || gen(MUL,E.place,E1.place,E2.place); |
E -> MINUS E1 | E.place = newtemp(); E.code = E1.code || gen(NEG,E.place,E1.place,E2.place); |
E -> LP E1 RP | E.place = E1.place; E.code = E1.code; |
E -> IDENT | E.place = id.place; E.code = emptylist(); |
Instruction set:
x := y op z | store result of binary operation on y and z to x |
---|---|
x := op y | store result of unary operation on y to x |
x := y | store y to x |
x := &y | store address of y to x |
x := *y | store contents pointed to by y to x |
*x := y | store y to location pointed to by x |
goto L | unconditional jump to L |
if x rop y then goto L | binary conditional jump to L |
if x then goto L | unary conditional jump to L |
if !x then goto L | unary negative conditional jump to L |
param x | store x as a parameter |
call p,n,x | call procedure p with n parameters, store result in x |
return x | return from procedure, use x as the result |
Declarations (Pseudo instructions): These declarations list size units as "bytes"; in a uniform-size environment offsets and counts could be given in units of "slots", where a slot (4 bytes on 32-bit machines) holds anything.
global x,n1,n2 | declare a global named x at offset n1 having n2 bytes of space |
---|---|
proc x,n1,n2 | declare a procedure named x with n1 bytes of parameter space and n2 bytes of local variable space |
local x,n | declare a local named x at offset n from the procedure frame |
label Ln | designate that label Ln refers to the next instruction |
end | declare the end of the current procedure |
x := y field z | lookup field named z within y, store address to x |
---|---|
class x,n1,n2 | declare a class named x with n1 bytes of class variables and n2 bytes of class method pointers |
field x,n | declare a field named x at offset n in the class frame |
new x | create a new instance of class name x |
Depending on your source language's semantic rules for things like "short-circuit" evaluation for boolean operators, the operators like || and && might be similar to + and * (non-short-circuit) or they might be more like if-then code.
A general technique for implementing control flow code is to add new attributes to tree nodes to hold labels that denote the possible targets of jumps. The labels in question are sort of analogous to FIRST and FOLLOW; for any given list of instructions corresponding to a given tree node, we might want a .first attribute to hold the label for the beginning of the list, and a .follow attribute to hold the label for the next instruction that comes after the list of instructions. The .first attribute can be easily synthesized. The .follow attribute must be inherited from a sibling. The labels have to actually be allocated and attached to instructions at appropriate nodes in the tree corresponding to grammar production rules that govern control flow. An instruction in the middle of a basic block need neither a first nor a follow.
C code | Attribute Manipulations |
---|---|
S->if E then S1 | E.true = newlabel(); E.false = S.follow; S1.follow = S.follow; S.code = E.code || gen(LABEL, E.true)|| S1.code |
S->if E then S1 else S2 | E.true = newlabel(); E.false = newlabel(); S1.follow = S.follow; S2.follow = S.follow; S.code = E.code || gen(LABEL, E.true)|| S1.code || gen(GOTO, S.follow) || gen(LABEL, E.false) || S2.code |
Implementation techniques for these alternatives include:
a translates into100: if a1 = 0 goto 104 103: t1 = 1 104: if c2 = 0 goto 108 107: t2 = 1 108: if e 3 = 0 goto 112 111: t3 = 1 112: t4 = t2 AND t3 t5 = t1 OR t4 Short-Circuit Example
a translates intoif a Note: L3 might instead be the target E.false; L1 might instead be E.true; no computation of a 0 or 1 into t might be needed at all.Final Code Generation
Final code generation takes a linear sequence of 3-address intermediate code instructions, and translates each 3-address instruction into one or more native instructions. The big issues in code generation are (a) instruction selection, and (b) register allocation and assignment.Instruction Selection
The hardware may have many difference sequences of instructions to accomplish a given task. Instruction selection must choose a particular sequence. At issue may be: how many registers to use, whether a special case instruction is available, and what addressing mode(s) to use. Given a choice among equivalent/alternaive sequences, the decision on which sequence of instructions to use is based on estimates or measurements of which sequence executes the fastest. This is usually determined by the number of memory references incurred during execution, including the memory references for the instructions themselves. Simply picking the shortest sequence of instructions is often a good approximation of the optimal result, since fewer instructions usually translates into fewer memory references.Register Allocation and Assignment
Accessing values in registers is much much faster than accessing main memory. Register allocation denotes the selection of which variables will go into registers. Register assignment is the determination of exactly which register to place a given variable. The goal of these operations is generally to minimize the total number of memory accesses required by the program.In the Old Days, there were Load-Store hardware architectures in which only one (accumulator) register was present. On such an architecture, register allocation and assignment is not needed; the compiler has few options about how it uses the accumulator register. Traditional x86 16-bit architecture was only a little better than a load-store architecture, with 4 registers instead of 1. At the other extreme, Recent History has included CPU's with 32 or more general purpose registers. On such systems, high quality compiler register allocation and assignment makes a huge difference in program execution speed. Unfortunately, optimal register allocation and assignment is NP-complete, so compilers must settle for doing a "good" job.
When the number of variables in use at a given time exceeds the number of registers available (the common case), some variables may be used directly from memory if the instruction set supports memory-based operations. When an instruction set does not support memory-based operations, all variables must be loaded into a register in order to perform arithmetic or logic using them.
Even if an instruction set does support memory-based operations, most compilers will want to load load a value into a register while it is being used, and then spill it back out to main memory when the register is needed for another purpose. The task of minimizing memory accesses becomes the task of minimizing register loads and spills.
Code Generation for Virtual Machines
A virtual machine architecture such as the JVM changes the "final" code generation somewhat. We have seen several changes, some of which simplify final code generation and some of which complicate things.
- no registers, simplified addressing
- a virtual machine may omit a register model and avoid complex addressing modes for different types of variables
- uni-size or descriptor-based values
- if all variables are "the same size", some of the details of memory management are simplified. In Java most values occupy a standard "slot" size, although some values occupy two slots. In Icon and Unicon, all values are stored using a same-size descriptor.
- runtime type system
- requiring type information at runtime may complicate the code generation task since type information must be present in generated code. For example in Java method invocation and field access instructions must encode class information.