Comp Review: Compilers

Fall 1996

Textbook: "Compilers" by Aho, Sethi & Ullman

Chapter 1: Introduction to Compiling

   What is Compiler?

   Compiler Vs Interpreter

   When Interpreter?
      - To execute command lang's
      - When info about data is known only at runtime (eg. arrays in APL)
      - If number of runs are one or few

   Parts of compilation
      o Analysis
      o Synthesis

   Analysis phases
      1. Lexical Analysis
      2. Syntax Analysis (Parsing)
      3. Semantic Analysis (Type checking)

   Parse tree Vs Syntax tree

   Phases of a compiler (p. 10)

                           Lexical Analyzer
                                 \/
                           Syntax Analyzer
                                 \/
               /          Semantic Analyzer           \
              /                   \/                   \
       Symbol --     Intermediate Code Generator      --  Error
       Table  \                  \/                    /  handler
               \           Code Optimizer             /
                                 \/
                           Code Generator
                                 \/
                           Target Program

   Cousins of compiler
       1. Preprocessor
       2. Assembler
       3. Loaders and Link-editiors

   Front End Vs Back End

   Compiler construction tools
       1. Parser generators
       2. Scanner generators
       3. Syntax-directed translation engines
       4. Automatic code generators (from intermediate lang)
       5. Data-flow engines (for optimization)

Chapter 2: A Simple One-Pass Compiler

    Terminals, nonterminals, productions, tokens.

    Context-Free Grammar (CFG), Language (token strings derived from S)

    Parsing
       - process of finding a parse tree for a string of tokens
       - process of determining whether a string of tokens can be
         generated by a grammar.
       Time: worst - O(n^3); typical - O(n)

    Parse tree (eg. 9-5+2, p. 28)

    Ambiguous grammar - more than one parse tree for a token string
    eg. 1. S -> S + S | S - S | a
        2. S -> S S + | S S - | a      (ambiguous?)

    postfix. Why postfix?
        - only one interpretation
        - no parentheses needed to disambiguate

    Parsing types
       - Top down (start from start symbol and derive string)
         o Efficient hand-written parsers
       eg. Recursive-Descent, Predictive

       - Bottom up (start from string and reduce to start symbol)
         o Handles larger class of grammars
         o Used by parser generators
       eg. LR parser

    Recursive-descent
       - 1 proc for each nonterminal (NT). Call proc if RHS has the NT.

      problems:
       1. Backtracking
       2. Left recursion

    So, Predictive parsing
       - no backtracking
       - depends on first symbols generated by RHS of production. (first
         symbols should be distinct within a production)
       - decide production to use based on lookahead symbol
       - NT on RHS? call the corres proc
         T on RHS? read next token

      problems:
       1. Left recursion

    Eliminating left recursion
       A -> Aa | b
         transformed to
       A -> b | bR
       R -> aR | epsilon

Chapter 3: Lexical Analysis

   Why separate lexical analysis phase?
      1. Simpler lang design
         parser processing comments and whitespace is more complex.
      2. Improved compiler efficiency
         special techniques for lexing
      3. Improved compiler portability

   Regular Expressions
      r & s are reg exp's. r+s, rs, r*, r?, r+
      char classes [abc], [a-z]*, etc.

   Limitations of reg exp's
      1. Can't use for anything that involves matching
         - balanced paren's
         - repeated strings { wcw }  (type-checking problem?)
      2. Can use only for fixed or unspecified number of repetitions

   lex - Constructs a lexer from a lang spec (in reg lang)

   Steps in lex implemetation.
      1. Read input lang spec
      2. Construct NDFA with epsilon-moves (Can also do DFA directly)
      3. Convert NDFA to DFA
      4. Optimize the DFA
      5. Generate parsing tables & code

   lex file format

      declarations
      %%
      translation rules
      %%
      auxiliary procedures

   lex eg. on p. 109

   FA - Used to construct recognizer that recog. sentences in the lang

   NFA - slower but smaller recognizers
   DFA - faster but bigger recognizers

   NFA to DFA

   reg exp to NFA - Thompson's construction

Chapter 4: Syntax Analysis

   CFGs specify the lang syntax

   Why grammars to specify lang syntax?
      1. Grammar gives a precise spec of lang
      2. Automatic construction of parser from grammar
      3. Detects lang ambiguities & difficulties (useful in lang design)
      4. Imparts structure to lang (helpful in translation & error detection)
      5. Addition of new constructs is easy (helpful in lang evolution)

   Parser - takes tokens from lexer and builds a parse tree

   Types of parsers
      1. Universal (CYK, Earley)
         o inefficient
      2. Top-down (LL)
         o simple
         o suitable for hand-written parsers
      3. Bottom-up (LR)
         o complex
         o covers a larger class of grammars
         o suitable for automated parsers

    Parse trees & derivations
      - left most
      - right most

    Ambiguous grammar - more than 1 left-most or right-most derivation

    Why reg exp's over CFG's for lex spec's? (p. 172)
      1. Lex rules are simple. For this, reg exp is sufficient; power of
         CFG not needed
      2. Reg exp's provide concise & easier to understand notation for
         tokens.
      3. More efficient lexers can be constructed from reg exp's than
         from grammars
      4. Separating syntax into lexical and non-lexical parts makes
         compiler manageable

    Reg exp uses: specifying syntax for id's, const's, keywords, etc.
    Grammar uses: nested structures, if-then-else, etc.

    if-then-else disambiguation
      grammar:
       stmt ->  if expr then stmt
              | if expr then stmt else stmt

      input: if expr1 then stmt1 if expr2 then stmt2 else stmt3
      'else' goes with first or second 'if'?

      Solutions:
        1. Modify grammar

           stmt -> m_s | u_s
           m_s -> if expr then m_s else m_s
           u_s -> if expr then stmt
                  | if expr then m_s else u_s

        2. Always match else with nearest if (shift)
           - default in yacc

     Left recursion elimination   eg. on p. 176
        - immediate left recursion
          eg.  A -> Aa
        - non-immediate (?) left recursion
          eg.  A -> Bb
               B -> Aa
        o both need to be eliminated for use in top-down methods
        o bottom-up (eg. yacc) can accomodate left recursion

     Left factoring
        - done to produce unique first symbols on RHS of production.
          (so as to make it suitable for predictive parsing)

        eg. (p. 178) { LL(5) needed? }
           S -> iEtS | iEtSeS | a
           E -> b
         transformed to
           S -> iEtSS' | a
           S'-> eS | epsilon
           E -> b

     Top-down parsing (LL(1) grammar)
        1. Recursive-descent
           - requires backtracking
           - left recursion in grammar not permissible
        2. Predictive
           - no backtracking
           - left recursion not permissible
           - first symbols should be distinct (do left factoring)
        3. Non-recursive predictive
           - explicit stack

     LL(1) - First L: Read from Left to right
             Second L: Apply Left-most derivation
             (1): 1-symbol lookahead

     LR(1) - L: Read from Left to right
             R: Use Right-most derivation for reduction
             (1): 1-symbol lookahead

     Bottom-up parsing

         - Reducing a string to the start symbol of grammar
         eg. p 199

     Methods
         LR parsing (most general)
            Shift-reduce parsing
               Operator precedence parsing (least general)

     shift/reduce conflict
         - to shift or to reduce?
         eg. if

     reduce/reduce conflict
         - more than 1 NT to reduce to. Which one to reduce to?

     Both conflicts imply ambiguity in grammar. But an LR grammar can
     never be ambiguous.

     LR Parsers are the most general & thus the most complex of all
     bottom-up parsers. Then,

     Why LR parsers?
         1. Recognizes a large class of CFGs.
         2. Most general non-backtracking shift-reduce parsing method
         3. LR grammars are proper superset of predictive grammars
         4. LR grammars detect syntactic errors as soon as it is
            possible to do so.
       Disadv of LR parsers:
         1. Difficult to construct by hand
            - need LR parser generators

     Methods for construction of LR parsing tables.
         1. Simple LR (SLR)
         2. Look Ahead LR (LALR)
         3. Canonical LR

     LR grammars are proper subset of CFGs. This is not a problem since
     most Prog lang's can be represented by LR grammars.

     Difference between LL(k) and LR(k) grammars. (p 221)

     Ambiguous grammars
         - some time we use them. Why?
             o shorter, natural spec for some lang constructs (eg. expr's)
             o isolate common constructs for special case optimization

     Yacc - Yet Another Compiler Compiler - an LALR parser (p. 257 onwards)

     file format

         declarations
         %%
         translation rules
         %%
         supporting C routines

       yacc eg. p. 258

Chapter 5: Syntax-Directed Translation

    Attribute grammars
        - useful in intermediate code generation

    Synthesized Vs Inherited Attributes  (p. 281)

    Syntax trees (condensed form of parse tree)

    Directed Acyclic Graph (DAG) for expressions

Chapter 6: Type Checking

    Static Vs Dynamic type checking

    Structural equivalence  (p. 356)
       - two types are equal if their structures are equal

    Name equivalence
       - two types are equal if the type names are same

    Type Conversion
       - Implicit (coercion)
       - Explicit

    Overloading & polymorphic functions

    Type variables  (p. 366)  (templates?)

Chapter 7: Run-Time Environments

    Memory Layout

         -----------
            Code
         -----------
         Static Data
         -----------
            Stack
         -----------
             \/

             /\
         -----------
            Heap
         -----------


    Activation Record (AR) or frame
        - Info needed by a single execution of a procedure (p. 398)

    AR fields

       1. Returned value (to the caller)
       2. Actual parameters (from the caller)
       3. Optional control link (points to caller's AR)
       4. Optional access link (points to AR of proc in enclosing scope)
       5. Saved machine status (PC, regs)
       6. Local data
       7. Temporaries

    Maintaing ARs - division of tasks between caller & callee
       fig 7.14 on p. 407

    Access to non-local names
       - static scope: use access link in AR
       - dynamic scope: use control link in AR

    Techniques for accessing non-local names
       1. Static chains
          o follow access links thru ARs till you hit on the name.
       2. Display - array of pointers to ARs; array size corresponds to
                    number of nesting levels
          o To find a name at nesting depth i, look in the activation record
            pointed to by the display element d[i]
          eg. on p. 421

    Parameter passing methods
       1. Call by value
       2. Call by reference (aka call-by-address, call-by-location)
       3. Copy-Restore (aka copy in copy out, value-result)
       4. Call by name (macro?)

    Symbol Tables:
       - used to store type, scope, storage, etc. of a symbol. If the
         symbol is a proc name then,
           o number & type of args
           o method of arg passing
           o type returned
         are also stored.

       Static scope: compiler maintains symbol table
       Dynamic scope: program needs to maintain?

     Hashing (and thus Hash tables) is commonly used for symbol tables

     Representing scope info in symbol table - insert in front of chain
       as you enter a new scope.

     Dynamic Storage
        1. Allocation
           a. fixed-size blocks
           b. var-size blocks
        2. Deallocation
           a. Reference counts    -|
           b. Marking Techniques   |-> Garbage collection

Chapter 8: Intermediate Code Generation

    Machine-independent

    Why intermediate code?
       1. Can retarget for a different machine by changing the back-end
       2. Machine-independent code optimizer can be applied

    Intermediate representations
       1. syntax trees (DAGs)
       2. postfix notation
       3. 3-address code     x := y op z

    Boolean expressions - short-circuit evaluation

    'case' statement translation
       1. Series of if's
       2. Nested if's
       3. Branch table

    Backpatching
       - used in one-pass compiler/assembler

Chapter 9: Code Generation

    Input choices
      - linear rep's
        1. postfix
        2. 3-address code
      - virtual machine rep
        3. stack machine code
      - graphical rep
        4. Syntax trees
        5. DAGs

    Output choices
        1. Absolute machine lang
        2. Relocatable machine lang
        3. Assembly lang

    Issues in the design of code generator
        1. Input to code generator
        2. Target programs (output choice)
        3. Memory Management
        4. Instruction Selection
        5. Register Allocation, etc.

    Transformations on basic blocks (kinda optimization)
        A. Structre-preserving
           1. Common subexpression elimination
           2. Dead-code elimination
           3. Renaming temp variables
           4. Interchange of 2 independent adjacent statements
        B. Algebraic

    Peep-hole optimizations
        - optimize a small window (peephole) of code (not nessarily
          contiguous)
        - chain effect: doing one peep-hole optimization uncovers other chances
          for peep-hole optimizations. Usually run the algorithm thru the code
          several times to optimize it fully.

      Examples of peep-hole optimizations:

        1. Redundant instruction elimination
           o loads & stores
           o unreachable code

        2. Flow-of-control optimizations
           eg. multiple goto's to single goto.

        3. Algebraic simplifications
           eg. X + 0 to X; X * 1 to X, etc.

        4. Strength reduction
           - convert costly (time-wise) operations to less costly operatios
           eg. X^2 to X * X

        5. Use of machine idioms (special machine instructions)
           eg. auto inc/dec while addressing memory

Chapter 10: Code Optimization

    A. Function-preserving transformations   (p. 592)
       1. Common subexpression elimination
       2. Copy propagation
       3. Dead-code elimination
       4. Constant folding

    B. Loop optimizations   (p. 596)
       1. Loop-invariant code motion
       2. Induction-variable elimination
       3. Strength reduction

    Beyond this, in the text...
       stuff you need to know to get a PhD in compilers :)

Created by Sreekanth Nagarajan on April 25, 1996

Minor modifications done on October 14th, 1996
HTML file format stolen from Shikha's HTML file for OS review.