# Parser Parses [[cfg|CFG]] of the language. > [!summary] Only one way to derive a thing! ## LL(1) Parsing ### First and Follow Sets To construct parser for [[ll1|LL(1) grammar]], first workout _First_ and _Follow_ sets. - Generate First set - $\text{First}(a) = \{a\}$ - For $X \rightarrow Y_1Y_2\dots Y_k$ - Add terminals in $\text{First}(Y_1)$ to $\text{First}(X)$. - If $Y_1\dots Y_{n-1} \Rightarrow \epsilon$, add terminals in $\text{First}(Y_n)$ to $\text{First}(X)$. - If $Y_1\dots Y_k \Rightarrow \epsilon$, add $\epsilon$ to $\text{First}(X)$. - Generate Follow set - $\text{Follow}(S) = \$ if $S$ is the **start** symbol. - For $A \rightarrow \alpha B \beta$, add $\text{First}(\beta)$ to $\text{Follow}(B)$, excluding $\epsilon$. - For $A \rightarrow \alpha B$, or if $\epsilon\in\text{First}(\beta)$, add non-terminals in $\text{Follow}(A)$ to $\text{Follow}(B)$. > [!warning] > > - In deriving First set, be careful with non-terminals that could potentially > produce $\epsilon$! > - In deriving Follow set, add Follow set of the LHS to the RHS, not vice > versa. CFG goes from **left to right** only. (Think of the things in the > right have no "control" on where it's placed.) > [!tip] In deriving First set, go bottom-up; in deriving Follow set, go > top-down. - CFG -> LL(1) -> Recursive Descent or Table Driven ### Recursive Descent Parsing - One simple function for each non-terminal. - Function body follows the RHS of the rule: - Non-terminals result in a call to another parse function - Terminals result in considering the next token. - Expect what's in the follow set. ```c int parse_P() { return parse_E() && expect_token(TOKEN_EOF); } int parse_E() { return parse_T() && parse_E_prime(); } ``` ### Table-Driven Parsing This approach is top-down. Using First set to determine which rule will be applied. - For each rule $A \rightarrow \alpha$ (denoted by the number): - $\forall \alpha \in \text{First}(\alpha)$, add this rule to $T[A, a]$. - If $\epsilon\in\text{First}(\alpha)$, $\forall b \in \text{Follow}(A)$, including $\$, add this rule to $T[A, b]$. | Stack | Input | Action | | ------------: | :------------ | :---------------: | | `P
| `int * int
| apply 1: `P => E` | | ... | ... | ... | | `int T' E'
| `int * int
| match `int` | | `T' E'
| `* int
| | - Parsing process - Push top symbol and EOF into stack. - Each time, look at the left-most token in the Input column. - Based on the table, decide which rule to apply. Pop and push symbols into the stack accordingly. - Some times token is matched without rule applied. ## LR Parsing ### Shift-Reduce Parsing | Stack | input | Action | | --------- | :----------------- | ---------------- | | | `id ( id + id )
| shift | | `id` | `( id + id )
| shift | | `id (` | `id + id )
| shift | | `id ( id` | `+ id )
| reduce `T -> id` | | `T ( id` | `+ id )
| reduce `E -> T` | - **Shift** = consume a token and push it onto the stack - **Reduce** = apply a rule to the top of the stack, pop tokens from and push non-terminals onto the stack. In this process, conflicts may occur! ### LR(0) Automaton - Represents all the possible rules that are currently under consideration by a shift-reduce parser. - Also known as _canonical collection_ or _compact [[fa|finite state machine]]_ of the grammar. - Each state: - **_Kernel_**, for state 0: `P -> . E`. - **_Closure_**, add all rules on the right side of the dot: `E -> . E + T`, `E -> . T`. - Takes non-terminals or terminals, and transition to other state. Move the dot accordingly. - Conflicts - Shift-reduce: `T -> id . ( E )` versus `T -> id .` - Reduce-reduce: `S -> id ( E ) .` versus `E -> id ( E ) .` - LR(0) automaton tells us which rules are _available_, but we don't know which rules to _apply_ exactly! ### SLR Parsing > [!summary] TL;DR SLR = LR(0) automaton + Follow Set - **S** stands for **Simple**. SLR parsing utilizes the **Follow set** to resolve conflicts. - In short, take reduction to `A` only when the next token is in `Follow(A)`. - Grammar can be parsed by SLR => _SLR Grammar_ - **SLR Parse Table** has _Goto_ and _Action_ sections. | State | `E` | `T` | `+` | `int` | `$ | | ----- | ---- | ---- | ------ | ----- | --- | | 0 | | | | | | | 1 | GOTO | GOTO | ACTION | ... | | | ... | | | | | | - To construct the table, if the **DOT** is followed by: - **Terminal**, simply **shift** to another state according to LR(0) automaton. - **Non-Terminal**, **goto** another state according to LR(0) automaton. - Nothing, reduce according to the rule. | Stack | Symbols | Input | Action | | :------ | :-------- | -----------------: | :--------------- | | 0 | | `id ( id + id )
| shift 4 | | 0 4 | `id` | `( id + id )
| shift 5 | | 0 4 5 | `id (` | `id + id )
| shift 4 | | 0 4 5 4 | `id ( id` | `+ id )
| reduce `T -> id` | | 0 4 5 8 | `id ( T` | `+ id )
| | - Parsing process - Shift tokens - Push states into stack 1. - Push symbols into stack 2. - If reduce - Pop the state at top. - Since a non-terminal is created, now use the non-terminal and the state at the top of the stack to perform **goto** - Push the new state to the stack. > [!note] Distinction between LL(1) and LR(1) Parsing Each LL(1) parsing state > considers only a single non-terminal, while each LR(1) parsing state considers > a large number of possible configurations. However, SLR is only a proper subset of LR(1)! ### [[lr1|LR(1)]] Parsing - LR(1) automaton augments LR(0) automaton by annotating the _lookahead_. - _Lookahead_ = set of tokens that can follow the non-terminal, in the _current parsing state_. -- ==I'm looking for this after this non-terminal.== - For $A \rightarrow \alpha\cdot B$, add new rules like $B \rightarrow \cdot \gamma$ with the same lookahead. - For $A \rightarrow \cdot B\beta$ - If $\beta$ **cannot** be $\epsilon$, the lookahead is $\text{First}(\beta)$. - If $\beta$ **can** be $\epsilon$, the lookahead is $\text{First}(\beta)$ plus the lookahead of $A$. - When constructing the automaton, we ==bring the lookahead from one state to the next.== ### LALR Parsing - **LA** = Lookahead. - The aforementioned automaton could be too big. In LALR parsing, we can merge states with the same **core**, by merging their lookahead.