# Parser
Parses [[cfg|CFG]] of the language.
> [!summary] Only one way to derive a thing!
## LL(1) Parsing
### First and Follow Sets
To construct parser for [[ll1|LL(1) grammar]], first workout _First_ and
_Follow_ sets.
- Generate First set
- $\text{First}(a) = \{a\}$
- For $X \rightarrow Y_1Y_2\dots Y_k$
- Add terminals in $\text{First}(Y_1)$ to $\text{First}(X)$.
- If $Y_1\dots Y_{n-1} \Rightarrow \epsilon$, add terminals in
$\text{First}(Y_n)$ to $\text{First}(X)$.
- If $Y_1\dots Y_k \Rightarrow \epsilon$, add $\epsilon$ to
$\text{First}(X)$.
- Generate Follow set
- $\text{Follow}(S) = \$ if $S$ is the **start** symbol.
- For $A \rightarrow \alpha B \beta$, add $\text{First}(\beta)$ to
$\text{Follow}(B)$, excluding $\epsilon$.
- For $A \rightarrow \alpha B$, or if $\epsilon\in\text{First}(\beta)$, add
non-terminals in $\text{Follow}(A)$ to $\text{Follow}(B)$.
> [!warning]
>
> - In deriving First set, be careful with non-terminals that could potentially
> produce $\epsilon$!
> - In deriving Follow set, add Follow set of the LHS to the RHS, not vice
> versa. CFG goes from **left to right** only. (Think of the things in the
> right have no "control" on where it's placed.)
> [!tip] In deriving First set, go bottom-up; in deriving Follow set, go
> top-down.
- CFG -> LL(1) -> Recursive Descent or Table Driven
### Recursive Descent Parsing
- One simple function for each non-terminal.
- Function body follows the RHS of the rule:
- Non-terminals result in a call to another parse function
- Terminals result in considering the next token.
- Expect what's in the follow set.
```c
int parse_P() { return parse_E() && expect_token(TOKEN_EOF); }
int parse_E() { return parse_T() && parse_E_prime(); }
```
### Table-Driven Parsing
This approach is top-down. Using First set to determine which rule will be
applied.
- For each rule $A \rightarrow \alpha$ (denoted by the number):
- $\forall \alpha \in \text{First}(\alpha)$, add this rule to $T[A, a]$.
- If $\epsilon\in\text{First}(\alpha)$, $\forall b \in \text{Follow}(A)$,
including $\$, add this rule to $T[A, b]$.
| Stack | Input | Action |
| ------------: | :------------ | :---------------: |
| `P
| `int * int
| apply 1: `P => E` |
| ... | ... | ... |
| `int T' E'
| `int * int
| match `int` |
| `T' E'
| `* int
| |
- Parsing process
- Push top symbol and EOF into stack.
- Each time, look at the left-most token in the Input column.
- Based on the table, decide which rule to apply. Pop and push symbols into
the stack accordingly.
- Some times token is matched without rule applied.
## LR Parsing
### Shift-Reduce Parsing
| Stack | input | Action |
| --------- | :----------------- | ---------------- |
| | `id ( id + id )
| shift |
| `id` | `( id + id )
| shift |
| `id (` | `id + id )
| shift |
| `id ( id` | `+ id )
| reduce `T -> id` |
| `T ( id` | `+ id )
| reduce `E -> T` |
- **Shift** = consume a token and push it onto the stack
- **Reduce** = apply a rule to the top of the stack, pop tokens from and push
non-terminals onto the stack.
In this process, conflicts may occur!
### LR(0) Automaton
- Represents all the possible rules that are currently under consideration by a
shift-reduce parser.
- Also known as _canonical collection_ or _compact [[fa|finite state machine]]_
of the grammar.
- Each state:
- **_Kernel_**, for state 0: `P -> . E`.
- **_Closure_**, add all rules on the right side of the dot: `E -> . E + T`,
`E -> . T`.
- Takes non-terminals or terminals, and transition to other state. Move the
dot accordingly.
- Conflicts
- Shift-reduce: `T -> id . ( E )` versus `T -> id .`
- Reduce-reduce: `S -> id ( E ) .` versus `E -> id ( E ) .`
- LR(0) automaton tells us which rules are _available_, but we don't know which
rules to _apply_ exactly!
### SLR Parsing
> [!summary] TL;DR SLR = LR(0) automaton + Follow Set
- **S** stands for **Simple**. SLR parsing utilizes the **Follow set** to
resolve conflicts.
- In short, take reduction to `A` only when the next token is in `Follow(A)`.
- Grammar can be parsed by SLR => _SLR Grammar_
- **SLR Parse Table** has _Goto_ and _Action_ sections.
| State | `E` | `T` | `+` | `int` | `$ |
| ----- | ---- | ---- | ------ | ----- | --- |
| 0 | | | | | |
| 1 | GOTO | GOTO | ACTION | ... | |
| ... | | | | | |
- To construct the table, if the **DOT** is followed by:
- **Terminal**, simply **shift** to another state according to LR(0)
automaton.
- **Non-Terminal**, **goto** another state according to LR(0) automaton.
- Nothing, reduce according to the rule.
| Stack | Symbols | Input | Action |
| :------ | :-------- | -----------------: | :--------------- |
| 0 | | `id ( id + id )
| shift 4 |
| 0 4 | `id` | `( id + id )
| shift 5 |
| 0 4 5 | `id (` | `id + id )
| shift 4 |
| 0 4 5 4 | `id ( id` | `+ id )
| reduce `T -> id` |
| 0 4 5 8 | `id ( T` | `+ id )
| |
- Parsing process
- Shift tokens
- Push states into stack 1.
- Push symbols into stack 2.
- If reduce
- Pop the state at top.
- Since a non-terminal is created, now use the non-terminal and the state at
the top of the stack to perform **goto**
- Push the new state to the stack.
> [!note] Distinction between LL(1) and LR(1) Parsing Each LL(1) parsing state
> considers only a single non-terminal, while each LR(1) parsing state considers
> a large number of possible configurations.
However, SLR is only a proper subset of LR(1)!
### [[lr1|LR(1)]] Parsing
- LR(1) automaton augments LR(0) automaton by annotating the _lookahead_.
- _Lookahead_ = set of tokens that can follow the non-terminal, in the _current
parsing state_. -- ==I'm looking for this after this non-terminal.==
- For $A \rightarrow \alpha\cdot B$, add new rules like
$B \rightarrow \cdot \gamma$ with the same lookahead.
- For $A \rightarrow \cdot B\beta$
- If $\beta$ **cannot** be $\epsilon$, the lookahead is
$\text{First}(\beta)$.
- If $\beta$ **can** be $\epsilon$, the lookahead is $\text{First}(\beta)$
plus the lookahead of $A$.
- When constructing the automaton, we ==bring the lookahead from one state to
the next.==
### LALR Parsing
- **LA** = Lookahead.
- The aforementioned automaton could be too big. In LALR parsing, we can merge
states with the same **core**, by merging their lookahead.