Part 2: Diving into Parser Generators
Traditionally, parser generators are the most versatile tool available for implementing custom language parsers (parser combinators are the other tool). These tools take a parser definition and produce a working parser in the language that the parser generator targets. The parser definition usually consists of a Backus–Naur Form (BNF)-like description of your grammar, with each production rule defined as a series of consecutive symbols and giving a semantic action expressed in the target language syntax. These actions spell out how to form the semantic extract of the given rule, in other words, what the rule means in terms of your internal representation.
Symbols can be either terminal or non-terminal: terminal symbols are obtained from the lexer (a module that breaks the input stream into a series of tokens or terminal symbols) and non-terminal symbols are those you defined in your parser definition. These symbols expand to a number of possible derivations (parse trees) based on your grammar. If you have more than one derivation for a given input string, your grammar is said to be ambiguous.
If the preceding paragraph didn't give it away, using parsers generated by tools relies on two separate definitions: one for the lexer and one for the parser. In my previous article on implementing DSLs in F#, you saw a lexer and parser for a small DSL intended for describing user interfaces. As I explained in that article, having a separate lexer and parser definition forces you to include the extra step of translating those to F#—and you are liable to make mistakes in that step that are not uncovered until you compile the generated code. Lacking on-the-fly type checking and IntelliSense support when developing FsYacc parsers is a significant burden, and it can be a daunting task to get things right with more complex grammars.
Another significant source of issues is the way you define your grammar. FsYacc produces so-called LALR parsers, a special subset of LR parsers—both classes reading input from left to right but producing derivations by expanding the rightmost symbol first. This means that in contrast to LL parsers (which read from left to right and derive non-terminals from the left) the input stream is parsed as far as possible before deriving a given rule, so if your grammar is ambiguous you will get a potentially large list of conflicts (shift-reduce and reduce-reduce conflicts that reflect the state of the parser automaton when the conflict arose).
You can find a numerous online resources on the classification of languages and parsers and the different techniques that you can apply to write parsers by hand. The main point to take away from this section is that while parser generators will give you a great deal of flexibility in terms of the languages you can define, they do so at the cost of not having on-the-fly type checking and IntelliSense support. In addition, their flexibility can lead to various grammar ambiguity issues.
Parsing with Active Patterns: Establishing the Basic Infrastructure
One way to deal with ambiguous grammars is to simply rule out ambiguities. Fundamentally, ambiguities arise because there are multiple productions that can derive the same input. For example, the string 1+2+3 can be parsed as meaning 1+(2+3) or (1+2)+3. What you can do in these situations is simply take the first rule that matches and go with that. That’s what parsing with active patterns does. After it locates a successful match case, no other match cases are considered; this is precisely the semantics that rule out ambiguities.
To get started, you need a way to identify and strip off terminal symbols (tokens) from the input stream. You can define tokens using regular expressions, so having a function that takes a regular expression (regex) pattern and returns a pair of the matched string and the remaining string would be ideal. Starting with that, you can define a new module:
module Language =
let private matchToken pattern s =
Regex.Match(s, pattern |> sprintf "\A(%s)((?s).*)",
|> fun mtch ->
if mtch.Success then
(mtch.Groups..Value, mtch.Groups..Value) |> Some
Note here that you are wrapping the regex pattern into two groups: one for the input pattern (%s) and another for the remaining input string ((?s).*).
|Author's Note: In the preceding code, the \A specifies that you are matching from the beginning of the input string, and the (?s) instructs the matcher to match irrespective of newline characters.|
With matchToken it is now easy to define whitespace and one-line comments (such as those starting with #):
let (|WS|_|) = matchToken "[ |\t|\n|\n\r]+"
let (|COMMENT|_|) = matchToken "#.*[\n|\r\n]"
Using these you can define a slightly more robust version of whitespace as:
let (|WHITESPACE|_|) s =
match s with
| WS rest ->
rest |> Some
| COMMENT rest ->
rest |> Some
| _ ->
This says that a WHITESPACE is either whitespace (a series of space, tab, LF, or CR+LF characters) or a comment. But remember that once a successful match is made, the other match cases are ignored, so WHITESPACE won't match things like " # one-line comment \n " (whitespace and comments following each other), simply because WHITESPACE will quit after matching a WS.
What you need is the * operator from ordinary BNF syntax: this matches zero or more occurrences of something.
let rec (|Star|_|) f acc s =
match f s with
| Some (res, rest) ->
(|Star|_|) f (res :: acc) rest
| None ->
(acc |> List.rev , s) |> Some
This is an example of a parameterized partial active pattern: you supply additional parameters (f and acc) to the pattern recognizer before the value that you are matching against. Note how the active pattern uses these arguments to drive how matching takes place. First you try to parse a new symbol (via f), and if you succeed, you recurse on the remaining input string with the matched input accumulated. If matching fails you simply return what you accumulated in preceding calls. (Note that this function never fails—if you need "one or more" matches you need to modify the None clause to return None if the first match was unsuccessful.) Normally, you would want to call (|Star|_|) with an empty list at first:
let (|WhiteSpace|_|) s = (|Star|_|) (|WHITESPACE|_|)  s
This just says that WhiteSpace matches zero or more instances of WHITESPACE, and will return a string list of these whitespace blocks and the remaining input string.
Armed with WhiteSpace you can implement your basic "tokenizer." For most grammars whitespace is ignored (a notable exception here is F# where indentation does matter), and in general it is much more tedious to treat whitespace as separate tokens, so you can simply strip them for this example.
let rec MatchTokenNoWS s pattern =
match (|WhiteSpace|_|) s with
| Some (_, rest) ->
rest |> matchToken pattern
| None ->
s |> matchToken pattern
The preceding function simply removes any leading whitespace before matching a regex pattern. You can now write a function that parses a token for a given regex, as:
let MatchToken s f pattern =
pattern |> MatchTokenNoWS s |> Option.bind f
Here, the parameter f carries the function that is executed on the match, which has type (string * string), where the first value in the tuple is the string matched, and the second is the remaining input string.
For tokens where you don't need the matched string (operators, keywords, etc.) you can write a function that returns only the rest of the input:
let MatchSymbol s pattern =
pattern |> MatchToken s (fun (_, rest) -> rest |> Some)