Parsing with Active Patterns in F#

Parsing with Active Patterns in F#

ou may have come across the need to read textual input into a structured representation before. Consider interpreting your own domain-specific language (DSL), reading back marshaled data from the disk, turning XML into a typed representation, or simply reading a comma-separated file: When it comes to parsing you have a number of options. For many common parsing tasks (such as working with the various XML APIs, regular expressions, or string splitting for comma-separated files), .NET provides extensive library support, but once you start growing your input language you will need a uniform parsing approach that can handle your requirements.

The first half of this article discusses active patterns in F# and shows single-case, multi-case, and partial active patterns in action. Active patterns are one of the most versatile tools in F# programming; they let you convert and partition arbitrary values into shapes against which you can perform pattern matching. With this ability you can turn grammar parsing into a typed discipline directly expressible as pattern-matching rules. This in turn allows you to quickly prototype DSLs, custom parsers for your data, and the like?and it does so in an incredibly concise and elegant way.

Editor’s Note: If you don’t have F#, you can download it here.

The second part of the article shows how to use your new background in active patterns to develop parser rules using partial active patterns, and then combine them to form a complete parser for a small language of arithmetic expressions with function calls. You will also develop an evaluator for this language and see how quickly you can get from ground zero to a small working language that you can interact with.

Part 1: Discriminated Unions and Pattern Matching

In ordinary pattern matching over discriminated unions you define a union type with various shapes, and then examine those in a pattern match. You can write arbitrarily complex match cases to match very specific shapes of your union values, and the compiler gives you a warning if your match cases are not complete or if certain matches are never reached. This sort of data abstraction is central in functional programming for at least two reasons: it provides a type-safe way to deal with multi-shaped values, and recursive discriminated unions are an ideal way to express many common data scenarios. They go perfectly with recursive problem decomposition: you solve problems by decomposing them to smaller ones that are in turn decomposed until you are working with primitive values.

Consider the following simple arithmetic expression evaluator:

type Expr =| Number of int| Sum    of Expr * Expr| Diff   of Expr * Expr| Prod   of Expr * Expr| Div    of Expr * Exprlet rec Eval = function| Number i      -> i| Sum (e1, e2)  -> Eval e1 + Eval e2| Diff (e1, e2) -> Eval e1 - Eval e2| Prod (e1, e2) -> Eval e1 * Eval e2| Div (e1, e2)  -> Eval e1 / Eval e2

Sending this bit of code to F# Interactive allows you to quickly experiment with Expression values. For example, you can create a new instance and evaluate it on the fly:

let i = Prod (Sum (Number 11, Number 12),    Diff (Number 43, Number 34))i |> Eval |> printf "The result is %d

The preceding code produces the output:

val i : ExprThe result is 207

You may find it awkward to define the arithmetic operators as separate shapes where in reality only their label (the operator name) is different. Instead you can express the above using a generic binary operation shape that carries the operation as a function and simply define additional members that map to this shape:

type Expr =| Number of int| BinOp  of (int -> int -> int) * Expr * Exprwith    static member Sum  (e1, e2) = BinOp (( + ), e1, e2)    static member Diff (e1, e2) = BinOp (( - ), e1, e2)    static member Prod (e1, e2) = BinOp (( * ), e1, e2)    static member Div  (e1, e2) = BinOp (( / ), e1, e2)let rec Eval = function| Number i          -> i| BinOp (f, e1, e2) -> f (Eval e1) (Eval e2)

You can verify this with the previous test value, this time without retaining a binding to it:

> Expr.Prod (Expr.Sum (Number 11, Number 12), Expr.Diff (Number 43, Number 34))|> fun i ->    i |> Eval |> printf "The result is %d
";;The result is 207

Single and Multi-Case Active Patterns

Active patterns let you view arbitrary values through a different lens, much the same way as you gave a different underlying representation to expressions in the two examples above. The key difference from ordinary pattern matching is that active patterns can be used on any value, not just discriminated unions. As you’ll see shortly, this makes them a handy and versatile tool in your F# programs.

Consider the following simple active pattern:

open Systemlet (|Empty|Null|String|) (s: string) =    if s = null then        Null    elif s.Trim() |> String.IsNullOrEmpty then        Empty    else        String (s.Trim())

First, note that active patterns are defined by surrounding each pattern case with pipe (|) characters. Because these are special characters, you always end up enclosing the entire pattern recognizer in parentheses (hence the so called “banana clips”).

What the preceding code says is: Take a string and decompose it into one of the three shapes depending on whether it is null, empty, or anything else. Using active patterns, instead of writing checks as you would normally do, now you can pattern match on strings?effectively hiding the details of the underlying string representation and creating a nice model around string values:

let TestString s =    match s with    | Null     -> printf "String is null
"    | Empty    -> printf "String is empty
"    | String _ -> printf "String is non-empty
"let _ =    null  |> TestString    ""    |> TestString    "foo" |> TestString

The output is:

String is nullString is emptyString is non-empty

Matching with active patterns works by calling the active pattern recognizer (in the above case (|Empty|Null|String|)?a normal function with a special annotation so it can be used inside match constructs and bind values), which in turn returns one of three shapes it defines. Note that these shapes may also carry values (as in the String case above); if they do you can bind those in the match case.

Besides partitioning values into a closed set of shapes (or put differently: erecting a conceptual model on those values), you can also use active patterns to convert values simply by having a single match case in the active pattern as in the following example:

let (|NiceString|) (s: string) =    if s = null then        ""    else        s.Trim()let PrintString s =    let (NiceString ns) = s    s |> printf "Nice string=[%s]

Here you used a simple let statement to match the argument and to bind the “nice” string representation of it to ns and printed that value (remember that a let binding also performs pattern matching).

Partial Active Patterns

The active patterns you have seen so far in this article were complete?in other words they represented or viewed the value that was matched using a closed set of shapes. These have the added benefit that the compiler is able to find incomplete matches or match cases that are never reached, just as in ordinary pattern matching. However, you often need a way to define pattern recognizers that may not conclusively commit to any shape. Consider the following example:

let (|PrimeEnding99|_|) (num: int) =    { 2 .. num |> float |> Math.Sqrt |> int }    |> Seq.tryFind (fun i -> num % i = 0)    |> function        | None ->            if (num + 1) % 100 = 0 then                num |> Some            else                None        | _ ->            None

Here the trailing |_| in the definition indicates that you are defining a partial active pattern. These work slightly differently from complete active patterns, because they always return a value that indicates whether a match occurred. In this example, the (|PrimeEnding99|_|) pattern recognizer takes an integer, and the match succeeds only when the integer is a prime number that ends with 99. In every other case, no successful match is made.

Naturally, when you match with partial active patterns you need to include a catch-all case in order to avoid a compiler warning:

let _ =    match 199 with    | PrimeEnding99 _ ->        printf "199 is a prime ending with 99
"    | _ ->        printf "199 is not a prime

Another useful bit of knowledge about active patterns is that they can be parameterized. This is particularly useful in situations where you need to control how matching should take place, or when the matching process needs external information to proceed. You will see an example of this in the next section.

Part 2: Diving into Parser Generators

Traditionally, parser generators are the most versatile tool available for implementing custom language parsers (parser combinators are the other tool). These tools take a parser definition and produce a working parser in the language that the parser generator targets. The parser definition usually consists of a Backus?Naur Form (BNF)-like description of your grammar, with each production rule defined as a series of consecutive symbols and giving a semantic action expressed in the target language syntax. These actions spell out how to form the semantic extract of the given rule, in other words, what the rule means in terms of your internal representation.

Symbols can be either terminal or non-terminal: terminal symbols are obtained from the lexer (a module that breaks the input stream into a series of tokens or terminal symbols) and non-terminal symbols are those you defined in your parser definition. These symbols expand to a number of possible derivations (parse trees) based on your grammar. If you have more than one derivation for a given input string, your grammar is said to be ambiguous.

If the preceding paragraph didn’t give it away, using parsers generated by tools relies on two separate definitions: one for the lexer and one for the parser. In my previous article on implementing DSLs in F#, you saw a lexer and parser for a small DSL intended for describing user interfaces. As I explained in that article, having a separate lexer and parser definition forces you to include the extra step of translating those to F#?and you are liable to make mistakes in that step that are not uncovered until you compile the generated code. Lacking on-the-fly type checking and IntelliSense support when developing FsYacc parsers is a significant burden, and it can be a daunting task to get things right with more complex grammars.

Another significant source of issues is the way you define your grammar. FsYacc produces so-called LALR parsers, a special subset of LR parsers?both classes reading input from left to right but producing derivations by expanding the rightmost symbol first. This means that in contrast to LL parsers (which read from left to right and derive non-terminals from the left) the input stream is parsed as far as possible before deriving a given rule, so if your grammar is ambiguous you will get a potentially large list of conflicts (shift-reduce and reduce-reduce conflicts that reflect the state of the parser automaton when the conflict arose).

You can find a numerous online resources on the classification of languages and parsers and the different techniques that you can apply to write parsers by hand. The main point to take away from this section is that while parser generators will give you a great deal of flexibility in terms of the languages you can define, they do so at the cost of not having on-the-fly type checking and IntelliSense support. In addition, their flexibility can lead to various grammar ambiguity issues.

Parsing with Active Patterns: Establishing the Basic Infrastructure

One way to deal with ambiguous grammars is to simply rule out ambiguities. Fundamentally, ambiguities arise because there are multiple productions that can derive the same input. For example, the string 1+2+3 can be parsed as meaning 1+(2+3) or (1+2)+3. What you can do in these situations is simply take the first rule that matches and go with that. That?s what parsing with active patterns does. After it locates a successful match case, no other match cases are considered; this is precisely the semantics that rule out ambiguities.

To get started, you need a way to identify and strip off terminal symbols (tokens) from the input stream. You can define tokens using regular expressions, so having a function that takes a regular expression (regex) pattern and returns a pair of the matched string and the remaining string would be ideal. Starting with that, you can define a new module:

module Language =    open System    open System.Text.RegularExpressions    let private matchToken pattern s =        Regex.Match(s, pattern |> sprintf "A(%s)((?s).*)",            RegexOptions.Multiline)        |> fun mtch ->            if mtch.Success then                (mtch.Groups.[1].Value, mtch.Groups.[2].Value) |> Some            else                None

Note here that you are wrapping the regex pattern into two groups: one for the input pattern (%s) and another for the remaining input string ((?s).*).

Author’s Note: In the preceding code, the A specifies that you are matching from the beginning of the input string, and the (?s) instructs the matcher to match irrespective of newline characters.

With matchToken it is now easy to define whitespace and one-line comments (such as those starting with #):

    let (|WS|_|) = matchToken "[ |	|

]+"    let (|COMMENT|_|) = matchToken "#.*[

Using these you can define a slightly more robust version of whitespace as:

    let (|WHITESPACE|_|) s =        match s with        | WS rest ->            rest |> Some        | COMMENT rest ->            rest |> Some        | _ ->            None

This says that a WHITESPACE is either whitespace (a series of space, tab, LF, or CR+LF characters) or a comment. But remember that once a successful match is made, the other match cases are ignored, so WHITESPACE won’t match things like " # one-line comment
(whitespace and comments following each other), simply because WHITESPACE will quit after matching a WS.

What you need is the * operator from ordinary BNF syntax: this matches zero or more occurrences of something.

    let rec (|Star|_|) f acc s =        match f s with        | Some (res, rest) ->            (|Star|_|) f (res :: acc) rest        | None ->            (acc |> List.rev , s) |> Some

This is an example of a parameterized partial active pattern: you supply additional parameters (f and acc) to the pattern recognizer before the value that you are matching against. Note how the active pattern uses these arguments to drive how matching takes place. First you try to parse a new symbol (via f), and if you succeed, you recurse on the remaining input string with the matched input accumulated. If matching fails you simply return what you accumulated in preceding calls. (Note that this function never fails?if you need “one or more” matches you need to modify the None clause to return None if the first match was unsuccessful.) Normally, you would want to call (|Star|_|) with an empty list at first:

    let (|WhiteSpace|_|) s = (|Star|_|) (|WHITESPACE|_|) [] s

This just says that WhiteSpace matches zero or more instances of WHITESPACE, and will return a string list of these whitespace blocks and the remaining input string.

Armed with WhiteSpace you can implement your basic “tokenizer.” For most grammars whitespace is ignored (a notable exception here is F# where indentation does matter), and in general it is much more tedious to treat whitespace as separate tokens, so you can simply strip them for this example.

    let rec MatchTokenNoWS s pattern =        match (|WhiteSpace|_|) s with        | Some (_, rest) ->            rest |> matchToken pattern        | None ->            s |> matchToken pattern

The preceding function simply removes any leading whitespace before matching a regex pattern. You can now write a function that parses a token for a given regex, as:

    let MatchToken s f pattern =        pattern |> MatchTokenNoWS s |> Option.bind f

Here, the parameter f carries the function that is executed on the match, which has type (string * string), where the first value in the tuple is the string matched, and the second is the remaining input string.

For tokens where you don’t need the matched string (operators, keywords, etc.) you can write a function that returns only the rest of the input:

    let MatchSymbol s pattern =        pattern |> MatchToken s (fun (_, rest) -> rest |> Some)

Parsing with Active Patterns: Building Up Rules

Now you have all the primitives to start building your parsers. Consider a simple language consisting of arithmetic expressions such as those you saw in the first part of this article. In that language (for which you only defined the abstract syntax earlier) you had numbers and the four basic arithmetic operators. This time you want to build a parser for this language, but also extend it to include floats (you just make all numbers floats) and function calls. A sample input in this language might be the string 1+cos(2+3)/sin(4*5).

This expression language calls for a slightly updated AST, so you need to change the earlier definition a bit (note that you are defining this as part of a new Ast module):

module Ast =    type var = string    type Expr =    | Number   of float    | BinOp    of (float -> float -> float) * Expr * Expr    | FunApply of var * Expr list    with        static member Sum (e1, e2) = BinOp (( + ), e1, e2)        static member Diff (e1, e2) = BinOp (( - ), e1, e2)        static member Prod (e1, e2) = BinOp (( * ), e1, e2)        static member Ratio (e1, e2) = BinOp (( / ), e1, e2)

The parser (continuing the Language module you defined earlier) needs active patterns to recognize numbers (floats), identifiers (function names that you are calling), and various tokens (parentheses and operators):

    let (|NUMBER|_|) s =        "[0-9]+.?[0-9]*" |> MatchToken s            (fun (n, rest) -> (n |> Double.Parse, rest) |> Some)    let (|ID|_|) s =        "[a-zA-Z]+" |> MatchToken s (fun res -> res |> Some)    let (|PLUS|_|)   s = "+" |> MatchSymbol s    let (|MINUS|_|)  s = "-"  |> MatchSymbol s    let (|MUL|_|)    s = "*" |> MatchSymbol s    let (|DIV|_|)    s = "/"  |> MatchSymbol s    let (|LPAREN|_|) s = "(" |> MatchSymbol s    let (|RPAREN|_|) s = ")" |> MatchSymbol s

Using these you can express the essential part of your grammar: expressions. To get around operator precedence you can simply introduce a hierarchy of active patterns that parse parts of an expression in decreasing precedence as is commonly done in recursive descent (LL) parsers.

    let rec (|Factor|_|) = function    | NUMBER (n, rest) ->        (Ast.Expr.Number n, rest) |> Some    | ID (f, LPAREN (Star (|Expression|_|) [] (args, RPAREN rest))) ->        (Ast.Expr.FunApply (f, args), rest) |> Some    | _ ->        None            and (|Term|_|) = function    | Factor (e1, MUL (Term (e2, rest))) ->        (Ast.Expr.Prod (e1, e2), rest) |> Some    | Factor (e1, DIV (Term (e2, rest))) ->        (Ast.Expr.Ratio (e1, e2), rest) |> Some    | Factor (e, rest) ->        (e, rest) |> Some    | _ ->        None    and (|Sum|_|) = function    | Term (e1, PLUS (Sum (e2, rest))) ->        (Ast.Expr.Sum (e1, e2), rest) |> Some    | Term (e1, MINUS (Sum (e2, rest))) ->        (Ast.Expr.Diff (e1, e2), rest) |> Some    | Term (e, rest) ->        (e, rest) |> Some    | _ ->        None    and (|Expression|_|) = (|Sum|_|)

Note that in this language the arguments passed to a function call are not separated, so you can pass arbitrary number of arguments to functions by separating them simply with whitespace. If you want to add your own separator symbol, you can build another rule that parses expressions followed by that separator symbol, and have the Star active pattern refer to that rule.

Because the rule that you first match successfully preempts all others, you need to be careful to list those rules that are recursive first?which inherently implies that you cannot start a match case with the same active pattern that you are defining; otherwise, you wind up in an infinite recursion. To get around that issue, you can make all binary operators associate to the right, so 1+2+3 will actually reduce to 1+(2+3).

Finally, a useful active pattern is Eof, which recognizes the end of the input string. However, rather than simply checking for an empty remaining input string, you should also check whether the only remaining input is whitespace (recall that whitespace can contain comments also).

    let (|Eof|_|) s =        if s |> String.IsNullOrEmpty then            () |> Some        else            match s with            | WhiteSpace (_, rest) when rest |> String.IsNullOrEmpty ->                () |> Some            | _ ->                None

After you load the Ast and the Language module into F# Interactive, you can quickly test your parser:

> let _ =    match "1+cos(2+3)/sin(4*5)" with    | Language.Expression (e, Language.Eof) ->        e |> printf "Match ? AST: %A
"    | _ ->        printf "No match
";;> Match ? AST: BinOp  (,Number 1.0,   BinOp     (,      FunApply ("cos",[BinOp (,Number 2.0,Number 3.0)]),      FunApply ("sin",[BinOp (,Number 4.0,Number 5.0)])))

F# lets you test your active patterns interactively, which is tremendously useful when developing your parser rules:

> Language.(|Expression|_|) "1+2*3/4";;val it : (Ast.Expr * string) option =  Some    (BinOp       (,Number 1.0,        BinOp          (,Number 2.0,           BinOp (,Number 3.0,Number 4.0))), "")

You can now patch up the evaluator you saw in the first part of this article to evaluate this new expression type by adding the evaluation of function calls (here you add only the sine and cosine functions, both of which work on a single argument):

module Evaluator =    open System        let rec Eval = function    | Ast.Number f          -> f    | Ast.BinOp (f, e1, e2) -> f (Eval e1) (Eval e2)    | Ast.FunApply (f, [e]) when f.ToLower() = "sin" ->        (Eval e) |> Math.Sin    | Ast.FunApply (f, [e]) when f.ToLower() = "cos" ->        (Eval e) |> Math.Cos    | e ->        e |> sprintf "Unknown function or arity: %A" |> failwith

Finally, you can test this evaluator in F# Interactive:

> let _ =    match "1+cos(2+3)/sin(4*5)" with    | Language.Expression (e, Language.Eof) ->        e |> Evaluator.Eval |> printf "Result=%f
"    | _ ->        printf "No match
";;> Result=1.310711

You’ve seen how to define active patterns that convert and partition values, via an example that used partial active patterns to implement a parser for arithmetic expressions with function calls. Effectively, you’ve seen how to translate the boring and error-prone task of implementing grammars with parser generators into a typed discipline that closely matches your parsing rules expressed in BNF syntax. Using these techniques you are better equipped to quickly prototype and implement your own parsers, define your own domain-specific languages, and work with your own concrete syntax for your data.

For further reading, try these links:


Share the Post: