09 May 2008

Parsing Dot with F#: Part 3 - The parser

The parser is produced by fsyacc (yacc stands for Yet Another Compiler Compiler). Basically it turns the tokens produced by the lexer into a nice set of types that you can easily manipulate from within your program. Fsyacc can generate a parser based on a definition that looks pretty much the same as the simplified grammar for Dot that I discussed in part 1.

Our dotparser.fsy file starts off with some preliminaries:


open Ast
open System.Collections.Generic


Same as the lexer, this code will eventually  be transformed into regular F# code. So here we can open some modules for use later, and potentially do other preliminaries.

Then, some parser-related startup:

%start start

%token <string> ID

%type < Ast.Graph > start

The first and last line indicate to fsyacc where to start parsing (at which rule, to be defined later), and the  type the complete parsing process will construct; if you remember our abstract syntax tree from part 1, we're trying to make a Graph type.

The lines in between are just an enumeration of all the possible terminal tokens (i.e. tokens that needn't be parsed further), along with the type of data that they carry. You should recognize these from the lexer. In our parser, only the ID token carries information of type string.

And finally, the part that makes this all tick:

start: Graph { $1 }

Graph: DIGRAPH ID LACC StmtList SEMI RACC    { ($4:List<_>).Reverse(); Graph $4 }

StmtList: Stmt            { let l = new List<Element>() in l.Add($1); l }
    | StmtList SEMI Stmt  { $1.Add($3); $1 }

    | ID AttrListOpt             { Node($1,$2) } 
    | ID DIEDGE ID AttrListOpt   { Edge($1,$3,$4) }
    | GRAPH AttrListOpt          { GraphAttributeList($2) }
    | NODE AttrListOpt           { NodeAttributeList($2) }
    | EDGE AttrListOpt           { EdgeAttributeList($2) }

    |                            { new Dictionary<string,string>() }
    | LPAREN AttrList RPAREN     { $2 }

Attr: ID ASSIGN ID               { new KeyValuePair<string,string>($1,$3) }

AttrList: Attr              { let attr = new Dictionary<string,string>() in attr.Add($1.Key,$1.Value); attr }
    | AttrList COMMA Attr   { $1.Add($3.Key, $3.Value); $1 }

Notice that this is very similar to the grammar defined earlier. I guess with some experience one can derive this fsyacc spec from the grammar in a fairly straightforward way. Expect some bumps on the road the first time around though. Notice that with each rule we can again associate the transformation that needs to occur, similar as we did in the lexer. Only this time we're not producing tokens, we're producing the actual F# types that we need. The $n variables in the right part refer to the data that is carried by the nth token on the left.

Some things to be aware of:

  • If you need to parse a list of things, like the StmtList above, you need to do this "backwards", i.e. the second option in StmtList is 'StmtList SEMI Stmt', not 'Stmt SEMI StmtList'. This also means I reverse the list once read in the Graph rule. This is simply a consequence of how the parser produced by fsyacc works under the hood - in this case a minor inconvenience.
  • I chose here to use regular .NET types as much as possible, since I intended to use this code from C#. If you use F# types like list, the code actually becomes a lot shorter. For example, 'let l = new List<Element>() in l.Add($1); l" just becomes '[$1]'.
  • I deliberately reversed the order of the rules Attr and AttrList, so that F# could infer the types of the arguments better. This is a result of the fact that F# parser top to bottom, left to right. If you get a lot of type inference errors, it pays off to experiment with the order of your statements, if you can reverse them. I try to avoid adding type annotations as much as possible, they make the code harder to read and to maintain. (in C# my use of the C# 'var' keyword has increased exponentially)

That's it; next time, I'll show you how to put everything together and call the parser.

No comments: