Forty Six and Two: fsyacc

Showing posts with label fsyacc. Show all posts

10 May 2008

Parsing Dot with F#: Part 4 - Conclusion

In this very last part, I'll just show how to put everything together: the types, the lexer and the parser. First, we need to generate some F# files from the dotlexer.fsl and the dotparser.fsy files. This is easily accomplished by running fslex and fsyacc, respectively. This creates dotlexer.fs, dotparser.fs and dotparser.fsi. If, like me, you're working in Visual Studio, you can add these files to your project, but take care that you put everything in the right order or it won't compile (I actually lost quite a bit of time with this): first the definition of the abstract syntax type, then the parser files, then the lexer file. Finally, you can call the parser:

let parseFromString(text) =
    let lexbuf = Lexing.from_string text
    try
        DotParser.start DotLexer.token lexbuf
    with e ->
        let pos = lexbuf.EndPos
        printf "error near line %d, character %d\n%s\n" pos.Line pos.Column (e.ToString())
        Graph (new List<_>())

This is fairly straightforward: the Lexing module defines functions to create a lexing buffer from a string. Feed this buffer into the DotParser.start function generated by fsyacc. This function will return the expected type, as defined in the .fsy file.

That's really all there is to it - the rest is some exception catching and printing an error for debugging purposes. One thing I'd like to warn you about is that debugging this stuff can be quite tedious, due to the fact that first the F# files are generated by the fsyacc and the fslex tools, and afterwards compiled. Although the compiler does a fairly good job of showing where the actual errors are in the fsy and fsl files, it is sometimes hard to track down what exactly is the problem.

Finally, you can download the example project.

Technorati: parsing,fslex,fsyacc,F#

09 May 2008

Parsing Dot with F#: Part 3 - The parser

The parser is produced by fsyacc (yacc stands for Yet Another Compiler Compiler). Basically it turns the tokens produced by the lexer into a nice set of types that you can easily manipulate from within your program. Fsyacc can generate a parser based on a definition that looks pretty much the same as the simplified grammar for Dot that I discussed in part 1.

Our dotparser.fsy file starts off with some preliminaries:

%{

open Ast
open System.Collections.Generic

%}

Same as the lexer, this code will eventually be transformed into regular F# code. So here we can open some modules for use later, and potentially do other preliminaries.

Then, some parser-related startup:

%start start

%token <string> ID
%token NODE EDGE DIGRAPH GRAPH DIEDGE UNEDGE QUOTE LACC RACC LPAREN RPAREN SEMI COMMA ASSIGN EOF 

%type < Ast.Graph > start

The first and last line indicate to fsyacc where to start parsing (at which rule, to be defined later), and the type the complete parsing process will construct; if you remember our abstract syntax tree from part 1, we're trying to make a Graph type.

The lines in between are just an enumeration of all the possible terminal tokens (i.e. tokens that needn't be parsed further), along with the type of data that they carry. You should recognize these from the lexer. In our parser, only the ID token carries information of type string.

And finally, the part that makes this all tick:

start: Graph { $1 }

Graph: DIGRAPH ID LACC StmtList SEMI RACC    { ($4:List<_>).Reverse(); Graph $4 }

StmtList: Stmt            { let l = new List<Element>() in l.Add($1); l }
    | StmtList SEMI Stmt  { $1.Add($3); $1 }

Stmt: 
    | ID AttrListOpt             { Node($1,$2) } 
    | ID DIEDGE ID AttrListOpt   { Edge($1,$3,$4) }
    | GRAPH AttrListOpt          { GraphAttributeList($2) }
    | NODE AttrListOpt           { NodeAttributeList($2) }
    | EDGE AttrListOpt           { EdgeAttributeList($2) }

AttrListOpt:
    |                            { new Dictionary<string,string>() }
    | LPAREN AttrList RPAREN     { $2 }

Attr: ID ASSIGN ID               { new KeyValuePair<string,string>($1,$3) }

AttrList: Attr              { let attr = new Dictionary<string,string>() in attr.Add($1.Key,$1.Value); attr }
    | AttrList COMMA Attr   { $1.Add($3.Key, $3.Value); $1 }

Notice that this is very similar to the grammar defined earlier. I guess with some experience one can derive this fsyacc spec from the grammar in a fairly straightforward way. Expect some bumps on the road the first time around though. Notice that with each rule we can again associate the transformation that needs to occur, similar as we did in the lexer. Only this time we're not producing tokens, we're producing the actual F# types that we need. The $n variables in the right part refer to the data that is carried by the nth token on the left.

Some things to be aware of:

If you need to parse a list of things, like the StmtList above, you need to do this "backwards", i.e. the second option in StmtList is 'StmtList SEMI Stmt', not 'Stmt SEMI StmtList'. This also means I reverse the list once read in the Graph rule. This is simply a consequence of how the parser produced by fsyacc works under the hood - in this case a minor inconvenience.
I chose here to use regular .NET types as much as possible, since I intended to use this code from C#. If you use F# types like list, the code actually becomes a lot shorter. For example, 'let l = new List<Element>() in l.Add($1); l" just becomes '[$1]'.
I deliberately reversed the order of the rules Attr and AttrList, so that F# could infer the types of the arguments better. This is a result of the fact that F# parser top to bottom, left to right. If you get a lot of type inference errors, it pays off to experiment with the order of your statements, if you can reverse them. I try to avoid adding type annotations as much as possible, they make the code harder to read and to maintain. (in C# my use of the C# 'var' keyword has increased exponentially)

That's it; next time, I'll show you how to put everything together and call the parser.

Technorati: F#,fsyacc,graphviz,parsing

02 April 2008

Parsing Dot with F#: Part 1

A while back I decided to write a parser for dot, the language used by Graphviz. This is both my first real project in F#, and in parsing. I learned the basics from the excellent Expert F# book.

I'll try to explain my solution, using fslex and fsyacc, in the order that I tackled the problem. There are a few more basic examples out there, explaining what parsing is, what a lexer and a parser are etc., However it seems the examples given are always small, typically parsing some expressions. Graphviz' dot definitely has more of a real world flavor, and I'll present it as a real world example of using fslex and fsyacc, without explaining much about those tools per se.

This is how I see this mini-series play out:

Simplified dot grammar and abstract syntax tree
The lexer
The parser
Putting it all together

This post is part 1.

Graphviz

Graphviz is a command line tool that takes a description of a graph as input, and outputs a description (or an image) of a layout of the graph. For example, as input you can give:

digraph G {
0 [label="Type1<>", shape=box];
1 [label="Type1<Type2>", shape=box];
2 [label="T", shape=box];
3 [label="Type2", shape=box];
4 [label="Type2[], Type2*, Type2&", shape=box];
5 [label="#Type2", shape=box];
6 [label="Type2.Type3", shape=box];
0 -> 1 [ label="MakeGenericType(Type2)"];
0 -> 2 [ label="GetGenericArguments()"];
1 -> 0 [ label="GetGenericTypeDefinition()"];
1 -> 3 [ label="GetGenericArguments()"];
2 -> 0 [ label="DeclaringType"];
3 -> 4 [ label="MakeArrayType(), MakePointerType(), MakeByRefType()"];
3 -> 6 [ label="GetNestedType(Type3.Name)"];
4 -> 3 [ label="GetElementType()"];
5 -> 3 [ label="BaseType"];
6 -> 3 [ label="DeclaringType"];
}

Basically, just a sequence of nodes and edges annotated with attributes. In the above example, only label and shape are used, but dot supports many, many more. Given such a file, graphviz outputs:

digraph G {
    node [label="\N"];
    graph [bb="0,0,1028,328"];
    0 [label="Type1<>", shape=box, pos="502,302", width="1.03", height="0.50"];
    1 [label="Type1<Type2>", shape=box, pos="380,210", width="1.61", height="0.50"];
    2 [label=T, shape=box, pos="625,210", width="0.75", height="0.50"];
    3 [label=Type2, shape=box, pos="568,118", width="0.78", height="0.50"];
    4 [label="Type2[], Type2*, Type2&", shape=box, pos="249,26", width="2.61", height="0.50"];
    5 [label="#Type2", shape=box, pos="703,210", width="0.92", height="0.50"];
    6 [label="Type2.Type3", shape=box, pos="784,26", width="1.42", height="0.50"];
    0 -> 1 [label="MakeGenericType(Type2)", pos="e,322,216 465,301 381,298 181,288 161,266 156,259 156,252 161,246 170,235 252,223 312,217", lp="263,256"];
    0 -> 2 [label="GetGenericArguments()", pos="e,623,228 539,298 561,293 589,284 607,266 615,258 619,247 621,237", lp="708,256"];
    1 -> 0 [label="GetGenericTypeDefinition()", pos="s,465,293 455,290 430,284 403,275 394,266 385,256 381,240 380,228", lp="504,256"];
    1 -> 3 [label="GetGenericArguments()", pos="e,540,121 394,192 405,179 420,163 437,154 467,137 504,128 531,123", lp="528,164"];
    2 -> 0 [label=DeclaringType, pos="s,539,300 548,300 628,295 785,284 802,266 807,259 807,252 802,246 781,222 691,237 661,228 658,227 655,226 652,225", lp="863,256"];
    3 -> 4 [label="MakeArrayType(), MakePointerType(), MakeByRefType()", pos="e,155,36 540,117 432,114 50,100 32,82 26,75 27,68 32,62 40,53 93,44 145,37", lp="251,72"];
    3 -> 6 [label="GetNestedType(Type3.Name)", pos="e,733,40 596,101 616,89 645,73 671,62 688,54 707,48 725,43", lp="781,72"];
    4 -> 3 [label="GetElementType()", pos="s,540,110 532,107 515,100 495,92 480,82 470,75 472,67 461,62 441,51 390,42 343,36", lp="557,72"];
    5 -> 3 [label=BaseType, pos="e,595,136 676,192 655,178 625,157 602,141", lp="692,164"];
    6 -> 3 [label=DeclaringType, pos="s,596,117 605,117 687,113 875,103 894,82 915,58 873,43 835,34", lp="956,72"];
}

The basic structure of the file is the same: a sequence of nodes and edges, but Graphviz has added position, height, width and other layout info. This is actually the file that was used to draw the graph in my previous post.

The dot grammar

When parsing using fslex and fsyacc, the first thing you should find or make is a grammar of the thing you're trying to parse. Everything else sort of flows from there. Luckily, the complete dot grammar can be found here. I thought it was a bit complicated for my purposes, so I simplified the grammar a bit:

(in the following, terminals are shown in bold. Literal characters are given in single quotes. Parentheses ( and ) indicate grouping when needed. Square brackets [ and ] enclose optional items. Vertical bars | separate alternatives.)

graph    :    digraph [ ID ] '{' stmt_list '}'
stmt_list:    [ stmt [ ';' ] [ stmt_list ] ]
stmt     :    node_stmt
         |    edge_stmt
         |    attr_stmt /*defines a default attribute*/
attr_stmt:    (graph | node | edge) attr_list
attr_list:    '[' attr  [ ',' ] [ attr_list ] ']' 
atrtr    :    ID '=' ID  
edge_stmt:    node_id -> node_id [ attr_list ]
node_stmt:    node_id [ attr_list ]
node_id  :    INT

If you compare with the original dot grammar, I made the following simplifications:

No node ports
No sub-graphs
No short definition of multiple edges (a -> b -> c)
no HTML IDs
only digraphs

Basically this grammar says that a graph is "digraph name { bunch of node, edge or default statements }". We've already seen node and edge statements; default statements basically just set an attribute on all the nodes and edges that follow. It is overridden by an attribute of the same name on a node or edge itself, or by a new default statement.

The abstract syntax tree

Based on that grammar, I came up with the following abstract syntax tree using F# discriminated unions:

#light

open List
open System
open System.Collections.Generic

type Attributes = Dictionary<string,string>

type Element = 
    | Node of string * Attributes
    | Edge of string * string * Attributes
    | GraphAttributeList of Attributes
    | NodeAttributeList of Attributes
    | EdgeAttributeList of Attributes

type Graph = Graph of List<Element>

It's easiest to read this from bottom to top (unfortunately it needs to be defined the other way round, otherwise F# has difficulty parsing). A graph is a list of elements. An element can either be a node, an edge or a default attribute list for the graph, the subsequent nodes or the subsequent edges. Each of these elements can have a number of attributes. Attributes are simply presented as a Dictionary. The abstract syntax tree was fairly straightforward to build from the grammar; I expect the same for any well-written grammar.

The abstract syntax tree is the parser's interface to the outside, so it's important that you think about how you're going to use the parser when making decisions about the representation of the abstract syntax tree. For example, I took care not to use any F# specific types in the above AST definition, so that client assemblies in other language would not need to reference F# specific assemblies. On that topic, don't forget to compile your F# assemblies with the --standalone flag, otherwise client assemblies will still need some F# specific libraries (e.g. discriminated unions implement IStructuralHash, so clients also need to know this interface).

Another issue you should think about is how 'deep' you want to parse. For example, it would be theoretically possible to define separate cases for each of the different types of attributes that can be generated by Graphviz. This would also allow us to parse some of the arguments (e.g. the list of position coordinates could be parsed into a list of tuples). However, given the large amount and frequent changes in Graphviz dot attributes, I choose not to take this route.

Next episode: the lexer!

Technorati: F#,fslex,fsyacc,parsing,graphviz

Forty Six and Two

10 May 2008

Parsing Dot with F#: Part 4 - Conclusion

09 May 2008

Parsing Dot with F#: Part 3 - The parser

02 April 2008

Parsing Dot with F#: Part 1

Subscribe

Search This Blog

Labels

Blog Archive

About me...

Quotes