02 April 2008

Parsing Dot with F#: Part 1

A while back I decided to write a parser for dot, the language used by Graphviz. This is both my first real project in F#, and in parsing. I learned the basics from the excellent Expert F# book.

I'll try to explain my solution, using fslex and fsyacc, in the order that I tackled the problem.  There are a few more basic examples out there, explaining what parsing is, what a lexer and a parser are etc., However it seems the examples given are always small, typically parsing some expressions. Graphviz' dot definitely has more of a real world flavor, and I'll present it as a real world example of using fslex and fsyacc, without explaining much about those tools per se.

This is how I see this mini-series play out:

  • Simplified dot grammar and abstract syntax tree
  • The lexer
  • The parser
  • Putting it all together

This post is part 1.

Graphviz

Graphviz is a command line tool that takes a description of a graph as input, and outputs a description (or an image) of a layout of the graph. For example, as input you can give:

digraph G {
0 [label="Type1<>", shape=box];
1 [label="Type1<Type2>", shape=box];
2 [label="T", shape=box];
3 [label="Type2", shape=box];
4 [label="Type2[], Type2*, Type2&", shape=box];
5 [label="#Type2", shape=box];
6 [label="Type2.Type3", shape=box];
0 -> 1 [ label="MakeGenericType(Type2)"];
0 -> 2 [ label="GetGenericArguments()"];
1 -> 0 [ label="GetGenericTypeDefinition()"];
1 -> 3 [ label="GetGenericArguments()"];
2 -> 0 [ label="DeclaringType"];
3 -> 4 [ label="MakeArrayType(), MakePointerType(), MakeByRefType()"];
3 -> 6 [ label="GetNestedType(Type3.Name)"];
4 -> 3 [ label="GetElementType()"];
5 -> 3 [ label="BaseType"];
6 -> 3 [ label="DeclaringType"];
}

Basically, just a sequence of nodes and edges annotated with attributes. In the above example, only label and shape are used, but dot supports many, many more. Given such a file, graphviz outputs:

digraph G {
    node [label="\N"];
    graph [bb="0,0,1028,328"];
    0 [label="Type1<>", shape=box, pos="502,302", width="1.03", height="0.50"];
    1 [label="Type1<Type2>", shape=box, pos="380,210", width="1.61", height="0.50"];
    2 [label=T, shape=box, pos="625,210", width="0.75", height="0.50"];
    3 [label=Type2, shape=box, pos="568,118", width="0.78", height="0.50"];
    4 [label="Type2[], Type2*, Type2&", shape=box, pos="249,26", width="2.61", height="0.50"];
    5 [label="#Type2", shape=box, pos="703,210", width="0.92", height="0.50"];
    6 [label="Type2.Type3", shape=box, pos="784,26", width="1.42", height="0.50"];
    0 -> 1 [label="MakeGenericType(Type2)", pos="e,322,216 465,301 381,298 181,288 161,266 156,259 156,252 161,246 170,235 252,223 312,217", lp="263,256"];
    0 -> 2 [label="GetGenericArguments()", pos="e,623,228 539,298 561,293 589,284 607,266 615,258 619,247 621,237", lp="708,256"];
    1 -> 0 [label="GetGenericTypeDefinition()", pos="s,465,293 455,290 430,284 403,275 394,266 385,256 381,240 380,228", lp="504,256"];
    1 -> 3 [label="GetGenericArguments()", pos="e,540,121 394,192 405,179 420,163 437,154 467,137 504,128 531,123", lp="528,164"];
    2 -> 0 [label=DeclaringType, pos="s,539,300 548,300 628,295 785,284 802,266 807,259 807,252 802,246 781,222 691,237 661,228 658,227 655,226 652,225", lp="863,256"];
    3 -> 4 [label="MakeArrayType(), MakePointerType(), MakeByRefType()", pos="e,155,36 540,117 432,114 50,100 32,82 26,75 27,68 32,62 40,53 93,44 145,37", lp="251,72"];
    3 -> 6 [label="GetNestedType(Type3.Name)", pos="e,733,40 596,101 616,89 645,73 671,62 688,54 707,48 725,43", lp="781,72"];
    4 -> 3 [label="GetElementType()", pos="s,540,110 532,107 515,100 495,92 480,82 470,75 472,67 461,62 441,51 390,42 343,36", lp="557,72"];
    5 -> 3 [label=BaseType, pos="e,595,136 676,192 655,178 625,157 602,141", lp="692,164"];
    6 -> 3 [label=DeclaringType, pos="s,596,117 605,117 687,113 875,103 894,82 915,58 873,43 835,34", lp="956,72"];
}

The basic structure of the file is the same: a sequence of nodes and edges, but Graphviz has added position, height, width and other layout info. This is actually the file that was used to draw the graph in my previous post.

The dot grammar

When parsing using fslex and fsyacc, the first thing you should find or make is a grammar of the thing you're trying to parse. Everything else sort of flows from there. Luckily, the complete dot grammar can be found here. I thought it was a bit complicated for my purposes, so I simplified the grammar a bit:

(in the following, terminals are shown in bold. Literal characters are given in single quotes. Parentheses ( and ) indicate grouping when needed. Square brackets [ and ] enclose optional items. Vertical bars | separate alternatives.)

graph    :    digraph [ ID ] '{' stmt_list '}'
stmt_list:    [ stmt [ ';' ] [ stmt_list ] ]
stmt     :    node_stmt
         |    edge_stmt
         |    attr_stmt /*defines a default attribute*/
attr_stmt:    (graph | node | edge) attr_list
attr_list:    '[' attr  [ ',' ] [ attr_list ] ']' 
atrtr    :    ID '=' ID  
edge_stmt:    node_id -> node_id [ attr_list ]
node_stmt:    node_id [ attr_list ]
node_id  :    INT

If you compare with the original dot grammar, I made the following simplifications:

  • No node ports
  • No sub-graphs
  • No short definition of multiple edges (a -> b -> c)
  • no HTML IDs
  • only digraphs

Basically this grammar says that a graph is "digraph name { bunch of node, edge or default statements }". We've already seen node and edge statements; default statements basically just set an attribute on all the nodes and edges that follow. It is overridden by an attribute of the same name on a node or edge itself, or by a new default statement.

The abstract syntax tree

Based on that grammar, I came up with the following abstract syntax tree using F# discriminated unions:

#light

open List
open System
open System.Collections.Generic

type Attributes = Dictionary<string,string>

type Element = 
    | Node of string * Attributes
    | Edge of string * string * Attributes
    | GraphAttributeList of Attributes
    | NodeAttributeList of Attributes
    | EdgeAttributeList of Attributes

type Graph = Graph of List<Element>

It's easiest to read this from bottom to top (unfortunately it needs to be defined the other way round, otherwise F# has difficulty parsing). A graph is a list of elements. An element can either be a node, an edge or a default attribute list for the graph, the subsequent nodes or the subsequent edges. Each of these elements can have a number of attributes. Attributes are simply presented as a Dictionary. The abstract syntax tree was fairly straightforward to build from the grammar; I expect the same for any well-written grammar.

The abstract syntax tree is the parser's interface to the outside, so it's important that you think about how you're going to use the parser when making decisions about the representation of the abstract syntax tree. For example, I took care not to use any F# specific types in the above AST definition, so that client assemblies in other language would not need to reference F# specific assemblies. On that topic, don't forget to compile your F# assemblies with the --standalone flag, otherwise client assemblies will still need some F# specific libraries (e.g. discriminated unions implement IStructuralHash, so clients also need to know this interface).

Another issue you should think about is how 'deep' you want to parse. For example, it would be theoretically possible to define separate cases for each of the different types of attributes that can be generated by Graphviz. This would also allow us to parse some of the arguments (e.g. the list of position coordinates could be parsed into a list of tuples). However, given the large amount and frequent changes in Graphviz dot attributes, I choose not to take this route.

Next episode: the lexer!

1 comment: