Showing posts with label fslex. Show all posts
Showing posts with label fslex. Show all posts

10 May 2008

Parsing Dot with F#: Part 4 - Conclusion

In this very last part, I'll just show how to put everything together: the types, the lexer and the parser. First, we need to generate some F# files from the dotlexer.fsl  and the dotparser.fsy files. This is easily accomplished by running fslex and fsyacc, respectively. This creates dotlexer.fs, dotparser.fs and dotparser.fsi. If, like me, you're working in Visual Studio, you can add these files to your project, but take care that you put everything in the right order or it won't compile (I actually lost quite a bit of time with this): first the definition of the abstract syntax type, then the parser files, then the lexer file. Finally, you can call the parser:

let parseFromString(text) =
    let lexbuf = Lexing.from_string text
    try
        DotParser.start DotLexer.token lexbuf
    with e ->
        let pos = lexbuf.EndPos
        printf "error near line %d, character %d\n%s\n" pos.Line pos.Column (e.ToString())
        Graph (new List<_>())

This is fairly straightforward: the Lexing module defines functions to create a lexing buffer from a string. Feed this buffer into the DotParser.start function generated by fsyacc. This function will return the expected type, as defined in the .fsy file.

That's really all there is to it - the rest is some exception catching and printing an error for debugging purposes. One thing I'd like to warn you about is that debugging this stuff can be quite tedious, due to the fact that first the F# files are generated by the fsyacc and the fslex tools, and afterwards compiled. Although the compiler does a fairly good job of showing where the actual errors are in the fsy and fsl files, it is sometimes hard to track down what exactly is the problem.

Finally, you can download the example project.

Technorati: ,,,

09 April 2008

Parsing Dot with F#: Part 2 - The lexer

Now that the preliminaries are over, we can get on with using the first part in the fslex/fsyacc tandem: the lexer. The lexer's responsibility is to break up the long input text into tokens that are better manageable for the parser, which in turn will produce the actual abstract syntax tree. So, instead of having to deal with the string "digraph MyGraph {  0 [label="a string_{}"] }", the lexer will turn this into a list of tokens, where each token can be attributed with some information. For the example, the output of our lexer is the following list of tokens: DIGRAPH ID["MyGraph"] LACC ID["0"] LBR ID["label"] ASSIGN ID["a string_{}"] RBR RACC.  As you can see, the only token with an attribute is the ID, in which all kinds of string names are parsed. Notice also that any ID string can contain "control characters" like { and }. The lexer's main job is to filter those kinds of things out so the parser's job is much easier.

You can tell fslex how to tokenize a string by providing it with a list of rules. Each rule is a series of regular expressions-action pairs, each producing an (attributed) token. These rules are specified in a special format, which is pre-processed to an actual F# code file by the fslex tool. I'm not going to explain that in detail, there are other sources for that.

Our parser.fsl file starts with some preliminaries:

{
open System
open DotParser
open Lexing
}

let digit = ['0'-'9']
let whitespace = [' ' '\t' ]
let newline = ('\n' | '\r' '\n')
let letter = [ 'a'-'z' 'A'-'Z' '_']

Notice here that, besides opening the System and Lexing namespaces, the DotParser namespace is opened. That's because the type that defines the tokens (actually a regular discriminated union type) is defined in the fsyacc file. We'll come back to that in the next post. Then, some basic regular expressions that are useful for almost any parser, are defined. If you're at all familiar with regular expressions, this stuff should be easy. Notcie that these are just regular let variable bindings in F#. Fslex just translates this file to actual F# code, so you're free to use F# constructs as you see fit.

Our actual parser consists of 2 rules. This is the first:

rule token = parse
| whitespace     { token lexbuf }
| ('\r' | '\t')  { token lexbuf }
| "digraph"      { DIGRAPH }
| "node"         { NODE }
| "edge"         { EDGE }
| "graph"        { GRAPH }
| "->"           { DIEDGE }
| "--"           { UNEDGE }
| "\""           { ID ( (string lexbuf.StartPos "" lexbuf) ) }
| '{'            { LACC }
| '}'            { RACC }
| '['            { LPAREN }
| ']'            { RPAREN }
| ';'            { SEMI }
| ','            { COMMA }
| '='            { ASSIGN }
| letter(letter|digit)*                { ID (lexeme lexbuf) }
| ['-']?('.'digit+ | digit+('.'digit*)? ) { ID (lexeme lexbuf) }
| newline        { lexbuf.EndPos <- lexbuf.EndPos.NextLine; token lexbuf }
| eof            { EOF }

On the left, written like a pattern match, are the regular expressions that will be matched against the input string. On the right, the tokens that are returned. As you can see in the line for lexing IDs, you can access the current lexeme, i.e. the currently-being-lexed part of the input string, by calling lexeme lexbuf. Just skipping the current lexeme can be achieved by calling token lexbuf. Also, it is useful if you keep track of the line number for debugging later. That's what the action in the newline pattern does.

One gotcha here is that the order of these regular expressions matter not only for lexing (a line higher up will take precedence over a lower line), but also for type inference. The above rule needed no type annotations, but try placing the newline rule at the top; after pre-processing, F# will complain during compilation that it needs some type annotations. It's not a big deal, but if the order does not matter for lexing, it is nicer and more flexible to change later if you don't need to put any type annotations.

One line in the above sample may puzzle you:

| "\""            { ID ( (string lexbuf.StartPos "" lexbuf) ) }

That's because this calls a second rule, called  string, which parses a string,i.e. something between double quotes ("). We go into this second rule when we encounter an open double quotes in the input. Such multiple rules are useful when you have different contexts in the input string in which different lexing rules apply. For dot, a string can contain characters such as brackets, arrows, semicolons and such that have a particular meaning in the main dot file. But when they're between double quotes, the lexer is in another context, where these characters lose their meaning and should just be added to a string that is part of the name of some entity. So once we encounter a " in our main file, we start parsing with the second rule; once the second rule finds a ", we return the string parsed so far and continue in the main rule:

and string pos s = parse
| "\\"    { string pos s lexbuf }
| "\r"    { string pos s lexbuf }
| "\""    { s }
| "\n"    { lexbuf.EndPos <- lexbuf.EndPos.NextLine;
            string pos s lexbuf }
| eof     { failwithf "end of file in string started at or near %A" pos }
| _       { string pos (s + (Lexing.lexeme lexbuf)) lexbuf }

As you can see, the string function takes the pos of the start of the string (for debugging) and accumulates the string in its parameter s. If the third pattern matches, the rule is finished: we've encountered the second ". The last pattern matches with anything that isn't matched by the previous lines, so we add it to the s accumulator.

The other matches, 1st 2nd and 4th match, are control characters that are simply skipped, and in the case of newline we update the line number. I did this because graphviz has the annoying habit of inserting a newline after a certain number of characters on one line. Presumably this is for readability, but why anyone would want to read dot files is beyond me. Anyway, Graphviz inserts a \ followed by \r in such cases, and this is what the two first rules filter out. This solved quite some pains afterwards when interpreting IDs ("230,345" is a much easier string to interpret than "230,3\\r45", especially when you don't know if, when or where graphviz will insert a line break. Yes, it actually did this in the middle of numbers.).

By the way, another typical example where it is useful to have multiple rules is when you would like to parse comments; also there the normal lexing rule does not apply. In fact, I adapted the above string function from a similar function designed to parse comments in the excellent Expert F# book.

That's it for the lexer. Paste all the stuff from this post together, and you should have a working .fsl file. Not compilable yet I'm afraid, you need the token definitions which are produced by the parser (remember the open DotParser?).

You've made it halfway! Next post: the parser.

Technorati: ,,,

02 April 2008

Parsing Dot with F#: Part 1

A while back I decided to write a parser for dot, the language used by Graphviz. This is both my first real project in F#, and in parsing. I learned the basics from the excellent Expert F# book.

I'll try to explain my solution, using fslex and fsyacc, in the order that I tackled the problem.  There are a few more basic examples out there, explaining what parsing is, what a lexer and a parser are etc., However it seems the examples given are always small, typically parsing some expressions. Graphviz' dot definitely has more of a real world flavor, and I'll present it as a real world example of using fslex and fsyacc, without explaining much about those tools per se.

This is how I see this mini-series play out:

  • Simplified dot grammar and abstract syntax tree
  • The lexer
  • The parser
  • Putting it all together

This post is part 1.

Graphviz

Graphviz is a command line tool that takes a description of a graph as input, and outputs a description (or an image) of a layout of the graph. For example, as input you can give:

digraph G {
0 [label="Type1<>", shape=box];
1 [label="Type1<Type2>", shape=box];
2 [label="T", shape=box];
3 [label="Type2", shape=box];
4 [label="Type2[], Type2*, Type2&", shape=box];
5 [label="#Type2", shape=box];
6 [label="Type2.Type3", shape=box];
0 -> 1 [ label="MakeGenericType(Type2)"];
0 -> 2 [ label="GetGenericArguments()"];
1 -> 0 [ label="GetGenericTypeDefinition()"];
1 -> 3 [ label="GetGenericArguments()"];
2 -> 0 [ label="DeclaringType"];
3 -> 4 [ label="MakeArrayType(), MakePointerType(), MakeByRefType()"];
3 -> 6 [ label="GetNestedType(Type3.Name)"];
4 -> 3 [ label="GetElementType()"];
5 -> 3 [ label="BaseType"];
6 -> 3 [ label="DeclaringType"];
}

Basically, just a sequence of nodes and edges annotated with attributes. In the above example, only label and shape are used, but dot supports many, many more. Given such a file, graphviz outputs:

digraph G {
    node [label="\N"];
    graph [bb="0,0,1028,328"];
    0 [label="Type1<>", shape=box, pos="502,302", width="1.03", height="0.50"];
    1 [label="Type1<Type2>", shape=box, pos="380,210", width="1.61", height="0.50"];
    2 [label=T, shape=box, pos="625,210", width="0.75", height="0.50"];
    3 [label=Type2, shape=box, pos="568,118", width="0.78", height="0.50"];
    4 [label="Type2[], Type2*, Type2&", shape=box, pos="249,26", width="2.61", height="0.50"];
    5 [label="#Type2", shape=box, pos="703,210", width="0.92", height="0.50"];
    6 [label="Type2.Type3", shape=box, pos="784,26", width="1.42", height="0.50"];
    0 -> 1 [label="MakeGenericType(Type2)", pos="e,322,216 465,301 381,298 181,288 161,266 156,259 156,252 161,246 170,235 252,223 312,217", lp="263,256"];
    0 -> 2 [label="GetGenericArguments()", pos="e,623,228 539,298 561,293 589,284 607,266 615,258 619,247 621,237", lp="708,256"];
    1 -> 0 [label="GetGenericTypeDefinition()", pos="s,465,293 455,290 430,284 403,275 394,266 385,256 381,240 380,228", lp="504,256"];
    1 -> 3 [label="GetGenericArguments()", pos="e,540,121 394,192 405,179 420,163 437,154 467,137 504,128 531,123", lp="528,164"];
    2 -> 0 [label=DeclaringType, pos="s,539,300 548,300 628,295 785,284 802,266 807,259 807,252 802,246 781,222 691,237 661,228 658,227 655,226 652,225", lp="863,256"];
    3 -> 4 [label="MakeArrayType(), MakePointerType(), MakeByRefType()", pos="e,155,36 540,117 432,114 50,100 32,82 26,75 27,68 32,62 40,53 93,44 145,37", lp="251,72"];
    3 -> 6 [label="GetNestedType(Type3.Name)", pos="e,733,40 596,101 616,89 645,73 671,62 688,54 707,48 725,43", lp="781,72"];
    4 -> 3 [label="GetElementType()", pos="s,540,110 532,107 515,100 495,92 480,82 470,75 472,67 461,62 441,51 390,42 343,36", lp="557,72"];
    5 -> 3 [label=BaseType, pos="e,595,136 676,192 655,178 625,157 602,141", lp="692,164"];
    6 -> 3 [label=DeclaringType, pos="s,596,117 605,117 687,113 875,103 894,82 915,58 873,43 835,34", lp="956,72"];
}

The basic structure of the file is the same: a sequence of nodes and edges, but Graphviz has added position, height, width and other layout info. This is actually the file that was used to draw the graph in my previous post.

The dot grammar

When parsing using fslex and fsyacc, the first thing you should find or make is a grammar of the thing you're trying to parse. Everything else sort of flows from there. Luckily, the complete dot grammar can be found here. I thought it was a bit complicated for my purposes, so I simplified the grammar a bit:

(in the following, terminals are shown in bold. Literal characters are given in single quotes. Parentheses ( and ) indicate grouping when needed. Square brackets [ and ] enclose optional items. Vertical bars | separate alternatives.)

graph    :    digraph [ ID ] '{' stmt_list '}'
stmt_list:    [ stmt [ ';' ] [ stmt_list ] ]
stmt     :    node_stmt
         |    edge_stmt
         |    attr_stmt /*defines a default attribute*/
attr_stmt:    (graph | node | edge) attr_list
attr_list:    '[' attr  [ ',' ] [ attr_list ] ']' 
atrtr    :    ID '=' ID  
edge_stmt:    node_id -> node_id [ attr_list ]
node_stmt:    node_id [ attr_list ]
node_id  :    INT

If you compare with the original dot grammar, I made the following simplifications:

  • No node ports
  • No sub-graphs
  • No short definition of multiple edges (a -> b -> c)
  • no HTML IDs
  • only digraphs

Basically this grammar says that a graph is "digraph name { bunch of node, edge or default statements }". We've already seen node and edge statements; default statements basically just set an attribute on all the nodes and edges that follow. It is overridden by an attribute of the same name on a node or edge itself, or by a new default statement.

The abstract syntax tree

Based on that grammar, I came up with the following abstract syntax tree using F# discriminated unions:

#light

open List
open System
open System.Collections.Generic

type Attributes = Dictionary<string,string>

type Element = 
    | Node of string * Attributes
    | Edge of string * string * Attributes
    | GraphAttributeList of Attributes
    | NodeAttributeList of Attributes
    | EdgeAttributeList of Attributes

type Graph = Graph of List<Element>

It's easiest to read this from bottom to top (unfortunately it needs to be defined the other way round, otherwise F# has difficulty parsing). A graph is a list of elements. An element can either be a node, an edge or a default attribute list for the graph, the subsequent nodes or the subsequent edges. Each of these elements can have a number of attributes. Attributes are simply presented as a Dictionary. The abstract syntax tree was fairly straightforward to build from the grammar; I expect the same for any well-written grammar.

The abstract syntax tree is the parser's interface to the outside, so it's important that you think about how you're going to use the parser when making decisions about the representation of the abstract syntax tree. For example, I took care not to use any F# specific types in the above AST definition, so that client assemblies in other language would not need to reference F# specific assemblies. On that topic, don't forget to compile your F# assemblies with the --standalone flag, otherwise client assemblies will still need some F# specific libraries (e.g. discriminated unions implement IStructuralHash, so clients also need to know this interface).

Another issue you should think about is how 'deep' you want to parse. For example, it would be theoretically possible to define separate cases for each of the different types of attributes that can be generated by Graphviz. This would also allow us to parse some of the arguments (e.g. the list of position coordinates could be parsed into a list of tuples). However, given the large amount and frequent changes in Graphviz dot attributes, I choose not to take this route.

Next episode: the lexer!