09 April 2008

Parsing Dot with F#: Part 2 - The lexer

Now that the preliminaries are over, we can get on with using the first part in the fslex/fsyacc tandem: the lexer. The lexer's responsibility is to break up the long input text into tokens that are better manageable for the parser, which in turn will produce the actual abstract syntax tree. So, instead of having to deal with the string "digraph MyGraph {  0 [label="a string_{}"] }", the lexer will turn this into a list of tokens, where each token can be attributed with some information. For the example, the output of our lexer is the following list of tokens: DIGRAPH ID["MyGraph"] LACC ID["0"] LBR ID["label"] ASSIGN ID["a string_{}"] RBR RACC.  As you can see, the only token with an attribute is the ID, in which all kinds of string names are parsed. Notice also that any ID string can contain "control characters" like { and }. The lexer's main job is to filter those kinds of things out so the parser's job is much easier.

You can tell fslex how to tokenize a string by providing it with a list of rules. Each rule is a series of regular expressions-action pairs, each producing an (attributed) token. These rules are specified in a special format, which is pre-processed to an actual F# code file by the fslex tool. I'm not going to explain that in detail, there are other sources for that.

Our parser.fsl file starts with some preliminaries:

{
open System
open DotParser
open Lexing
}

let digit = ['0'-'9']
let whitespace = [' ' '\t' ]
let newline = ('\n' | '\r' '\n')
let letter = [ 'a'-'z' 'A'-'Z' '_']

Notice here that, besides opening the System and Lexing namespaces, the DotParser namespace is opened. That's because the type that defines the tokens (actually a regular discriminated union type) is defined in the fsyacc file. We'll come back to that in the next post. Then, some basic regular expressions that are useful for almost any parser, are defined. If you're at all familiar with regular expressions, this stuff should be easy. Notcie that these are just regular let variable bindings in F#. Fslex just translates this file to actual F# code, so you're free to use F# constructs as you see fit.

Our actual parser consists of 2 rules. This is the first:

rule token = parse
| whitespace     { token lexbuf }
| ('\r' | '\t')  { token lexbuf }
| "digraph"      { DIGRAPH }
| "node"         { NODE }
| "edge"         { EDGE }
| "graph"        { GRAPH }
| "->"           { DIEDGE }
| "--"           { UNEDGE }
| "\""           { ID ( (string lexbuf.StartPos "" lexbuf) ) }
| '{'            { LACC }
| '}'            { RACC }
| '['            { LPAREN }
| ']'            { RPAREN }
| ';'            { SEMI }
| ','            { COMMA }
| '='            { ASSIGN }
| letter(letter|digit)*                { ID (lexeme lexbuf) }
| ['-']?('.'digit+ | digit+('.'digit*)? ) { ID (lexeme lexbuf) }
| newline        { lexbuf.EndPos <- lexbuf.EndPos.NextLine; token lexbuf }
| eof            { EOF }

On the left, written like a pattern match, are the regular expressions that will be matched against the input string. On the right, the tokens that are returned. As you can see in the line for lexing IDs, you can access the current lexeme, i.e. the currently-being-lexed part of the input string, by calling lexeme lexbuf. Just skipping the current lexeme can be achieved by calling token lexbuf. Also, it is useful if you keep track of the line number for debugging later. That's what the action in the newline pattern does.

One gotcha here is that the order of these regular expressions matter not only for lexing (a line higher up will take precedence over a lower line), but also for type inference. The above rule needed no type annotations, but try placing the newline rule at the top; after pre-processing, F# will complain during compilation that it needs some type annotations. It's not a big deal, but if the order does not matter for lexing, it is nicer and more flexible to change later if you don't need to put any type annotations.

One line in the above sample may puzzle you:

| "\""            { ID ( (string lexbuf.StartPos "" lexbuf) ) }

That's because this calls a second rule, called  string, which parses a string,i.e. something between double quotes ("). We go into this second rule when we encounter an open double quotes in the input. Such multiple rules are useful when you have different contexts in the input string in which different lexing rules apply. For dot, a string can contain characters such as brackets, arrows, semicolons and such that have a particular meaning in the main dot file. But when they're between double quotes, the lexer is in another context, where these characters lose their meaning and should just be added to a string that is part of the name of some entity. So once we encounter a " in our main file, we start parsing with the second rule; once the second rule finds a ", we return the string parsed so far and continue in the main rule:

and string pos s = parse
| "\\"    { string pos s lexbuf }
| "\r"    { string pos s lexbuf }
| "\""    { s }
| "\n"    { lexbuf.EndPos <- lexbuf.EndPos.NextLine;
            string pos s lexbuf }
| eof     { failwithf "end of file in string started at or near %A" pos }
| _       { string pos (s + (Lexing.lexeme lexbuf)) lexbuf }

As you can see, the string function takes the pos of the start of the string (for debugging) and accumulates the string in its parameter s. If the third pattern matches, the rule is finished: we've encountered the second ". The last pattern matches with anything that isn't matched by the previous lines, so we add it to the s accumulator.

The other matches, 1st 2nd and 4th match, are control characters that are simply skipped, and in the case of newline we update the line number. I did this because graphviz has the annoying habit of inserting a newline after a certain number of characters on one line. Presumably this is for readability, but why anyone would want to read dot files is beyond me. Anyway, Graphviz inserts a \ followed by \r in such cases, and this is what the two first rules filter out. This solved quite some pains afterwards when interpreting IDs ("230,345" is a much easier string to interpret than "230,3\\r45", especially when you don't know if, when or where graphviz will insert a line break. Yes, it actually did this in the middle of numbers.).

By the way, another typical example where it is useful to have multiple rules is when you would like to parse comments; also there the normal lexing rule does not apply. In fact, I adapted the above string function from a similar function designed to parse comments in the excellent Expert F# book.

That's it for the lexer. Paste all the stuff from this post together, and you should have a working .fsl file. Not compilable yet I'm afraid, you need the token definitions which are produced by the parser (remember the open DotParser?).

You've made it halfway! Next post: the parser.

Technorati: ,,,

No comments: