2013-06-19

Parsing Erlang Terms in OCaml (part 2)

In this installment, we'll continue my example of parsing with OCaml by focusing on the parser.

As I was developing this program, I didn't first write the tokenizer and then the parser. It was an iterative process where some progress was made breaking the input into tokens, and then supporting it in the grammar. Other times, I defined more of the grammar and then tweaked the tokenizer to support it. Describing this in a few blog entries would get confusing if I kept going back and forth between the stages, so I'll simply focus on the parser for now and explain the tokenizer in the next article.

How to Invoke

The grammar is defined in a source file with a .yml suffix. Once compiled, an .ml file is generated. So, for instance, if my grammar specification is in a file called grammar.yml, I would compile it with:

ocamlyacc grammar.yml

This command generates grammar.ml, which holds the parser implementation. Amid all the various functions and tables used to recognize the language is an exported function which starts all the processing. This function takes two parameters: a function that breaks the input into tokens, and a source for the lexer. The function used for the first parameter will be created when we define the tokenizer. The second parameter is obtained from the Lexing module in the standard library.

Defining the Grammar

First we'll define tokens we expect to get from the input. The order in which the tokens are arranged determines the grammar of the language. Some tokens are fixed, like operators in an expression or keywords in a programming language. Some tokens represent values which vary, but have a consistent format. For instance, a string token is represented by text between two quotes and a number token is represented by 1 or more consecutive digits.

ocamlyacc has a section where the tokens are defined. If a token is fixed (i.e. punctuation), then the format is "%token NAME [NAME2 ...]". If the token represents a changing entity, then the syntax is extended to include the type of the value: "%token <type> NAME [NAME2 ...]"

For our Erlang term parser, we have several fixed tokens which represent the punctuation used; square brackets are used to mark where a list begins and ends, braces mark the beginning We can itemize all of them in one line:

%token OLIST OTUPLE CLIST CTUPLE COMMA PERIOD PIPE EOF

There are four primitive types that the tokenizer will recognize (the container types will be handled by the parser.) The four types read by the tokenizer are integers, floating point numbers, strings, and atoms. We associate these tokens with their expected types using the following three lines:

%token <int> INT
%token <float> FLOAT
%token <string> STRING ATOM

In the token section of the .yml file, we also get to specify the name of the function which begins the parsing and what its return value will be. In this case, we have a file containing a series of Erlang terms which we want to parse one term at a time. We'll have the function return an Erlang.t option so if a value is read, it will return Some v and at end of file it will return None.

%start next
%type <Erlang.t option> next

The next section of the source file defines the rules of the grammar. These rules are made up of the name of the rule, followed by the pattern of tokens that define the rule and then OCaml code to execute, if a match is found. The name of the rule will end up being the name of a function in the generated .ml file.

Our first rule will be our starting point, the function next. This function will return a parsed term or None if we reach the end of file. We know all top level terms are followed by a period, so we define the next rule as:

next: term PERIOD               { Some $1 }
| EOF                           { None }

The $1 in the OCaml code will be replaced with the value found in the first matched term. The second matched term is accessed with $2, but it's simply a period so it isn't very interesting. Typically we only take the values of tokens that have a type argument since the fixed tokens are used to build the context of the grammar.

Now we need to add a term rule so Erlang terms can be matched. The four primitive types are simple:

term:
| ATOM                          { Erlang.Atom $1 }
| STRING                        { Erlang.String $1 }
| INT                           { Erlang.Integer $1 }
| FLOAT                         { Erlang.Number $1 }

The four tokens we match against (ATOM, STRING, INT, and FLOAT) are tokens that are defined with a type argument. When the OCaml code refers to a token (the $1 in these examples) it wants the value that was associated with the token. Think of tokens with types as a variant type constructor that takes an argument. In fact, these four rules are transforming ocamlyacc "constructors" into our variant type's constructors.

We're not fully done with this rule, though. We also need to match Erlang tuples and lists:

| OLIST terms CLIST             { Erlang.List $2 }
| OLIST terms PIPE term CLIST   { Erlang.List (List.append $2 [$4]) }
| OTUPLE terms CTUPLE           { Erlang.Tuple $2 }

Even though I differentiate between Erlang tuples and lists, I stored the contents for both in an OCaml list. I didn't want to define an Erlang.Tuple2, Erlang.Tuple3, etc. for varying lengths of tuples. My OCaml knowledge isn't strong enough to know if there's a way to support different tuple sizes in a single variant type constructor, so I went for simplicity and saved the values in a list.

These rules match on the opening and closing tokens to determine whether it's a list or tuple. It relies on the terms rule to read in the values.

You may have also noticed that there are two rules to make a list. The Erlang designers, for some reason, allow building a list where the tail isn't another list! For instance: [ 3 | 9 ] is a list that is more like a 2-tuple. You can't pass this "list" to many of the standard library list functions without throwing an exception, so I'm not sure of the wisdom in allowing these lists to be built. Unfortunately, they can occur (there are some in the profiler output!) so we need to support them. The grammer I've defined will recognize degenerate lists and convert them into a normal OCaml list.

The last rule to define is the terms rule:

terms:                          { [] }
| term                          { [$1] }
| term COMMA terms              { ($1 :: $3) }

In the first line, there isn't a pattern, just OCaml code. This rule is an empty rule and it matches if the other rules don't. This rule will become a function named terms and it's use in previous rules show it needs to return a list of Erlang.t values. So if we don't match any pattern, we return the empty list.

What we're trying to do in the terms rule is find a comma separated list of terms. The second line finds a term that isn't followed by a comma, so it returns a one-item list. The third line finds a term followed by a comma and then recursively calls itself to parse the value after the list. The associated OCaml code builds up an OCaml list using the values of the two rules.

Done for now

The calculator examples you typically find in YACC tutorials are very useful because they show operator precedence along with doing slightly more processing in each rule (by computing a result.) This example shows a different use where we're trying to take text representing data and converting it into something usable by OCaml functions. My example, along with the traditional calculator example, will hopefully make ocamlyacc more accessible to programmers.

The next article will cover tokenizing the input.

Here is the full content of grammar.yml:

%{
%}

%token <int> INT
%token <float> FLOAT
%token <string> STRING ATOM
%token OLIST OTUPLE CLIST CTUPLE COMMA PERIOD PIPE EOF

%start next
%type <Erlang.t option> next

%%

next: term PERIOD               { Some $1 }
| EOF                           { None }

term:
| ATOM                          { Erlang.Atom $1 }
| STRING                        { Erlang.String $1 }
| INT                           { Erlang.Integer $1 }
| FLOAT                         { Erlang.Number $1 }
| OLIST terms CLIST             { Erlang.List $2 }
| OLIST terms PIPE term CLIST   { Erlang.List (List.append $2 [$4]) }
| OTUPLE terms CTUPLE           { Erlang.Tuple $2 }

terms:                          { [] }
| term                          { [$1] }
| term COMMA terms              { ($1 :: $3) }

2013-06-06

Parsing Erlang Terms in OCaml (Part 1)

I recently used the Erlang profiler to find the bottleneck in a project. Using the profiler was relatively easy but its output wasn't obvious in what it was trying to say (at least not at my experience level.) After trying to interpret the results, I realized I could to do some processing on the output to pull the information I needed. About the same time, I was investigating how to do parsing in OCaml using ocamllex and ocamlyacc. Although I was able to sort out the profiler information on my own, I still thought it would be interesting to try to parse the output using OCaml.

The Input

When saving performance statistics to a file, the profiler writes out the information using Erlang data terms. These could be primitive terms (i.e. integers, atoms) or they can be containers of terms. Each top-level term in the file ends with a period. For instance here's a file with three top-level terms:

[1,2,3].
this_is_an_atom.
[{ok, "this is a string"}, 5].

If you aren't familiar with Erlang, the first line is a list of three integers. The second line is a single atom with a long name. The third row is a little more complex: it's a list of two items. The first item is a tuple of two items; the atom ok and a string. The second item in the list is an integer.

These terms won't map directly into fundamental OCaml types for several reasons. First, there is no atom type in OCaml and, second, OCaml lists can only hold data of one type. Interestingly enough, we can model Erlang's loose typing using OCaml variant types. Another detail in our favor is that Erlang is a simple language in which you can't create new types, so representing all possibilities is easy.

The Output

We want the parser to read the terms from a file and represent them using OCaml values. To do this, we need to define how the terms will appear to an OCaml application. This, of course, is easily done by creating a module, Erlang, and defining the type t as:

type t =
  | Atom of string
  | String of string
  | Integer of int
  | Number of float
  | List of t list
  | Tuple of t list

The variant type, Erlang.t, can represent atoms, strings, and numbers. It can also represent containers like tuples and lists. OCaml type definitions are recursive by default, which means constructors of the type can take a value of the type of which they're a member. We see the List constructor takes, as an argument, an OCaml list of type Erlang.t, which can be any of these values, including other lists or tuples!

At this point, we should pause a moment to appreciate OCaml's elegance; in seven lines of OCaml, we can describe six Erlang data types! I have to admit, however, I made several simplifications and alterations:

  • Erlang integers are arbitrary precision, so the Integer constructor should use the Big_int module in the OCaml standard library.
  • We shouldn't have a string value since Erlang doesn't really have strings (it has lists of low-valued integers.) However, most of the uses I've seen have had list of integers used when the data was a list of integers and a string of text between quotes when the data was supposed to be human-readable. So our parser will take advantage of this.
  • Erlang also has process IDs and references, but they're typically represented by strings containing carefully formatted values, so we'll make them OCaml strings.

In practice, these limitations haven't been a problem.

So, knowing what our data types are going to be, the three lines in the above example file will get parsed and transformed into the following OCaml values:

Line 1: List [Integer 1; Integer 2; Integer 3]
line 2: Atom "this_is_an_atom"
line 3: List [Tuple [Atom "ok"; String "this is a string"];
              Integer 5]

Next...

In the next installment, I'll focus on the grammer used by ocamlyacc.