2013-06-06

Parsing Erlang Terms in OCaml (Part 1)

I recently used the Erlang profiler to find the bottleneck in a project. Using the profiler was relatively easy but its output wasn't obvious in what it was trying to say (at least not at my experience level.) After trying to interpret the results, I realized I could to do some processing on the output to pull the information I needed. About the same time, I was investigating how to do parsing in OCaml using ocamllex and ocamlyacc. Although I was able to sort out the profiler information on my own, I still thought it would be interesting to try to parse the output using OCaml.

The Input

When saving performance statistics to a file, the profiler writes out the information using Erlang data terms. These could be primitive terms (i.e. integers, atoms) or they can be containers of terms. Each top-level term in the file ends with a period. For instance here's a file with three top-level terms:

[1,2,3].
this_is_an_atom.
[{ok, "this is a string"}, 5].

If you aren't familiar with Erlang, the first line is a list of three integers. The second line is a single atom with a long name. The third row is a little more complex: it's a list of two items. The first item is a tuple of two items; the atom ok and a string. The second item in the list is an integer.

These terms won't map directly into fundamental OCaml types for several reasons. First, there is no atom type in OCaml and, second, OCaml lists can only hold data of one type. Interestingly enough, we can model Erlang's loose typing using OCaml variant types. Another detail in our favor is that Erlang is a simple language in which you can't create new types, so representing all possibilities is easy.

The Output

We want the parser to read the terms from a file and represent them using OCaml values. To do this, we need to define how the terms will appear to an OCaml application. This, of course, is easily done by creating a module, Erlang, and defining the type t as:

type t =
  | Atom of string
  | String of string
  | Integer of int
  | Number of float
  | List of t list
  | Tuple of t list

The variant type, Erlang.t, can represent atoms, strings, and numbers. It can also represent containers like tuples and lists. OCaml type definitions are recursive by default, which means constructors of the type can take a value of the type of which they're a member. We see the List constructor takes, as an argument, an OCaml list of type Erlang.t, which can be any of these values, including other lists or tuples!

At this point, we should pause a moment to appreciate OCaml's elegance; in seven lines of OCaml, we can describe six Erlang data types! I have to admit, however, I made several simplifications and alterations:

  • Erlang integers are arbitrary precision, so the Integer constructor should use the Big_int module in the OCaml standard library.
  • We shouldn't have a string value since Erlang doesn't really have strings (it has lists of low-valued integers.) However, most of the uses I've seen have had list of integers used when the data was a list of integers and a string of text between quotes when the data was supposed to be human-readable. So our parser will take advantage of this.
  • Erlang also has process IDs and references, but they're typically represented by strings containing carefully formatted values, so we'll make them OCaml strings.

In practice, these limitations haven't been a problem.

So, knowing what our data types are going to be, the three lines in the above example file will get parsed and transformed into the following OCaml values:

Line 1: List [Integer 1; Integer 2; Integer 3]
line 2: Atom "this_is_an_atom"
line 3: List [Tuple [Atom "ok"; String "this is a string"];
              Integer 5]

Next...

In the next installment, I'll focus on the grammer used by ocamlyacc.

No comments:

Post a Comment