Skip to content

Writing a parser: Sequencing, Binding, Diagnostics

Armando Blancas edited this page May 26, 2018 · 8 revisions

In this tutorial we'll write a custom parser for a function call in the following form:

funcall ::=  fname '(' (arg (, arg)*)* ')' ';'
arg ::=  number | boolean

The result of the parsing should be a map with the following structure:

{ :fname "function name" :args [arg1 arg2 ... argn] }

We want to allow for whitespace and comments, so we'll use the C-style lexer.

(use '[blancas.kern.core]
     '[blancas.kern.lexer.c-style])

By convention, Kern parsers that don't take any arguments are defined with def to save typing and avoid clutter, so that's how we'll define our parser. Fortunately, the lexer library offers parsers for all the parts of the syntax; our job is to put it together in a sequence.

  • identifier parses the function name.
  • parens parses input in parenthesis, like our argument list.
  • comma-sep parses a comma-separated list of items.
  • semi parses a semicolon.

Sequencing and Binding

You may have seen the usage of the sequencing parser <*>. As a reminder:

(run (<*> upper digit) "U2")  ;; [\U \2]

While this works, it's difficult to grab the results when there's more input to parse. What we really need is a couple of things. One is a way to sequence helper parsers and access their results directly. The other is a way to prepare our parser's result value. The second one is easy: use the return parser, which job is to return its argument (consuming no input). The first one is done with special type of binding provided by the bind form the Kern core namespace.

Instead of listing a sequence of parsers, we use a syntax similar to that of let. To illustrate the sample above, we could do:

(bind [x upper
       y digit]
  (return (str "first " x " then " y)))

This is similar to the above, but we specify that the result of the parser upper be bound to x and that the result of the parser digit be bound to y. Note that the above form is already a custom parser that we can run:

(run
  (bind [x upper
         y digit]
    (return (str "first " x " then " y)))
  "U2")
;; "first U then 2"

The above evaluation shows how bind sets up a sequence of three parsers: upper, digit and return. It also shows how the parsed values of upper and digit are bound to x and y, respectively. Thus bind is similar to let, but its last form must be a parser (i.e., the last one in the sequence), so that to run parser1 to parserN in sequence we write:

(bind [v1 parser1 
       v2 parser2
       .. ...]
  parserN)

If you're familiar with Haskell you may have noticed that this is corresponds to the do notation. Accordingly, the bind macro expands to the following:

(>>= parser1 
     (fn [v1]
       (>>= parser2
            (fn [v2]
            ... (return (f v1 v2 ...))))))

There is usually no reason for using the lower-level >>= ("then") parser except perhaps for more granularity during debugging.

Note that in the above descriptions parser1 and parser2 are actually expressions that evaluate to a parser, and not necessarily only references to a defed parser. Also, each expression can use previous bindings. So we may, for example, use conditional expressions in bindings.

(def condbind (bind [x letter
                     y (if (= x \U) digit letter)]
                (return [x y])))

(run condbind "U2")  ;; [\U \2]
(run condbind "OS")  ;; [\O \S]

Parser funcall

We now have enough to get started writing our parser. First the name:

(def funcall
  "Parses a function call."
  (bind [name identifier]
    (return {:fname name})))

(run funcall "foo")  ;; {:fname "foo"}

Now we add the ability to parse an argument in any of the three types by adding another parser bound to a variable to capture the value. The parser parens expects parenthesis in the input; the parser <|> offers a choice for what the input can be parse into.

(def funcall
  "Parses a function call."
  (bind [name identifier
         lst (parens (<|> float-lit bool-lit))]
    (return {:fname name :args lst})))

(run funcall "foo(747)")  ;; {:fname "foo", :args 747.0}
(run funcall "foo(false)")  ;; {:fname "foo", :args false}

Next, we add the ability to parser multiple arguments using comma-sep.

(def funcall
  "Parses a function call."
  (bind [name identifier
         lst (parens (comma-sep (<|> float-lit bool-lit)))]
    (return {:fname name :args lst})))

(run funcall "foo(747, 0, true)")
;; {:fname "foo", :args [747.0 0.0 true]}

We complete the syntax with the ending semicolon using the lexer parser semi. Though we expect a semicolon to follow the argument list, the actual character can be ignored.

(def funcall
  "Parses a function call."
  (bind [name identifier
         lst (parens (comma-sep (<|> float-lit bool-lit)))
         _   semi]
    (return {:fname name :args lst})))

(run funcall "foo(747, 0, true);")
;; {:fname "foo", :args [747.0 0.0 true]}

Error Diagnostics

We now have a working parser but error diagnostics can be improved. For example, let's see what happens if the argument list is missing.

(run funcall "foo;")
;; line 1 column 4
;; unexpected \;
;; expecting \(

Though correct, the diagnostic could be more comprehensive and not too focused on the next character. With the combinator <?> is possible to override messages of the type expecting ... with a better description of the missing part. The requirement is that the parser has not consumed any input, which is the case here. To make our code modular, we'll parse the argument list in helper definition.

(def arglist
  "Parses an argument list."
  (<?> (parens (comma-sep (<|> float-lit bool-lit)))
       "the argument list"))

(def funcall
  "Parses a function call."
  (bind [name identifier  lst arglist  _ semi]
    (return {:fname name :args lst})))

(run funcall "foo(747, 0, true);")
;; {:fname "foo", :args [747.0 0.0 true]}

(run funcall "foo;")
;; line 1 column 4
;; unexpected \;
;; expecting the argument list

Note that the run command is convenient for trying out your code at the repl, but calls to funcall in a program are done with parse or value, which return the parser state or the parsed value, respectively.