Skip to content

Control of the Parsing Process

Armando Blancas edited this page May 26, 2018 · 12 revisions

When we have multiple data objects to extract from a text, it is useful to break the parsing job into smaller pieces; both for modularity and for identifying any patterns of repetition. What keeps a parser going are its subordinate parsers working on repeating patterns in the text, which may vary from single characters to complex constructs. Kern offers various combinators that will apply one or more parsers as long as they succeed, such as many, sep-by, comma-sep, semi-sep and several others.

The following will load the Kern core and C-style lexer namespaces.

(use '[blancas.kern.core]
     '[blancas.kern.lexer.c-style])

Let's start with this simple text: "/Users/blancas/dev/tools". The result should be a list of path segments and we'll ignore things like root directory or trailing slashes. The repeating pattern is straight-forward: a slash and a name. Since we don't need the slash, we use the >> operator, which applies all its parsers but only returns the result of the last one:

(>> (sym \/) (field "/"))

(run (>> (sym \/) (field "/")) "/Users")  ;; "Users"

Now we an define out operator by repeating the pattern:

(def path
  "Parses a path and produces a list of segments."
  (many1 (>> (sym \/) (field "/"))))

(run path "/Users/blancas/dev/tools")  ;; ["Users" "blancas" "dev" "tools"]

Validation

The binding form lets a parser work with intermediate values that may be validated. These values may be found invalid even if they are syntactically correct. For example, a program may specify that identifiers may not have more than a given number of characters.

(defn name [n]
  "Parses an identifier of up to n characters."
  (bind [s identifier]
    (if (> (count s) n)
      (fail (format "id %s longer than %d chars" s n))
      (return s))))

(run (name 8) "A1234567")  ;; "A1234567"
(run (name 12) "A123456789ABC")
;; line 1 column 14
;; id A123456789ABC longer than 12 chars

Parser Selection

Writing parser is the process of creating a sequence of subordinate parsers. In the above example, the function name returns a parser bind that sequences sub-parsers identifier and either fail or return, depending on the value of s. We can just as easy branch on two valid values of s with different parsers. In the following example will parse the syntax char (bool | $hex):

(def choice
  "Has a choice of a boolean or a hex number"
  (<*> char-lit (<|> bool-lit hex-lit)))

(run choice "'z' true")  ;; [\z true]
(run choice "'z' 0xBA")  ;; [\z 186]

Now suppose the choice depends on the actual char value: bool for vowels and hex for the rest. Instead of using the choice parser <|> we bind the value of the char and branch accordingly.

(def choice
  "Has a choice of a boolean or a hex number"
  (bind [c char-lit]
    (if ({\a 1 \e 2 \i 3 \o 4 \u 5} c)
      bool-lit 
      hex-lit)))

(run choice "'o' true")  ;; true
(run choice "'x' 0xBA")  ;; 186

If we want to produce a parser that will also return the character read, then we have to return a parser that will collect both values.

(def choice
  "Has a choice of a boolean or a hex number"
  (bind [c char-lit]
    (if ({\a 1 \e 2 \i 3 \o 4 \u 5} c)
      (bind [b bool-lit] (return [c b])) 
      (bind [h hex-lit] (return [c h])))))

(run choice "'o' true")  ;; [\o true]
(run choice "'x' 0xBA")  ;; [\x 186]

Backtracking

A common use of backtracking is to force the choice parser <|> to evaluate the next parser if one fails having consumed input. In this example we read a float or a range. A float input works fine, but with a range input float-num makes <|> stop because it consumes digits:

(run (<|> float-num (sep-by (<*> dot dot) dec-num)) "3.1416")  ;; 3.1416
(run (<|> float-num (sep-by (<*> dot dot) dec-num)) "30..45")
;; line 1 column 4
;; unexpected \.
;; expecting digit

What we need is for <|> to evaluate the second parser from the start of the input. This is done by adding backtracking to float-num like so: (<:> float-num). Now when float-num fails because of the second dot, it shows no input consumed and thus the range parser can be tried:

(run (<|> (<:> float-num) (sep-by (<*> dot dot) dec-num)) "30..45")  ;; [30 45]

Lookahead

Another form of backtracking is an explicit check for a particular construct. Using the look-ahead parser we can attempt to parse the start of a range, and then check the result (the rest of the parser state remains the same). If it's nil it means the input is not a range. The following example is similar to the one above but the parser returns a map properly coded with the type of data.

(def data
  "Parses a float or a range and returns a type-annotated map."
  (bind [v (look-ahead (<*> dec-num dot dot))]
    (if (nil? v)
      (bind [f float-num]
        (return {:type :FLOAT :value f}))
      (bind [r (sep-by (<*> dot dot) dec-num)]
        (return {:type :RANGE :value r})))))

(run data "3.1416")
;; {:type :FLOAT, :value 3.1416}
(run data "128..512")
;; {:type :RANGE, :value [128 512]}