-
Notifications
You must be signed in to change notification settings - Fork 15
Control of the Parsing Process
When we have multiple data objects to extract from a text, it is useful to break the parsing job into smaller pieces; both for modularity and for identifying any patterns of repetition. What keeps a parser going are its subordinate parsers working on repeating patterns in the text, which may vary from single characters to complex constructs. Kern offers various combinators that will apply one or more parsers as long as they succeed, such as many
, sep-by
, comma-sep
, semi-sep
and several others.
The following will load the Kern core
and C-style lexer
namespaces.
(use '[blancas.kern.core]
'[blancas.kern.lexer.c-style])
Let's start with this simple text: "/Users/blancas/dev/tools"
. The result should be a list of path segments and we'll ignore things like root directory or trailing slashes. The repeating pattern is straight-forward: a slash and a name. Since we don't need the slash, we use the >>
operator, which applies all its parsers but only returns the result of the last one:
(>> (sym \/) (field "/"))
(run (>> (sym \/) (field "/")) "/Users") ;; "Users"
Now we an define out operator by repeating the pattern:
(def path
"Parses a path and produces a list of segments."
(many1 (>> (sym \/) (field "/"))))
(run path "/Users/blancas/dev/tools") ;; ["Users" "blancas" "dev" "tools"]
The binding form lets a parser work with intermediate values that may be validated. These values may be found invalid even if they are syntactically correct. For example, a program may specify that identifiers may not have more than a given number of characters.
(defn name [n]
"Parses an identifier of up to n characters."
(bind [s identifier]
(if (> (count s) n)
(fail (format "id %s longer than %d chars" s n))
(return s))))
(run (name 8) "A1234567") ;; "A1234567"
(run (name 12) "A123456789ABC")
;; line 1 column 14
;; id A123456789ABC longer than 12 chars
Writing parser is the process of creating a sequence of subordinate parsers. In the above example, the function name
returns a parser bind
that sequences sub-parsers identifier
and either fail
or return
, depending on the value of s. We can just as easy branch on two valid values of s
with different parsers. In the following example will parse the syntax char (bool | $hex)
:
(def choice
"Has a choice of a boolean or a hex number"
(<*> char-lit (<|> bool-lit hex-lit)))
(run choice "'z' true") ;; [\z true]
(run choice "'z' 0xBA") ;; [\z 186]
Now suppose the choice depends on the actual char value: bool for vowels and hex for the rest. Instead of using the choice parser <|>
we bind the value of the char and branch accordingly.
(def choice
"Has a choice of a boolean or a hex number"
(bind [c char-lit]
(if ({\a 1 \e 2 \i 3 \o 4 \u 5} c)
bool-lit
hex-lit)))
(run choice "'o' true") ;; true
(run choice "'x' 0xBA") ;; 186
If we want to produce a parser that will also return the character read, then we have to return a parser that will collect both values.
(def choice
"Has a choice of a boolean or a hex number"
(bind [c char-lit]
(if ({\a 1 \e 2 \i 3 \o 4 \u 5} c)
(bind [b bool-lit] (return [c b]))
(bind [h hex-lit] (return [c h])))))
(run choice "'o' true") ;; [\o true]
(run choice "'x' 0xBA") ;; [\x 186]
A common use of backtracking is to force the choice parser <|>
to evaluate the next parser if one fails having consumed input. In this example we read a float or a range. A float input works fine, but with a range input float-num
makes <|>
stop because it consumes digits:
(run (<|> float-num (sep-by (<*> dot dot) dec-num)) "3.1416") ;; 3.1416
(run (<|> float-num (sep-by (<*> dot dot) dec-num)) "30..45")
;; line 1 column 4
;; unexpected \.
;; expecting digit
What we need is for <|>
to evaluate the second parser from the start of the input. This is done by adding backtracking to float-num
like so: (<:> float-num)
. Now when float-num
fails because of the second dot, it shows no input consumed and thus the range parser can be tried:
(run (<|> (<:> float-num) (sep-by (<*> dot dot) dec-num)) "30..45") ;; [30 45]
Another form of backtracking is an explicit check for a particular construct. Using the look-ahead
parser we can attempt to parse the start of a range, and then check the result (the rest of the parser state remains the same). If it's nil it means the input is not a range. The following example is similar to the one above but the parser returns a map properly coded with the type of data.
(def data
"Parses a float or a range and returns a type-annotated map."
(bind [v (look-ahead (<*> dec-num dot dot))]
(if (nil? v)
(bind [f float-num]
(return {:type :FLOAT :value f}))
(bind [r (sep-by (<*> dot dot) dec-num)]
(return {:type :RANGE :value r})))))
(run data "3.1416")
;; {:type :FLOAT, :value 3.1416}
(run data "128..512")
;; {:type :RANGE, :value [128 512]}