Skip to content

Working with Parser State

blancas edited this page Jan 24, 2013 · 10 revisions

Kern parsers are pure functions that can model mutable state by using a technique called state monad. Simply put, it is a way to chain functions is a way that the boilerplate can be hidden from client code. The internal parser state is made of the input source, a position, error messages, and an arbitrary value supplied by the client code called user state.

In this tutorial we'll work with the parser position and user state. The program should read a text and produce a list of tokens, one per word. Each token is a map with four values: a word, the line, the column, and the word's current frequency within the text. The final result will be the map of words and their frequencies.

The functions to be used below are in the blancas.kern.core and blancas.kern.lexer.basic namespaces.

(use 'blancas.kern.core
     'blancas.kern.lexer.basic)

We define a word as a group of alphanumeric characters that starts with a letter and is delimited by punctuation or any other symbol. A simple method is to try to read as many words as possible using the construct (many1 identifier). Being a lexer parser, identifier consumes any whitespace after the word, but will stop at the first punctuation or special symbol, so we follow that by skipping multiple such characters.

(def punct
  "Reads punctuation or special symbols."
  (one-of "~!@#$%^&*()_-=+[]{}\\\\|;:<>,./?"))

(def tokens
  "Parses words, optionally followed by symbols."
  (many1 (<< (many1 identifier) (many punct))))

(run tokens "now is the time,\\n the time is now; or is it?")
;; [["now" "is" "the" "time"] ["the" "time" "is" "now"] ["or" "is" "it"]]

Parser tokens repeats many times the reading of a compound parser that reads one or more words and then zero or more symbols. Since we're not interested in the punctuation and symbols we use the combinator <<, which only keeps the result of the first parser. The result is a vector of vectors because, due to punctuation, words are parsed in multiple calls to <<.

We'll start producing tokens as maps by wrapping the parser identifier in a new one that creates a map. Note the use of get-position, a function that assigns the parser's position to the variable p.

(def make-token
  "Makes a map token from each word."
  (bind [w identifier p get-position]
    (return {:word w :line (:line p) :col (- (:col p) (count w) 1)})))

(def tokens
  "Parses words, optionally followed by symbols."
  (<$> flatten (many1 (<< (many1 make-token) (many punct)))))

(run tokens "now is the time,\\n the time is now; or is it?")
;; ({:word "now", :line 1, :col 1}
;;  {:word "is", :line 1, :col 5}
;;  {:word "the", :line 1, :col 8}
;;  {:word "time", :line 1, :col 11}
;;  {:word "the", :line 2, :col 2}
;;  {:word "time", :line 2, :col 6}
;;  {:word "is", :line 2, :col 11}
;;  {:word "now", :line 2, :col 13}
;;  {:word "or", :line 2, :col 19}
;;  {:word "is", :line 2, :col 22}
;;  {:word "it", :line 2, :col 24})

Note that parser tokens now uses <$> to apply the function flatten on the result, so we get the tokens as a single list. What remains is to keep a count of each word using modify-state and also get the word's frequency in each token by using get-state. The function that updates the counts must consider the case when the word is not in the map yet.

(defn tick [m k]
  "Increments k's value by one; adds [k 1] if k is not in m." 
  (if-let [v (get m k)]
    (update-in m [k] inc)
    (assoc m k 1)))

(def make-token
  "Makes a map token from each word."
  (bind [w identifier 
         p get-position
         _ (modify-state tick w)
         u get-state]
  (return {:word w :line (:line p) :col (- (:col p) (count w) 1) :count (get u w)})))

(def tokens
  "Parses words, optionally followed by symbols."
  (<$> flatten (many1 (<< (many1 make-token) (many punct)))))

(run tokens "now is the time,\\n the time is now; or is it?" "test" {})
;; ({:word "now", :line 1, :col 1, :count 1}
;;  {:word "is", :line 1, :col 5, :count 1}
;;  {:word "the", :line 1, :col 8, :count 1}
;;  {:word "time", :line 1, :col 11, :count 1}
;;  {:word "the", :line 2, :col 2, :count 2}
;;  {:word "time", :line 2, :col 6, :count 2}
;;  {:word "is", :line 2, :col 11, :count 2}
;;  {:word "now", :line 2, :col 13, :count 2}
;;  {:word "or", :line 2, :col 19, :count 1}
;;  {:word "is", :line 2, :col 22, :count 3}
;;  {:word "it", :line 2, :col 24, :count 1})
;; {"it" 1, "or" 1, "time" 2, "the" 2, "is" 3, "now" 2}

The call to run above has two more arguments, one to identify the source and the other to provide the initial user state. We get the tokens with their position and count at the time they were parsed; then we get the value of the resulting user state with the word frequencies.