Skip to content

The Lexer Library

Armando Blancas edited this page May 26, 2018 · 5 revisions

Lexer refers to lexical analysis, typically in the context of a programming language, or a subset like JSON. Kern's lexer library offers predefined combinators for parsing literal values of the usual data types. It can also handle whitespace, comments (regular or nested), and line-continuation. As a convention, all parsing functions in the lexer namespace will end their work by skipping over any whitespace and comments, thus leaving the input ready for the next parser. Because of that, client code must clear any such content at the very start of the parsing process.

Because of differences between languages or even personal preferences, the lexer library can be configured on various parameters:

comment-start A string that marks the start of a block comment.
comment-end A string that marks the end of a block comment.
comment-line A string that marks the start of a line comment.
nested-comments Whether the lexer accepts nested comments; a boolean.
identifier-start A parser for the start of an identifier.
identifier-letter A parser for the subsequent characters of an identifier.
reserved-names A list of names that cannot be identifiers.
case-sensitive Whether tokens are case-sensitive; a boolean.
line-continuation A parser for the token that concatenates two lines.
trim-newline Treats newline character(s) as whitespace.
leading-sign Whether numbers accept an optional leading sign.

The library can be configured by setting any of the above fields in a definition record and calling a function that return the configured functions. To make this process easier, there are predefined namespaces with configurations that comply with C, Java, Haskell, and the Shell, plus a basic one that supports no comments.

The following sample usage follows the Java language specifications; the parsers are in blancas.kern.lexer.java-style. Note that most of these parsers are more capable versions of primitive parsers and combinators in the blancas.kern.core namespace. Though they look very similar, the lexer parsers skip over whitespace and comments, support line continuation, and can be case or non-case sensitive. We recommend that you use the lexer versions instead the more primitive ones.

(use 'blancas.kern.core
     'blancas.kern.lexer.java-style)

trim removes any leading whitespace and comments.

(:input (parse trim "\n\n  XYZ"))  ;; (\X \Y \Z)

For any text that may contain leading whitespace or comments, use the idiom (>> trim p), which will leave the input ready for the parser p.

(run (>> trim (brackets (comma-sep dec-lit))) 
     "/* Here we go! */ [1,2,3,4,5,6,7,8,9,10]")
;; [1 2 3 4 5 6 7 8 9 10]

lexeme applies the supplied parser; then calls trim on the input.

(:input (parse (lexeme digit) "9   0"))  ;; (\0)

The rest of the parsers will skip any whitespace and comments after the input they consume.

sym parses the specified character; the comparison is case-sensitive.

(run (sym \:) ":foo")  ;; \:

new-line parses a new-line character.

(run new-line "\n")  ;; \newline

one-of succeeds if the next character is in the supplied string.

(run (one-of "$@*") "@ref")  ;; \@

none-of succeeds if the next character is not in the supplied string.

(run (none-of "$@*") "#ref")  ;; \#

token parsers a specific sequence of characters, not necessarily delimited. If multiple target sequences are given they're tried in turn until one succeeds or the parser fails.

(run (token "foo") "football")  ;; "foo"
(run (token "foo" "bar" "baz") "bazaar")  ;; "baz"

word works like token but must be delimited by any character other than alphanumeric.

(run (word "foobar") "foobar*")  ;; "foobar"
(run (word "football" "foobar") "foobar*")  ;; "foobar"
(run (word "foo") "football")
;; line 1 column 5
;; unexpected t
;; expecting end of foo

identifier parses an unquoted string that must start with a letter, followed by alphanumeric characters and delimited by any other type of character. The comparisons are case-sensitive in the Java style, though the parser works according to the setting :case-sensitive. Note that this parser may be customized to reject words in :reserved-names, but the Java style defines no such words.

(run identifier "counter")  ;; "counter"
(run identifier "A100")  ;; "A100"

field parses unquoted text terminated by any character in the supplied string.

(run (field ",") "now is the time, --")  ;; "now is the time"

char-lit parses a character literal according to the style setting. The common syntax is a symbol in single quotes with common escape codes, plus additional escapes from each language.

Common \b \t \n \f \r \' \" \/
C \0ooo \0xnn \unnnnnnnn
Java \0ooo \unnnn
Haskell \nnnn \onnnn \xnnnn
Shell \0ooo \0xnn \unnnnnnnn
(run char-lit "'^'")  ;; \^

string-lit parses a string literal according to the style setting. The common syntax is any number of characters (can be zero) in double quotes with common escape codes, plus additional escapes from each language.

Common \b \t \n \f \r \' \" \/
C \0ooo \0xnn \unnnnnnnn
Java \0ooo \unnnn
Haskell \nnnn \onnnn \xnnnn
Shell \0ooo \0xnn \unnnnnnnn
(run string-lit "\"hello, world\"")  ;; "hello, world"

dec-lit parses a decimal number as a Long or BigInt depending on the magnitude or if it ends with N.

(run dec-lit "1984") ;; 1984

oct-lit parses an octal number as a Long or BigInt depending on the magnitude or if it ends with N.

(run oct-lit "0644") ;; 420

hex-lit parses a hex number as a Long or BigInt depending on the magnitude or if it ends with N.

(run hex-lit "0xCAFE") ;; 51966

float-lit parses a floating-point number as Double or BigDecimal depending on the magnitude or if it ends with M. It cannot start with a period. The first period found must be followed by at least one digit.

(run float-lit "3.1415927")  ;; 3.1415927

bool-lit parses the values true or false; it's configurable with :case-sensitive.

(run bool-lit "true statement")  ;; true

nil-lit parses the values null or nil and returns nil; it's configurable with :case-sensitive.

(run nil-lit "null, 0]")  ;; nil

parens applies the supplied parser, skipping over opening and closing parenthesis in the input.

(run (parens dec-lit) "( 747 )")  ;; 747

braces applies the supplied parser, skipping over opening and closing braces in the input.

(run (braces dec-lit) "{ 747 }")  ;; 747

angles applies the supplied parser, skipping over opening and closing angle brackets in the input.

(run (angles dec-lit) "< 747 >")  ;; 747

brackets applies the supplied parser, skipping over opening and closing brackets in the input.

(run (brackets dec-lit) "[ 747 ]")  ;; 747

semi accepts a semicolon.

(run semi "; /* end of stmt */")  ;; \;

comma accepts a comma.

(run comma ", /* next! */")  ;; \,

colon accepts a colon.

(run colon ": /* here we go */")  ;; \:

dot accepts a dot.

(run dot ". /* that's it */")  ;; \.

semi-sep applies the supplied parser zero or more times, skiping over separating semicolons.

(run (semi-sep dec-num) "xyz")  ;; []
(run (semi-sep dec-num) "1;2;3;4;5")  ;; [1 2 3 4 5]

semi-sep1 applies the supplied parser one or more times, skiping over separating semicolons.

(run (semi-sep1 dec-num) "1;2;3;4;5")  ;; [1 2 3 4 5]
(run (semi-sep1 dec-num) "xyz")
;; line 1 column 1
;; unexpected \x
;; expecting decimal literal

comma-sep applies the supplied parser zero or more times, skiping over separating commas.

(run (comma-sep dec-num) "xyz")  ;; []
(run (comma-sep dec-num) "1,2,3,4,5")  ;; [1 2 3 4 5]

comma-sep1 applies the supplied parser one or more times, skiping over separating commas.

(run (comma-sep1 dec-num) "1,2,3,4,5")  ;; [1 2 3 4 5]
(run (comma-sep1 dec-num) "xyz")
;; line 1 column 1
;; unexpected \x
;; expecting decimal literal