-
Notifications
You must be signed in to change notification settings - Fork 15
The Lexer Library
Lexer refers to lexical analysis, typically in the context of a programming language, or a subset like JSON. Kern's lexer library offers predefined combinators for parsing literal values of the usual data types. It can also handle whitespace, comments (regular or nested), and line-continuation. As a convention, all parsing functions in the lexer
namespace will end their work by skipping over any whitespace and comments, thus leaving the input ready for the next parser. Because of that, client code must clear any such content at the very start of the parsing process.
Because of differences between languages or even personal preferences, the lexer library can be configured on various parameters:
comment-start | A string that marks the start of a block comment. |
comment-end | A string that marks the end of a block comment. |
comment-line | A string that marks the start of a line comment. |
nested-comments | Whether the lexer accepts nested comments; a boolean. |
identifier-start | A parser for the start of an identifier. |
identifier-letter | A parser for the subsequent characters of an identifier. |
reserved-names | A list of names that cannot be identifiers. |
case-sensitive | Whether tokens are case-sensitive; a boolean. |
line-continuation | A parser for the token that concatenates two lines. |
trim-newline | Treats newline character(s) as whitespace. |
leading-sign | Whether numbers accept an optional leading sign. |
The library can be configured by setting any of the above fields in a definition record and calling a function that return the configured functions. To make this process easier, there are predefined namespaces with configurations that comply with C, Java, Haskell, and the Shell, plus a basic one that supports no comments.
The following sample usage follows the Java language specifications; the parsers are in blancas.kern.lexer.java-style
. Note that most of these parsers are more capable versions of primitive parsers and combinators in the blancas.kern.core
namespace. Though they look very similar, the lexer parsers skip over whitespace and comments, support line continuation, and can be case or non-case sensitive. We recommend that you use the lexer versions instead the more primitive ones.
(use 'blancas.kern.core
'blancas.kern.lexer.java-style)
trim
removes any leading whitespace and comments.
(:input (parse trim "\n\n XYZ")) ;; (\X \Y \Z)
For any text that may contain leading whitespace or comments, use the idiom (>> trim p)
, which will leave the input ready for the parser p.
(run (>> trim (brackets (comma-sep dec-lit)))
"/* Here we go! */ [1,2,3,4,5,6,7,8,9,10]")
;; [1 2 3 4 5 6 7 8 9 10]
lexeme
applies the supplied parser; then calls trim
on the input.
(:input (parse (lexeme digit) "9 0")) ;; (\0)
The rest of the parsers will skip any whitespace and comments after the input they consume.
sym
parses the specified character; the comparison is case-sensitive.
(run (sym \:) ":foo") ;; \:
new-line
parses a new-line character.
(run new-line "\n") ;; \newline
one-of
succeeds if the next character is in the supplied string.
(run (one-of "$@*") "@ref") ;; \@
none-of
succeeds if the next character is not in the supplied string.
(run (none-of "$@*") "#ref") ;; \#
token
parsers a specific sequence of characters, not necessarily delimited. If multiple target sequences are given they're tried in turn until one succeeds or the parser fails.
(run (token "foo") "football") ;; "foo"
(run (token "foo" "bar" "baz") "bazaar") ;; "baz"
word
works like token
but must be delimited by any character other than alphanumeric.
(run (word "foobar") "foobar*") ;; "foobar"
(run (word "football" "foobar") "foobar*") ;; "foobar"
(run (word "foo") "football")
;; line 1 column 5
;; unexpected t
;; expecting end of foo
identifier
parses an unquoted string that must start with a letter, followed by alphanumeric characters and delimited by any other type of character. The comparisons are case-sensitive in the Java style, though the parser works according to the setting :case-sensitive. Note that this parser may be customized to reject words in :reserved-names, but the Java style defines no such words.
(run identifier "counter") ;; "counter"
(run identifier "A100") ;; "A100"
field
parses unquoted text terminated by any character in the supplied string.
(run (field ",") "now is the time, --") ;; "now is the time"
char-lit
parses a character literal according to the style setting. The common syntax is a symbol in single quotes with common escape codes, plus additional escapes from each language.
Common | \b \t \n \f \r \' \" \/ |
C | \0ooo \0xnn \unnnnnnnn |
Java | \0ooo \unnnn |
Haskell | \nnnn \onnnn \xnnnn |
Shell | \0ooo \0xnn \unnnnnnnn |
(run char-lit "'^'") ;; \^
string-lit
parses a string literal according to the style setting. The common syntax is any number of characters (can be zero) in double quotes with common escape codes, plus additional escapes from each language.
Common | \b \t \n \f \r \' \" \/ |
C | \0ooo \0xnn \unnnnnnnn |
Java | \0ooo \unnnn |
Haskell | \nnnn \onnnn \xnnnn |
Shell | \0ooo \0xnn \unnnnnnnn |
(run string-lit "\"hello, world\"") ;; "hello, world"
dec-lit
parses a decimal number as a Long or BigInt depending on the magnitude or if it ends with N.
(run dec-lit "1984") ;; 1984
oct-lit
parses an octal number as a Long or BigInt depending on the magnitude or if it ends with N.
(run oct-lit "0644") ;; 420
hex-lit
parses a hex number as a Long or BigInt depending on the magnitude or if it ends with N.
(run hex-lit "0xCAFE") ;; 51966
float-lit
parses a floating-point number as Double or BigDecimal depending on the magnitude or if it ends with M. It cannot start with a period. The first period found must be followed by at least one digit.
(run float-lit "3.1415927") ;; 3.1415927
bool-lit
parses the values true or false; it's configurable with :case-sensitive.
(run bool-lit "true statement") ;; true
nil-lit
parses the values null or nil and returns nil; it's configurable with :case-sensitive.
(run nil-lit "null, 0]") ;; nil
parens
applies the supplied parser, skipping over opening and closing parenthesis in the input.
(run (parens dec-lit) "( 747 )") ;; 747
braces
applies the supplied parser, skipping over opening and closing braces in the input.
(run (braces dec-lit) "{ 747 }") ;; 747
angles
applies the supplied parser, skipping over opening and closing angle brackets in the input.
(run (angles dec-lit) "< 747 >") ;; 747
brackets
applies the supplied parser, skipping over opening and closing brackets in the input.
(run (brackets dec-lit) "[ 747 ]") ;; 747
semi
accepts a semicolon.
(run semi "; /* end of stmt */") ;; \;
comma
accepts a comma.
(run comma ", /* next! */") ;; \,
colon
accepts a colon.
(run colon ": /* here we go */") ;; \:
dot
accepts a dot.
(run dot ". /* that's it */") ;; \.
semi-sep
applies the supplied parser zero or more times, skiping over separating semicolons.
(run (semi-sep dec-num) "xyz") ;; []
(run (semi-sep dec-num) "1;2;3;4;5") ;; [1 2 3 4 5]
semi-sep1
applies the supplied parser one or more times, skiping over separating semicolons.
(run (semi-sep1 dec-num) "1;2;3;4;5") ;; [1 2 3 4 5]
(run (semi-sep1 dec-num) "xyz")
;; line 1 column 1
;; unexpected \x
;; expecting decimal literal
comma-sep
applies the supplied parser zero or more times, skiping over separating commas.
(run (comma-sep dec-num) "xyz") ;; []
(run (comma-sep dec-num) "1,2,3,4,5") ;; [1 2 3 4 5]
comma-sep1
applies the supplied parser one or more times, skiping over separating commas.
(run (comma-sep1 dec-num) "1,2,3,4,5") ;; [1 2 3 4 5]
(run (comma-sep1 dec-num) "xyz")
;; line 1 column 1
;; unexpected \x
;; expecting decimal literal