-
Notifications
You must be signed in to change notification settings - Fork 15
How to Customize the Lexer
The library of lexer parsers may be customized on the following properties:
- Delimiters for multi-line comments.
- Start of a single-line comment.
- Whether multi-line comments may be nested.
- Starting and subsequent characters in an identifier.
- Any reserved words not to be parsed as identifiers.
- Whether lexers for specific words are case-sensitive.
- The line-continuation character.
- Whether to treat newline characters as whitespace.
- Whether numberic lexers shuld accept an optional leading sign.
Kern offers predefined customizations for several language styles, which are placed in their own namespaces:
blancas.kern.lexer.basic
blancas.kern.lexer.c-style
blancas.kern.lexer.haskell-style
blancas.kern.lexer.java-style
blancas.kern.lexer.shell-style
We'll now describe how to customize the lexer by using the HOC interpreter as an illustration. The goal is to change the lexer parsers to get the following language features:
- Enter comments in pascal-style: (* comment *).
- Comments may be nested.
- Reserve the words used in statements.
- Make the language not case-sensitive.
There's no need to create a separate namespace for the lexer parsers. Since we won't be using all of the parsers, the changes are small and thus we can make them within the HOC program itself. So instead of importing parsers from a predefined namespace, we'll follow a simple three-step process: (1) fill-in the settings, (2) make the lexers, and (3) define the lexers as local vars for easy access.
The following definition extends the basic record basic-def
with the values that corresponds to the requirements listed above.
The namespace declaration loads the Kern Core and Lexer libraries.
(ns customhoc
(:use [blancas.kern core expr]
[clojure.string :only (upper-case)])
(:require [blancas.kern.lexer :as lex]))
(def hoc-style
(assoc lex/basic-def
:comment-start "(*"
:comment-end "*)"
:nested-comments true
:identifier-letter (<|> alpha-num (one-of* "_-."))
:reserved-names ["while" "if" "else" "read" "print" "return" "fun" "proc"]
:case-sensitive false
:trim-newline false))
The last setting is not new, as the program was using the shell-style
to get it. This setting means that newline characters are not skipped as whitespace but must be accounted for by the parser. The reason why is that a newline can end and separate HOC statements. The rest of the settings simply follow from our intended parser.
Generating the lexers is done by calling make-parsers
using the hoc-style
record.
(def- rec (lex/make-parsers hoc-style))
For easy access and to avoid having to lookup the lexers, we simply define vars for them:
(def trim (:trim rec))
(def sym (:sym rec))
(def new-line (:new-line rec))
(def word (:word rec))
(def string-lit (:string-lit rec))
(def dec-lit (:dec-lit rec))
(def float-lit (:float-lit rec))
(def parens (:parens rec))
(def braces (:braces rec))
(def comma-sep (:comma-sep rec))
(def identifier (<$> upper-case (:identifier rec)))
The lexer identifier
does a little more work: after parsing an identifier it converts it to all caps. As such it'll be entered in the symbol table and looked up afterwards, so it won't make a difference how is entered in HOC sources. This is one part of the non-case sensitivity; the other is handled by Kern automatically and thus a program may have if, If, or IF without being any different.
Having made the above changes, the following code is now legal HOC:
(* Define the factorial function. *)
Func factorial() {
IF ($1 == 0) {
Return 1
} else {
RETURN $1 * factorial($1 - 1)
}
}
(* Oh, no, can't do that!
(* Call factorial on a negative *)
factorial(-100)
*)
(* Call the factorial *)
FACTorial(6)
All of the code is available in the program customhoc.clj. The sample HOC program is custom-fact.hoc. If you want to keep the lexer definitions in a separate namespace, the file custom_lexer.clj shows how this can be done. In this case, the main program would require or use the namespace.