-
Notifications
You must be signed in to change notification settings - Fork 26
Lexer
The lexer's responsibility is to read a source string and break it into tokens — small pieces of text with special meanings. The most important properties of the lexer are:
-
each token has a location in the source text,
-
has the ability to check whether all characters are valid in the HLASM source,
-
can jump backward and forward in the source file if necessary (for implementation of instructions like AGO and AIF). Because of this, it is not possible to use a standard lexing tool.
The latter point necessitated designing a custom lexer for HLASM. We had a number of reasons to do so. The HLASM language is complex. It was first introduced several decades ago and, during this long time, the language was subjected to prolonged development which has resulted in complexity. Also, it contains some features, for example, AREAD or COPY, that can alter the source code at parse time.
Conventional lexing tools are most often based on regular expressions. As discussed above, there are several difficulties that one must consider while designing a lexer for this particular language. A regular expression-based lexer would be too difficult or even impossible to design.
One could match separate characters from the input and let the parser or semantic analysis deal with some of the described problems. This drastic solution would cost in performance, as parsers are usually more performance demanding.
Source code encodings differ for the used libraries. All strings are encoded in UTF as follows:
-
UTF-8LSP string encoding, -
UTF-16offsets (positions in source code) in LSP, -
UTF-32ANTLR 4 source code representation.
Note that there are two input_sources and there are many tokens generated. The AINSERT buffer has higher priority. Specifically, if the buffer is non-empty, the lexer consumes the input from this buffer.
Besides the custom lexer, we altered ANTLR’s classes Token, TokenFactory and ANTLRInputStream (see Third party libraries#ANTLR 4 Pipeline). The reason was to add custom attributes to tokens that are vital for later stages of the HLASM code analysis (parsing, semantic analysis etc.). Lexer functionality is implemented in the following classes (see the picture above):
-
tokenImplements ANTLR’s classTokenand extends it by adding properties important for location of the token within the input stream. As the LSP protocol works with offsets encoded inUTF-16and ANTLR 4 works withUTF-32encoding, we add attributes forUTF-16positions too.A token does not carry the actual text from the source but instead references the position in code (unlike
CommonToken). Note that the position of a token is vital for further analysis.IGNORED sequence of characters ignored in processing COMMENT commentary statements SPACE a sequence of spaces IDENTIFIER symbol identifier ORDSYMBOL Ordinary symbol identifier PROCESS process statement token NUM number ATTR apostrophe that serves as an attribute reference ASTERISK, SLASH, MINUS, PLUS, LT, GT, EQUALS, LPAR, RPAR expression tokens DOT, COMMA, APOSTROPHE, AMPERSAND, VERTICAL special meaning tokens A notable component of HLASM language complexity is the absence of a string token (see the table above). The lexer does not generate this token due to the existence of model statements. There, a variable symbol can be written anywhere in the statement (even in the middle of the string), which significantly restricts the lexer.
-
token_factory
Produces tokens of the previously described custom typetoken. -
input_source
ImplementsANTLRInputStreamwhich encapsulates source code. This implementation adds an API for resetting, rewinding and rewriting input.Note the usage of
UTFencodings:_data(source code string) and positions/indices in API are inUTF-32;getTextreturns aUTF-8string. -
lexer
Based on ANTLR’sTokenSourceclass. As with most lexers, it is also, in principle, a finite state machine. The most important difference compared to conventional FSMs and other lexers is the added communication interface that connects the parser and the instruction interpreter with the lexer. Other unconventional features include input rewinding (to support instructions such asAREAD), lexing from parallel sources (AINSERTbuffer) and a helper API for subsequent processing stages.
Important functions:
-
nextToken()
Implements the main functionality: lexing and emitting tokens. Before lexing, the function uses the right input stream (either the source code orAINSERTbuffer if not empty). After choosing the right input source, the lexer emits a token. HLASM introduces a continuation symbol (an arbitrary non-blank symbol in the continuation column) that breaks one logical line into two or more source-code lines. -
create_token()
Creates a token of a given type. The lexer’s internal state gives the position of the token. -
consume()
Consumes a character from the input stream and updates the lexer’s internal state (used increate_token()). -
lex_tokens()
Lexes of most of the token types. -
lex_begin()
Up to a certain column, the input can be ignored (can be set in HLASM). -
lex_end()
Lexes everything after a continuation symbol.
I. Project overview
II. Component description
-
Language server
4.1 LSP and DAP
4.2 Language server overview
4.3 IO handling
4.4 LSP and DAP server
4.5 Request manager -
Workspace manager
5.1 Parser library API
5.2 Libraries configuration
5.3 Workspace manager overview -
Analyzer
6.1. LSP data collector
6.2. Processing manager
6.2.1 Statement providers
6.2.2 Statement processors
6.2.3 Instruction processors
6.2.4 Expressions
6.3. Instruction format validation
6.4. Lexer
6.5. Parser
6.6. HLASM context tables - Macro tracer
- Extension
III. Dependencies and Build Instructions