-
Notifications
You must be signed in to change notification settings - Fork 38
Guide on multi-mode lexing #132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,175 @@ | ||||||
--- | ||||||
title: "Multi-Mode Lexing" | ||||||
weight: 400 | ||||||
--- | ||||||
|
||||||
Many modern programming languages such as [JavaScript](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Template_literals) or [C#](https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/tokens/interpolated) support template literals. | ||||||
They are a way to easily concatenate or interpolate string values while maintaining great code readability. | ||||||
This guide will show you how to support template literals in Langium. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
This paragraph is still a bit strange, as it reads more like the topic is template literals. |
||||||
|
||||||
For this specific example, our template literal starts and ends using backticks `` ` `` and are interupted by expressions that are wrapped in curly braces `{}`. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
So in our example, usage of template literals might look something like this: | ||||||
montymxb marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
```js | ||||||
println(`hello {name}!`); | ||||||
``` | ||||||
|
||||||
Conceptually, template strings work by reading a start terminal which starts with `` ` `` and ends with `{`, | ||||||
montymxb marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
followed by an expression and then an end terminal which is effectively just the start terminal in reverse using `}` and `` ` ``. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
Since we don't want to restrict users to only a single expression in their template literals, we also need a "middle" terminal reading from `}` to `{`. | ||||||
montymxb marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
Of course, there's also the option that a user only uses a template literal without any expressions in there. | ||||||
montymxb marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
So we additionally need a "full" terminal that reads from the start of the literal all the way to the end in one go. | ||||||
montymxb marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
To achieve this, we will define a `TemplateLiteral` parser rule and a few terminals. | ||||||
These terminals will adhere to the requirements that we just defined. | ||||||
To make it a bit easier to read and maintain, we also define a special terminal fragment that we can reuse in all our terminal definitions: | ||||||
montymxb marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
```antlr | ||||||
TemplateLiteral: | ||||||
// Either just the full content | ||||||
content+=TEMPLATE_LITERAL_FULL | | ||||||
// Or template literal parts with expressions in between | ||||||
( | ||||||
content+=TEMPLATE_LITERAL_START | ||||||
content+=Expression? | ||||||
( | ||||||
content+=TEMPLATE_LITERAL_MIDDLE | ||||||
content+=Expression? | ||||||
)* | ||||||
content+=TEMPLATE_LITERAL_END | ||||||
) | ||||||
; | ||||||
|
||||||
terminal TEMPLATE_LITERAL_FULL: | ||||||
'`' IN_TEMPLATE_LITERAL* '`'; | ||||||
|
||||||
terminal TEMPLATE_LITERAL_START: | ||||||
'`' IN_TEMPLATE_LITERAL* '{'; | ||||||
|
||||||
terminal TEMPLATE_LITERAL_MIDDLE: | ||||||
'}' IN_TEMPLATE_LITERAL* '{'; | ||||||
|
||||||
terminal TEMPLATE_LITERAL_END: | ||||||
'}' IN_TEMPLATE_LITERAL* '`'; | ||||||
|
||||||
// '{{' is handled in a special way so we can escape normal '{' characters | ||||||
// '``' is doing the same for the '`' character | ||||||
terminal fragment IN_TEMPLATE_LITERAL: | ||||||
/[^{`]|{{|``/; | ||||||
``` | ||||||
|
||||||
If we go ahead and start parsing files with these changes, most things should work as expected. | ||||||
However, depending on the structure of your existing grammar, some of these new terminals might be in conflict with existing terminals of your language. | ||||||
For example, if your language supports block statements, chaining multiple blocks together will make this issue apparent: | ||||||
|
||||||
```js | ||||||
{ | ||||||
console.log('hi'); | ||||||
} | ||||||
{ | ||||||
console.log('hello'); | ||||||
} | ||||||
``` | ||||||
|
||||||
The `} ... {` block in this example won't be parsed as separate `}` and `{` tokens, but instead as a single `TEMPLATE_LITERAL_MIDDLE` token, resulting in a parser error due to the unexpected token. | ||||||
This doesn't make a lot of sense, since we aren't in the middle of a template literal at this point anyway. | ||||||
However, our lexer doesn't know yet that the `TEMPLATE_LITERAL_MIDDLE` and `TEMPLATE_LITERAL_END` terminals are only allowed to show up within a `TemplateLiteral` rule. | ||||||
To rectify this, we will need to make use of lexer modes. They will give us the necessary context to know whether we're inside a template literal or outside of it. | ||||||
Depending on the current selected mode, we can lex different terminals. In our case, we want to exclude the `TEMPLATE_LITERAL_MIDDLE` and `TEMPLATE_LITERAL_END` terminals. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
The following implementation of a `TokenBuilder` will do the job for us. It creates two lexing modes, which are almost identical except for the `TEMPLATE_LITERAL_MIDDLE` and `TEMPLATE_LITERAL_END` terminals. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There should be at least another sentence devoted to what the TokenBuilder is, and another one that describes what the 2 lexing modes are in this context. |
||||||
We will also need to make sure that the modes are switched based on the `TEMPLATE_LITERAL_START` and `TEMPLATE_LITERAL_END` terminals. We use `PUSH_MODE` and `POP_MODE` for this. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should explain why we need push & pop modes here, and probably a quick detail that there's a lexer mode stack underneath the hood. Even a single sentence would help to keep context. |
||||||
|
||||||
```ts | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would recommend splitting this up into 3 separate parts. Building a custom token builder is non-trivial, and it would help to explain the steps a bit more. I've written a few suggestions below for splits (heads-up, some comments below appear to be out of order with regards to line position). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As a step up from this, I still feel we should split this up. But in the interest of moving this along after some time can we instead make an issue for a custom token builder guide separately? |
||||||
import { DefaultTokenBuilder, isTokenTypeArray, GrammarAST } from "langium"; | ||||||
import { IMultiModeLexerDefinition, TokenType, TokenVocabulary } from "chevrotain"; | ||||||
|
||||||
const REGULAR_MODE = 'regular_mode'; | ||||||
const TEMPLATE_MODE = 'template_mode'; | ||||||
|
||||||
export class CustomTokenBuilder extends DefaultTokenBuilder { | ||||||
|
||||||
override buildTokens(grammar: GrammarAST.Grammar, options?: { caseInsensitive?: boolean }): TokenVocabulary { | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. From before, I would first break this out into a separate paragraph, explaining we need to first build up a multi-mode lexer definition that has various modes, which are pushed on by our special tokens. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See above. |
||||||
const tokenTypes = super.buildTokens(grammar, options); | ||||||
|
||||||
if(isTokenTypeArray(tokenTypes)) { | ||||||
// Regular mode just drops template literal middle & end | ||||||
const regularModeTokens = tokenTypes | ||||||
.filter(token => !['TEMPLATE_LITERAL_MIDDLE','TEMPLATE_LITERAL_END'].includes(token.name)); | ||||||
// Template mode needs to exclude the '}' keyword | ||||||
const templateModeTokens = tokenTypes | ||||||
.filter(token => !['}'].includes(token.name)); | ||||||
|
||||||
const multiModeLexerDef: IMultiModeLexerDefinition = { | ||||||
modes: { | ||||||
[REGULAR_MODE]: regularModeTokens, | ||||||
[TEMPLATE_MODE]: templateModeTokens | ||||||
}, | ||||||
defaultMode: REGULAR_MODE | ||||||
}; | ||||||
return multiModeLexerDef; | ||||||
} else { | ||||||
throw new Error('Invalid token vocabulary received from DefaultTokenBuilder!'); | ||||||
} | ||||||
} | ||||||
|
||||||
protected override buildKeywordToken( | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This would make a nice second part, indicating we need cleanup our There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See above. |
||||||
keyword: GrammarAST.Keyword, | ||||||
terminalTokens: TokenType[], | ||||||
caseInsensitive: boolean | ||||||
): TokenType { | ||||||
let tokenType = super.buildKeywordToken(keyword, terminalTokens, caseInsensitive); | ||||||
|
||||||
if (tokenType.name === '}') { | ||||||
// The default } token will use [TEMPLATE_LITERAL_MIDDLE, TEMPLATE_LITERAL_END] as longer alts | ||||||
// We need to delete the LONGER_ALT, they are not valid for the regular lexer mode | ||||||
delete tokenType.LONGER_ALT; | ||||||
} | ||||||
return tokenType; | ||||||
} | ||||||
|
||||||
protected override buildTerminalToken(terminal: GrammarAST.TerminalRule): TokenType { | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Third part, we can add this & explain how we're associating a push/pop action for start/end literals (which chevrotain needs). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See above. |
||||||
let tokenType = super.buildTerminalToken(terminal); | ||||||
|
||||||
// Update token types to enter & exit template mode | ||||||
if(tokenType.name === 'TEMPLATE_LITERAL_START') { | ||||||
tokenType.PUSH_MODE = TEMPLATE_MODE; | ||||||
} else if(tokenType.name === 'TEMPLATE_LITERAL_END') { | ||||||
tokenType.POP_MODE = true; | ||||||
} | ||||||
return tokenType; | ||||||
} | ||||||
} | ||||||
``` | ||||||
|
||||||
With this change in place, the parser will work as expected. There is one last issue which we need to resolve in order to get everything working perfectly. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
When inspecting our AST, the `TemplateLiteral` object will contain strings with input artifacts in there; mainly `` ` ``, `{` and `}`. | ||||||
montymxb marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
These aren't actually part of the semantic value of these strings, so we should get rid of them. | ||||||
We will need to create a custom `ValueConverter` and remove these artifacts: | ||||||
montymxb marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
```ts | ||||||
import { CstNode, GrammarAST, DefaultValueConverter, ValueType, convertString } from 'langium'; | ||||||
montymxb marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
export class CustomValueConverter extends DefaultValueConverter { | ||||||
|
||||||
protected override runConverter(rule: GrammarAST.AbstractRule, input: string, cstNode: CstNode): ValueType { | ||||||
if (rule.name.startsWith('TEMPLATE_LITERAL')) { | ||||||
// 'convertString' simply removes the first and last character of the input | ||||||
return convertString(input); | ||||||
montymxb marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
} else { | ||||||
return super.runConverter(rule, input, cstNode); | ||||||
} | ||||||
} | ||||||
} | ||||||
``` | ||||||
|
||||||
Of course, let's not forget to bind all of these services: | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
```ts | ||||||
export const CustomModule = { | ||||||
parser: { | ||||||
TokenBuilder: () => new CustomTokenBuilder(), | ||||||
ValueConverter: () => new CustomValueConverter() | ||||||
}, | ||||||
}; | ||||||
``` | ||||||
montymxb marked this conversation as resolved.
Show resolved
Hide resolved
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend changing the title here to something about template literals, possibly