A lexical analyzer generator. Compiles regex patterns into C++ table-driven scanners.
Lexy takes token specifications and generates standalone C++ scanners. The generated code uses transition tables to recognize tokens—same approach as Lex/Flex.
Pipeline:
- Parse
.lexytoken specifications - Build regex ASTs
- Convert ASTs to NFAs (Thompson's construction)
- Merge multiple NFAs into one
- Determinize to DFA (subset construction)
- Minimize DFA (Hopcroft's algorithm)
- Generate C++ code with transition tables
Table-Driven Design:
The generated scanners use a 2D array TRANSITION_TABLE[state][char] -> next_state plus an accepting states array. A simple loop walks the input, looks up transitions, and implements longest-match with backtracking.
- Operators:
|,*,+,?, and concatenation - Ranges:
{n,m},{n,} - Character classes:
[a-z],[^abc] - Escapes:
\n,\t,\\, and metacharacters - Wildcard:
.
make # Build generator
./scanner_generator.exe input.lexy # Generate scannerRequires C++20.
Input (examples/myScanner.lexy):
IDENTIFIER ::= "[a-zA-Z_][a-zA-Z0-9_]*"
INTEGER ::= "0|[1-9][0-9]*"
Generate:
./scanner_generator.exe examples/myScanner.lexyOutput: generated/scanners/myScanner.cpp
Test:
Scanner scanner("hello123");
Token t1 = scanner.getNextToken(); // IDENTIFIER: "hello"
Token t2 = scanner.getNextToken(); // INTEGER: "123"
Token t3 = scanner.getNextToken(); // EOF- ASCII only (0-127)
- No table compression
- No token priority rules
- No whitespace skipping
- Aho, Sethi, Ullman - Compilers: Principles, Techniques, and Tools (Dragon Book)
- Cooper & Torczon - Engineering a Compiler
- Hopcroft, Motwani, Ullman - Introduction to Automata Theory