# Lexing ## Overview The first stage in femto's compiler pipeline is lexical analysis, or lexing. During this stage, an input stream of characters (a single .fm file) is converted into a stream of character spans, called tokens, which represent keywords, identifiers, operators, etc. For example, take the following file: ```femto let main = fn() void { let a = 5; let b = 1; let c = a + b; }; ``` The above file gets converted to the following token stream: ```zig .k_let, .ident, .equal, .k_fn, .l_paren, .r_paren, .ident, .l_brace, .k_let, .ident, .equal, .int_lit, .semi, .k_let, .ident, .equal, .int_lit, .semi, .k_let, .ident, .equal, .ident, .plus, .ident, .semi, .r_brace, .semi, ``` ## Token Structure A token represents a span into the entire file's string contents. That is, it contains a start index (inclusive), end index (exclusive), and a token tag representing the type of token. ```zig pub const Token = struct { tag: Tag, loc: Loc, pub const Loc = struct { start: u32, end: u32, }; pub const Tag = enum { eof, ident, semi, k_let, // ... }; }; ``` Notice that the character indices are `u32`, not `u64`. femtoc places an upper limit on any single file size as UINT32_MAX characters. This cuts memory usage of storing token indices in half, and allows other optimizations later down the pipeline (such as improving cache locality). Note also that the token does not own and string data. When the pipeline wants the contents of a token, it can read it with `buf[loc.start..loc.end]`. There is only ever one copy of the file buffer (although later stages may copy specific strings, such as identifiers, for use in name resolution and string literal storage). ## Lexer Lexical analysis is performed by the `Lexer` structure. It operates a finite state machine which matches characters in certain orders, and emits a token when a token-ending character is found. This could be a whitespace character, or any other character that doesn't look like part of the same token (i.e. `abc(`) matches to `.ident` and `.l_paren` even though there is no whitespace. ```zig pub const Lexer = struct { source: [:0]const u8, // null-terminated pointer to read-only buffer index: u32, // current location of lexer in buffer (only moves forward) const State = enum { start, ident, annot, period, equal, // ... }; // runs the FSM until a token is emitted pub fn next(self: *Lexer) Token; }; ``` The `next` function body fires up the FSM, starting with the `.start` state. The span is initialized to start at the current character, and doesn't have a known end index. Then, we loop, switching on the current state, and then the current character. This is documented in the source code of lex.zig, but should be fairly straight forward to understand. Here are a few notes: * Single character tokens (like punctuation and some operators) don't have any intermediate states; They simply set the token tag and return. * Multi-character tokens that start with the same characters store state. For example, `>>` and `>=` store the `.r_angle` state when the `>` is received in the `.start` state. Then, when in the `.r_angle` state, they switch on the current character and either emit `.r_angle_r_angle` or `.r_angle_equal`. * Tokens are named using the character names, not semantic meaning (`.r_angle_equal` rather than `.greater_equal`) because many operators correspond to multiple meanings, and the lexer is not context aware. * The loop increments `self.index` at the end of each iteration. However, when breaking at the end of a token, the code must increment, since the loop afterthought won't get executed. ## Level of Analysis Each level of the compiler pipeline makes progress towards understanding and validating that femto code is legal. The lexer validates that tokens are reasonable (`abc2` is an `.ident`, but `2abc` is not valid since identifiers cannot start with numerics. However, it doesn't try to make sense of the stream of tokens; `.ident .l_brace .equal .asterisk .r_brace` is not caught.