Why regex over tree-sitter for extractors? #168
-
|
Hey, curious about the design choice to use pure regex for all 21 language extractors rather than tree-sitter. I get the zero-deps constraint, but was that the main driver, or were there other reasons (startup cost, "never throw" guarantees, accuracy being good enough for signature-only extraction)? Did you evaluate tree-sitter at any point and decide against it? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
|
Good question. Zero-deps was the main driver, but not the only one. SigMap is intentionally not a full parser. It is a fast repo-orientation layer: signatures, imports, classes, exports, and structure — enough to help an agent decide which files/functions to inspect first. Tree-sitter is great, but it adds install/platform complexity, grammar maintenance, package weight, and more failure modes. For the core CLI I wanted something deterministic, portable, fast, and “never throw” by design. The regex extractors are conservative and signature-focused. They won’t capture every language edge case, but they degrade gracefully and keep the workflow moving. I did consider tree-sitter, and I can see it as an optional advanced mode later. But for the core SigMap use case, regex + import graph + ranking gave the best balance of simplicity, speed, portability, and usefulness. |
Beta Was this translation helpful? Give feedback.
-
|
I am working few more improvment plan will share here soon. |
Beta Was this translation helpful? Give feedback.
Good question. Zero-deps was the main driver, but not the only one.
SigMap is intentionally not a full parser. It is a fast repo-orientation layer: signatures, imports, classes, exports, and structure — enough to help an agent decide which files/functions to inspect first.
Tree-sitter is great, but it adds install/platform complexity, grammar maintenance, package weight, and more failure modes. For the core CLI I wanted something deterministic, portable, fast, and “never throw” by design.
The regex extractors are conservative and signature-focused. They won’t capture every language edge case, but they degrade gracefully and keep the workflow moving.
I did consider tree-sitter, and I can s…