feat(lang): bind tree-sitter grammars for SQL, JSON, TOML, Haskell, OCaml#56
Conversation
Vendors DerekStride/tree-sitter-sql v0.3.11 under internal/tsgrammars/ (parser.c generated via `tree-sitter generate` — upstream gitignores it). Registers SQL in the language registry with a TreeSitter binding, moving it from the recognition-only section into the parsed-languages section. Adds classifySQL to parser/parser.go to surface CREATE TABLE, CREATE INDEX, and CREATE FUNCTION nodes as typed symbols (table / index / function). Smoke-tested against infra/postgres/init.sql: 15 symbols — 4 tables, 6 indexes, 5 functions (all RPCs: ingest_doc, append_contract_schema, latest_contract_schema, submit_label, stale_count).
Promotes four recognition-only languages to tree-sitter parsed: - json (github.com/tree-sitter/tree-sitter-json v0.24.8) - toml (github.com/tree-sitter-grammars/tree-sitter-toml v0.7.0) - haskell (github.com/tree-sitter/tree-sitter-haskell v0.23.1) - ocaml (github.com/tree-sitter/tree-sitter-ocaml v0.25.0) OCaml splits into two registry entries: ocaml (.ml, LanguageOCaml) and ocaml_interface (.mli, LanguageOCamlInterface). All modules ship pre-generated parser.c. go mod tidy clean. All 311 tests pass (CGO_CFLAGS="-DSQLITE_ENABLE_FTS5"). Skipped: - dockerfile (camdencheek/tree-sitter-dockerfile v0.2.0): go.mod at bindings/go/ declares the root module path but no Go package exists at that path — Go toolchain cannot resolve it. - r (r-lib/tree-sitter-r v1.2.0): binding.go omits scanner.c include, causing linker errors for external scanner symbols. - markdown (tree-sitter-grammars/tree-sitter-markdown v0.5.3): no Go bindings directory in the repo. - make, zig, erlang, perl: no Go bindings found in any maintained fork.
There was a problem hiding this comment.
Pull request overview
This PR expands cymbal’s tree-sitter support by promoting SQL/JSON/TOML/Haskell/OCaml from extension-only recognition to fully parseable languages (via registry bindings), and adds SQL-specific symbol extraction for common DDL constructs.
Changes:
- Register new tree-sitter grammars for
sql,json,toml,haskell,ocaml, andocaml_interfaceinlang.Default. - Add SQL symbol extraction in the parser for
CREATE TABLE,CREATE INDEX, andCREATE FUNCTION. - Vendor the
tree-sitter-sqlgrammar internally (including generatedparser.c+ external scanner) and update tests/deps to reflect new supported languages.
Reviewed changes
Copilot reviewed 12 out of 14 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| walker/walker_feature_test.go | Updates supported-language walking expectations now that JSON/TOML are parseable. |
| parser/parser.go | Routes SQL files to a new SQL classifier for symbol extraction. |
| lang/registry.go | Adds tree-sitter language registrations for SQL/JSON/TOML/Haskell/OCaml(+interface). |
| lang/lang_test.go | Updates extension mapping expectations and Supported() coverage for newly-parseable languages. |
| internal/tsgrammars/tree-sitter-sql/UPSTREAM.txt | Records upstream source/version and why SQL is vendored. |
| internal/tsgrammars/tree-sitter-sql/src/tree_sitter/parser.h | Vendored tree-sitter runtime header required by generated parser/scanner. |
| internal/tsgrammars/tree-sitter-sql/src/tree_sitter/array.h | Vendored tree-sitter runtime header required by generated parser/scanner. |
| internal/tsgrammars/tree-sitter-sql/src/tree_sitter/alloc.h | Vendored tree-sitter runtime header required by generated parser/scanner. |
| internal/tsgrammars/tree-sitter-sql/src/scanner.c | Adds SQL external scanner implementation (dollar-quoted strings). |
| internal/tsgrammars/tree-sitter-sql/LICENSE | Includes upstream MIT license text for vendored grammar. |
| internal/tsgrammars/tree-sitter-sql/bindings/go/binding.go | Adds cgo binding exposing Language() for the vendored SQL grammar. |
| go.mod | Adds module requirements for new external grammars (toml, haskell, json, ocaml). |
| go.sum | Updates sums for newly added/updated dependencies. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| case "sql": | ||
| return e.classifySQL(nodeType, node) |
There was a problem hiding this comment.
Added TestFeatureSQLDDL in bcf7db1. Covers CREATE TABLE (table symbol), CREATE INDEX (index symbol), and CREATE FUNCTION (function symbol), including language field assertion on each.
| if (state->start_tag != NULL) { | ||
| free(state->start_tag); | ||
| state->start_tag = NULL; | ||
| } | ||
| return tag_length; | ||
| } | ||
|
|
||
| void tree_sitter_sql_external_scanner_deserialize(void *payload, const char *buffer, unsigned length) { | ||
| LexerState *state = (LexerState *)payload; | ||
| state->start_tag = NULL; |
There was a problem hiding this comment.
Fixed in bcf7db1. The serialize callback no longer frees or clears start_tag after copying it into the buffer - it now returns immediately after memcpy. Deserialize now frees any existing start_tag before overwriting it.
| void tree_sitter_sql_external_scanner_deserialize(void *payload, const char *buffer, unsigned length) { | ||
| LexerState *state = (LexerState *)payload; | ||
| state->start_tag = NULL; | ||
| // A length of 1 can't exists. | ||
| if (length > 1) { | ||
| state->start_tag = malloc(length); | ||
| memcpy(state->start_tag, buffer, length); | ||
| } | ||
| } |
There was a problem hiding this comment.
Fixed in bcf7db1 (same commit as serialize fix). deserialize now frees any existing start_tag before assigning NULL or the new malloc'd buffer.
| void tree_sitter_sql_external_scanner_deserialize(void *payload, const char *buffer, unsigned length) { | ||
| LexerState *state = (LexerState *)payload; | ||
| state->start_tag = NULL; | ||
| // A length of 1 can't exists. |
serialize was mutating state after copying (freeing start_tag), breaking parse-stack backtracking. deserialize was leaking any existing start_tag before replacing it. Fix both and correct a grammar error in a comment.
Vendor opa-oz/tree-sitter-nginx at commit 47ade644 under internal/tsgrammars/tree-sitter-nginx with a hand-written CGO binding mirroring the SQL grammar pattern. Register it in lang.Default with Filenames: ["nginx.conf"] and Extensions: [".nginx"] to avoid claiming the ambiguous .conf extension. Upstream Go bindings were not usable directly: the upstream go.mod declares the wrong module path and uses the smacker API instead of github.com/tree-sitter/go-tree-sitter. Also add a TODO comment for future make/cmake grammar additions above the recognition-only language block.
Without language-specific classifiers, Haskell and OCaml files indexed as zero symbols despite having valid tree-sitter grammars — classifyGeneric only matches function_definition/class_definition which neither grammar uses. classifyHaskell: function node → function (variable child), data_type → type classifyOCaml: value_definition → function (let_binding/value_name), type_definition → type (type_binding/type_constructor), module_definition → module (module_binding/module_name) classifyNginx: upstream attribute → upstream (value child after keyword), location node → location (location_route child) Adds TestFeatureHaskellFunctions, TestFeatureOCamlFunctions, TestFeatureNginxSymbols to parser_feature_test.go.
withTestWorkingDir only overrode XDG_CONFIG_HOME and APPDATA, but os.UserConfigDir on darwin ignores XDG_CONFIG_HOME and resolves to $HOME/Library/Application Support. As a result, the user-scope branch of opencodePluginPath leaked into the real user's config dir, so any project-scope install test failed when a managed plugin already existed at ~/Library/Application Support/opencode/plugins. Override HOME (and USERPROFILE) to the same tempdir so all scope lookups stay inside the sandbox on every platform.
mattn/go-sqlite3 exposes FTS5 via the canonical `sqlite_fts5` build tag, which is more idiomatic and discoverable than threading `CGO_CFLAGS=-DSQLITE_ENABLE_FTS5` through every Make target. No behavior change: `make build`, `make build-check`, `make install`, `make test`, and `make test-coverage` all continue to compile FTS5 into the sqlite driver. Bare `go test ./...` still requires the tag (CGO C deps cannot be configured from in-tree Go build tags) -- the canonical invocation remains `make test`.
Summary
Promotes six recognition-only languages to tree-sitter parsed, and adds symbol extractors for SQL, Haskell, OCaml, and Nginx:
SQL symbol extractor
Adds
classifySQLtoparser/parser.gothat surfaces:CREATE TABLE→tablesymbolCREATE INDEX→indexsymbolCREATE FUNCTION→functionsymbolHaskell symbol extractor
Adds
classifyHaskellthat surfaces:functionnode →functionsymbol (name from firstvariablechild)data_typenode →typesymbol (name fromnamechild)OCaml symbol extractor
Adds
classifyOCamlthat surfaces:value_definition→functionsymbol (name fromlet_binding→value_name)type_definition→typesymbol (name fromtype_binding→type_constructor)module_definition→modulesymbol (name frommodule_binding→module_name)Nested definitions (functions inside modules, inner let-recs) are parented correctly.
Nginx symbol extractor
Adds
classifyNginxthat surfaces:upstream name { ... }→upstreamsymbol (name fromvaluechild afterupstreamkeyword)location /path { ... }→locationsymbol (path fromlocation_routechild)Server blocks are anonymous and not surfaced.
SQL vendoring
DerekStride/tree-sitter-sql gitignores
parser.c, so it cannot be used as an external Go module. Follows the existing pattern ininternal/tsgrammars/(dart, elixir, swift) — parser.c was generated withtree-sitter generateat v0.3.11 and vendored alongside scanner.c and the tree_sitter headers.Nginx vendoring
opa-oz/tree-sitter-nginx upstream go.mod declares the wrong module path and uses the smacker API instead of github.com/tree-sitter/go-tree-sitter. Vendored at commit 47ade644 following the same pattern.
OCaml split
OCaml exposes two sub-grammars:
LanguageOCaml()for.mlfiles andLanguageOCamlInterface()for.mlifiles (same pattern as TypeScript/TSX). Registered as separate cymbal languages:ocamlandocaml_interface.Skipped (no working Go binding)
bindings/go/— Go toolchain cannot resolve the importbinding.goomitsscanner.cinclude, causing linker errors for external scanner symbolsTest plan
go test ./...withCGO_CFLAGS="-DSQLITE_ENABLE_FTS5"— all relevant tests passTestFeatureSQLDDL— tables, indexes, functions surface as symbolsTestFeatureHaskellFunctions— function and data type symbols extractedTestFeatureOCamlFunctions— function, type, module symbols extracted (including nested)TestFeatureNginxSymbols— upstream and location symbols extractedTestSupportedin lang package — all 6 new languages marked as tree-sitter-backedgo mod tidyproduces no changes