Skip to content

feat(lang): bind tree-sitter grammars for SQL, JSON, TOML, Haskell, OCaml#56

Open
stochastic-sisyphus wants to merge 8 commits into
1broseidon:mainfrom
stochastic-sisyphus:feat/sql-and-recognition-grammars
Open

feat(lang): bind tree-sitter grammars for SQL, JSON, TOML, Haskell, OCaml#56
stochastic-sisyphus wants to merge 8 commits into
1broseidon:mainfrom
stochastic-sisyphus:feat/sql-and-recognition-grammars

Conversation

@stochastic-sisyphus

@stochastic-sisyphus stochastic-sisyphus commented May 11, 2026

Copy link
Copy Markdown

Summary

Promotes six recognition-only languages to tree-sitter parsed, and adds symbol extractors for SQL, Haskell, OCaml, and Nginx:

Language Module Version Notes
sql vendored (DerekStride/tree-sitter-sql) v0.3.11 parser.c generated — upstream gitignores it
json github.com/tree-sitter/tree-sitter-json v0.24.8 ships parser.c
toml github.com/tree-sitter-grammars/tree-sitter-toml v0.7.0 ships parser.c
haskell github.com/tree-sitter/tree-sitter-haskell v0.23.1 ships parser.c
ocaml github.com/tree-sitter/tree-sitter-ocaml v0.25.0 two entries: ocaml (.ml) + ocaml_interface (.mli)
nginx vendored (opa-oz/tree-sitter-nginx) commit 47ade644 upstream go.mod uses wrong module path + smacker API

SQL symbol extractor

Adds classifySQL to parser/parser.go that surfaces:

  • CREATE TABLEtable symbol
  • CREATE INDEXindex symbol
  • CREATE FUNCTIONfunction symbol

Haskell symbol extractor

Adds classifyHaskell that surfaces:

  • function node → function symbol (name from first variable child)
  • data_type node → type symbol (name from name child)

OCaml symbol extractor

Adds classifyOCaml that surfaces:

  • value_definitionfunction symbol (name from let_bindingvalue_name)
  • type_definitiontype symbol (name from type_bindingtype_constructor)
  • module_definitionmodule symbol (name from module_bindingmodule_name)

Nested definitions (functions inside modules, inner let-recs) are parented correctly.

Nginx symbol extractor

Adds classifyNginx that surfaces:

  • upstream name { ... }upstream symbol (name from value child after upstream keyword)
  • location /path { ... }location symbol (path from location_route child)

Server blocks are anonymous and not surfaced.

SQL vendoring

DerekStride/tree-sitter-sql gitignores parser.c, so it cannot be used as an external Go module. Follows the existing pattern in internal/tsgrammars/ (dart, elixir, swift) — parser.c was generated with tree-sitter generate at v0.3.11 and vendored alongside scanner.c and the tree_sitter headers.

Nginx vendoring

opa-oz/tree-sitter-nginx upstream go.mod declares the wrong module path and uses the smacker API instead of github.com/tree-sitter/go-tree-sitter. Vendored at commit 47ade644 following the same pattern.

OCaml split

OCaml exposes two sub-grammars: LanguageOCaml() for .ml files and LanguageOCamlInterface() for .mli files (same pattern as TypeScript/TSX). Registered as separate cymbal languages: ocaml and ocaml_interface.

Skipped (no working Go binding)

  • dockerfile (camdencheek/tree-sitter-dockerfile v0.2.0): go.mod declares root module path but the Go package lives in bindings/go/ — Go toolchain cannot resolve the import
  • r (r-lib/tree-sitter-r v1.2.0): binding.go omits scanner.c include, causing linker errors for external scanner symbols
  • markdown (tree-sitter-grammars/tree-sitter-markdown v0.5.3): no Go bindings directory
  • make, zig, erlang, perl: no Go bindings found in maintained upstream forks

Test plan

  • go test ./... with CGO_CFLAGS="-DSQLITE_ENABLE_FTS5" — all relevant tests pass
  • TestFeatureSQLDDL — tables, indexes, functions surface as symbols
  • TestFeatureHaskellFunctions — function and data type symbols extracted
  • TestFeatureOCamlFunctions — function, type, module symbols extracted (including nested)
  • TestFeatureNginxSymbols — upstream and location symbols extracted
  • TestSupported in lang package — all 6 new languages marked as tree-sitter-backed
  • go mod tidy produces no changes

Vendors DerekStride/tree-sitter-sql v0.3.11 under internal/tsgrammars/
(parser.c generated via `tree-sitter generate` — upstream gitignores it).
Registers SQL in the language registry with a TreeSitter binding, moving
it from the recognition-only section into the parsed-languages section.

Adds classifySQL to parser/parser.go to surface CREATE TABLE, CREATE INDEX,
and CREATE FUNCTION nodes as typed symbols (table / index / function).

Smoke-tested against infra/postgres/init.sql: 15 symbols — 4 tables,
6 indexes, 5 functions (all RPCs: ingest_doc, append_contract_schema,
latest_contract_schema, submit_label, stale_count).
Promotes four recognition-only languages to tree-sitter parsed:
- json    (github.com/tree-sitter/tree-sitter-json v0.24.8)
- toml    (github.com/tree-sitter-grammars/tree-sitter-toml v0.7.0)
- haskell (github.com/tree-sitter/tree-sitter-haskell v0.23.1)
- ocaml   (github.com/tree-sitter/tree-sitter-ocaml v0.25.0)

OCaml splits into two registry entries: ocaml (.ml, LanguageOCaml)
and ocaml_interface (.mli, LanguageOCamlInterface).

All modules ship pre-generated parser.c. go mod tidy clean.
All 311 tests pass (CGO_CFLAGS="-DSQLITE_ENABLE_FTS5").

Skipped:
- dockerfile (camdencheek/tree-sitter-dockerfile v0.2.0): go.mod at
  bindings/go/ declares the root module path but no Go package exists
  at that path — Go toolchain cannot resolve it.
- r (r-lib/tree-sitter-r v1.2.0): binding.go omits scanner.c include,
  causing linker errors for external scanner symbols.
- markdown (tree-sitter-grammars/tree-sitter-markdown v0.5.3): no Go
  bindings directory in the repo.
- make, zig, erlang, perl: no Go bindings found in any maintained fork.
Copilot AI review requested due to automatic review settings May 11, 2026 04:33

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands cymbal’s tree-sitter support by promoting SQL/JSON/TOML/Haskell/OCaml from extension-only recognition to fully parseable languages (via registry bindings), and adds SQL-specific symbol extraction for common DDL constructs.

Changes:

  • Register new tree-sitter grammars for sql, json, toml, haskell, ocaml, and ocaml_interface in lang.Default.
  • Add SQL symbol extraction in the parser for CREATE TABLE, CREATE INDEX, and CREATE FUNCTION.
  • Vendor the tree-sitter-sql grammar internally (including generated parser.c + external scanner) and update tests/deps to reflect new supported languages.

Reviewed changes

Copilot reviewed 12 out of 14 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
walker/walker_feature_test.go Updates supported-language walking expectations now that JSON/TOML are parseable.
parser/parser.go Routes SQL files to a new SQL classifier for symbol extraction.
lang/registry.go Adds tree-sitter language registrations for SQL/JSON/TOML/Haskell/OCaml(+interface).
lang/lang_test.go Updates extension mapping expectations and Supported() coverage for newly-parseable languages.
internal/tsgrammars/tree-sitter-sql/UPSTREAM.txt Records upstream source/version and why SQL is vendored.
internal/tsgrammars/tree-sitter-sql/src/tree_sitter/parser.h Vendored tree-sitter runtime header required by generated parser/scanner.
internal/tsgrammars/tree-sitter-sql/src/tree_sitter/array.h Vendored tree-sitter runtime header required by generated parser/scanner.
internal/tsgrammars/tree-sitter-sql/src/tree_sitter/alloc.h Vendored tree-sitter runtime header required by generated parser/scanner.
internal/tsgrammars/tree-sitter-sql/src/scanner.c Adds SQL external scanner implementation (dollar-quoted strings).
internal/tsgrammars/tree-sitter-sql/LICENSE Includes upstream MIT license text for vendored grammar.
internal/tsgrammars/tree-sitter-sql/bindings/go/binding.go Adds cgo binding exposing Language() for the vendored SQL grammar.
go.mod Adds module requirements for new external grammars (toml, haskell, json, ocaml).
go.sum Updates sums for newly added/updated dependencies.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread parser/parser.go
Comment on lines +902 to +903
case "sql":
return e.classifySQL(nodeType, node)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added TestFeatureSQLDDL in bcf7db1. Covers CREATE TABLE (table symbol), CREATE INDEX (index symbol), and CREATE FUNCTION (function symbol), including language field assertion on each.

Comment on lines +173 to +182
if (state->start_tag != NULL) {
free(state->start_tag);
state->start_tag = NULL;
}
return tag_length;
}

void tree_sitter_sql_external_scanner_deserialize(void *payload, const char *buffer, unsigned length) {
LexerState *state = (LexerState *)payload;
state->start_tag = NULL;

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in bcf7db1. The serialize callback no longer frees or clears start_tag after copying it into the buffer - it now returns immediately after memcpy. Deserialize now frees any existing start_tag before overwriting it.

Comment on lines +180 to +188
void tree_sitter_sql_external_scanner_deserialize(void *payload, const char *buffer, unsigned length) {
LexerState *state = (LexerState *)payload;
state->start_tag = NULL;
// A length of 1 can't exists.
if (length > 1) {
state->start_tag = malloc(length);
memcpy(state->start_tag, buffer, length);
}
}

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in bcf7db1 (same commit as serialize fix). deserialize now frees any existing start_tag before assigning NULL or the new malloc'd buffer.

void tree_sitter_sql_external_scanner_deserialize(void *payload, const char *buffer, unsigned length) {
LexerState *state = (LexerState *)payload;
state->start_tag = NULL;
// A length of 1 can't exists.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in bcf7db1. Comment now reads "can't exist".

serialize was mutating state after copying (freeing start_tag), breaking
parse-stack backtracking. deserialize was leaking any existing start_tag
before replacing it. Fix both and correct a grammar error in a comment.
Vendor opa-oz/tree-sitter-nginx at commit 47ade644 under
internal/tsgrammars/tree-sitter-nginx with a hand-written CGO binding
mirroring the SQL grammar pattern. Register it in lang.Default with
Filenames: ["nginx.conf"] and Extensions: [".nginx"] to avoid claiming
the ambiguous .conf extension.

Upstream Go bindings were not usable directly: the upstream go.mod
declares the wrong module path and uses the smacker API instead of
github.com/tree-sitter/go-tree-sitter.

Also add a TODO comment for future make/cmake grammar additions above
the recognition-only language block.
Without language-specific classifiers, Haskell and OCaml files indexed
as zero symbols despite having valid tree-sitter grammars — classifyGeneric
only matches function_definition/class_definition which neither grammar uses.

classifyHaskell: function node → function (variable child), data_type → type
classifyOCaml: value_definition → function (let_binding/value_name),
  type_definition → type (type_binding/type_constructor),
  module_definition → module (module_binding/module_name)
classifyNginx: upstream attribute → upstream (value child after keyword),
  location node → location (location_route child)

Adds TestFeatureHaskellFunctions, TestFeatureOCamlFunctions,
TestFeatureNginxSymbols to parser_feature_test.go.
withTestWorkingDir only overrode XDG_CONFIG_HOME and APPDATA, but
os.UserConfigDir on darwin ignores XDG_CONFIG_HOME and resolves to
$HOME/Library/Application Support. As a result, the user-scope branch
of opencodePluginPath leaked into the real user's config dir, so any
project-scope install test failed when a managed plugin already
existed at ~/Library/Application Support/opencode/plugins.

Override HOME (and USERPROFILE) to the same tempdir so all scope
lookups stay inside the sandbox on every platform.
mattn/go-sqlite3 exposes FTS5 via the canonical `sqlite_fts5` build tag,
which is more idiomatic and discoverable than threading
`CGO_CFLAGS=-DSQLITE_ENABLE_FTS5` through every Make target.

No behavior change: `make build`, `make build-check`, `make install`,
`make test`, and `make test-coverage` all continue to compile FTS5 into
the sqlite driver. Bare `go test ./...` still requires the tag (CGO C
deps cannot be configured from in-tree Go build tags) -- the canonical
invocation remains `make test`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants