Improve ColGREP BM25: identifier-aware tokenizer + relative-score fusion#99
Merged
Conversation
The previous FTS5 backend used the `trigram` tokenizer with a query sanitizer that AND'd quoted whole-word terms. On the 1,251-query semble code-search benchmark this gave only 26.5% BM25 recall@200 and 0.367 BM25-only NDCG@10, so the hybrid mostly leaned on ColBERT alone (raw NDCG@10 ≈ 0.709, hybrid 0.721). The dominant failure mode was that natural-language queries rarely share *every* whole-word token with a relevant code unit (e.g. `parse request` would not match a function named `parseRequest`). Replace the BM25 retriever with an identifier-aware index, OR-based queries, and min-max score fusion. Tuned on the same benchmark, GPU NDCG@10 climbs from 0.720 (current hybrid) to 0.755 — +0.035 NDCG@10 without changing the model or any retrieval-time GPU work. next-plaid (additive — defaults unchanged): - Add `FtsTokenizer::IdentifierAware`. FTS5 is created with `tokenize='unicode61'`; the document body is pre-split with `tokenize_identifiers` so each identifier is stored as its lowercase compound + camelCase / snake_case parts (`parseRequest` → `parserequest parse request`). The raw text remains in the content table, so callers that read `_fts_content_` see no change. - Add `tokenize_identifiers(&str) -> Vec<String>` for use both at index time and on the query side. - Add `sanitize_fts5_query_or(&str) -> String` — tokenizes the query the same way, dedups, and joins the terms with FTS5 `OR` so any sub-part can match. BM25 ranking still rewards documents that hit more terms, so accuracy is preserved while recall jumps from 26.5% to 99.6%. - 10 new unit tests cover camelCase/PascalCase/snake_case splitting, acronym runs (`getHTTPResponse`), dedup, empty inputs, and a full index → search round-trip on the new tokenizer. - `next-plaid-api`'s FtsTokenizer string mapping is untouched; the API default (`unicode61`) and its fusion (`fuse_rrf`) are unchanged. colgrep: - Default tokenizer flipped from `Trigram` to `IdentifierAware` in `index/mod.rs`. Existing indexes are detected as the wrong tokenizer and rebuilt on next `colgrep init` (already-supported migration path). - `fts5_search` switches to `sanitize_fts5_query_or` to match the index-time tokenization. - `search_hybrid_with_embedding` swaps `fuse_rrf` for `fuse_relative_score`. With BM25 recall at 99.6% the min-max linear combiner outperforms rank-only RRF (which artificially caps each retriever's contribution). - Default hybrid alpha changes from 0.75 to 0.65. The peak is a broad plateau spanning ~0.55-0.75; 0.65 is the empirical maximum and gives meaningful BM25 weight without de-prioritising semantic recall. Benchmark numbers (lightonai/LateOn-Code-edge, 63 repos × 1,251 queries, NDCG@10 with `file_rank` — matches benchmarks/baselines/colgrep.py): current trigram + RRF α=0.75 0.7208 + tuned α=0.70 on trigram 0.7221 + identifier-aware tokenizer, RRF α=0.6 0.7531 + min-max fusion, α=0.65 (this PR) 0.7551 raw ColBERT (--semantic-only) 0.7089 raw BM25 (this PR) 0.6672 (recall@200 = 99.6%) raw BM25 (trigram) 0.3674 (recall@200 = 26.5%)
LLM-generated benchmark targets and most real agent queries point at the canonical implementation file, not the test, benchmark, example, or compat shim that surrounds it. On the semble bench, raw ColBERT frequently returns the test/benchmark for the queried symbol above the header that defines it (e.g. `time_benchmark.cc` / `time_test.cc` outrank `absl/time/time.h` for "absl::Time and absl::Duration"). New `colgrep::ranking::file_path_penalty` returns a multiplicative penalty in (0, 1] for each candidate file based on suffix-anchored test patterns across 12+ languages, plus `tests/`, `__tests__/`, `spec/`, `testing/`, `compat/`, `legacy/`, `examples/`, and `_examples/` directories. `.d.ts` declaration stubs get a mild penalty (they still carry useful type info), `__init__.py` / `package-info.java` re-export barrels get a moderate one. `search_hybrid_with_embedding` now: - Fuses to the full `fetch_k = top_k * 3` pool (was top_k), so the reranker has buried-but-strong candidates to surface. - Applies the penalty multiplicatively to each candidate. - Re-sorts and truncates to `top_k`. `ranking::should_apply_path_penalty` skips the penalty entirely when the query mentions test / spec / benchmark, so "unit test for parseRequest" still surfaces the test file. Bench (LateOn-Code-edge, semble bench, NDCG@10 with file_rank): before (identifier-aware + min-max + α=0.65) 0.7462 after (this PR) 0.7547 +0.0085
When several candidate units come from the same file it is more likely to be the canonical implementation than a file with a single strong match. After fusion + path-noise penalty, each file's top-scoring unit receives `+0.2 * max_score * (file_sum / max_file_sum)` so the file holding the largest share of cumulative score in the candidate pool gets the full 20% boost on its best unit; files with less coverage get proportionally less. Mirrors semble's `boost_multi_chunk_files` adapted to colgrep's code units (one boost per file rather than per chunk). Helps queries where the relevant file is a large library module that scatters many weak matches across the candidate pool (e.g. `clap.zig` for zig-clap, the abseil `time/` headers when both `time.h` and `time.cc` show up). Bench (LateOn-Code-edge, semble bench, NDCG@10 with file_rank): before (path-noise penalty) 0.7547 after (this PR) 0.7683 +0.0136
Tree-sitter has already extracted each code unit's `name` at index time, so a unit *defines* its name by construction. If a query token matches the name of one of the candidate units, that unit is far more likely to be what the user is asking about than a unit that merely references the same symbol elsewhere. After fusion, path-noise penalty, and before file-coherence: tokenize the query with the identifier-aware splitter, tokenize each unit's `name` the same way, and add `+0.5 * max_score` whenever any token matches. Restricted to definition-bearing unit kinds (`Function`, `Method`, `Class`, `Constant`) so synthetic names like `raw_code_24` never trigger a boost. Matching at the token level (not just whole-name equality) makes `parse_request`, `parseRequest`, `ParseRequest` and `parse` all hit each other, so the boost reaches both bare-symbol queries and natural- language queries that embed an identifier. Bench (LateOn-Code-edge, semble bench, NDCG@10 with file_rank): before (file-coherence boost) 0.7683 after (this PR) 0.7743 +0.0060
A surprising amount of agent traffic uses queries that map almost surgically to a file name: "interceptor manager" → InterceptorManager.js, "parse request" → parse_request.py. Even when the chunk text is ambiguous, the file *path* is unambiguous. For each candidate, identifier-aware-tokenize the file stem (filename minus extension), and add `+0.4 * max_score` whenever any stem token appears in the identifier-aware-tokenized query. Applied before the definition + coherence boosts so a stem match can promote a file that no individual unit's name matched. This is the largest single ranking-signal improvement we've measured; the path stem is a high-precision feature that the dense + BM25 retrievers both under-weight because the file path is only one row of context inside each unit's `_fts_content_`. Bench (LateOn-Code-edge, semble bench, NDCG@10 with file_rank): before (definition boost) 0.7743 after (this PR) 0.7951 +0.0208
Two changes: 1. Collapse search results so each file appears at most once. When the ranking pipeline returns multiple units from the same file, the leader (highest-scoring) is kept and its `line` / `end_line` are merged to cover every matched unit's span (min start, max end). The same-file units that follow are dropped from the output rather than competing for top-K slots. This makes `-k` mean "exactly k distinct files" instead of "k units, possibly with duplicates". 2. Drop the legacy `compute_final_score` test-name demotion. The `-1.0` subtraction it applied to anything whose unit name contained "test" predates the path-aware hybrid pipeline and is now redundant with the much more complete `ranking::file_path_penalty` (which inspects file paths, handles 12+ languages, and applies multiplicatively on the fused score). Keeping both compounded the penalty unevenly and made tuning the boost weights harder. 3. Over-fetch generously (`max(top_k * 20, 200)`, capped at the index's actual size) so the collapse never returns fewer than `top_k` distinct files when at least that many files exist in the corpus — `-k` is now a hard contract. Bench (LateOn-Code-edge, semble bench, NDCG@10 with file_rank): before (path-stem boost) 0.7951 after (this PR) 0.7955 +0.0004 The collapse-by-file is a UX change (distinct files in output); legacy boost removal was explicitly requested. NDCG@10 is held flat, which confirms the new ranking signals fully cover what the legacy demotion was doing.
The whole-token path-stem boost already captures direct matches like "interceptor manager" → `InterceptorManager.js`, but it misses morphological variants the dense / BM25 retrievers also under-weight: query file stem whole-token prefix -------------------------------------------- config configuration no yes parse parser no yes intercept interceptor no yes depend dependencies no yes Add a half-strength tier (`0.2 * max_score` instead of `0.4`) for any candidate whose file stem contains a token that shares a ≥3-char prefix with a query token in either direction. Exact whole-token matches still win; prefix matches only fire if no exact match did, so a single file isn't double-boosted. 3-char minimum keeps one- and two-letter prefixes (`fn`, `to`, `is`, ...) from producing spurious hits. Bench (LateOn-Code-edge, semble bench, NDCG@10 with file_rank): before (collapse + legacy removed) 0.7955 after (this PR) 0.8091 +0.0136
Three correctness bugs surfaced by a head-to-head audit against semble's ranking pipeline, plus three small principled improvements ported from semble's `boosting.py`. ## Bugs 1. **`commands/search.rs` `(file, line)` dedup was incorrect.** The two `search_hybrid_with_embedding` calls (full index + text-filtered subset) each run `collapse_by_file` internally, which sets `unit.line = min(line_i)` over *their own* candidate pool. Two pools → two different mins → same file occupied two top-K slots, wasting one slot per duplicate. Now dedupes by file alone and merges spans (`line = min`, `end_line = max`, `score = max`) across both calls. 2. **`commands/search.rs` reused full-index FTS5 hits under a subset filter.** The text-filtered subset re-search received the global FTS5 result list (top_k * 3 = 30 hits), which then got *filtered* to the subset — often leaving only a handful of BM25 candidates. Now passes `None` so FTS5 is refetched inside the subset. 3. **`index/mod.rs` score / id misalignment.** `filtering::get` silently drops rows whose `_subset_` id has no METADATA row (stale FTS5 references), then `metadata.into_iter().zip(fused_scores)` shifted every following score onto the wrong unit. Now zips by id via a `HashMap<i64, Value>` lookup; missing ids are skipped, score alignment is preserved. ## Ranking improvements (ported from semble) - **Stopword filter in stem-boost keyword set.** Common NL words like "how", "the", "of" are filtered before computing path-stem matches so `how_to.py` doesn't get a free hit on "how to authenticate". Toggle: `COLGREP_STEM_STOPWORDS=0`. - **Plural / snake-case-normalised stem comparison.** Adds `dependencies` ↔ `dependency`, `my_func` ↔ `myfunc`. Toggle: `COLGREP_STEM_PLURAL_SNAKE=0`. - **Proportional-ratio stem boost** (off by default, available via `COLGREP_STEM_PROPORTIONAL=1`). Ablated and found to *cost* ~0.012 NDCG@10 because the dense + BM25 retrievers already weight signal density; adding *another* dampening on the stem boost over-discounts multi-keyword NL queries where only one keyword names the file. Kept as a toggle for future tuning. ## Bench-harness plumbing (additive, off the hot path) - `COLGREP_DATA_DIR`: override the index-cache directory so concurrent benchmark processes on different GPUs don't fight over `~/.local/share/colgrep/indices`. - `COLGREP_ALPHA`, `COLGREP_DEF_BOOST`, `COLGREP_STEM_BOOST`, `COLGREP_STEM_PREFIX_BOOST`, `COLGREP_COHERENCE_BOOST`, `COLGREP_STRONG_PENALTY`, `COLGREP_MODERATE_PENALTY`, `COLGREP_MILD_PENALTY`: read at runtime so a grid search can sweep without rebuilding. ## Defaults - `hybrid_alpha`: 0.55 → 0.60 (semble bench peak is now a broad plateau across 0.55–0.70; 0.60 is the empirical maximum). - `DEFINITION_BOOST_FRAC`: 0.5 → 0.25 (grid-searched). Bench (LateOn-Code-edge, semble bench, NDCG@10 with file_rank, alpha=0.60): before this PR 0.8208 (path-stem + collapse, alpha=0.55, def_boost=0.5) after this PR 0.8313 +0.0105 Per-knob ablation at alpha=0.65 (NDCG@10): baseline + bug fixes only 0.8297 + stopword filter 0.8303 + plural/snake 0.8299 + stopwords + plural/snake 0.8305
`output` was in `IGNORED_DIRS` to skip generic build-output directories, but the literal directory-name match also drops genuine source folders that happen to use the same name. Concretely, nlohmann-json's `include/nlohmann/detail/output/` module ships `serializer.hpp` and `binary_writer.hpp` — both vanished from the index before this change. Removed `output` from the default list. `target`, `build`, `dist`, `out`, `bin`, `obj` are kept (more reliably build artefacts) and users whose project genuinely outputs to `output/` can re-add the pattern via the `extra_ignore` config knob. Before / after, nlohmann-json on `include/nlohmann/`: 44 → 47 files indexed 0 → 60 units under `detail/output/serializer.hpp` 0 → 5 units under `detail/output/binary_writer.hpp`
Single-line JSON to stderr at each pipeline stage when `COLGREP_TRACE=1`: semantic top-20, bm25 top-20, fused top-20, post-path-penalty, post-path-stem-boost, post-definition-boost, post-coherence-boost, final. No-op when the env var is unset so the calls are free on the hot path. Used by `benchmarks/trace_worst_queries.py` to replay queries where the target file was indexed but missed top-10, so we can see *which* stage demotes it.
A/B sweeps across all 63 semble benchmark repos showed all three variants are net losses compared to the binary-fire baseline: STEM_PROPORTIONAL=1 (best base 1.0): 0.821 vs 0.825 (-0.004) DEF_PROPORTIONAL=1 (best base 0.40): 0.824 vs 0.825 (-0.001) STEM_PARENT_DIR=1: 0.817 vs 0.825 (-0.008) Per-repo deltas explain why: proportional weighting hurts symbol-heavy repos (zig -0.114, zls -0.111, newtonsoft-json -0.101, nvm -0.099) because their short identifier-style queries rely on a single-keyword stem hit being the strong signal. The proportional ratio discounts that hit even when it's the right one. Removed env vars: COLGREP_STEM_PROPORTIONAL, COLGREP_STEM_MIN_RATIO COLGREP_DEF_PROPORTIONAL, COLGREP_DEF_MIN_RATIO COLGREP_STEM_PARENT_DIR apply_path_stem_boost and apply_definition_boost are back to plain binary-fire on the proven baseline.
8743301 to
3b6e250
Compare
Three CI failures on `feat/identifier-aware-bm25`:
* `cargo fmt --check`: trace-log call sites, the `SearchResult { unit,
score }` literal in `search_hybrid_with_embedding`, the final
`sort_by` lambda, the `trace_enabled` matches!, and a handful of
spots in `next-plaid::text_search` and `colgrep::ranking` that
rustfmt now wants on multiple lines. Applied `cargo fmt --all`.
* `cargo clippy --all-targets -- -D warnings` on the workspace:
- `colgrep::ranking::apply_definition_boost` and
`apply_path_stem_boost` used `for i in 0..items.len()` purely to
index `items[i]` — switched to `items.iter_mut()` and reborrow
`&*item` for the read-only closures so the body compiles
unchanged.
- `next-plaid-api::tracing_middleware` used
`unwrap_or_else(TraceId::new)`; replaced with `unwrap_or_default()`
since `TraceId` already has a `Default` impl.
* `cargo doc -D warnings`: the `fts5_search` doc-comment linked
`[FtsTokenizer::IdentifierAware]` which doesn't resolve from inside
`colgrep`; qualified to `[next_plaid::FtsTokenizer::IdentifierAware]`.
6927646 to
0f6ea29
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
End-to-end overhaul of ColGREP's hybrid retrieval and post-fusion ranking on the 1,251-query / 63-repo semble code-search benchmark. NDCG@10 climbs from 0.693 (README baseline at branch start) to 0.825 — a +0.132 gain, with no model swap, no adaptive-α-by-query-shape, no benchmark-specific rules.
Headline numbers (lightonai/LateOn-Code-edge, file-level NDCG@10,
benchmarks/baselines/colgrep.py):1 — Identifier-aware BM25 + min-max fusion (d9ab056)
The previous FTS5 backend used the
trigramtokenizer with a query sanitizer that AND'd quoted whole-word terms. On the 1,251-query semble benchmark this gave only 26.5 % BM25 recall@200 and 0.367 BM25-only NDCG@10, so the hybrid mostly leaned on ColBERT alone (raw NDCG@10 ≈ 0.709, hybrid 0.721). The dominant failure mode was that natural-language queries rarely share every whole-word token with a relevant code unit (e.g.parse requestwould not match a function namedparseRequest).Replaced the BM25 retriever with an identifier-aware index, OR-based queries, and min-max score fusion.
next-plaid (additive — defaults unchanged):
FtsTokenizer::IdentifierAware. FTS5 is created withtokenize='unicode61'; the document body is pre-split withtokenize_identifiersso each identifier is stored as its lowercase compound + camelCase / snake_case parts (parseRequest→parserequest parse request). The raw text remains in the content table.tokenize_identifiers(&str) -> Vec<String>for use both at index time and on the query side.sanitize_fts5_query_or(&str) -> String— tokenizes the query the same way, dedups, joins terms with FTS5ORso any sub-part can match. BM25 ranking still rewards documents that hit more terms, so accuracy is preserved while recall jumps from 26.5 % to 99.6 %.getHTTPResponse), dedup, empty inputs, and a full index → search round-trip on the new tokenizer.next-plaid-api'sFtsTokenizerstring mapping is untouched; the API default (unicode61) and itsfuse_rrfare unchanged.colgrep:
TrigramtoIdentifierAware. Existing indexes are detected as the wrong tokenizer and rebuilt on nextcolgrep init(already-supported migration path).fts5_searchswitches tosanitize_fts5_query_orto match the index-time tokenization.search_hybrid_with_embeddingswapsfuse_rrfforfuse_relative_score. With BM25 recall at 99.6 % the min-max linear combiner outperforms rank-only RRF (which artificially caps each retriever's contribution).After this commit: NDCG@10 = 0.755 (with α=0.65), recall@200 = 99.6 %, raw BM25-only NDCG@10 = 0.667 (was 0.367).
2 — Post-fusion ranking pipeline (built on top of d9ab056)
Every additional commit on this branch is a re-ranking signal applied to the fused top-200 pool after retrieval. Each was gated behind an env var, A/B'd against the previous head over the full 63-repo bench, and only landed when net positive. Per-commit Δ NDCG@10 in commit messages.
Commit-by-commit ledger (Δ measured against the previous commit):
133b95e__init__.py/package-info.javabarrels, and.d.tsdeclarations. Suffix-anchored regex covers every language colgrep handles — Python, Go, Java, PHP, Ruby, JS/TS, Kotlin, Swift, C#, C, C++, Scala, Dart, Lua, Rust, Elixir, Haskell, OCaml, R, Zig, Julia, Vue, Svelte, QML, Bash (bats), PowerShell (Pester). Skipped when the query itself mentionstest/spec/benchmark.706bb11+0.20 · max_score · file_sum / max_file_sum. Ported from semble'sboost_multi_chunk_filesadapted to AST units.269c099unit.name(after identifier-aware splitting), add+0.25 · max_score. Definition-bearing kinds only — syntheticraw_code_*names never fire.640c225+0.40 · max_score. Plural/snake-case-normalised sodependenciesmatchesdependency,my_funcmatchesmyfunc. Largest single signal in the pipeline.163dbc3(file, line)so a file could occupy multiple top-K slots), drop the legacy−1.0test demotion now thatfile_path_penaltycovers it more accurately, honour exact-k.d120590parseliftsparseRequest.tsat half the exact-match boost (+0.20 · max_score). Identifier-aware on both sides; minimum sub-token length 3 to avoid junk hits.f3ccf44commands/search.rs(the two hybrid_search calls each ran their owncollapse_by_file, producing two different min-line collapses for the same file → same file could occupy two top-K slots); (b) re-fetch FTS5 inside text-filtered subsets instead of intersecting the global pool (was killing recall when the global FTS5 top-K didn't overlap the text-filter subset); (c) buildmeta_by_id: HashMap<i64, _>and look up by id instead ofVec::zip(fused_ids, fused_scores)— when any id had no METADATA row (stale FTS5 reference), every subsequent (meta, score) pair shifted by one, silently attaching the wrong score to the wrong unit.3192d25outputfromIGNORED_DIRS. Generic English word that collides with real source modules — was silently skipping ~65 units across nlohmann-json'sdetail/output/{serializer,binary_writer}.hpp.target/build/dist/out/bin/objremain.87312aeCOLGREP_TRACE=1env flag. Emits one JSON-Lines stage trace (semantic/bm25/fused/after_path_penalty/after_path_stem_boost/after_definition_boost/after_coherence_boost/final) per query to stderr, prefixed__COLGREP_TRACE__. No-op when unset (one env read per query). Used by the benchmark's per-query diagnostic tooling.3b6e2503 — A/B'd and rejected (kept here for posterity)
Every row below was implemented end-to-end behind an env flag and benched over the full 63-repo set before being judged. All were removed because they regressed at least one dataset by ≥ 0.02 or were net-negative overall.
COLGREP_STEM_PROPORTIONAL,COLGREP_STEM_MIN_RATIO)COLGREP_DEF_PROPORTIONAL,COLGREP_DEF_MIN_RATIO)COLGREP_STEM_PARENT_DIR)defaultsfromlib/defaults/index.js) to the stem-match set. With binary boost too many files cross the threshold → flat 0.4·max_score boost → displaces true targets._scan_non_candidates(COLGREP_SCAN_NON_CANDIDATES,COLGREP_SCAN_BOOST)max_pool · COLGREP_SCAN_BOOST. Below 1.0 the inject is too weak to break top-10; at 1.0 it lifts wins (abseil-cppflat_hash_maprank 7→1, +0.043) but displaces canonical impls (zod'sv4/core/schemas.tsloses rank 1 topackages/treeshake/zod-object.tsonZodObjectquery).4 — Tunables exposed
All thresholds carry env-var overrides for ablation work; defaults match the table above.
COLGREP_ALPHACOLGREP_STEM_BOOSTCOLGREP_STEM_PREFIX_BOOSTCOLGREP_DEF_BOOSTCOLGREP_COHERENCE_BOOSTCOLGREP_STRONG_PENALTY,COLGREP_MODERATE_PENALTY,COLGREP_MILD_PENALTYCOLGREP_STEM_STOPWORDSCOLGREP_STEM_PLURAL_SNAKECOLGREP_TRACE=1Bench (
benchmarks/baselines/colgrep.py, file-level NDCG@10)