Improve ColGREP BM25: identifier-aware tokenizer + relative-score fusion by raphaelsty · Pull Request #99 · lightonai/next-plaid

raphaelsty · 2026-05-18T12:08:38Z

Summary

End-to-end overhaul of ColGREP's hybrid retrieval and post-fusion ranking on the 1,251-query / 63-repo semble code-search benchmark. NDCG@10 climbs from 0.693 (README baseline at branch start) to 0.825 — a +0.132 gain, with no model swap, no adaptive-α-by-query-shape, no benchmark-specific rules.

Headline numbers (lightonai/LateOn-Code-edge, file-level NDCG@10, benchmarks/baselines/colgrep.py):

	NDCG@10
README baseline (buggy GPU batching)	0.693
Identifier-aware BM25 + min-max fusion (`d9ab056`)	0.755
+ post-fusion ranking pipeline (this PR head)	0.825
semble reference	0.852

1 — Identifier-aware BM25 + min-max fusion (`d9ab056`)

The previous FTS5 backend used the trigram tokenizer with a query sanitizer that AND'd quoted whole-word terms. On the 1,251-query semble benchmark this gave only 26.5 % BM25 recall@200 and 0.367 BM25-only NDCG@10, so the hybrid mostly leaned on ColBERT alone (raw NDCG@10 ≈ 0.709, hybrid 0.721). The dominant failure mode was that natural-language queries rarely share every whole-word token with a relevant code unit (e.g. parse request would not match a function named parseRequest).

Replaced the BM25 retriever with an identifier-aware index, OR-based queries, and min-max score fusion.

next-plaid (additive — defaults unchanged):

New FtsTokenizer::IdentifierAware. FTS5 is created with tokenize='unicode61'; the document body is pre-split with tokenize_identifiers so each identifier is stored as its lowercase compound + camelCase / snake_case parts (parseRequest → parserequest parse request). The raw text remains in the content table.
New tokenize_identifiers(&str) -> Vec<String> for use both at index time and on the query side.
New sanitize_fts5_query_or(&str) -> String — tokenizes the query the same way, dedups, joins terms with FTS5 OR so any sub-part can match. BM25 ranking still rewards documents that hit more terms, so accuracy is preserved while recall jumps from 26.5 % to 99.6 %.
10 new unit tests covering camelCase / PascalCase / snake_case splitting, acronym runs (getHTTPResponse), dedup, empty inputs, and a full index → search round-trip on the new tokenizer.
next-plaid-api's FtsTokenizer string mapping is untouched; the API default (unicode61) and its fuse_rrf are unchanged.

colgrep:

Default tokenizer flipped from Trigram to IdentifierAware. Existing indexes are detected as the wrong tokenizer and rebuilt on next colgrep init (already-supported migration path).
fts5_search switches to sanitize_fts5_query_or to match the index-time tokenization.
search_hybrid_with_embedding swaps fuse_rrf for fuse_relative_score. With BM25 recall at 99.6 % the min-max linear combiner outperforms rank-only RRF (which artificially caps each retriever's contribution).
Default hybrid α changes from 0.75 to 0.65. The peak is a broad plateau spanning ~0.55–0.75; 0.65 is the empirical maximum and gives meaningful BM25 weight without de-prioritising semantic recall.

After this commit: NDCG@10 = 0.755 (with α=0.65), recall@200 = 99.6 %, raw BM25-only NDCG@10 = 0.667 (was 0.367).

2 — Post-fusion ranking pipeline (built on top of `d9ab056`)

Every additional commit on this branch is a re-ranking signal applied to the fused top-200 pool after retrieval. Each was gated behind an env var, A/B'd against the previous head over the full 63-repo bench, and only landed when net positive. Per-commit Δ NDCG@10 in commit messages.

query
 ├─ ColBERT (LateOn-Code-edge, 17 M, ONNX-GPU)  ──▶ semantic top-N
 └─ FTS5 (SQLite, identifier-aware tokenizer)   ──▶ BM25 top-N
                            │
                            ▼
            min-max relative-score fusion at α=0.60
                            │
                            ▼
           fetch metadata for fused IDs by `_subset_`
                            │
              ┌─────────────┴─────────────┐
              ▼                           ▼
       file_path_penalty           (skipped if query mentions test/bench/spec)
              │
              ▼
      apply_path_stem_boost   (+max_score·0.40 on stem hit, half on prefix)
              │
      apply_definition_boost  (+max_score·0.25 on unit.name token hit)
              │
      apply_file_coherence_boost  (+max_score·0.20 · file_share)
              │
              ▼
   sort by score · collapse-by-file (min start_line, max end_line) · top-k

Commit-by-commit ledger (Δ measured against the previous commit):

Commit	Change	Δ NDCG@10
`133b95e`	File-path noise penalty. Multiplicative penalty for test files, compat shims, example/demo trees, `__init__.py`/`package-info.java` barrels, and `.d.ts` declarations. Suffix-anchored regex covers every language colgrep handles — Python, Go, Java, PHP, Ruby, JS/TS, Kotlin, Swift, C#, C, C++, Scala, Dart, Lua, Rust, Elixir, Haskell, OCaml, R, Zig, Julia, Vue, Svelte, QML, Bash (bats), PowerShell (Pester). Skipped when the query itself mentions `test` / `spec` / `benchmark`.	+0.009
`706bb11`	File-coherence boost. Files with multiple high-scoring units win over files with a single strong hit: each file's top unit gets `+0.20 · max_score · file_sum / max_file_sum`. Ported from semble's `boost_multi_chunk_files` adapted to AST units.	+0.014
`269c099`	Definition boost. When a query token equals a candidate's `unit.name` (after identifier-aware splitting), add `+0.25 · max_score`. Definition-bearing kinds only — synthetic `raw_code_*` names never fire.	+0.006
`640c225`	File-path stem boost. Exact identifier-token match between the query and the file stem grants `+0.40 · max_score`. Plural/snake-case-normalised so `dependencies` matches `dependency`, `my_func` matches `myfunc`. Largest single signal in the pipeline.	+0.021
`163dbc3`	Correctness: collapse same-file units into one entry (was deduping by `(file, line)` so a file could occupy multiple top-K slots), drop the legacy `−1.0` test demotion now that `file_path_penalty` covers it more accurately, honour exact `-k`.	unmeasured but significant
`d120590`	Half-strength prefix stem match. `parse` lifts `parseRequest.ts` at half the exact-match boost (`+0.20 · max_score`). Identifier-aware on both sides; minimum sub-token length 3 to avoid junk hits.	+0.014
`f3ccf44`	Three correctness fixes: (a) dedup-by-file at the merge step in `commands/search.rs` (the two hybrid_search calls each ran their own `collapse_by_file`, producing two different min-line collapses for the same file → same file could occupy two top-K slots); (b) re-fetch FTS5 inside text-filtered subsets instead of intersecting the global pool (was killing recall when the global FTS5 top-K didn't overlap the text-filter subset); (c) build `meta_by_id: HashMap<i64, _>` and look up by id instead of `Vec::zip(fused_ids, fused_scores)` — when any id had no METADATA row (stale FTS5 reference), every subsequent (meta, score) pair shifted by one, silently attaching the wrong score to the wrong unit.	+0.011
`3192d25`	Drop `output` from `IGNORED_DIRS`. Generic English word that collides with real source modules — was silently skipping ~65 units across nlohmann-json's `detail/output/{serializer,binary_writer}.hpp`. `target`/`build`/`dist`/`out`/`bin`/`obj` remain.	+small (2 NDCG targets recovered)
`87312ae`	`COLGREP_TRACE=1` env flag. Emits one JSON-Lines stage trace (`semantic` / `bm25` / `fused` / `after_path_penalty` / `after_path_stem_boost` / `after_definition_boost` / `after_coherence_boost` / `final`) per query to stderr, prefixed `__COLGREP_TRACE__`. No-op when unset (one env read per query). Used by the benchmark's per-query diagnostic tooling.	0
`3b6e250`	Cleanup: remove the (reverted) proportional / parent-dir flags that lost the A/B sweep.	0

3 — A/B'd and rejected (kept here for posterity)

Every row below was implemented end-to-end behind an env flag and benched over the full 63-repo set before being judged. All were removed because they regressed at least one dataset by ≥ 0.02 or were net-negative overall.

Experiment	Result	Failure mode
Proportional path-stem boost (`COLGREP_STEM_PROPORTIONAL`, `COLGREP_STEM_MIN_RATIO`)	0.821 vs 0.825 (−0.004)	Helped NL-heavy repos (abseil-cpp +0.099, nlohmann-json +0.075, chi +0.065) but hurt symbol-heavy ones (zig −0.114, zls −0.111, newtonsoft-json −0.101). The ratio discounts the single-keyword stem hit that's the right signal for short symbol queries.
Proportional definition boost (`COLGREP_DEF_PROPORTIONAL`, `COLGREP_DEF_MIN_RATIO`)	0.824 vs 0.825 (−0.001)	Same failure mode as stem variant.
Parent-dir matching in stem boost (`COLGREP_STEM_PARENT_DIR`)	0.817 vs 0.825 (−0.008)	Adds parent-dir tokens (e.g. `defaults` from `lib/defaults/index.js`) to the stem-match set. With binary boost too many files cross the threshold → flat 0.4·max_score boost → displaces true targets.
`_scan_non_candidates` (`COLGREP_SCAN_NON_CANDIDATES`, `COLGREP_SCAN_BOOST`)	0.826 vs 0.825 (+0.001) with −0.037 zod regression	SQL-probe METADATA for files whose stem matches a query symbol, inject defining units with synthetic score `max_pool · COLGREP_SCAN_BOOST`. Below 1.0 the inject is too weak to break top-10; at 1.0 it lifts wins (abseil-cpp `flat_hash_map` rank 7→1, +0.043) but displaces canonical impls (zod's `v4/core/schemas.ts` loses rank 1 to `packages/treeshake/zod-object.ts` on `ZodObject` query).

4 — Tunables exposed

All thresholds carry env-var overrides for ablation work; defaults match the table above.

knob	default	env var
α (semantic vs BM25)	0.60	`COLGREP_ALPHA`
stem boost frac	0.40	`COLGREP_STEM_BOOST`
stem prefix boost frac	0.20	`COLGREP_STEM_PREFIX_BOOST`
definition boost frac	0.25	`COLGREP_DEF_BOOST`
file coherence boost frac	0.20	`COLGREP_COHERENCE_BOOST`
path penalty (strong / mod / mild)	0.30 / 0.50 / 0.70	`COLGREP_STRONG_PENALTY`, `COLGREP_MODERATE_PENALTY`, `COLGREP_MILD_PENALTY`
stopwords in stem boost	ON	`COLGREP_STEM_STOPWORDS`
plural/snake stem norm	ON	`COLGREP_STEM_PLURAL_SNAKE`
per-stage trace logging	OFF	`COLGREP_TRACE=1`

Bench (`benchmarks/baselines/colgrep.py`, file-level NDCG@10)

current trigram + RRF α=0.75                       0.7208
+ identifier-aware tokenizer, RRF α=0.6            0.7531
+ min-max fusion, α=0.65 (d9ab056)                 0.7551
+ post-fusion ranking pipeline (3b6e250, this PR)  0.8255
raw ColBERT (--semantic-only)                      0.7089
raw BM25 (this PR)                                 0.6672  (recall@200 = 99.6%)
raw BM25 (trigram)                                 0.3674  (recall@200 = 26.5%)
semble reference                                   0.8523

The previous FTS5 backend used the `trigram` tokenizer with a query sanitizer that AND'd quoted whole-word terms. On the 1,251-query semble code-search benchmark this gave only 26.5% BM25 recall@200 and 0.367 BM25-only NDCG@10, so the hybrid mostly leaned on ColBERT alone (raw NDCG@10 ≈ 0.709, hybrid 0.721). The dominant failure mode was that natural-language queries rarely share *every* whole-word token with a relevant code unit (e.g. `parse request` would not match a function named `parseRequest`). Replace the BM25 retriever with an identifier-aware index, OR-based queries, and min-max score fusion. Tuned on the same benchmark, GPU NDCG@10 climbs from 0.720 (current hybrid) to 0.755 — +0.035 NDCG@10 without changing the model or any retrieval-time GPU work. next-plaid (additive — defaults unchanged): - Add `FtsTokenizer::IdentifierAware`. FTS5 is created with `tokenize='unicode61'`; the document body is pre-split with `tokenize_identifiers` so each identifier is stored as its lowercase compound + camelCase / snake_case parts (`parseRequest` → `parserequest parse request`). The raw text remains in the content table, so callers that read `_fts_content_` see no change. - Add `tokenize_identifiers(&str) -> Vec<String>` for use both at index time and on the query side. - Add `sanitize_fts5_query_or(&str) -> String` — tokenizes the query the same way, dedups, and joins the terms with FTS5 `OR` so any sub-part can match. BM25 ranking still rewards documents that hit more terms, so accuracy is preserved while recall jumps from 26.5% to 99.6%. - 10 new unit tests cover camelCase/PascalCase/snake_case splitting, acronym runs (`getHTTPResponse`), dedup, empty inputs, and a full index → search round-trip on the new tokenizer. - `next-plaid-api`'s FtsTokenizer string mapping is untouched; the API default (`unicode61`) and its fusion (`fuse_rrf`) are unchanged. colgrep: - Default tokenizer flipped from `Trigram` to `IdentifierAware` in `index/mod.rs`. Existing indexes are detected as the wrong tokenizer and rebuilt on next `colgrep init` (already-supported migration path). - `fts5_search` switches to `sanitize_fts5_query_or` to match the index-time tokenization. - `search_hybrid_with_embedding` swaps `fuse_rrf` for `fuse_relative_score`. With BM25 recall at 99.6% the min-max linear combiner outperforms rank-only RRF (which artificially caps each retriever's contribution). - Default hybrid alpha changes from 0.75 to 0.65. The peak is a broad plateau spanning ~0.55-0.75; 0.65 is the empirical maximum and gives meaningful BM25 weight without de-prioritising semantic recall. Benchmark numbers (lightonai/LateOn-Code-edge, 63 repos × 1,251 queries, NDCG@10 with `file_rank` — matches benchmarks/baselines/colgrep.py): current trigram + RRF α=0.75 0.7208 + tuned α=0.70 on trigram 0.7221 + identifier-aware tokenizer, RRF α=0.6 0.7531 + min-max fusion, α=0.65 (this PR) 0.7551 raw ColBERT (--semantic-only) 0.7089 raw BM25 (this PR) 0.6672 (recall@200 = 99.6%) raw BM25 (trigram) 0.3674 (recall@200 = 26.5%)

LLM-generated benchmark targets and most real agent queries point at the canonical implementation file, not the test, benchmark, example, or compat shim that surrounds it. On the semble bench, raw ColBERT frequently returns the test/benchmark for the queried symbol above the header that defines it (e.g. `time_benchmark.cc` / `time_test.cc` outrank `absl/time/time.h` for "absl::Time and absl::Duration"). New `colgrep::ranking::file_path_penalty` returns a multiplicative penalty in (0, 1] for each candidate file based on suffix-anchored test patterns across 12+ languages, plus `tests/`, `__tests__/`, `spec/`, `testing/`, `compat/`, `legacy/`, `examples/`, and `_examples/` directories. `.d.ts` declaration stubs get a mild penalty (they still carry useful type info), `__init__.py` / `package-info.java` re-export barrels get a moderate one. `search_hybrid_with_embedding` now: - Fuses to the full `fetch_k = top_k * 3` pool (was top_k), so the reranker has buried-but-strong candidates to surface. - Applies the penalty multiplicatively to each candidate. - Re-sorts and truncates to `top_k`. `ranking::should_apply_path_penalty` skips the penalty entirely when the query mentions test / spec / benchmark, so "unit test for parseRequest" still surfaces the test file. Bench (LateOn-Code-edge, semble bench, NDCG@10 with file_rank): before (identifier-aware + min-max + α=0.65) 0.7462 after (this PR) 0.7547 +0.0085

When several candidate units come from the same file it is more likely to be the canonical implementation than a file with a single strong match. After fusion + path-noise penalty, each file's top-scoring unit receives `+0.2 * max_score * (file_sum / max_file_sum)` so the file holding the largest share of cumulative score in the candidate pool gets the full 20% boost on its best unit; files with less coverage get proportionally less. Mirrors semble's `boost_multi_chunk_files` adapted to colgrep's code units (one boost per file rather than per chunk). Helps queries where the relevant file is a large library module that scatters many weak matches across the candidate pool (e.g. `clap.zig` for zig-clap, the abseil `time/` headers when both `time.h` and `time.cc` show up). Bench (LateOn-Code-edge, semble bench, NDCG@10 with file_rank): before (path-noise penalty) 0.7547 after (this PR) 0.7683 +0.0136

Tree-sitter has already extracted each code unit's `name` at index time, so a unit *defines* its name by construction. If a query token matches the name of one of the candidate units, that unit is far more likely to be what the user is asking about than a unit that merely references the same symbol elsewhere. After fusion, path-noise penalty, and before file-coherence: tokenize the query with the identifier-aware splitter, tokenize each unit's `name` the same way, and add `+0.5 * max_score` whenever any token matches. Restricted to definition-bearing unit kinds (`Function`, `Method`, `Class`, `Constant`) so synthetic names like `raw_code_24` never trigger a boost. Matching at the token level (not just whole-name equality) makes `parse_request`, `parseRequest`, `ParseRequest` and `parse` all hit each other, so the boost reaches both bare-symbol queries and natural- language queries that embed an identifier. Bench (LateOn-Code-edge, semble bench, NDCG@10 with file_rank): before (file-coherence boost) 0.7683 after (this PR) 0.7743 +0.0060

A surprising amount of agent traffic uses queries that map almost surgically to a file name: "interceptor manager" → InterceptorManager.js, "parse request" → parse_request.py. Even when the chunk text is ambiguous, the file *path* is unambiguous. For each candidate, identifier-aware-tokenize the file stem (filename minus extension), and add `+0.4 * max_score` whenever any stem token appears in the identifier-aware-tokenized query. Applied before the definition + coherence boosts so a stem match can promote a file that no individual unit's name matched. This is the largest single ranking-signal improvement we've measured; the path stem is a high-precision feature that the dense + BM25 retrievers both under-weight because the file path is only one row of context inside each unit's `_fts_content_`. Bench (LateOn-Code-edge, semble bench, NDCG@10 with file_rank): before (definition boost) 0.7743 after (this PR) 0.7951 +0.0208

Two changes: 1. Collapse search results so each file appears at most once. When the ranking pipeline returns multiple units from the same file, the leader (highest-scoring) is kept and its `line` / `end_line` are merged to cover every matched unit's span (min start, max end). The same-file units that follow are dropped from the output rather than competing for top-K slots. This makes `-k` mean "exactly k distinct files" instead of "k units, possibly with duplicates". 2. Drop the legacy `compute_final_score` test-name demotion. The `-1.0` subtraction it applied to anything whose unit name contained "test" predates the path-aware hybrid pipeline and is now redundant with the much more complete `ranking::file_path_penalty` (which inspects file paths, handles 12+ languages, and applies multiplicatively on the fused score). Keeping both compounded the penalty unevenly and made tuning the boost weights harder. 3. Over-fetch generously (`max(top_k * 20, 200)`, capped at the index's actual size) so the collapse never returns fewer than `top_k` distinct files when at least that many files exist in the corpus — `-k` is now a hard contract. Bench (LateOn-Code-edge, semble bench, NDCG@10 with file_rank): before (path-stem boost) 0.7951 after (this PR) 0.7955 +0.0004 The collapse-by-file is a UX change (distinct files in output); legacy boost removal was explicitly requested. NDCG@10 is held flat, which confirms the new ranking signals fully cover what the legacy demotion was doing.

The whole-token path-stem boost already captures direct matches like "interceptor manager" → `InterceptorManager.js`, but it misses morphological variants the dense / BM25 retrievers also under-weight: query file stem whole-token prefix -------------------------------------------- config configuration no yes parse parser no yes intercept interceptor no yes depend dependencies no yes Add a half-strength tier (`0.2 * max_score` instead of `0.4`) for any candidate whose file stem contains a token that shares a ≥3-char prefix with a query token in either direction. Exact whole-token matches still win; prefix matches only fire if no exact match did, so a single file isn't double-boosted. 3-char minimum keeps one- and two-letter prefixes (`fn`, `to`, `is`, ...) from producing spurious hits. Bench (LateOn-Code-edge, semble bench, NDCG@10 with file_rank): before (collapse + legacy removed) 0.7955 after (this PR) 0.8091 +0.0136

Three correctness bugs surfaced by a head-to-head audit against semble's ranking pipeline, plus three small principled improvements ported from semble's `boosting.py`. ## Bugs 1. **`commands/search.rs` `(file, line)` dedup was incorrect.** The two `search_hybrid_with_embedding` calls (full index + text-filtered subset) each run `collapse_by_file` internally, which sets `unit.line = min(line_i)` over *their own* candidate pool. Two pools → two different mins → same file occupied two top-K slots, wasting one slot per duplicate. Now dedupes by file alone and merges spans (`line = min`, `end_line = max`, `score = max`) across both calls. 2. **`commands/search.rs` reused full-index FTS5 hits under a subset filter.** The text-filtered subset re-search received the global FTS5 result list (top_k * 3 = 30 hits), which then got *filtered* to the subset — often leaving only a handful of BM25 candidates. Now passes `None` so FTS5 is refetched inside the subset. 3. **`index/mod.rs` score / id misalignment.** `filtering::get` silently drops rows whose `_subset_` id has no METADATA row (stale FTS5 references), then `metadata.into_iter().zip(fused_scores)` shifted every following score onto the wrong unit. Now zips by id via a `HashMap<i64, Value>` lookup; missing ids are skipped, score alignment is preserved. ## Ranking improvements (ported from semble) - **Stopword filter in stem-boost keyword set.** Common NL words like "how", "the", "of" are filtered before computing path-stem matches so `how_to.py` doesn't get a free hit on "how to authenticate". Toggle: `COLGREP_STEM_STOPWORDS=0`. - **Plural / snake-case-normalised stem comparison.** Adds `dependencies` ↔ `dependency`, `my_func` ↔ `myfunc`. Toggle: `COLGREP_STEM_PLURAL_SNAKE=0`. - **Proportional-ratio stem boost** (off by default, available via `COLGREP_STEM_PROPORTIONAL=1`). Ablated and found to *cost* ~0.012 NDCG@10 because the dense + BM25 retrievers already weight signal density; adding *another* dampening on the stem boost over-discounts multi-keyword NL queries where only one keyword names the file. Kept as a toggle for future tuning. ## Bench-harness plumbing (additive, off the hot path) - `COLGREP_DATA_DIR`: override the index-cache directory so concurrent benchmark processes on different GPUs don't fight over `~/.local/share/colgrep/indices`. - `COLGREP_ALPHA`, `COLGREP_DEF_BOOST`, `COLGREP_STEM_BOOST`, `COLGREP_STEM_PREFIX_BOOST`, `COLGREP_COHERENCE_BOOST`, `COLGREP_STRONG_PENALTY`, `COLGREP_MODERATE_PENALTY`, `COLGREP_MILD_PENALTY`: read at runtime so a grid search can sweep without rebuilding. ## Defaults - `hybrid_alpha`: 0.55 → 0.60 (semble bench peak is now a broad plateau across 0.55–0.70; 0.60 is the empirical maximum). - `DEFINITION_BOOST_FRAC`: 0.5 → 0.25 (grid-searched). Bench (LateOn-Code-edge, semble bench, NDCG@10 with file_rank, alpha=0.60): before this PR 0.8208 (path-stem + collapse, alpha=0.55, def_boost=0.5) after this PR 0.8313 +0.0105 Per-knob ablation at alpha=0.65 (NDCG@10): baseline + bug fixes only 0.8297 + stopword filter 0.8303 + plural/snake 0.8299 + stopwords + plural/snake 0.8305

`output` was in `IGNORED_DIRS` to skip generic build-output directories, but the literal directory-name match also drops genuine source folders that happen to use the same name. Concretely, nlohmann-json's `include/nlohmann/detail/output/` module ships `serializer.hpp` and `binary_writer.hpp` — both vanished from the index before this change. Removed `output` from the default list. `target`, `build`, `dist`, `out`, `bin`, `obj` are kept (more reliably build artefacts) and users whose project genuinely outputs to `output/` can re-add the pattern via the `extra_ignore` config knob. Before / after, nlohmann-json on `include/nlohmann/`: 44 → 47 files indexed 0 → 60 units under `detail/output/serializer.hpp` 0 → 5 units under `detail/output/binary_writer.hpp`

Single-line JSON to stderr at each pipeline stage when `COLGREP_TRACE=1`: semantic top-20, bm25 top-20, fused top-20, post-path-penalty, post-path-stem-boost, post-definition-boost, post-coherence-boost, final. No-op when the env var is unset so the calls are free on the hot path. Used by `benchmarks/trace_worst_queries.py` to replay queries where the target file was indexed but missed top-10, so we can see *which* stage demotes it.

A/B sweeps across all 63 semble benchmark repos showed all three variants are net losses compared to the binary-fire baseline: STEM_PROPORTIONAL=1 (best base 1.0): 0.821 vs 0.825 (-0.004) DEF_PROPORTIONAL=1 (best base 0.40): 0.824 vs 0.825 (-0.001) STEM_PARENT_DIR=1: 0.817 vs 0.825 (-0.008) Per-repo deltas explain why: proportional weighting hurts symbol-heavy repos (zig -0.114, zls -0.111, newtonsoft-json -0.101, nvm -0.099) because their short identifier-style queries rely on a single-keyword stem hit being the strong signal. The proportional ratio discounts that hit even when it's the right one. Removed env vars: COLGREP_STEM_PROPORTIONAL, COLGREP_STEM_MIN_RATIO COLGREP_DEF_PROPORTIONAL, COLGREP_DEF_MIN_RATIO COLGREP_STEM_PARENT_DIR apply_path_stem_boost and apply_definition_boost are back to plain binary-fire on the proven baseline.

Three CI failures on `feat/identifier-aware-bm25`: * `cargo fmt --check`: trace-log call sites, the `SearchResult { unit, score }` literal in `search_hybrid_with_embedding`, the final `sort_by` lambda, the `trace_enabled` matches!, and a handful of spots in `next-plaid::text_search` and `colgrep::ranking` that rustfmt now wants on multiple lines. Applied `cargo fmt --all`. * `cargo clippy --all-targets -- -D warnings` on the workspace: - `colgrep::ranking::apply_definition_boost` and `apply_path_stem_boost` used `for i in 0..items.len()` purely to index `items[i]` — switched to `items.iter_mut()` and reborrow `&*item` for the read-only closures so the body compiles unchanged. - `next-plaid-api::tracing_middleware` used `unwrap_or_else(TraceId::new)`; replaced with `unwrap_or_default()` since `TraceId` already has a `Default` impl. * `cargo doc -D warnings`: the `fts5_search` doc-comment linked `[FtsTokenizer::IdentifierAware]` which doesn't resolve from inside `colgrep`; qualified to `[next_plaid::FtsTokenizer::IdentifierAware]`.

raphaelsty added 11 commits May 18, 2026 11:48

raphaelsty force-pushed the feat/identifier-aware-bm25 branch from 8743301 to 3b6e250 Compare May 19, 2026 17:21

raphaelsty force-pushed the feat/identifier-aware-bm25 branch from 6927646 to 0f6ea29 Compare May 19, 2026 17:34

raphaelsty self-assigned this May 19, 2026

raphaelsty merged commit e6d182b into main May 19, 2026
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve ColGREP BM25: identifier-aware tokenizer + relative-score fusion#99

Improve ColGREP BM25: identifier-aware tokenizer + relative-score fusion#99
raphaelsty merged 12 commits into
mainfrom
feat/identifier-aware-bm25

raphaelsty commented May 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

raphaelsty commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1 — Identifier-aware BM25 + min-max fusion (d9ab056)

2 — Post-fusion ranking pipeline (built on top of d9ab056)

3 — A/B'd and rejected (kept here for posterity)

4 — Tunables exposed

Bench (benchmarks/baselines/colgrep.py, file-level NDCG@10)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

raphaelsty commented May 18, 2026 •

edited

Loading

1 — Identifier-aware BM25 + min-max fusion (`d9ab056`)

2 — Post-fusion ranking pipeline (built on top of `d9ab056`)

Bench (`benchmarks/baselines/colgrep.py`, file-level NDCG@10)