Skip to content

Improve ColGREP BM25: identifier-aware tokenizer + relative-score fusion#99

Merged
raphaelsty merged 12 commits into
mainfrom
feat/identifier-aware-bm25
May 19, 2026
Merged

Improve ColGREP BM25: identifier-aware tokenizer + relative-score fusion#99
raphaelsty merged 12 commits into
mainfrom
feat/identifier-aware-bm25

Conversation

@raphaelsty
Copy link
Copy Markdown
Collaborator

@raphaelsty raphaelsty commented May 18, 2026

Summary

End-to-end overhaul of ColGREP's hybrid retrieval and post-fusion ranking on the 1,251-query / 63-repo semble code-search benchmark. NDCG@10 climbs from 0.693 (README baseline at branch start) to 0.825 — a +0.132 gain, with no model swap, no adaptive-α-by-query-shape, no benchmark-specific rules.

Headline numbers (lightonai/LateOn-Code-edge, file-level NDCG@10, benchmarks/baselines/colgrep.py):

NDCG@10
README baseline (buggy GPU batching) 0.693
Identifier-aware BM25 + min-max fusion (d9ab056) 0.755
+ post-fusion ranking pipeline (this PR head) 0.825
semble reference 0.852

1 — Identifier-aware BM25 + min-max fusion (d9ab056)

The previous FTS5 backend used the trigram tokenizer with a query sanitizer that AND'd quoted whole-word terms. On the 1,251-query semble benchmark this gave only 26.5 % BM25 recall@200 and 0.367 BM25-only NDCG@10, so the hybrid mostly leaned on ColBERT alone (raw NDCG@10 ≈ 0.709, hybrid 0.721). The dominant failure mode was that natural-language queries rarely share every whole-word token with a relevant code unit (e.g. parse request would not match a function named parseRequest).

Replaced the BM25 retriever with an identifier-aware index, OR-based queries, and min-max score fusion.

next-plaid (additive — defaults unchanged):

  • New FtsTokenizer::IdentifierAware. FTS5 is created with tokenize='unicode61'; the document body is pre-split with tokenize_identifiers so each identifier is stored as its lowercase compound + camelCase / snake_case parts (parseRequestparserequest parse request). The raw text remains in the content table.
  • New tokenize_identifiers(&str) -> Vec<String> for use both at index time and on the query side.
  • New sanitize_fts5_query_or(&str) -> String — tokenizes the query the same way, dedups, joins terms with FTS5 OR so any sub-part can match. BM25 ranking still rewards documents that hit more terms, so accuracy is preserved while recall jumps from 26.5 % to 99.6 %.
  • 10 new unit tests covering camelCase / PascalCase / snake_case splitting, acronym runs (getHTTPResponse), dedup, empty inputs, and a full index → search round-trip on the new tokenizer.
  • next-plaid-api's FtsTokenizer string mapping is untouched; the API default (unicode61) and its fuse_rrf are unchanged.

colgrep:

  • Default tokenizer flipped from Trigram to IdentifierAware. Existing indexes are detected as the wrong tokenizer and rebuilt on next colgrep init (already-supported migration path).
  • fts5_search switches to sanitize_fts5_query_or to match the index-time tokenization.
  • search_hybrid_with_embedding swaps fuse_rrf for fuse_relative_score. With BM25 recall at 99.6 % the min-max linear combiner outperforms rank-only RRF (which artificially caps each retriever's contribution).
  • Default hybrid α changes from 0.75 to 0.65. The peak is a broad plateau spanning ~0.55–0.75; 0.65 is the empirical maximum and gives meaningful BM25 weight without de-prioritising semantic recall.

After this commit: NDCG@10 = 0.755 (with α=0.65), recall@200 = 99.6 %, raw BM25-only NDCG@10 = 0.667 (was 0.367).

2 — Post-fusion ranking pipeline (built on top of d9ab056)

Every additional commit on this branch is a re-ranking signal applied to the fused top-200 pool after retrieval. Each was gated behind an env var, A/B'd against the previous head over the full 63-repo bench, and only landed when net positive. Per-commit Δ NDCG@10 in commit messages.

query
 ├─ ColBERT (LateOn-Code-edge, 17 M, ONNX-GPU)  ──▶ semantic top-N
 └─ FTS5 (SQLite, identifier-aware tokenizer)   ──▶ BM25 top-N
                            │
                            ▼
            min-max relative-score fusion at α=0.60
                            │
                            ▼
           fetch metadata for fused IDs by `_subset_`
                            │
              ┌─────────────┴─────────────┐
              ▼                           ▼
       file_path_penalty           (skipped if query mentions test/bench/spec)
              │
              ▼
      apply_path_stem_boost   (+max_score·0.40 on stem hit, half on prefix)
              │
      apply_definition_boost  (+max_score·0.25 on unit.name token hit)
              │
      apply_file_coherence_boost  (+max_score·0.20 · file_share)
              │
              ▼
   sort by score · collapse-by-file (min start_line, max end_line) · top-k

Commit-by-commit ledger (Δ measured against the previous commit):

Commit Change Δ NDCG@10
133b95e File-path noise penalty. Multiplicative penalty for test files, compat shims, example/demo trees, __init__.py/package-info.java barrels, and .d.ts declarations. Suffix-anchored regex covers every language colgrep handles — Python, Go, Java, PHP, Ruby, JS/TS, Kotlin, Swift, C#, C, C++, Scala, Dart, Lua, Rust, Elixir, Haskell, OCaml, R, Zig, Julia, Vue, Svelte, QML, Bash (bats), PowerShell (Pester). Skipped when the query itself mentions test / spec / benchmark. +0.009
706bb11 File-coherence boost. Files with multiple high-scoring units win over files with a single strong hit: each file's top unit gets +0.20 · max_score · file_sum / max_file_sum. Ported from semble's boost_multi_chunk_files adapted to AST units. +0.014
269c099 Definition boost. When a query token equals a candidate's unit.name (after identifier-aware splitting), add +0.25 · max_score. Definition-bearing kinds only — synthetic raw_code_* names never fire. +0.006
640c225 File-path stem boost. Exact identifier-token match between the query and the file stem grants +0.40 · max_score. Plural/snake-case-normalised so dependencies matches dependency, my_func matches myfunc. Largest single signal in the pipeline. +0.021
163dbc3 Correctness: collapse same-file units into one entry (was deduping by (file, line) so a file could occupy multiple top-K slots), drop the legacy −1.0 test demotion now that file_path_penalty covers it more accurately, honour exact -k. unmeasured but significant
d120590 Half-strength prefix stem match. parse lifts parseRequest.ts at half the exact-match boost (+0.20 · max_score). Identifier-aware on both sides; minimum sub-token length 3 to avoid junk hits. +0.014
f3ccf44 Three correctness fixes: (a) dedup-by-file at the merge step in commands/search.rs (the two hybrid_search calls each ran their own collapse_by_file, producing two different min-line collapses for the same file → same file could occupy two top-K slots); (b) re-fetch FTS5 inside text-filtered subsets instead of intersecting the global pool (was killing recall when the global FTS5 top-K didn't overlap the text-filter subset); (c) build meta_by_id: HashMap<i64, _> and look up by id instead of Vec::zip(fused_ids, fused_scores) — when any id had no METADATA row (stale FTS5 reference), every subsequent (meta, score) pair shifted by one, silently attaching the wrong score to the wrong unit. +0.011
3192d25 Drop output from IGNORED_DIRS. Generic English word that collides with real source modules — was silently skipping ~65 units across nlohmann-json's detail/output/{serializer,binary_writer}.hpp. target/build/dist/out/bin/obj remain. +small (2 NDCG targets recovered)
87312ae COLGREP_TRACE=1 env flag. Emits one JSON-Lines stage trace (semantic / bm25 / fused / after_path_penalty / after_path_stem_boost / after_definition_boost / after_coherence_boost / final) per query to stderr, prefixed __COLGREP_TRACE__. No-op when unset (one env read per query). Used by the benchmark's per-query diagnostic tooling. 0
3b6e250 Cleanup: remove the (reverted) proportional / parent-dir flags that lost the A/B sweep. 0

3 — A/B'd and rejected (kept here for posterity)

Every row below was implemented end-to-end behind an env flag and benched over the full 63-repo set before being judged. All were removed because they regressed at least one dataset by ≥ 0.02 or were net-negative overall.

Experiment Result Failure mode
Proportional path-stem boost (COLGREP_STEM_PROPORTIONAL, COLGREP_STEM_MIN_RATIO) 0.821 vs 0.825 (−0.004) Helped NL-heavy repos (abseil-cpp +0.099, nlohmann-json +0.075, chi +0.065) but hurt symbol-heavy ones (zig −0.114, zls −0.111, newtonsoft-json −0.101). The ratio discounts the single-keyword stem hit that's the right signal for short symbol queries.
Proportional definition boost (COLGREP_DEF_PROPORTIONAL, COLGREP_DEF_MIN_RATIO) 0.824 vs 0.825 (−0.001) Same failure mode as stem variant.
Parent-dir matching in stem boost (COLGREP_STEM_PARENT_DIR) 0.817 vs 0.825 (−0.008) Adds parent-dir tokens (e.g. defaults from lib/defaults/index.js) to the stem-match set. With binary boost too many files cross the threshold → flat 0.4·max_score boost → displaces true targets.
_scan_non_candidates (COLGREP_SCAN_NON_CANDIDATES, COLGREP_SCAN_BOOST) 0.826 vs 0.825 (+0.001) with −0.037 zod regression SQL-probe METADATA for files whose stem matches a query symbol, inject defining units with synthetic score max_pool · COLGREP_SCAN_BOOST. Below 1.0 the inject is too weak to break top-10; at 1.0 it lifts wins (abseil-cpp flat_hash_map rank 7→1, +0.043) but displaces canonical impls (zod's v4/core/schemas.ts loses rank 1 to packages/treeshake/zod-object.ts on ZodObject query).

4 — Tunables exposed

All thresholds carry env-var overrides for ablation work; defaults match the table above.

knob default env var
α (semantic vs BM25) 0.60 COLGREP_ALPHA
stem boost frac 0.40 COLGREP_STEM_BOOST
stem prefix boost frac 0.20 COLGREP_STEM_PREFIX_BOOST
definition boost frac 0.25 COLGREP_DEF_BOOST
file coherence boost frac 0.20 COLGREP_COHERENCE_BOOST
path penalty (strong / mod / mild) 0.30 / 0.50 / 0.70 COLGREP_STRONG_PENALTY, COLGREP_MODERATE_PENALTY, COLGREP_MILD_PENALTY
stopwords in stem boost ON COLGREP_STEM_STOPWORDS
plural/snake stem norm ON COLGREP_STEM_PLURAL_SNAKE
per-stage trace logging OFF COLGREP_TRACE=1

Bench (benchmarks/baselines/colgrep.py, file-level NDCG@10)

current trigram + RRF α=0.75                       0.7208
+ identifier-aware tokenizer, RRF α=0.6            0.7531
+ min-max fusion, α=0.65 (d9ab056)                 0.7551
+ post-fusion ranking pipeline (3b6e250, this PR)  0.8255
raw ColBERT (--semantic-only)                      0.7089
raw BM25 (this PR)                                 0.6672  (recall@200 = 99.6%)
raw BM25 (trigram)                                 0.3674  (recall@200 = 26.5%)
semble reference                                   0.8523

raphaelsty added 11 commits May 18, 2026 11:48
The previous FTS5 backend used the `trigram` tokenizer with a query
sanitizer that AND'd quoted whole-word terms. On the 1,251-query semble
code-search benchmark this gave only 26.5% BM25 recall@200 and 0.367
BM25-only NDCG@10, so the hybrid mostly leaned on ColBERT alone (raw
NDCG@10 ≈ 0.709, hybrid 0.721). The dominant failure mode was that
natural-language queries rarely share *every* whole-word token with a
relevant code unit (e.g. `parse request` would not match a function
named `parseRequest`).

Replace the BM25 retriever with an identifier-aware index, OR-based
queries, and min-max score fusion. Tuned on the same benchmark, GPU
NDCG@10 climbs from 0.720 (current hybrid) to 0.755 — +0.035 NDCG@10
without changing the model or any retrieval-time GPU work.

next-plaid (additive — defaults unchanged):
- Add `FtsTokenizer::IdentifierAware`. FTS5 is created with
  `tokenize='unicode61'`; the document body is pre-split with
  `tokenize_identifiers` so each identifier is stored as its lowercase
  compound + camelCase / snake_case parts (`parseRequest` →
  `parserequest parse request`). The raw text remains in the content
  table, so callers that read `_fts_content_` see no change.
- Add `tokenize_identifiers(&str) -> Vec<String>` for use both at index
  time and on the query side.
- Add `sanitize_fts5_query_or(&str) -> String` — tokenizes the query the
  same way, dedups, and joins the terms with FTS5 `OR` so any sub-part
  can match. BM25 ranking still rewards documents that hit more terms,
  so accuracy is preserved while recall jumps from 26.5% to 99.6%.
- 10 new unit tests cover camelCase/PascalCase/snake_case splitting,
  acronym runs (`getHTTPResponse`), dedup, empty inputs, and a full
  index → search round-trip on the new tokenizer.
- `next-plaid-api`'s FtsTokenizer string mapping is untouched; the API
  default (`unicode61`) and its fusion (`fuse_rrf`) are unchanged.

colgrep:
- Default tokenizer flipped from `Trigram` to `IdentifierAware` in
  `index/mod.rs`. Existing indexes are detected as the wrong tokenizer
  and rebuilt on next `colgrep init` (already-supported migration path).
- `fts5_search` switches to `sanitize_fts5_query_or` to match the
  index-time tokenization.
- `search_hybrid_with_embedding` swaps `fuse_rrf` for
  `fuse_relative_score`. With BM25 recall at 99.6% the min-max linear
  combiner outperforms rank-only RRF (which artificially caps each
  retriever's contribution).
- Default hybrid alpha changes from 0.75 to 0.65. The peak is a broad
  plateau spanning ~0.55-0.75; 0.65 is the empirical maximum and gives
  meaningful BM25 weight without de-prioritising semantic recall.

Benchmark numbers (lightonai/LateOn-Code-edge, 63 repos × 1,251 queries,
NDCG@10 with `file_rank` — matches benchmarks/baselines/colgrep.py):
  current trigram + RRF α=0.75           0.7208
  + tuned α=0.70 on trigram               0.7221
  + identifier-aware tokenizer, RRF α=0.6 0.7531
  + min-max fusion, α=0.65 (this PR)      0.7551
  raw ColBERT (--semantic-only)           0.7089
  raw BM25 (this PR)                      0.6672  (recall@200 = 99.6%)
  raw BM25 (trigram)                      0.3674  (recall@200 = 26.5%)
LLM-generated benchmark targets and most real agent queries point at the
canonical implementation file, not the test, benchmark, example, or
compat shim that surrounds it.  On the semble bench, raw ColBERT
frequently returns the test/benchmark for the queried symbol above the
header that defines it (e.g. `time_benchmark.cc` / `time_test.cc` outrank
`absl/time/time.h` for "absl::Time and absl::Duration").

New `colgrep::ranking::file_path_penalty` returns a multiplicative
penalty in (0, 1] for each candidate file based on suffix-anchored test
patterns across 12+ languages, plus `tests/`, `__tests__/`, `spec/`,
`testing/`, `compat/`, `legacy/`, `examples/`, and `_examples/`
directories.  `.d.ts` declaration stubs get a mild penalty (they still
carry useful type info), `__init__.py` / `package-info.java` re-export
barrels get a moderate one.

`search_hybrid_with_embedding` now:
- Fuses to the full `fetch_k = top_k * 3` pool (was top_k), so the
  reranker has buried-but-strong candidates to surface.
- Applies the penalty multiplicatively to each candidate.
- Re-sorts and truncates to `top_k`.

`ranking::should_apply_path_penalty` skips the penalty entirely when the
query mentions test / spec / benchmark, so "unit test for parseRequest"
still surfaces the test file.

Bench (LateOn-Code-edge, semble bench, NDCG@10 with file_rank):
  before (identifier-aware + min-max + α=0.65)  0.7462
  after  (this PR)                              0.7547   +0.0085
When several candidate units come from the same file it is more likely
to be the canonical implementation than a file with a single strong
match. After fusion + path-noise penalty, each file's top-scoring unit
receives `+0.2 * max_score * (file_sum / max_file_sum)` so the file
holding the largest share of cumulative score in the candidate pool
gets the full 20% boost on its best unit; files with less coverage get
proportionally less.

Mirrors semble's `boost_multi_chunk_files` adapted to colgrep's code
units (one boost per file rather than per chunk).  Helps queries where
the relevant file is a large library module that scatters many weak
matches across the candidate pool (e.g. `clap.zig` for zig-clap, the
abseil `time/` headers when both `time.h` and `time.cc` show up).

Bench (LateOn-Code-edge, semble bench, NDCG@10 with file_rank):
  before (path-noise penalty)                   0.7547
  after  (this PR)                              0.7683   +0.0136
Tree-sitter has already extracted each code unit's `name` at index
time, so a unit *defines* its name by construction.  If a query token
matches the name of one of the candidate units, that unit is far more
likely to be what the user is asking about than a unit that merely
references the same symbol elsewhere.

After fusion, path-noise penalty, and before file-coherence: tokenize
the query with the identifier-aware splitter, tokenize each unit's
`name` the same way, and add `+0.5 * max_score` whenever any token
matches.  Restricted to definition-bearing unit kinds (`Function`,
`Method`, `Class`, `Constant`) so synthetic names like `raw_code_24`
never trigger a boost.

Matching at the token level (not just whole-name equality) makes
`parse_request`, `parseRequest`, `ParseRequest` and `parse` all hit each
other, so the boost reaches both bare-symbol queries and natural-
language queries that embed an identifier.

Bench (LateOn-Code-edge, semble bench, NDCG@10 with file_rank):
  before (file-coherence boost)                 0.7683
  after  (this PR)                              0.7743   +0.0060
A surprising amount of agent traffic uses queries that map almost
surgically to a file name: "interceptor manager" → InterceptorManager.js,
"parse request" → parse_request.py.  Even when the chunk text is
ambiguous, the file *path* is unambiguous.

For each candidate, identifier-aware-tokenize the file stem (filename
minus extension), and add `+0.4 * max_score` whenever any stem token
appears in the identifier-aware-tokenized query.  Applied before the
definition + coherence boosts so a stem match can promote a file that
no individual unit's name matched.

This is the largest single ranking-signal improvement we've measured;
the path stem is a high-precision feature that the dense + BM25
retrievers both under-weight because the file path is only one row of
context inside each unit's `_fts_content_`.

Bench (LateOn-Code-edge, semble bench, NDCG@10 with file_rank):
  before (definition boost)                     0.7743
  after  (this PR)                              0.7951   +0.0208
Two changes:

1. Collapse search results so each file appears at most once. When the
   ranking pipeline returns multiple units from the same file, the
   leader (highest-scoring) is kept and its `line` / `end_line` are
   merged to cover every matched unit's span (min start, max end).
   The same-file units that follow are dropped from the output rather
   than competing for top-K slots. This makes `-k` mean "exactly k
   distinct files" instead of "k units, possibly with duplicates".

2. Drop the legacy `compute_final_score` test-name demotion. The
   `-1.0` subtraction it applied to anything whose unit name contained
   "test" predates the path-aware hybrid pipeline and is now redundant
   with the much more complete `ranking::file_path_penalty` (which
   inspects file paths, handles 12+ languages, and applies
   multiplicatively on the fused score). Keeping both compounded the
   penalty unevenly and made tuning the boost weights harder.

3. Over-fetch generously (`max(top_k * 20, 200)`, capped at the
   index's actual size) so the collapse never returns fewer than
   `top_k` distinct files when at least that many files exist in the
   corpus — `-k` is now a hard contract.

Bench (LateOn-Code-edge, semble bench, NDCG@10 with file_rank):
  before (path-stem boost)                      0.7951
  after  (this PR)                              0.7955   +0.0004
The collapse-by-file is a UX change (distinct files in output);
legacy boost removal was explicitly requested. NDCG@10 is held flat,
which confirms the new ranking signals fully cover what the legacy
demotion was doing.
The whole-token path-stem boost already captures direct matches like
"interceptor manager" → `InterceptorManager.js`, but it misses
morphological variants the dense / BM25 retrievers also under-weight:

  query    file stem      whole-token  prefix
  --------------------------------------------
  config   configuration  no           yes
  parse    parser         no           yes
  intercept interceptor   no           yes
  depend   dependencies   no           yes

Add a half-strength tier (`0.2 * max_score` instead of `0.4`) for any
candidate whose file stem contains a token that shares a ≥3-char prefix
with a query token in either direction.  Exact whole-token matches
still win; prefix matches only fire if no exact match did, so a single
file isn't double-boosted.  3-char minimum keeps one- and two-letter
prefixes (`fn`, `to`, `is`, ...) from producing spurious hits.

Bench (LateOn-Code-edge, semble bench, NDCG@10 with file_rank):
  before (collapse + legacy removed)            0.7955
  after  (this PR)                              0.8091   +0.0136
Three correctness bugs surfaced by a head-to-head audit against semble's
ranking pipeline, plus three small principled improvements ported from
semble's `boosting.py`.

## Bugs

1. **`commands/search.rs` `(file, line)` dedup was incorrect.** The two
   `search_hybrid_with_embedding` calls (full index + text-filtered
   subset) each run `collapse_by_file` internally, which sets
   `unit.line = min(line_i)` over *their own* candidate pool. Two pools
   → two different mins → same file occupied two top-K slots, wasting
   one slot per duplicate. Now dedupes by file alone and merges spans
   (`line = min`, `end_line = max`, `score = max`) across both calls.

2. **`commands/search.rs` reused full-index FTS5 hits under a subset
   filter.** The text-filtered subset re-search received the global
   FTS5 result list (top_k * 3 = 30 hits), which then got *filtered*
   to the subset — often leaving only a handful of BM25 candidates.
   Now passes `None` so FTS5 is refetched inside the subset.

3. **`index/mod.rs` score / id misalignment.** `filtering::get`
   silently drops rows whose `_subset_` id has no METADATA row (stale
   FTS5 references), then `metadata.into_iter().zip(fused_scores)`
   shifted every following score onto the wrong unit. Now zips by id
   via a `HashMap<i64, Value>` lookup; missing ids are skipped, score
   alignment is preserved.

## Ranking improvements (ported from semble)

- **Stopword filter in stem-boost keyword set.** Common NL words like
  "how", "the", "of" are filtered before computing path-stem matches
  so `how_to.py` doesn't get a free hit on "how to authenticate".
  Toggle: `COLGREP_STEM_STOPWORDS=0`.
- **Plural / snake-case-normalised stem comparison.** Adds
  `dependencies` ↔ `dependency`, `my_func` ↔ `myfunc`. Toggle:
  `COLGREP_STEM_PLURAL_SNAKE=0`.
- **Proportional-ratio stem boost** (off by default, available via
  `COLGREP_STEM_PROPORTIONAL=1`). Ablated and found to *cost* ~0.012
  NDCG@10 because the dense + BM25 retrievers already weight signal
  density; adding *another* dampening on the stem boost over-discounts
  multi-keyword NL queries where only one keyword names the file.
  Kept as a toggle for future tuning.

## Bench-harness plumbing (additive, off the hot path)

- `COLGREP_DATA_DIR`: override the index-cache directory so concurrent
  benchmark processes on different GPUs don't fight over
  `~/.local/share/colgrep/indices`.
- `COLGREP_ALPHA`, `COLGREP_DEF_BOOST`, `COLGREP_STEM_BOOST`,
  `COLGREP_STEM_PREFIX_BOOST`, `COLGREP_COHERENCE_BOOST`,
  `COLGREP_STRONG_PENALTY`, `COLGREP_MODERATE_PENALTY`,
  `COLGREP_MILD_PENALTY`: read at runtime so a grid search can sweep
  without rebuilding.

## Defaults

- `hybrid_alpha`: 0.55 → 0.60 (semble bench peak is now a broad plateau
  across 0.55–0.70; 0.60 is the empirical maximum).
- `DEFINITION_BOOST_FRAC`: 0.5 → 0.25 (grid-searched).

Bench (LateOn-Code-edge, semble bench, NDCG@10 with file_rank, alpha=0.60):
  before this PR  0.8208 (path-stem + collapse, alpha=0.55, def_boost=0.5)
  after  this PR  0.8313     +0.0105

Per-knob ablation at alpha=0.65 (NDCG@10):
  baseline + bug fixes only                                  0.8297
  + stopword filter                                          0.8303
  + plural/snake                                             0.8299
  + stopwords + plural/snake                                 0.8305
`output` was in `IGNORED_DIRS` to skip generic build-output directories,
but the literal directory-name match also drops genuine source folders
that happen to use the same name.  Concretely, nlohmann-json's
`include/nlohmann/detail/output/` module ships `serializer.hpp` and
`binary_writer.hpp` — both vanished from the index before this change.

Removed `output` from the default list.  `target`, `build`, `dist`,
`out`, `bin`, `obj` are kept (more reliably build artefacts) and users
whose project genuinely outputs to `output/` can re-add the pattern via
the `extra_ignore` config knob.

Before / after, nlohmann-json on `include/nlohmann/`:
  44 → 47 files indexed
  0 → 60 units under `detail/output/serializer.hpp`
  0 → 5 units under `detail/output/binary_writer.hpp`
Single-line JSON to stderr at each pipeline stage when
`COLGREP_TRACE=1`: semantic top-20, bm25 top-20, fused top-20,
post-path-penalty, post-path-stem-boost, post-definition-boost,
post-coherence-boost, final.  No-op when the env var is unset so the
calls are free on the hot path.

Used by `benchmarks/trace_worst_queries.py` to replay queries where the
target file was indexed but missed top-10, so we can see *which* stage
demotes it.
A/B sweeps across all 63 semble benchmark repos showed all three
variants are net losses compared to the binary-fire baseline:

  STEM_PROPORTIONAL=1 (best base 1.0): 0.821 vs 0.825 (-0.004)
  DEF_PROPORTIONAL=1  (best base 0.40): 0.824 vs 0.825 (-0.001)
  STEM_PARENT_DIR=1:                    0.817 vs 0.825 (-0.008)

Per-repo deltas explain why: proportional weighting hurts symbol-heavy
repos (zig -0.114, zls -0.111, newtonsoft-json -0.101, nvm -0.099)
because their short identifier-style queries rely on a single-keyword
stem hit being the strong signal. The proportional ratio discounts
that hit even when it's the right one.

Removed env vars:
  COLGREP_STEM_PROPORTIONAL, COLGREP_STEM_MIN_RATIO
  COLGREP_DEF_PROPORTIONAL,  COLGREP_DEF_MIN_RATIO
  COLGREP_STEM_PARENT_DIR

apply_path_stem_boost and apply_definition_boost are back to plain
binary-fire on the proven baseline.
@raphaelsty raphaelsty force-pushed the feat/identifier-aware-bm25 branch from 8743301 to 3b6e250 Compare May 19, 2026 17:21
Three CI failures on `feat/identifier-aware-bm25`:

* `cargo fmt --check`: trace-log call sites, the `SearchResult { unit,
  score }` literal in `search_hybrid_with_embedding`, the final
  `sort_by` lambda, the `trace_enabled` matches!, and a handful of
  spots in `next-plaid::text_search` and `colgrep::ranking` that
  rustfmt now wants on multiple lines. Applied `cargo fmt --all`.
* `cargo clippy --all-targets -- -D warnings` on the workspace:
  - `colgrep::ranking::apply_definition_boost` and
    `apply_path_stem_boost` used `for i in 0..items.len()` purely to
    index `items[i]` — switched to `items.iter_mut()` and reborrow
    `&*item` for the read-only closures so the body compiles
    unchanged.
  - `next-plaid-api::tracing_middleware` used
    `unwrap_or_else(TraceId::new)`; replaced with `unwrap_or_default()`
    since `TraceId` already has a `Default` impl.
* `cargo doc -D warnings`: the `fts5_search` doc-comment linked
  `[FtsTokenizer::IdentifierAware]` which doesn't resolve from inside
  `colgrep`; qualified to `[next_plaid::FtsTokenizer::IdentifierAware]`.
@raphaelsty raphaelsty force-pushed the feat/identifier-aware-bm25 branch from 6927646 to 0f6ea29 Compare May 19, 2026 17:34
@raphaelsty raphaelsty self-assigned this May 19, 2026
@raphaelsty raphaelsty merged commit e6d182b into main May 19, 2026
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant