Skip to content

test(bench): pin benchmark repos to fixed commits for reproducible metrics#236

Merged
manojmallick merged 1 commit into
developfrom
chore/pin-benchmark-repos
Jun 9, 2026
Merged

test(bench): pin benchmark repos to fixed commits for reproducible metrics#236
manojmallick merged 1 commit into
developfrom
chore/pin-benchmark-repos

Conversation

@manojmallick

Copy link
Copy Markdown
Owner

Summary

  • Fixes the root cause behind the v6.15.0 hit@5 "drop" (81.1% → 75.6%): the benchmark corpus was a moving target, not a regression.

Why

The retrieval/token benchmark cloned each repo with git clone --depth 1 --single-branch — always the latest upstream HEAD. So with byte-for-byte identical SigMap ranking code, hit@5 measured 81.1% at v6.11 and 75.6% now purely because the cloned repo contents changed. The number was also carried forward unchanged through v6.12/v6.13 (only the benchmark_id label was bumped), so v6.15.0 was the first real re-measurement since v6.11. Net effect: the metric was neither reproducible nor a trustworthy regression signal.

Changes

  • scripts/run-benchmark.mjs: every one of the 21 benchmark repos now carries a commit: pin (the exact corpus the v6.15.0 numbers were measured on). New fetchPinned() fetches by SHA on clone (GitHub allows SHA fetch — verified). Existing caches are re-checked-out to the pinned commit; if a pin can't be fetched, the run degrades to the default branch with a WARN (results may drift).
  • docs-vp/guide/benchmark.md: documents the frozen-corpus methodology.

Effect

hit@5 / token reduction now move only when SigMap's own ranking/extraction changes — a true release-over-release signal.

Test plan

  • node -c scripts/run-benchmark.mjs
  • SHA-fetch verified against GitHub (fresh clone at a pinned commit)
  • --skip-clone run: all 21 repos report "present at pinned commit"
  • retrieval benchmark reproduces 75.6% on the frozen corpus
  • VitePress docs build passes
  • no tests reference the benchmark script internals

🤖 Generated with Claude Code

…trics

The retrieval/token benchmark cloned each repo with `git clone --depth 1
--single-branch` (always latest upstream HEAD), so the corpus silently drifted
between releases. With identical SigMap code, hit@5 read 81.1% at v6.11 and
75.6% now purely because the cloned repo contents changed — making the metric
non-reproducible and a poor regression signal.

Pin all 21 repos to fixed commits (the corpus the v6.15.0 numbers were measured
on) and fetch them by SHA on clone (GitHub allows SHA fetch). Existing caches
are re-checked-out to the pinned commit; if a pin can't be fetched the run
degrades to the default branch with a warning. Now hit@5 only moves when
SigMap's ranking/extraction changes.

- scripts/run-benchmark.mjs: commit pins + fetchPinned() + clone/cache logic
- docs-vp/guide/benchmark.md: document the frozen-corpus methodology
@manojmallick manojmallick merged commit d232c80 into develop Jun 9, 2026
5 checks passed
@manojmallick manojmallick deleted the chore/pin-benchmark-repos branch June 9, 2026 10:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant