test(bench): pin benchmark repos to fixed commits for reproducible metrics by manojmallick · Pull Request #236 · manojmallick/sigmap

manojmallick · 2026-06-09T10:02:32Z

Summary

Fixes the root cause behind the v6.15.0 hit@5 "drop" (81.1% → 75.6%): the benchmark corpus was a moving target, not a regression.

Why

The retrieval/token benchmark cloned each repo with git clone --depth 1 --single-branch — always the latest upstream HEAD. So with byte-for-byte identical SigMap ranking code, hit@5 measured 81.1% at v6.11 and 75.6% now purely because the cloned repo contents changed. The number was also carried forward unchanged through v6.12/v6.13 (only the benchmark_id label was bumped), so v6.15.0 was the first real re-measurement since v6.11. Net effect: the metric was neither reproducible nor a trustworthy regression signal.

Changes

scripts/run-benchmark.mjs: every one of the 21 benchmark repos now carries a commit: pin (the exact corpus the v6.15.0 numbers were measured on). New fetchPinned() fetches by SHA on clone (GitHub allows SHA fetch — verified). Existing caches are re-checked-out to the pinned commit; if a pin can't be fetched, the run degrades to the default branch with a WARN (results may drift).
docs-vp/guide/benchmark.md: documents the frozen-corpus methodology.

Effect

hit@5 / token reduction now move only when SigMap's own ranking/extraction changes — a true release-over-release signal.

Test plan

node -c scripts/run-benchmark.mjs
SHA-fetch verified against GitHub (fresh clone at a pinned commit)
--skip-clone run: all 21 repos report "present at pinned commit"
retrieval benchmark reproduces 75.6% on the frozen corpus
VitePress docs build passes
no tests reference the benchmark script internals

🤖 Generated with Claude Code

…trics The retrieval/token benchmark cloned each repo with `git clone --depth 1 --single-branch` (always latest upstream HEAD), so the corpus silently drifted between releases. With identical SigMap code, hit@5 read 81.1% at v6.11 and 75.6% now purely because the cloned repo contents changed — making the metric non-reproducible and a poor regression signal. Pin all 21 repos to fixed commits (the corpus the v6.15.0 numbers were measured on) and fetch them by SHA on clone (GitHub allows SHA fetch). Existing caches are re-checked-out to the pinned commit; if a pin can't be fetched the run degrades to the default branch with a warning. Now hit@5 only moves when SigMap's ranking/extraction changes. - scripts/run-benchmark.mjs: commit pins + fetchPinned() + clone/cache logic - docs-vp/guide/benchmark.md: document the frozen-corpus methodology

manojmallick merged commit d232c80 into develop Jun 9, 2026
5 checks passed

manojmallick deleted the chore/pin-benchmark-repos branch June 9, 2026 10:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(bench): pin benchmark repos to fixed commits for reproducible metrics#236

test(bench): pin benchmark repos to fixed commits for reproducible metrics#236
manojmallick merged 1 commit into
developfrom
chore/pin-benchmark-repos

manojmallick commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

manojmallick commented Jun 9, 2026

Summary

Why

Changes

Effect

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant