chore(benchmark): archive snapshot + purge x-ai/grok-4.1-fast raw data#230
Merged
Merged
Conversation
…st raw data
xAI deprecated grok-4.1-fast upstream (every call returns 404 with the
"switch to Grok 4.3" advisory) and grok-4.3 is now the live entry. Keeping
780 dead grok-4.1-fast rows in calls.jsonl was bloating the prompt/response
tree by 903 files without adding any benchmark signal.
Before deleting, snapshot the report with grok-4.1-fast still shown as an
active model so we keep the comparative numbers (F1 0.642 at $0.1509/ep,
rank 2 overall, rank 1 F1/$) on disk:
results/archive/2026-05-16/
report.md
report_assets/ (full 32-model chart set)
Then purge:
- 780 grok-4.1-fast rows removed from calls.jsonl (20,492 -> 19,712)
- 780 corresponding response files deleted
- 123 prompt files (unique to grok-4.1-fast hashes) deleted
- live results/report.md regenerated -- no grok-4.1-fast anywhere,
and the Deprecated Models footnote drops out since there is no
aggregate data left to render
The benchmark.toml entry stays as deprecated=true so the runner skips it
on any future sweep (no new data sneaks back in). A local backup of the
pre-purge calls.jsonl sits at results/raw/calls.jsonl.bak-pre-grok-purge
and is intentionally not committed.
No production runtime impact, no version bump, no CHANGELOG.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
xAI deprecated
grok-4.1-fastupstream (every call returns 404 with the "switch to Grok 4.3" advisory) andgrok-4.3is the live xAI entry. Keeping 780 deadgrok-4.1-fastrows incalls.jsonlwas bloating the prompt/response tree by 903 files without adding any benchmark signal.Before purging, this PR snapshots the report with
grok-4.1-fastshown as an active model so the comparative numbers stay on disk (F1 0.642 at $0.1509/ep, rank 2 overall by F1, rank 1 by F1-per-dollar):Then purges:
grok-4.1-fastrows removed fromcalls.jsonl(20,492 -> 19,712)grok-4.1-fasthashes) deletedresults/report.mdregenerated -- nogrok-4.1-fastanywhere, and the Deprecated Models footnote drops out since there is no aggregate data left to renderThe
benchmark.tomlentry stays asdeprecated = trueso the runner skips it on any future sweep (no new data sneaks back in). A local backup of the pre-purgecalls.jsonlsits atresults/raw/calls.jsonl.bak-pre-grok-purgeand is intentionally not committed.Benchmark-tooling / data only (
benchmarks/is dockerignored). Noversion.pybump, no CHANGELOG entry.Test plan
grep -c "grok-4.1-fast" benchmarks/llm/results/report.mdreturns0grep -c "grok-4.1-fast" benchmarks/llm/results/archive/2026-05-16/report.mdreturns26(historical snapshot intact)wc -l benchmarks/llm/results/raw/calls.jsonlreturns19712report.mdno longer has a### Deprecated Modelssectionreport.mdshows the full Deprecated-Models-free pre-purge picture (grok-4.1-fast as active model)