chore(benchmark): archive snapshot + purge x-ai/grok-4.1-fast raw data by ttlequals0 · Pull Request #230 · ttlequals0/MinusPod

ttlequals0 · 2026-05-16T00:24:01Z

Summary

xAI deprecated grok-4.1-fast upstream (every call returns 404 with the "switch to Grok 4.3" advisory) and grok-4.3 is the live xAI entry. Keeping 780 dead grok-4.1-fast rows in calls.jsonl was bloating the prompt/response tree by 903 files without adding any benchmark signal.

Before purging, this PR snapshots the report with grok-4.1-fast shown as an active model so the comparative numbers stay on disk (F1 0.642 at $0.1509/ep, rank 2 overall by F1, rank 1 by F1-per-dollar):

results/archive/2026-05-16/
  report.md
  report_assets/   (full 32-model chart set)

Then purges:

780 grok-4.1-fast rows removed from calls.jsonl (20,492 -> 19,712)
780 corresponding response files deleted
123 prompt files (unique to grok-4.1-fast hashes) deleted
live results/report.md regenerated -- no grok-4.1-fast anywhere, and the Deprecated Models footnote drops out since there is no aggregate data left to render

The benchmark.toml entry stays as deprecated = true so the runner skips it on any future sweep (no new data sneaks back in). A local backup of the pre-purge calls.jsonl sits at results/raw/calls.jsonl.bak-pre-grok-purge and is intentionally not committed.

Benchmark-tooling / data only (benchmarks/ is dockerignored). No version.py bump, no CHANGELOG entry.

Test plan

CI lint + tests pass
grep -c "grok-4.1-fast" benchmarks/llm/results/report.md returns 0
grep -c "grok-4.1-fast" benchmarks/llm/results/archive/2026-05-16/report.md returns 26 (historical snapshot intact)
wc -l benchmarks/llm/results/raw/calls.jsonl returns 19712
Live report.md no longer has a ### Deprecated Models section
Archive report.md shows the full Deprecated-Models-free pre-purge picture (grok-4.1-fast as active model)

…st raw data xAI deprecated grok-4.1-fast upstream (every call returns 404 with the "switch to Grok 4.3" advisory) and grok-4.3 is now the live entry. Keeping 780 dead grok-4.1-fast rows in calls.jsonl was bloating the prompt/response tree by 903 files without adding any benchmark signal. Before deleting, snapshot the report with grok-4.1-fast still shown as an active model so we keep the comparative numbers (F1 0.642 at $0.1509/ep, rank 2 overall, rank 1 F1/$) on disk: results/archive/2026-05-16/ report.md report_assets/ (full 32-model chart set) Then purge: - 780 grok-4.1-fast rows removed from calls.jsonl (20,492 -> 19,712) - 780 corresponding response files deleted - 123 prompt files (unique to grok-4.1-fast hashes) deleted - live results/report.md regenerated -- no grok-4.1-fast anywhere, and the Deprecated Models footnote drops out since there is no aggregate data left to render The benchmark.toml entry stays as deprecated=true so the runner skips it on any future sweep (no new data sneaks back in). A local backup of the pre-purge calls.jsonl sits at results/raw/calls.jsonl.bak-pre-grok-purge and is intentionally not committed. No production runtime impact, no version bump, no CHANGELOG.

ttlequals0 merged commit 53ac483 into main May 16, 2026
9 checks passed

ttlequals0 deleted the chore/archive-and-purge-grok-4.1-fast branch May 16, 2026 00:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(benchmark): archive snapshot + purge x-ai/grok-4.1-fast raw data#230

chore(benchmark): archive snapshot + purge x-ai/grok-4.1-fast raw data#230
ttlequals0 merged 1 commit into
mainfrom
chore/archive-and-purge-grok-4.1-fast

ttlequals0 commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ttlequals0 commented May 16, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant