Add x-ai/grok-4.3 v2 benchmark results (xhigh) by patelnav · Pull Request #25 · petergpt/bullshit-benchmark

patelnav · 2026-05-01T19:10:24Z

Status: ready for review

Adds x-ai/grok-4.3@reasoning=xhigh as a new published row in the v2 track. PR contains the full diff and is ready for review.

Phase 0 — pre-flight (config, dry-run gate)
Phase 1a — collect (100 q) + primary judge (Sonnet 4.6)
Phase 1b — additional judges (GPT-5.2 + Gemini 3.1 Pro Preview) + publish
Metadata (model_launch_dates.csv, model_params.csv)
Durable config promotion (config.v2.json) + CHANGELOG entry (2.0.12)
Re-publish to refresh data/v2/latest/ metadata snapshots after canonical rows landed (commit 1743413)
Local viewer smoke test against viewer/index.v2.html — page renders, zero console errors, grok-4.3 row appears at rank 22 with exact expected values (1.3167 / 52 / 30 / 18), version selector + model filter work

Final results

Run ID: run_20260501_175254 · Panel ID: run_20260501_175254_panel

Leaderboard row (data/v2/latest/leaderboard.csv):

22,x-ai/grok-4.3@reasoning=xhigh,x-ai,xhigh,1.3167,0.52,0.18,52,30,18,100,0

leaderboard_with_launch.csv row (after metadata fix):

22,x-ai/grok-4.3@reasoning=xhigh,x-ai,xhigh,1.3167,0.52,0.18,52,30,18,100,0,x-ai/grok-4.3,2026-04-30,1,https://openrouter.ai/x-ai/grok-4.3,closed,,,not_disclosed,proprietary


Mean consensus score	1.3167
Green / Amber / Red	52 / 30 / 18
Errors	0

For context: lands between grok-4.20-beta@xhigh (1.27 / 54 green) and grok-4.20-multi-agent-beta@xhigh (1.43 / 64 green) — sensible position in the xAI family.

Cost details

Stage	Spend	Notes
Cost probe (6 calls, pre-run)	$0.030	Sample-of-3 calls used to size the run before committing budget
Generation — Grok 4.3 xhigh (100 q)	$0.557	100/100 ok, max_attempt=1, 0 rate-limit requeues, 147,611 reasoning tokens (avg 1,476/sample), avg latency 29.3s
Judge — Sonnet 4.6 (medium reasoning)	$1.768	164,531 prompt + 84,933 completion + 45,169 reasoning tokens
Judge — GPT-5.2 (medium reasoning)	$0.627
Judge — Gemini 3.1 Pro Preview (medium reasoning)	$0.000*	OpenRouter `usage.cost` reports 0 for this preview tier — same behavior observed in the cost probe; likely unsurfaced rather than literally free
Total	~$2.98	Stayed well under the $10 OpenRouter key cap

Token usage is fully captured in data/v2/latest/collection_stats.json::usage_summary and per-row in data/v2/latest/responses.jsonl.

Why a one-off `config.grok-4.3.v2.json` instead of `config.new-models.v2.json`

config.new-models.v2.json already contains other contributors' pending candidates (moonshotai/kimi-k2.6, z-ai/glm-5.1, qwen/qwen3.6-plus). Running run_end_to_end.sh against that config would have re-collected and additively merged duplicate rows for those models into the published v2 dataset. The one-off run config (gitignored, kept outside the repo per AGENTS.md §Benchmark config alignment) scoped this contribution strictly to x-ai/grok-4.3. The model is promoted to config.v2.json in this PR.

Reasoning variant rationale

Single xhigh variant matches the high-effort comparator from the existing x-ai/grok-4.20-beta family (["low","xhigh"]). The OpenRouter key budget excluded a second variant for the first pass; a none-effort ablation (~~$0.50 incremental) and the v1/55-question track (~~$1.10) are straightforward follow-ups if you'd like.

Provenance

Run pinned to upstream 4ff1d2dc48a880b195652438b9e98e5c3763eed5. Diff produced from a fork (patelnav/bullshit-benchmark) on branch add-grok-4-3.
Pipeline used unmodified: ./scripts/run_end_to_end.sh (split into Phase 1a collect + primary judge then Phase 1b --skip-collect --skip-primary-judge --with-additional-judges for an explicit cost-check pause point). No hand-rolled CSVs or aggregate edits.
Per-row OpenRouter generation IDs are in data/v2/latest/responses.jsonl::response_id. Aggregate rows in data/v2/latest/aggregate.jsonl join by sample_id. Spot-check any row against https://openrouter.ai/api/v1/generation?id=<response_id>.
num_runs: 1. No silent retries; if the run had errored, partial state would have been preserved and surfaced in this PR rather than silently re-run.
For independent verification with your own panel:
```
./scripts/run_end_to_end.sh \
  --config config.grok-4.3.v2.json \
  --skip-collect --skip-primary-judge --with-additional-judges \
  --run-id run_20260501_175254 --panel-id <new_panel_id> \
  --viewer-output-dir <your_dir>
```
This re-runs the panel against the published data/v2/latest/responses.jsonl using your OpenRouter key. (You'd need a matching one-off config; happy to attach mine on request — content is x-ai/grok-4.3 + ["xhigh"] + the same 3-judge panel as config.v2.json.)

Intentionally not in this PR

Raw runs/<run_id>/ gist. Everything needed for spot-checking is already in data/v2/latest/{responses.jsonl,aggregate.jsonl} (per-row OpenRouter generation IDs + verbatim judge justifications). If you want the raw run directory (intermediate grade dirs, retry log, partial files) attached as a gist, I'm happy to upload it — just say so.
Refreshed docs/images/* README screenshots. The 2.0.10 release refreshed those after a multi-row update; this PR adds a single row that doesn't materially change chart imagery. Happy to refresh if you'd prefer the README always reflect the latest dataset.

Files changed

data/v2/latest/* — 9 files; responses.jsonl and aggregate.jsonl get +100 rows each, leaderboard CSVs re-sort to place the new row at rank 22, the JSON manifests update counts, recent_additions.json reflects the new entry. Verified: responses_added=100, replaced=0, aggregate_added=100, replaced=0, no sample_id collisions on the initial publish.
data/model_metadata/model_launch_dates.csv — Grok 4.3 row, launch 2026-04-30 sourced from https://openrouter.ai/x-ai/grok-4.3 (matches the pattern used by recent DeepSeek V4 / Tencent Hy3 / Xiaomi MiMo rows).
data/model_metadata/model_params.csv — closed / not_disclosed / proprietary, mirroring existing xAI rows; xAI does not publicly disclose parameter counts for Grok 4.3.
config.v2.json — durable promotion: x-ai/grok-4.3 added to collect.models and model_reasoning_efforts: ["xhigh"].
.gitignore — minor additions for local secret/working files (separate prep commit).
CHANGELOG.md — 2.0.12 entry.

Test plan

AGENTS.md verification gate: JSON parse + collect --dry-run --limit 1 passed before live run
Live collect + primary judge: 100/100 ok, no retries, costs within budget
Live additional judges + publish: panel completed, responses_added=100 aggregate_added=100 *_replaced=0
panel_summary.json shows 3 judges + mean consensus
Exactly one new leaderboard row for x-ai/grok-4.3@reasoning=xhigh
git diff --stat -- data/v2/latest shows expected file set only (no unrelated touches)
Re-publish to fix metadata snapshots (leaderboard_with_launch.csv row populated, data/v2/latest/{model_launch_dates,model_params}.csv include grok-4.3)
Local viewer smoke test: page renders, zero console errors, grok-4.3 row visible at rank 22 with correct values, filters/version-selector functional

Adds patterns for local env files (.env, .env.*), per-contributor working notes (*.local.md), and the one-off run config used for this contribution (config.grok-4.3.*.json) plus a small cost-probe script kept outside the published pipeline.

Adds x-ai/grok-4.3@reasoning=xhigh as a new published row in the v2 track. Run identifiers: run_20260501_175254 / run_20260501_175254_panel. Pipeline used unmodified scripts/run_end_to_end.sh against a one-off config (config.grok-4.3.v2.json, gitignored) scoped to a single model to avoid re-collecting other contributors' pending candidates from config.new-models.v2.json. AGENTS.md §Benchmark config alignment permits a documented one-off; the model is promoted into the durable config.v2.json in this commit. 3-judge panel: anthropic/claude-sonnet-4.6 + openai/gpt-5.2 + google/gemini-3.1-pro-preview. panel_mode=full, consensus_method=mean. Final consensus: 1.3167 (52 green / 30 amber / 18 red). Per-row OpenRouter generation IDs are preserved in data/v2/latest/responses.jsonl::response_id; aggregate rows in data/v2/latest/aggregate.jsonl join by sample_id. Spot-check via https://openrouter.ai/api/v1/generation?id=<response_id>. - data/v2/latest/* — 100 responses + 100 aggregate rows added, 0 replaced; leaderboard re-sort accounts for most diff churn - data/model_metadata/model_launch_dates.csv — Grok 4.3 launch 2026-04-30 sourced from OpenRouter model listing (matches the pattern used by recent DeepSeek/Tencent/Xiaomi rows) - data/model_metadata/model_params.csv — closed/not_disclosed/ proprietary, mirroring existing Grok rows - config.v2.json — durable promotion with ["xhigh"] reasoning setting - CHANGELOG.md — 2.0.12 entry

Re-runs publish_latest_to_viewer.sh against the existing run artifacts (run_20260501_175254) to refresh the published metadata snapshots after the canonical rows were added in the previous commit. Without this, data/v2/latest/leaderboard_with_launch.csv carried the new grok-4.3 row with blank launch-date and model-params fields, and data/v2/latest/{model_launch_dates,model_params}.csv were missing the new row entirely (the publish step had run before the canonical metadata was added). The supplemental merge replaces 100 responses/aggregate rows in place (idempotent) and refreshes the metadata snapshots.

patelnav · 2026-05-01T20:15:02Z

patelnav added 3 commits May 1, 2026 15:09

patelnav marked this pull request as ready for review May 1, 2026 19:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add x-ai/grok-4.3 v2 benchmark results (xhigh)#25

Add x-ai/grok-4.3 v2 benchmark results (xhigh)#25
patelnav wants to merge 3 commits into
petergpt:mainfrom
patelnav:add-grok-4-3

patelnav commented May 1, 2026 •

edited

Loading

Uh oh!

patelnav commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

patelnav commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Status: ready for review

Final results

Cost details

Why a one-off config.grok-4.3.v2.json instead of config.new-models.v2.json

Reasoning variant rationale

Provenance

Intentionally not in this PR

Files changed

Test plan

Uh oh!

patelnav commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

patelnav commented May 1, 2026 •

edited

Loading

Why a one-off `config.grok-4.3.v2.json` instead of `config.new-models.v2.json`