Skip to content

Add x-ai/grok-4.3 v2 benchmark results (xhigh)#25

Open
patelnav wants to merge 3 commits into
petergpt:mainfrom
patelnav:add-grok-4-3
Open

Add x-ai/grok-4.3 v2 benchmark results (xhigh)#25
patelnav wants to merge 3 commits into
petergpt:mainfrom
patelnav:add-grok-4-3

Conversation

@patelnav
Copy link
Copy Markdown

@patelnav patelnav commented May 1, 2026

Status: ready for review

Adds x-ai/grok-4.3@reasoning=xhigh as a new published row in the v2 track. PR contains the full diff and is ready for review.

  • Phase 0 — pre-flight (config, dry-run gate)
  • Phase 1a — collect (100 q) + primary judge (Sonnet 4.6)
  • Phase 1b — additional judges (GPT-5.2 + Gemini 3.1 Pro Preview) + publish
  • Metadata (model_launch_dates.csv, model_params.csv)
  • Durable config promotion (config.v2.json) + CHANGELOG entry (2.0.12)
  • Re-publish to refresh data/v2/latest/ metadata snapshots after canonical rows landed (commit 1743413)
  • Local viewer smoke test against viewer/index.v2.html — page renders, zero console errors, grok-4.3 row appears at rank 22 with exact expected values (1.3167 / 52 / 30 / 18), version selector + model filter work

Final results

Run ID: run_20260501_175254 · Panel ID: run_20260501_175254_panel

Leaderboard row (data/v2/latest/leaderboard.csv):

22,x-ai/grok-4.3@reasoning=xhigh,x-ai,xhigh,1.3167,0.52,0.18,52,30,18,100,0

leaderboard_with_launch.csv row (after metadata fix):

22,x-ai/grok-4.3@reasoning=xhigh,x-ai,xhigh,1.3167,0.52,0.18,52,30,18,100,0,x-ai/grok-4.3,2026-04-30,1,https://openrouter.ai/x-ai/grok-4.3,closed,,,not_disclosed,proprietary
Mean consensus score 1.3167
Green / Amber / Red 52 / 30 / 18
Errors 0

For context: lands between grok-4.20-beta@xhigh (1.27 / 54 green) and grok-4.20-multi-agent-beta@xhigh (1.43 / 64 green) — sensible position in the xAI family.

Cost details

Stage Spend Notes
Cost probe (6 calls, pre-run) $0.030 Sample-of-3 calls used to size the run before committing budget
Generation — Grok 4.3 xhigh (100 q) $0.557 100/100 ok, max_attempt=1, 0 rate-limit requeues, 147,611 reasoning tokens (avg 1,476/sample), avg latency 29.3s
Judge — Sonnet 4.6 (medium reasoning) $1.768 164,531 prompt + 84,933 completion + 45,169 reasoning tokens
Judge — GPT-5.2 (medium reasoning) $0.627
Judge — Gemini 3.1 Pro Preview (medium reasoning) $0.000* OpenRouter usage.cost reports 0 for this preview tier — same behavior observed in the cost probe; likely unsurfaced rather than literally free
Total ~$2.98 Stayed well under the $10 OpenRouter key cap

Token usage is fully captured in data/v2/latest/collection_stats.json::usage_summary and per-row in data/v2/latest/responses.jsonl.

Why a one-off config.grok-4.3.v2.json instead of config.new-models.v2.json

config.new-models.v2.json already contains other contributors' pending candidates (moonshotai/kimi-k2.6, z-ai/glm-5.1, qwen/qwen3.6-plus). Running run_end_to_end.sh against that config would have re-collected and additively merged duplicate rows for those models into the published v2 dataset. The one-off run config (gitignored, kept outside the repo per AGENTS.md §Benchmark config alignment) scoped this contribution strictly to x-ai/grok-4.3. The model is promoted to config.v2.json in this PR.

Reasoning variant rationale

Single xhigh variant matches the high-effort comparator from the existing x-ai/grok-4.20-beta family (["low","xhigh"]). The OpenRouter key budget excluded a second variant for the first pass; a none-effort ablation ($0.50 incremental) and the v1/55-question track ($1.10) are straightforward follow-ups if you'd like.

Provenance

  • Run pinned to upstream 4ff1d2dc48a880b195652438b9e98e5c3763eed5. Diff produced from a fork (patelnav/bullshit-benchmark) on branch add-grok-4-3.
  • Pipeline used unmodified: ./scripts/run_end_to_end.sh (split into Phase 1a collect + primary judge then Phase 1b --skip-collect --skip-primary-judge --with-additional-judges for an explicit cost-check pause point). No hand-rolled CSVs or aggregate edits.
  • Per-row OpenRouter generation IDs are in data/v2/latest/responses.jsonl::response_id. Aggregate rows in data/v2/latest/aggregate.jsonl join by sample_id. Spot-check any row against https://openrouter.ai/api/v1/generation?id=<response_id>.
  • num_runs: 1. No silent retries; if the run had errored, partial state would have been preserved and surfaced in this PR rather than silently re-run.
  • For independent verification with your own panel:
    ./scripts/run_end_to_end.sh \
      --config config.grok-4.3.v2.json \
      --skip-collect --skip-primary-judge --with-additional-judges \
      --run-id run_20260501_175254 --panel-id <new_panel_id> \
      --viewer-output-dir <your_dir>
    This re-runs the panel against the published data/v2/latest/responses.jsonl using your OpenRouter key. (You'd need a matching one-off config; happy to attach mine on request — content is x-ai/grok-4.3 + ["xhigh"] + the same 3-judge panel as config.v2.json.)

Intentionally not in this PR

  • Raw runs/<run_id>/ gist. Everything needed for spot-checking is already in data/v2/latest/{responses.jsonl,aggregate.jsonl} (per-row OpenRouter generation IDs + verbatim judge justifications). If you want the raw run directory (intermediate grade dirs, retry log, partial files) attached as a gist, I'm happy to upload it — just say so.
  • Refreshed docs/images/* README screenshots. The 2.0.10 release refreshed those after a multi-row update; this PR adds a single row that doesn't materially change chart imagery. Happy to refresh if you'd prefer the README always reflect the latest dataset.

Files changed

  • data/v2/latest/* — 9 files; responses.jsonl and aggregate.jsonl get +100 rows each, leaderboard CSVs re-sort to place the new row at rank 22, the JSON manifests update counts, recent_additions.json reflects the new entry. Verified: responses_added=100, replaced=0, aggregate_added=100, replaced=0, no sample_id collisions on the initial publish.
  • data/model_metadata/model_launch_dates.csv — Grok 4.3 row, launch 2026-04-30 sourced from https://openrouter.ai/x-ai/grok-4.3 (matches the pattern used by recent DeepSeek V4 / Tencent Hy3 / Xiaomi MiMo rows).
  • data/model_metadata/model_params.csvclosed / not_disclosed / proprietary, mirroring existing xAI rows; xAI does not publicly disclose parameter counts for Grok 4.3.
  • config.v2.json — durable promotion: x-ai/grok-4.3 added to collect.models and model_reasoning_efforts: ["xhigh"].
  • .gitignore — minor additions for local secret/working files (separate prep commit).
  • CHANGELOG.md2.0.12 entry.

Test plan

  • AGENTS.md verification gate: JSON parse + collect --dry-run --limit 1 passed before live run
  • Live collect + primary judge: 100/100 ok, no retries, costs within budget
  • Live additional judges + publish: panel completed, responses_added=100 aggregate_added=100 *_replaced=0
  • panel_summary.json shows 3 judges + mean consensus
  • Exactly one new leaderboard row for x-ai/grok-4.3@reasoning=xhigh
  • git diff --stat -- data/v2/latest shows expected file set only (no unrelated touches)
  • Re-publish to fix metadata snapshots (leaderboard_with_launch.csv row populated, data/v2/latest/{model_launch_dates,model_params}.csv include grok-4.3)
  • Local viewer smoke test: page renders, zero console errors, grok-4.3 row visible at rank 22 with correct values, filters/version-selector functional

patelnav added 3 commits May 1, 2026 15:09
Adds patterns for local env files (.env, .env.*), per-contributor
working notes (*.local.md), and the one-off run config used for this
contribution (config.grok-4.3.*.json) plus a small cost-probe script
kept outside the published pipeline.
Adds x-ai/grok-4.3@reasoning=xhigh as a new published row in the v2
track. Run identifiers: run_20260501_175254 / run_20260501_175254_panel.

Pipeline used unmodified scripts/run_end_to_end.sh against a one-off
config (config.grok-4.3.v2.json, gitignored) scoped to a single model
to avoid re-collecting other contributors' pending candidates from
config.new-models.v2.json. AGENTS.md §Benchmark config alignment
permits a documented one-off; the model is promoted into the durable
config.v2.json in this commit.

3-judge panel: anthropic/claude-sonnet-4.6 + openai/gpt-5.2 +
google/gemini-3.1-pro-preview. panel_mode=full, consensus_method=mean.
Final consensus: 1.3167 (52 green / 30 amber / 18 red).

Per-row OpenRouter generation IDs are preserved in
data/v2/latest/responses.jsonl::response_id; aggregate rows in
data/v2/latest/aggregate.jsonl join by sample_id. Spot-check via
https://openrouter.ai/api/v1/generation?id=<response_id>.

- data/v2/latest/* — 100 responses + 100 aggregate rows added,
  0 replaced; leaderboard re-sort accounts for most diff churn
- data/model_metadata/model_launch_dates.csv — Grok 4.3 launch
  2026-04-30 sourced from OpenRouter model listing (matches the
  pattern used by recent DeepSeek/Tencent/Xiaomi rows)
- data/model_metadata/model_params.csv — closed/not_disclosed/
  proprietary, mirroring existing Grok rows
- config.v2.json — durable promotion with ["xhigh"] reasoning setting
- CHANGELOG.md — 2.0.12 entry
Re-runs publish_latest_to_viewer.sh against the existing run artifacts
(run_20260501_175254) to refresh the published metadata snapshots
after the canonical rows were added in the previous commit.

Without this, data/v2/latest/leaderboard_with_launch.csv carried the
new grok-4.3 row with blank launch-date and model-params fields, and
data/v2/latest/{model_launch_dates,model_params}.csv were missing the
new row entirely (the publish step had run before the canonical
metadata was added).

The supplemental merge replaces 100 responses/aggregate rows in place
(idempotent) and refreshes the metadata snapshots.
@patelnav patelnav marked this pull request as ready for review May 1, 2026 19:43
@patelnav
Copy link
Copy Markdown
Author

patelnav commented May 1, 2026

bsbench-v2-hero

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant