Add x-ai/grok-4.3 v2 benchmark results (xhigh)#25
Open
patelnav wants to merge 3 commits into
Open
Conversation
Adds patterns for local env files (.env, .env.*), per-contributor working notes (*.local.md), and the one-off run config used for this contribution (config.grok-4.3.*.json) plus a small cost-probe script kept outside the published pipeline.
Adds x-ai/grok-4.3@reasoning=xhigh as a new published row in the v2 track. Run identifiers: run_20260501_175254 / run_20260501_175254_panel. Pipeline used unmodified scripts/run_end_to_end.sh against a one-off config (config.grok-4.3.v2.json, gitignored) scoped to a single model to avoid re-collecting other contributors' pending candidates from config.new-models.v2.json. AGENTS.md §Benchmark config alignment permits a documented one-off; the model is promoted into the durable config.v2.json in this commit. 3-judge panel: anthropic/claude-sonnet-4.6 + openai/gpt-5.2 + google/gemini-3.1-pro-preview. panel_mode=full, consensus_method=mean. Final consensus: 1.3167 (52 green / 30 amber / 18 red). Per-row OpenRouter generation IDs are preserved in data/v2/latest/responses.jsonl::response_id; aggregate rows in data/v2/latest/aggregate.jsonl join by sample_id. Spot-check via https://openrouter.ai/api/v1/generation?id=<response_id>. - data/v2/latest/* — 100 responses + 100 aggregate rows added, 0 replaced; leaderboard re-sort accounts for most diff churn - data/model_metadata/model_launch_dates.csv — Grok 4.3 launch 2026-04-30 sourced from OpenRouter model listing (matches the pattern used by recent DeepSeek/Tencent/Xiaomi rows) - data/model_metadata/model_params.csv — closed/not_disclosed/ proprietary, mirroring existing Grok rows - config.v2.json — durable promotion with ["xhigh"] reasoning setting - CHANGELOG.md — 2.0.12 entry
Re-runs publish_latest_to_viewer.sh against the existing run artifacts
(run_20260501_175254) to refresh the published metadata snapshots
after the canonical rows were added in the previous commit.
Without this, data/v2/latest/leaderboard_with_launch.csv carried the
new grok-4.3 row with blank launch-date and model-params fields, and
data/v2/latest/{model_launch_dates,model_params}.csv were missing the
new row entirely (the publish step had run before the canonical
metadata was added).
The supplemental merge replaces 100 responses/aggregate rows in place
(idempotent) and refreshes the metadata snapshots.
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Status: ready for review
Adds
x-ai/grok-4.3@reasoning=xhighas a new published row in the v2 track. PR contains the full diff and is ready for review.model_launch_dates.csv,model_params.csv)config.v2.json) + CHANGELOG entry (2.0.12)data/v2/latest/metadata snapshots after canonical rows landed (commit1743413)viewer/index.v2.html— page renders, zero console errors, grok-4.3 row appears at rank 22 with exact expected values (1.3167 / 52 / 30 / 18), version selector + model filter workFinal results
Run ID:
run_20260501_175254· Panel ID:run_20260501_175254_panelLeaderboard row (
data/v2/latest/leaderboard.csv):leaderboard_with_launch.csvrow (after metadata fix):For context: lands between
grok-4.20-beta@xhigh(1.27 / 54 green) andgrok-4.20-multi-agent-beta@xhigh(1.43 / 64 green) — sensible position in the xAI family.Cost details
usage.costreports 0 for this preview tier — same behavior observed in the cost probe; likely unsurfaced rather than literally freeToken usage is fully captured in
data/v2/latest/collection_stats.json::usage_summaryand per-row indata/v2/latest/responses.jsonl.Why a one-off
config.grok-4.3.v2.jsoninstead ofconfig.new-models.v2.jsonconfig.new-models.v2.jsonalready contains other contributors' pending candidates (moonshotai/kimi-k2.6,z-ai/glm-5.1,qwen/qwen3.6-plus). Runningrun_end_to_end.shagainst that config would have re-collected and additively merged duplicate rows for those models into the published v2 dataset. The one-off run config (gitignored, kept outside the repo per AGENTS.md §Benchmark config alignment) scoped this contribution strictly tox-ai/grok-4.3. The model is promoted toconfig.v2.jsonin this PR.Reasoning variant rationale
Single
xhighvariant matches the high-effort comparator from the existingx-ai/grok-4.20-betafamily (["low","xhigh"]). The OpenRouter key budget excluded a second variant for the first pass; anone-effort ablation ($0.50 incremental) and the v1/55-question track ($1.10) are straightforward follow-ups if you'd like.Provenance
4ff1d2dc48a880b195652438b9e98e5c3763eed5. Diff produced from a fork (patelnav/bullshit-benchmark) on branchadd-grok-4-3../scripts/run_end_to_end.sh(split into Phase 1acollect + primary judgethen Phase 1b--skip-collect --skip-primary-judge --with-additional-judgesfor an explicit cost-check pause point). No hand-rolled CSVs or aggregate edits.data/v2/latest/responses.jsonl::response_id. Aggregate rows indata/v2/latest/aggregate.jsonljoin bysample_id. Spot-check any row againsthttps://openrouter.ai/api/v1/generation?id=<response_id>.num_runs: 1. No silent retries; if the run had errored, partial state would have been preserved and surfaced in this PR rather than silently re-run.data/v2/latest/responses.jsonlusing your OpenRouter key. (You'd need a matching one-off config; happy to attach mine on request — content isx-ai/grok-4.3+["xhigh"]+ the same 3-judge panel asconfig.v2.json.)Intentionally not in this PR
runs/<run_id>/gist. Everything needed for spot-checking is already indata/v2/latest/{responses.jsonl,aggregate.jsonl}(per-row OpenRouter generation IDs + verbatim judge justifications). If you want the raw run directory (intermediate grade dirs, retry log, partial files) attached as a gist, I'm happy to upload it — just say so.docs/images/*README screenshots. The 2.0.10 release refreshed those after a multi-row update; this PR adds a single row that doesn't materially change chart imagery. Happy to refresh if you'd prefer the README always reflect the latest dataset.Files changed
data/v2/latest/*— 9 files;responses.jsonlandaggregate.jsonlget +100 rows each, leaderboard CSVs re-sort to place the new row at rank 22, the JSON manifests update counts,recent_additions.jsonreflects the new entry. Verified:responses_added=100, replaced=0, aggregate_added=100, replaced=0, nosample_idcollisions on the initial publish.data/model_metadata/model_launch_dates.csv— Grok 4.3 row, launch 2026-04-30 sourced fromhttps://openrouter.ai/x-ai/grok-4.3(matches the pattern used by recent DeepSeek V4 / Tencent Hy3 / Xiaomi MiMo rows).data/model_metadata/model_params.csv—closed / not_disclosed / proprietary, mirroring existing xAI rows; xAI does not publicly disclose parameter counts for Grok 4.3.config.v2.json— durable promotion:x-ai/grok-4.3added tocollect.modelsandmodel_reasoning_efforts: ["xhigh"]..gitignore— minor additions for local secret/working files (separate prep commit).CHANGELOG.md—2.0.12entry.Test plan
collect --dry-run --limit 1passed before live runresponses_added=100 aggregate_added=100 *_replaced=0panel_summary.jsonshows 3 judges +meanconsensusx-ai/grok-4.3@reasoning=xhighgit diff --stat -- data/v2/latestshows expected file set only (no unrelated touches)leaderboard_with_launch.csvrow populated,data/v2/latest/{model_launch_dates,model_params}.csvinclude grok-4.3)