Local analysis of 7 frontier models × 25 problems from the AmnesiaBench Scott 25 Kaggle task.
Each model ran the full pipeline: unbounded → prediction → sweep →
binary search. We recomputed n_reliable from the raw sweep /
binary-search trials (≥2/3 pass rule), then looked at how the models
compare on several axes.
Models: claude-haiku, claude-opus, claude-sonnet, deepseek-3.2,
glm-5, gpt-5.4, qwen-480b.
(gemini-flash-2.5 was excluded from this run — see the data
sub-section at the bottom for why.)
| model | solved | total cost | mean compaction_ratio ↓ | composite_ctx_eff (incl. unsolved) ↑ | composite_winpred (abstention-aware) ↑ |
|---|---|---|---|---|---|
| claude-opus | 14/25 | $32.51 | 1.09 | 0.47 | 0.22 |
| claude-haiku | 10/25 | $3.40 | 1.43 | 0.25 | 0.15 |
| qwen-480b | 6/25 | $8.23 | 1.56 | 0.15 | 0.19 |
| deepseek-3.2 | 4/25 | $3.19 | 1.85 | 0.06 | 0.08 |
| claude-sonnet | 9/25 | $42.40 | 1.97 | 0.19 | 0.22 |
| gpt-5.4 | 2/25 | $0.14 | 2.03 | 0.08 | 0.02 |
| glm-5 | 13/25 | $4.15 | 2.05 | 0.13 | 0.04 |
compaction_ratio = n_reliable / n_while_unbounded over solved
problems. Lower is better.
composite_ctx_eff = baseline_n_reliable / n_reliable averaged over
all 25 problems (unsolved → 0, cost_unbounded as cost). Baseline is the
minimum n_reliable across the 7 observed models per problem.
composite_winpred uses the abstention-aware rule: a model that
declines to predict (no finite n_predicted) is excluded from that
problem's window-prediction score, rather than penalized.
The seven models in this cohort all ship with max context windows between 128k and 1M tokens, released between July 2025 and March 2026. The AmnesiaBench harness tested at 131,072 tokens (the dashed horizontal line) — a size that was the 2024 frontier but is now middle-of-the-pack. Against that backdrop, compaction quality matters more than raw max: the question isn't whether a model can reach 1M tokens, it's whether it can solve in far less than its advertised capacity.
Dates and max context windows sourced from Anthropic / OpenAI / Google / DeepSeek / Alibaba / Zhipu launch announcements and third-party trackers (verified April 2026). Filled rings with a black border are the seven Scott-25 cohort models.
- claude-opus is the only model at the 1× floor. Mean 1.09, tight spread (σ = 0.30), 11-problem sample.
- claude-haiku beats every larger model except opus (1.43) — the biggest surprise in the ranking.
- gpt-5.4 is cheapest per defining run but not efficient — only solved 2 problems and averages 2× overhead.
- The cost axis spans ~2 orders of magnitude; the ratio axis spans ~2×. Paying 100× more does not buy 100× better compaction.
- Compaction actually helped (ratio < 1.0) on only 9/59 solved problems (~15 %). Every one of those wins came from the claude family: haiku (4), opus (4), sonnet (1). For every other model, compaction is pure overhead — never a focusing mechanism.
Two panels share the same Y-axis (baseline_n_reliable / n_reliable,
unsolved → 0). Every model has all 25 points: small dots =
individual problems, big dot = per-model mean.
- Panel A: X = cost of the n_reliable run (nanodollars of the
single trial set that defined
n_reliable; for unsolved problems,cost_unbounded). This is the "what did it cost to land at your score" view. - Panel B: X = cost per token (the model's average price
sum(cost_nd) / sum(tokens)across all 25 problems). This is the "how expensive is this model per unit of compute" view; each model's dots share an x-coordinate.
Findings:
- claude-opus leads at 0.47. It's the only model whose mean lifts above 0.2 once you include the problems it failed.
- Everyone else bunches between 0.06 and 0.25. 50× cost differences in that cluster don't predict meaningful score differences.
- Pareto frontier (Panel A): gpt-5.4 → qwen-480b → claude-haiku → claude-opus. Sonnet, glm-5, deepseek are all dominated.
- Pareto frontier (Panel B): deepseek-3.2 → qwen-480b → claude-haiku → claude-opus. deepseek and qwen are nearly tied on price per token (~850 nd/token), but qwen's score is 2.4× higher.
- Opus is ~14× more expensive per token than qwen-480b but only ~3× better on ctx_eff. Sublinear scaling.
Same Y-axis, but X is the model's average cost per token
(sum(cost_nanodollars) / sum(tokens) across all 25 problems). Since
cost-per-token is a per-model scalar, dots for a given model are
x-aligned.
- Opus is ~14× more expensive per token than qwen-480b but only ~3× better on ctx_eff. Sublinear scaling.
- deepseek and qwen are priced near-identically (~850 nd/token) but qwen wins on both solved count and ctx_eff.
- Panel A (solved-only) shows the "abstention reward" problem: gpt-5.4's tiny n biases it toward high-looking scores.
Window prediction = how close the model's self-predicted n_predicted
is to the measured n_reliable:
- ratio =
n_reliable / n_predicted - score =
ratioif ≤ 1 (under-predicted, good) - score =
(1/ratio)²if > 1 (over-confident, soft penalty) - score = 0 if the model committed a number but failed to solve
- dropped from composite if the model abstained (
n_predictedis missing / infinite — the model said "I don't know how much I'll need")
Top panel: X = cost of the n_reliable run. Bottom panel: X = cost-per-token.
Abstention rates:
| model | predicted | abstained |
|---|---|---|
| claude-opus | 25/25 | 0 |
| claude-sonnet | 25/25 | 0 |
| deepseek-3.2 | 25/25 | 0 |
| gpt-5.4 | 25/25 | 0 |
| claude-haiku | 24/25 | 1 |
| qwen-480b | 20/25 | 5 |
| glm-5 | 8/25 | 17 |
glm-5 declines to predict on 17/25 problems. Before this correction, those were counted as 0, dragging its composite to 0.01; with the fix, the composite is 0.04. Still worst among committers, but the gap to leaders shrank 3×.
qwen-480b also jumped from 0.15 → 0.19 once its 5 honest abstentions stopped counting.
Ranking after the fix:
| model | winpred |
|---|---|
| claude-sonnet | 0.22 |
| claude-opus | 0.22 |
| qwen-480b | 0.19 |
| claude-haiku | 0.15 |
| deepseek-3.2 | 0.08 |
| glm-5 | 0.04 |
| gpt-5.4 | 0.02 |
Human testing established ~200 tokens as near-optimal for
op_char_count (95-token prompt + 100-token compaction trigger).
| model | n_reliable | × human |
|---|---|---|
| claude-opus | 464 | 2.3× |
| qwen-480b | 736 | 3.7× |
| claude-sonnet | 829 | 4.1× |
| claude-haiku | 1,024 | 5.1× |
| glm-5 | 3,200 | 16.0× |
| deepseek-3.2 | ∞ | — |
| gpt-5.4 | ∞ | — |
No model in this 7-model cohort beats the human. The best (opus) is 2.3× over; glm-5 needs 16× more budget for a task a human did in 200 tokens; deepseek and gpt-5.4 can't solve it at all.
Across the 25 problems:
- 9 universally unsolved:
op_chess_board,op_color_blocks_10,op_color_blocks_12,op_letter_distance_q,op_pressure_vessel,op_simulation_10x10x10,op_simulation_5x5x5,op_stock_cut_3,op_stock_cut_4. The benchmark's score distribution is dominated by whether a model can crack any of these. - 3 near-universal solves (6-7 of 7):
op_prime_7547,op_closed_loops,op_prime_sum_negative. - 3 single-solver problems:
op_stock_cut_5(opus only),op_prime_2149519(glm-5 only),op_labeled_trees(glm-5 only). These are where glm-5 earns real points despite its bad composites. - Per-problem
n_reliablespreads up to 433× — the benchmark's geometric-midpoint binary search is critical; a linear search would be hopeless at this dynamic range.
-
Compression distribution per model — strip plot showing individual problem dots + per-model mean and ±1 SD:

-
3-panel comparison — compression_ratio, cost, and context_eff side by side per model:

-
Self-knowledge composite —
sqrt(success_pred · window_pred)with abstention-aware window component:
-
Success-prediction composite — almost always 1.0 (solved-only) or solve-rate (including unsolved) because
attempt=Trueis near-universal:
-
Per-problem × per-model table — interactive HTML table you can switch between
n_reliable,n_unbounded,cost_unbounded, andcost of n_reliable run:scott25_table.html
- The headline finding holds with sharper numbers. No model comes close to the human-demonstrated floor on the one problem we have a trace for; median compaction overhead across solved problems is ~1.8×.
- "Frontier" doesn't predict compaction quality. Opus leads, but haiku beats every larger model except opus on compaction ratio, and glm-5 solves 13/25 problems at 1/8th of opus's cost. Meanwhile gpt-5.4 solves 2 and deepseek solves 4.
- More dollars don't buy much more compaction. Cost spans 2 orders of magnitude across models; compaction_ratio and context_efficiency span less than 1. Opus's lead is real but narrow.
- Pick your metric carefully. Solved-only vs. all-problems rankings disagree in almost every row. The inclusive view rewards breadth and flattens narrow-but-cheap models.
- Honest abstention deserves credit, not a penalty. The original window-prediction scoring gave 0 to both "committed and failed" and "honestly said I don't know." That disincentivizes honesty. The abstention-aware variant here moves glm-5 from essentially 0 to a real-but-bad 0.04 and rewards qwen for its 5 judicious abstentions.
n_reliablerecomputed fromtraces_scott25.jsonusing theamnesia_kaggle/log_search.pyrule (≥2/3 pass for 3-trial entries; 1/1 for single-trial outer sweeps).cost_of_n_reliable_runis the sum ofcost_nanodollarsover only the trials at the defining N. When all those trials havefinish_reason = reused_unbounded(0 billed), we substitute the model's unbounded-phase cost. Reason: those trials did no new compute — they replayed the unbounded trace — so charging "0" misleadingly undercounts the real cost of that run.baseline_n_reliableforcomposite_context_efficiency_scoreis the minimumn_reliableacross the 7 observed models per problem. This is not the official frozen baseline the Kaggle scoring module uses, so absolute ctx_eff values here are upper-bounded by whatever the best local model achieved on each problem.gemini-flash-2.5was excluded from this cohort (its results showed anomalies — notably an_reliable = 100onop_char_countthat we couldn't independently verify; keeping it would have skewed every context_efficiency baseline).attempt = Falseoccurs only once (glm-5 on a single problem), so the success-prediction axis is essentially (solve rate × 1) and carries little independent signal.
Every chart above was generated by the Python files in this repo;
see each plot_*.py for the exact data-loading and scoring logic.





