v2: multi-model support (Gemini, OpenRouter), prediction phase, Scott scoring by 82deutschmark · Pull Request #2 · sonpham-org/AmnesiaBench

82deutschmark · 2026-03-29T12:58:42Z

Summary

Promotes the live working v2 from Son's Strix Halo to the repo. The original commit only had the llama.cpp baseline; this is the version that ran the overnight benchmark on 28-March-2026.

Changes

New backends

GeminiClient — Google Gemini API via gemini:// URL prefix
OpenRouterClient — OpenRouter via openrouter:// prefix
create_client() factory routes URLs to the right backend

Multi-model overnight runs

--model / --model-name / --api-key flags for single-model invocation
--run-all-models reads models.json and runs sequentially
models.json added: Qwen3.5-35B-A3B-Q4, gemini-3.1-flash-lite-preview, 4× OSS models via OpenRouter

Prediction phase + Scott scoring

run_prediction_phase(): model estimates n_reliable before binary search (calibration signal)
calculate_scores(): composite Scott scoring — efficiency, calibration, reliability, adaptability
--scores subcommand prints leaderboard from saved results

Reliability + retry

with_exponential_backoff(): wraps all external API calls, 5 retries, base 1s, max 60s + jitter
SUCCESS_THRESHOLD raised 20% → 60% (2/3 trials required)
Per-model result namespacing: results/<ModelName>_<problem>_<config>.json

Config simplification

Reduced from 4 configs to 2: NoTIR_HardCut, NoTIR_Compact
TIR configs removed — cleaner baseline, run separately if needed

Testing

Ran overnight on Son's Strix Halo (28-March-2026):

Qwen3.5-35B-A3B-Q4: all 10 problems complete
gemini-3.1-flash-lite-preview: all 10 problems complete (parallel launch)
Results in results/ on the Strix Halo

Notes

This is the exact file from /home/bubba-bench/bench-work/AmnesiaBench/amnesia_bench.py on the Strix Halo, committed to the repo so it's no longer orphaned on the local machine.

…se, Scott scoring Key changes: - Add GeminiClient and OpenRouterClient alongside LLMClient (llama.cpp) - Add create_client() factory that routes gemini://, openrouter://, http:// URLs - Add --model / --model-name / --api-key flags for single-model runs - Add --run-all-models flag to iterate models.json sequentially - Add models.json with Qwen3.5-35B-A3B-Q4, gemini-3.1-flash-lite-preview, OSS models - Add prediction phase (run_prediction_phase): model estimates n_reliable before binary search - Add composite Scott scoring (calculate_scores): efficiency, calibration, reliability, adaptability - Add --scores subcommand to display leaderboard from saved results - Raise SUCCESS_THRESHOLD 20% → 60% (2/3 trials required, more reliable measurement) - Add with_exponential_backoff() wrapping all external API calls (429/503, 5 retries, jitter) - Per-model result namespacing: results/<ModelName>_<problem>_<config>.json - Reduce configs from 4 to 2: NoTIR_HardCut, NoTIR_Compact (TIR removed for cleaner baseline) Author: Claude Sonnet 4.6 (Bubba), 28-March-2026

…3.5-9b etc)

… binary search - Add ARC module: evaluator, prompts (simple/guided), problem generator - 23 ARC problems (15 unsolved + 8 hard-but-solvable) flattened to problems/ - 10 AIMO3 problems (7 new from openrouter-support branch) - 11 models in models.json: Qwen3.6+, Nemotron-120B, GLM-4.5, Trinity, Qwen3-Coder, Step-3.5, MiniMax, Gemini FlashLite, Mistral Small, Claude Sonnet 4.6 (OAuth), DeepSeek v3.2 - AnthropicOAuthClient: Bearer auth, oauth-2025-04-20 header, Claude Code system preamble, no temperature param - ARC prompt routing: Anthropic gets fixed preamble + ARC in user message, all others get ARC as system prompt - Binary search: TRIALS_PER_WINDOW=1, snap to 16-token granularity - ARC evaluation via evaluate_arc_answer() for topic==arc problems

- Add MODEL_CONTEXT_WINDOWS mapping (9 models) - Add run_unbounded() for 5-run token measurement - SYSTEM_UNBOUNDED prompt (no context limit mention) - Forced compaction at 50% of token_limit in Compact config - MAX_COMPACTIONS bumped from 5 to 10 - Snap grid = 16 tokens

Bubba added 4 commits March 29, 2026 08:58

fix: handle OpenRouter reasoning field name for thinking models (qwen…

7ce8d14

…3.5-9b etc)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2: multi-model support (Gemini, OpenRouter), prediction phase, Scott scoring#2

v2: multi-model support (Gemini, OpenRouter), prediction phase, Scott scoring#2
82deutschmark wants to merge 4 commits intomasterfrom
v2-multi-model

82deutschmark commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

82deutschmark commented Mar 29, 2026

Summary

Changes

New backends

Multi-model overnight runs

Prediction phase + Scott scoring

Reliability + retry

Config simplification

Testing

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant