v2: multi-model support (Gemini, OpenRouter), prediction phase, Scott scoring#2
Open
82deutschmark wants to merge 4 commits intomasterfrom
Open
v2: multi-model support (Gemini, OpenRouter), prediction phase, Scott scoring#282deutschmark wants to merge 4 commits intomasterfrom
82deutschmark wants to merge 4 commits intomasterfrom
Conversation
added 4 commits
March 29, 2026 08:58
…se, Scott scoring Key changes: - Add GeminiClient and OpenRouterClient alongside LLMClient (llama.cpp) - Add create_client() factory that routes gemini://, openrouter://, http:// URLs - Add --model / --model-name / --api-key flags for single-model runs - Add --run-all-models flag to iterate models.json sequentially - Add models.json with Qwen3.5-35B-A3B-Q4, gemini-3.1-flash-lite-preview, OSS models - Add prediction phase (run_prediction_phase): model estimates n_reliable before binary search - Add composite Scott scoring (calculate_scores): efficiency, calibration, reliability, adaptability - Add --scores subcommand to display leaderboard from saved results - Raise SUCCESS_THRESHOLD 20% → 60% (2/3 trials required, more reliable measurement) - Add with_exponential_backoff() wrapping all external API calls (429/503, 5 retries, jitter) - Per-model result namespacing: results/<ModelName>_<problem>_<config>.json - Reduce configs from 4 to 2: NoTIR_HardCut, NoTIR_Compact (TIR removed for cleaner baseline) Author: Claude Sonnet 4.6 (Bubba), 28-March-2026
… binary search - Add ARC module: evaluator, prompts (simple/guided), problem generator - 23 ARC problems (15 unsolved + 8 hard-but-solvable) flattened to problems/ - 10 AIMO3 problems (7 new from openrouter-support branch) - 11 models in models.json: Qwen3.6+, Nemotron-120B, GLM-4.5, Trinity, Qwen3-Coder, Step-3.5, MiniMax, Gemini FlashLite, Mistral Small, Claude Sonnet 4.6 (OAuth), DeepSeek v3.2 - AnthropicOAuthClient: Bearer auth, oauth-2025-04-20 header, Claude Code system preamble, no temperature param - ARC prompt routing: Anthropic gets fixed preamble + ARC in user message, all others get ARC as system prompt - Binary search: TRIALS_PER_WINDOW=1, snap to 16-token granularity - ARC evaluation via evaluate_arc_answer() for topic==arc problems
- Add MODEL_CONTEXT_WINDOWS mapping (9 models) - Add run_unbounded() for 5-run token measurement - SYSTEM_UNBOUNDED prompt (no context limit mention) - Forced compaction at 50% of token_limit in Compact config - MAX_COMPACTIONS bumped from 5 to 10 - Snap grid = 16 tokens
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Promotes the live working v2 from Son's Strix Halo to the repo. The original commit only had the llama.cpp baseline; this is the version that ran the overnight benchmark on 28-March-2026.
Changes
New backends
GeminiClient— Google Gemini API viagemini://URL prefixOpenRouterClient— OpenRouter viaopenrouter://prefixcreate_client()factory routes URLs to the right backendMulti-model overnight runs
--model/--model-name/--api-keyflags for single-model invocation--run-all-modelsreadsmodels.jsonand runs sequentiallymodels.jsonadded: Qwen3.5-35B-A3B-Q4, gemini-3.1-flash-lite-preview, 4× OSS models via OpenRouterPrediction phase + Scott scoring
run_prediction_phase(): model estimates n_reliable before binary search (calibration signal)calculate_scores(): composite Scott scoring — efficiency, calibration, reliability, adaptability--scoressubcommand prints leaderboard from saved resultsReliability + retry
with_exponential_backoff(): wraps all external API calls, 5 retries, base 1s, max 60s + jitterresults/<ModelName>_<problem>_<config>.jsonConfig simplification
NoTIR_HardCut,NoTIR_CompactTesting
Ran overnight on Son's Strix Halo (28-March-2026):
results/on the Strix HaloNotes
This is the exact file from
/home/bubba-bench/bench-work/AmnesiaBench/amnesia_bench.pyon the Strix Halo, committed to the repo so it's no longer orphaned on the local machine.