Skip to content

v2: multi-model support (Gemini, OpenRouter), prediction phase, Scott scoring#2

Open
82deutschmark wants to merge 4 commits intomasterfrom
v2-multi-model
Open

v2: multi-model support (Gemini, OpenRouter), prediction phase, Scott scoring#2
82deutschmark wants to merge 4 commits intomasterfrom
v2-multi-model

Conversation

@82deutschmark
Copy link
Copy Markdown
Collaborator

Summary

Promotes the live working v2 from Son's Strix Halo to the repo. The original commit only had the llama.cpp baseline; this is the version that ran the overnight benchmark on 28-March-2026.

Changes

New backends

  • GeminiClient — Google Gemini API via gemini:// URL prefix
  • OpenRouterClient — OpenRouter via openrouter:// prefix
  • create_client() factory routes URLs to the right backend

Multi-model overnight runs

  • --model / --model-name / --api-key flags for single-model invocation
  • --run-all-models reads models.json and runs sequentially
  • models.json added: Qwen3.5-35B-A3B-Q4, gemini-3.1-flash-lite-preview, 4× OSS models via OpenRouter

Prediction phase + Scott scoring

  • run_prediction_phase(): model estimates n_reliable before binary search (calibration signal)
  • calculate_scores(): composite Scott scoring — efficiency, calibration, reliability, adaptability
  • --scores subcommand prints leaderboard from saved results

Reliability + retry

  • with_exponential_backoff(): wraps all external API calls, 5 retries, base 1s, max 60s + jitter
  • SUCCESS_THRESHOLD raised 20% → 60% (2/3 trials required)
  • Per-model result namespacing: results/<ModelName>_<problem>_<config>.json

Config simplification

  • Reduced from 4 configs to 2: NoTIR_HardCut, NoTIR_Compact
  • TIR configs removed — cleaner baseline, run separately if needed

Testing

Ran overnight on Son's Strix Halo (28-March-2026):

  • Qwen3.5-35B-A3B-Q4: all 10 problems complete
  • gemini-3.1-flash-lite-preview: all 10 problems complete (parallel launch)
  • Results in results/ on the Strix Halo

Notes

This is the exact file from /home/bubba-bench/bench-work/AmnesiaBench/amnesia_bench.py on the Strix Halo, committed to the repo so it's no longer orphaned on the local machine.

Bubba added 4 commits March 29, 2026 08:58
…se, Scott scoring

Key changes:
- Add GeminiClient and OpenRouterClient alongside LLMClient (llama.cpp)
- Add create_client() factory that routes gemini://, openrouter://, http:// URLs
- Add --model / --model-name / --api-key flags for single-model runs
- Add --run-all-models flag to iterate models.json sequentially
- Add models.json with Qwen3.5-35B-A3B-Q4, gemini-3.1-flash-lite-preview, OSS models
- Add prediction phase (run_prediction_phase): model estimates n_reliable before binary search
- Add composite Scott scoring (calculate_scores): efficiency, calibration, reliability, adaptability
- Add --scores subcommand to display leaderboard from saved results
- Raise SUCCESS_THRESHOLD 20% → 60% (2/3 trials required, more reliable measurement)
- Add with_exponential_backoff() wrapping all external API calls (429/503, 5 retries, jitter)
- Per-model result namespacing: results/<ModelName>_<problem>_<config>.json
- Reduce configs from 4 to 2: NoTIR_HardCut, NoTIR_Compact (TIR removed for cleaner baseline)

Author: Claude Sonnet 4.6 (Bubba), 28-March-2026
… binary search

- Add ARC module: evaluator, prompts (simple/guided), problem generator
- 23 ARC problems (15 unsolved + 8 hard-but-solvable) flattened to problems/
- 10 AIMO3 problems (7 new from openrouter-support branch)
- 11 models in models.json: Qwen3.6+, Nemotron-120B, GLM-4.5, Trinity,
  Qwen3-Coder, Step-3.5, MiniMax, Gemini FlashLite, Mistral Small,
  Claude Sonnet 4.6 (OAuth), DeepSeek v3.2
- AnthropicOAuthClient: Bearer auth, oauth-2025-04-20 header, Claude Code
  system preamble, no temperature param
- ARC prompt routing: Anthropic gets fixed preamble + ARC in user message,
  all others get ARC as system prompt
- Binary search: TRIALS_PER_WINDOW=1, snap to 16-token granularity
- ARC evaluation via evaluate_arc_answer() for topic==arc problems
- Add MODEL_CONTEXT_WINDOWS mapping (9 models)
- Add run_unbounded() for 5-run token measurement
- SYSTEM_UNBOUNDED prompt (no context limit mention)
- Forced compaction at 50% of token_limit in Compact config
- MAX_COMPACTIONS bumped from 5 to 10
- Snap grid = 16 tokens
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant