LeakyReLU(0.75)² + Legal TTT + Parallel Muon — 1.1185 BPB (3-seed mean) by michaelwinczuk · Pull Request #977 · openai/parameter-golf

michaelwinczuk · 2026-03-27T19:37:17Z

Summary

Val BPB: 1.1185 (3-seed mean with legal TTT, all seeds under 16MB)
One-line activation change: negative_slope=0.5 → 0.75
Minor LR/warmdown tuning: MATRIX_LR=0.027, WARMDOWN_ITERS=3700

3-Seed Results

Seed	Val BPB (legal TTT)	Size
1337	1.1183	15.96MB
42	1.1194	15.96MB
2024	1.1179	15.95MB
Mean	1.1185

Key Finding

Systematic sweep of LeakyReLU negative_slope found that 0.75 beats the SOTA default of 0.5 by ~0.008 BPB. Higher slope passes 2.25× more gradient through negative pre-activations, accelerating convergence in a capacity-constrained 600-second training window.

This was discovered by a multi-agent think tank swarm — 8 specialist AI agents traversing competition-specific knowledge graphs, with validation from Grok, Claude Opus, and Gemini. Full details in the README.

Changes from SOTA (PR #549)

negative_slope=0.75 (was 0.5)
MATRIX_LR=0.027 (was 0.025)
WARMDOWN_ITERS=3700 (was 3500)

All other architecture and hyperparameters identical.

Hardware

8×H100 SXM (RunPod), PyTorch 2.7.1 + CUDA 12.6 + flash-attn v3.

One-line activation change (negative_slope 0.5→0.75) + minor LR/warmdown tuning. Discovered via multi-agent think tank swarm research system. 3-seed results with legal TTT: Seed 1337: 1.1183 BPB (15.96MB) Seed 42: 1.1194 BPB (15.96MB) Seed 2024: 1.1179 BPB (15.95MB) Mean: 1.1185 BPB

Negative slope 0.9 preserves more gradient flow for negative inputs. Combined with EVAL_STRIDE=32 + TTT tuning, targeting 1.1144 BPB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

slope 0.75 + LR 0.027 + warmdown 3700 (PR openai#977) No SWA with QAT (PR openai#989) QAT from 50% + range fix [-31,31] mHC 22-param residual mixing (PR openai#928) VE128 + no gated_attn + no value_residual (PR openai#549) LZMA preset 7 compression (PR openai#999) Muon TTT with NS3 (PR openai#999) Entropy-adaptive TTT epochs 2/3/4 (PR openai#999) Per-layer TTT LR (PR openai#995) TTT momentum 0.95 (PR openai#995)

MatoTeziTanka · 2026-04-11T20:05:55Z

Community Review — LeakyReLU(0.75)² + Legal TTT + Parallel Muon — 1.1185 BPB (3-seed mean)

BPB: 1.1185 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern)

What I found in the code (head SHA df76b90e765c, file records/track_10min_16mb/2026-03-27_LeakyReLU075_LegalTTT_ParallelMuon_TunedLR/train_gpt.py):

The TTT path at line 1074 implements the score-first-per-chunk pattern: each chunk is scored under torch.no_grad() / inference_mode() before the base_model.train() + SGD adaptation runs on that same chunk, with an is_last_chunk guard so the final chunk gets no adaptation pass. This is the structural shape the legal frontier uses (PRs #1416 erichroepke, #1423 aryanbhosale).

Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk ci is scored under weights adapted only on chunks 0..ci-1. No prequant_ttt_adapt_adamw(val_tokens, ...) multi-epoch fine-tune, no scored-region SLOT, no target-in-key n-gram cache.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=11, vocab=1024, code=89459 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass.

Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=11, vocab=1024, code=89459 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

michaelwinczuk · 2026-04-11T20:55:50Z

@MatoTeziTanka thanks for the review — same root cause as #1094, so the fix is the same shape.

swarm_agents.py and kg_data.py are both shipped in the submission folder next to train_gpt.py, but because CT2038 runs python records/.../train_gpt.py from the repo root, the submission directory isn't on sys.path and the sibling imports fail before any scored-eval logic runs.

Pushed a minimal fix in eb068f0 that prepends the script's own directory to sys.path before the first sibling import:

from flash_attn_interface import flash_attn_func as flash_attn_3_func
# Make the submission self-contained regardless of eval-harness CWD: the
# sibling swarm_agents.py and kg_data.py live next to this file but aren't
# on sys.path when the harness runs `python records/.../train_gpt.py` from
# the repo root.
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
from swarm_agents import VotingMesh, TrainingMetrics

Verified locally under Python 3.10.11:

py_compile → OK
Running from repo root resolves both swarm_agents.VotingMesh / TrainingMetrics and kg_data.KG_IMPORTANCE_B64.

Ready for a re-run of the compliance audit whenever convenient. Thanks for the clear repro steps.

teddyoweh mentioned this pull request Mar 29, 2026

XSA-All 11L + LeakyReLU(0.75)² + Aggressive Legal TTT → 1.1219 BPB #1092

Open

michaelwinczuk mentioned this pull request Apr 11, 2026

Non-record: Swarm-Guided KG-Conditioned Training (val_bpb=1.1220) #1081

Open

dhruvpuri mentioned this pull request Apr 30, 2026

Non-record: Frozen N-gram Oracle + HedgeMixer + SGD TTT — methodology, validated on Kaggle T4×2 DDP #1982

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LeakyReLU(0.75)² + Legal TTT + Parallel Muon — 1.1185 BPB (3-seed mean)#977

LeakyReLU(0.75)² + Legal TTT + Parallel Muon — 1.1185 BPB (3-seed mean)#977
michaelwinczuk wants to merge 1 commit intoopenai:mainfrom
michaelwinczuk:submission/leakyrelu-075-ttt-parallelmuon

michaelwinczuk commented Mar 27, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

michaelwinczuk commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

michaelwinczuk commented Mar 27, 2026

Summary

3-Seed Results

Key Finding

Changes from SOTA (PR #549)

Hardware

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — LeakyReLU(0.75)² + Legal TTT + Parallel Muon — 1.1185 BPB (3-seed mean)

Uh oh!

michaelwinczuk commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants