Turbo-Muon + EngramLite + ParamBanking + GPTQ Reserve Opt — val_bpb 1.1126 (3-seed mean) by Bortlesboat · Pull Request #1169 · openai/parameter-golf

Bortlesboat · 2026-03-31T03:38:33Z

Summary

val_bpb: 1.1126 (3-seed mean, std 0.0003)
Artifact: ~15.98 MB (all seeds under 16,000,000 bytes)
Eval time: ~120s (no TTT, sliding window stride=64)
Built on PR #1089 by @mikeapedia

3-Seed Results

Seed	Sliding BPB	val_loss (nats)	Artifact
1337	1.1126	1.87857	15,981,856
42	1.1123	1.87803	15,984,349
999	1.1129	1.87900	15,985,912
Mean	1.1126	1.87853

vs merged SOTA (PR #549, 1.89002 nats): -0.01149 nats. Note: open PRs #1089 (1.1091) and #1105 (1.1138) achieve better scores.

What's New vs PR #1089

GPTQ Reserve Optimization: Reduced calibration reserve from 14s to 9s (actual calibration ~8.4s), recovering ~55 extra training steps
Experimental fused Triton MLP kernel: Forward-only fusion via torch.library.triton_op with standard PyTorch backward. Hard-disabled in this submission (produces NaN on PT2.9 due to TTIR analysis bug). Included as experimental code for future work.

Compliance

Standard F.cross_entropy scoring
No TTT, no eval-time training data access
Artifact < 16,000,000 bytes (all 3 seeds)
Training < 600s, eval < 600s
Causal sliding-window evaluation (stride=64)
3-seed verification: -0.01149 nats vs merged SOTA (> 0.005 threshold)

Credits

PR #1089 by @mikeapedia (Turbo-Muon, EngramLite, ParamBanking)
PR #1072 by @vimeto (fused Triton kernel design)
PR #1105 by @abaybektursun (forward-only fusion insight)
PR #549 by @abaybektursun (base scaffold)

….1126 3-seed results: 1.1126/1.1123/1.1129 (mean 1.1126, std 0.0003) Built on PR openai#1089 with GPTQ reserve optimization (14s to 9s). Includes experimental fused Triton MLP kernel (hard-disabled).

MatoTeziTanka · 2026-04-12T06:08:33Z

Community Review — Turbo-Muon + EngramLite + ParamBanking + GPTQ Reserve Opt — val_bpb 1.1126 (3-seed mean)

Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

PR #1169 — Audit Summary Head SHA: `c41b299` File audited: `records/track_10min_16mb/2026-03-30_V18_FusedTritonOp/train_gpt.py` (2672 lines) --- ### Checklist 1. ILLEGAL n-gram family bug (target XOR'd into hash key) CLEAR. The n-gram implementation is `EngramLite` (lines 944–985). Hash keys use `input_ids` (current position token) and `prev_ids`/`pp_ids` (prior-position tokens shifted via `F.pad`). No target labels (`y`) are ever passed to or used in hashing. The forward signature takes `input_ids` only (line 965). No target leakage found. 2. ILLEGAL Pre-Quant TTT (multi-epoch on val_tokens without score-first) CLEAR. `eval_val` (line 521) and `eval_val_sliding` (line 1258) both run under `torch.inference_mode()` (lines 550, 1287) with `model.eval()`. No optimizer steps, no `.backward()` calls, no gradient accumulation occur anywhere using `val_tokens`. The GPTQ calibration at end-of-training (lines 2393–2396) uses `train_files` via `DistributedTokenLoader`, not val_tokens. Zero TTT of any kind is present. 3. LEGAL score-first TTT (PR #1413 pattern, is_last_chunk guard) N/A — no TTT present at all. No `is_last_chunk` guard needed or present. 4. SCORED-REGION SLOT HOLD N/A — no special scored-region patching or slot tricks present. 5. Pure neural characterization This submission is a pure transformer training run with: - Standard training loop (train data only for gradient updates, lines 2258–2303) - GPTQ int6 quantization post-training on train data - Optional Fused Triton MLP kernel (`HAS_FUSED_MLP`) - EngramLite multi-head bigram+trigram embeddings (learned, clean hash using only context tokens) - EMA/SWA weight averaging - No external lookup, no n-gram inference tricks at eval time - Sliding window eval is scoring-only (inference_mode, no weight updates) All val usage is read-only...

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission.

Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

Turbo-Muon + EngramLite + ParamBanking + GPTQ Reserve Opt — val_bpb 1…

c41b299

….1126 3-seed results: 1.1126/1.1123/1.1129 (mean 1.1126, std 0.0003) Built on PR openai#1089 with GPTQ reserve optimization (14s to 9s). Includes experimental fused Triton MLP kernel (hard-disabled).

This was referenced Apr 6, 2026

Non-record: QAT + Neural Cache + LoRA TTT #304

Closed

Record: Coprime-Stride Loader + Full GPTQ + XSA-all — val_bpb 1.1133 (3-seed mean) #1099

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Turbo-Muon + EngramLite + ParamBanking + GPTQ Reserve Opt — val_bpb 1.1126 (3-seed mean)#1169

Turbo-Muon + EngramLite + ParamBanking + GPTQ Reserve Opt — val_bpb 1.1126 (3-seed mean)#1169
Bortlesboat wants to merge 1 commit intoopenai:mainfrom
Bortlesboat:submission/v18-turbomuon-fused-1.1126

Bortlesboat commented Mar 31, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Bortlesboat commented Mar 31, 2026

Summary

3-Seed Results

What's New vs PR #1089

Compliance

Credits

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Community Review — Turbo-Muon + EngramLite + ParamBanking + GPTQ Reserve Opt — val_bpb 1.1126 (3-seed mean)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants