Skip to content

Turbo-Muon + EngramLite + ParamBanking + GPTQ Reserve Opt — val_bpb 1.1126 (3-seed mean)#1169

Open
Bortlesboat wants to merge 1 commit intoopenai:mainfrom
Bortlesboat:submission/v18-turbomuon-fused-1.1126
Open

Turbo-Muon + EngramLite + ParamBanking + GPTQ Reserve Opt — val_bpb 1.1126 (3-seed mean)#1169
Bortlesboat wants to merge 1 commit intoopenai:mainfrom
Bortlesboat:submission/v18-turbomuon-fused-1.1126

Conversation

@Bortlesboat
Copy link
Copy Markdown

Summary

  • val_bpb: 1.1126 (3-seed mean, std 0.0003)
  • Artifact: ~15.98 MB (all seeds under 16,000,000 bytes)
  • Eval time: ~120s (no TTT, sliding window stride=64)
  • Built on PR #1089 by @mikeapedia

3-Seed Results

Seed Sliding BPB val_loss (nats) Artifact
1337 1.1126 1.87857 15,981,856
42 1.1123 1.87803 15,984,349
999 1.1129 1.87900 15,985,912
Mean 1.1126 1.87853

vs merged SOTA (PR #549, 1.89002 nats): -0.01149 nats. Note: open PRs #1089 (1.1091) and #1105 (1.1138) achieve better scores.

What's New vs PR #1089

  1. GPTQ Reserve Optimization: Reduced calibration reserve from 14s to 9s (actual calibration ~8.4s), recovering ~55 extra training steps
  2. Experimental fused Triton MLP kernel: Forward-only fusion via torch.library.triton_op with standard PyTorch backward. Hard-disabled in this submission (produces NaN on PT2.9 due to TTIR analysis bug). Included as experimental code for future work.

Compliance

  • Standard F.cross_entropy scoring
  • No TTT, no eval-time training data access
  • Artifact < 16,000,000 bytes (all 3 seeds)
  • Training < 600s, eval < 600s
  • Causal sliding-window evaluation (stride=64)
  • 3-seed verification: -0.01149 nats vs merged SOTA (> 0.005 threshold)

Credits

….1126

3-seed results: 1.1126/1.1123/1.1129 (mean 1.1126, std 0.0003)
Built on PR openai#1089 with GPTQ reserve optimization (14s to 9s).
Includes experimental fused Triton MLP kernel (hard-disabled).
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Turbo-Muon + EngramLite + ParamBanking + GPTQ Reserve Opt — val_bpb 1.1126 (3-seed mean)

Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

PR #1169 — Audit Summary Head SHA: c41b299 File audited: records/track_10min_16mb/2026-03-30_V18_FusedTritonOp/train_gpt.py (2672 lines) --- ### Checklist 1. ILLEGAL n-gram family bug (target XOR'd into hash key) CLEAR. The n-gram implementation is EngramLite (lines 944–985). Hash keys use input_ids (current position token) and prev_ids/pp_ids (prior-position tokens shifted via F.pad). No target labels (y) are ever passed to or used in hashing. The forward signature takes input_ids only (line 965). No target leakage found. 2. ILLEGAL Pre-Quant TTT (multi-epoch on val_tokens without score-first) CLEAR. eval_val (line 521) and eval_val_sliding (line 1258) both run under torch.inference_mode() (lines 550, 1287) with model.eval(). No optimizer steps, no .backward() calls, no gradient accumulation occur anywhere using val_tokens. The GPTQ calibration at end-of-training (lines 2393–2396) uses train_files via DistributedTokenLoader, not val_tokens. Zero TTT of any kind is present. 3. LEGAL score-first TTT (PR #1413 pattern, is_last_chunk guard) N/A — no TTT present at all. No is_last_chunk guard needed or present. 4. SCORED-REGION SLOT HOLD N/A — no special scored-region patching or slot tricks present. 5. Pure neural characterization This submission is a pure transformer training run with: - Standard training loop (train data only for gradient updates, lines 2258–2303) - GPTQ int6 quantization post-training on train data - Optional Fused Triton MLP kernel (HAS_FUSED_MLP) - EngramLite multi-head bigram+trigram embeddings (learned, clean hash using only context tokens) - EMA/SWA weight averaging - No external lookup, no n-gram inference tricks at eval time - Sliding window eval is scoring-only (inference_mode, no weight updates) All val usage is read-only...

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission.


Reviewed by @MatoTeziTankaThe Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants