1.1145 BPB: Parallel Muon + INT5 GPTQ + Legal TTT by EthanYangTW · Pull Request #1171 · openai/parameter-golf

EthanYangTW · 2026-03-31T05:33:43Z

Summary

3-seed mean: 1.1145 BPB (std 0.0005)

Seed	TTT BPB	Artifact Size
1337	1.1144	15.38 MB
42	1.1141	15.12 MB
7	1.1150	15.26 MB
Mean	1.1145

All runs: 600s training + ~335s eval (sliding window stride=64 + 5-epoch TTT) on 8×H100 SXM.

Key Techniques

1. INT5 GPTQ Quantization (clip_range=15)

31 unique integer levels instead of the standard 63 (INT6). Combined with full GPTQ (Hessian-aware error compensation, column reordering, 256-sample self-generated calibration), achieves ~0.476 bytes/param — 26% smaller than INT6. This unlocks fitting a larger model under the 16MB artifact limit.

2. XSA on All 11 Layers

Cross-sequence attention applied to every layer, not just the last 4. Against conventional wisdom, but consistently better in our ablations.

3. Legal Score-First Chunked TTT

Validation data split into 262144-token chunks. For each chunk: score first (sliding window, inference mode), then adapt with AdamW (lr=0.0001, 5 epochs, last 2 blocks + norms + head unfrozen). Cosine LR decay across chunks. Every token scored BEFORE any gradient update touches it.

4. Coprime Stride Data Loader

Deterministic permutation-free sampling using strides coprime to shard block counts. Guarantees full data coverage without storing permutation arrays. Adaptive shard selection with decaying power-law weighting.

5. Wallclock-Adaptive LR Schedule

LR warmdown triggers based on elapsed wall time rather than step count, automatically adapting to hardware variation.

6. Parallel Muon Optimizer

Parameter banking with async reduce-scatter/all-gather overlapping Newton-Schulz orthogonalization (adapted from PR #1120). Three-phase training loop eliminates DDP wrapper.

Architecture

11 layers, model_dim=512, MHA 8/8 (head_dim=64)
MLP: LeakyReLU squared with 3.5x expansion (1792 hidden)
XSA on all 11 layers, Partial RoPE (16/64), LN Scale (1/sqrt(layer+1))
SmearGate + OrthoInit, U-Net skip connections
BigramHash 6144 (dim=128), Shared ValueEmbedding (layers 9,10)
EMA 0.997, Tight SWA (every 50 steps during warmdown)
Late QAT (threshold 0.15), 3% magnitude pruning
~32M unique params, INT5 GPTQ + zstd-22 compression

Training: Muon (lr=0.025, WD=0.04, NS5) + AdamW. 94ms/step, ~6333 steps in 600s.

Compliance

All tokens scored BEFORE any gradient update (per rulings val-only 10min record (val_bpb:1.1111) #44, Add TTT (Test-Time Training) submission: 1.1767 BPB #152, SOTA Attempt: Paid prefix (val_bpb=1.0238) #168)
Artifact <= 16 MB (all seeds)
Training <= 600s wallclock
No n-gram caching, no eval token exploitation, no two-pass rescoring

Key innovations over previous submission (1.1195, PR openai#529): 1. **Parallel Muon Optimizer** — Parameter banking with async reduce-scatter/ all-gather overlapping Newton-Schulz orthogonalization. 3-phase training loop: (1) launch async RS for banks, (2) all-reduce + Adam step for replicated params (overlaps with RS), (3) wait RS, NS5, async AG. Eliminates DDP wrapper entirely. From PR openai#1120 (Rascal/Cambrian). 2. **INT5 Quantization (clip_range=15)** — 31 unique integer levels instead of 63 (INT6). Combined with GPTQ Hessian-aware error compensation, achieves ~0.476 bytes/param compression ratio vs ~0.64 for INT6. Enables fitting a larger model (MHA 8/8, MLP 3.5x, BigramHash 6144, ~32M unique params) under the 16MB artifact limit. 3. **Coprime Stride Data Loader** — Deterministic permutation-free sampling using coprime strides over memory-mapped shards. Each shard is traversed via stride coprime to block count, guaranteeing full coverage without storing permutation arrays. Adaptive shard selection with power-law weighting (alpha decays 0.9→0.5 over training). 4. **Wallclock-Adaptive LR Schedule** — LR warmdown triggers based on elapsed wallclock time rather than step count. Automatically adapts to varying step times across hardware, ensuring consistent convergence regardless of system performance. 5. **MHA 8/8 + MLP 3.5x + BigramHash 6144** — Larger architecture than previous submissions (was GQA 8/4, MLP 3.0, BigramHash 2048). Full multi-head attention, wider MLP, richer bigram hash embeddings. Only possible due to INT5 compression. Architecture: 11L, dim=512, MHA 8/8, MLP 3.5x (1792), LeakyReLU²(0.5), XSA all 11 layers, partial RoPE 16/64, LN scale 1/√(L+1), SmearGate, OrthoInit, BigramHash 6144, Shared VE128 (layers 9,10), U-Net skip connections, EMA 0.997, Tight SWA (every 50), Late QAT (threshold 0.15), Muon lr=0.025 WD=0.04 (momentum warmup 0.92→0.99 over 1500 steps) Training: 94ms/step → ~6333 steps in 600s wallclock on 8×H100 SXM Quantization: INT5 GPTQ (clip_range=15, block_size=64, 256-sample calibration) + 2% magnitude pruning + zstd-22 compression Eval: Sliding window (stride=64) + Legal score-first AdamW TTT (5 epochs, lr=0.0001, last 2 blocks + norms + head unfrozen, 262144-token chunks) 3-seed results: Seed 1337: 1.1144 BPB (16.12 MB artifact) Seed 42: 1.1141 BPB (15.12 MB artifact) Seed 7: 1.1150 BPB (15.26 MB artifact) Mean: 1.1145 BPB (std 0.0005)

Copilot

Copilot wasn't able to review any files in this pull request.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

MatoTeziTanka · 2026-04-12T06:08:26Z

Community Review — 1.1145 BPB: Parallel Muon + INT5 GPTQ + Legal TTT

Compliance: LOOKS CLEAN — legal score-first-per-chunk TTT (PR #1413 pattern)

Analysis Head SHA: `38702da` Files changed: train_gpt.py only --- ### N-gram / BigramHash family bug check — CLEAN `BigramHashEmbedding.bigram_hash` (line 236): `python out[...,1:] = torch.bitwise_xor(36313t[...,1:], 27191t[...,:-1]) % mod` `t` is `token_ids` passed from `forward(self, token_ids)` which is `input_ids` (i.e., `x`, the context). Target `y` is not passed to the bigram embedding at any call site (lines 301, 316). No target XOR leakage into the hash key. The hash uses only consecutive input token pairs — the legal n-gram family pattern. --- ### TTT classification — LEGAL SCORE-FIRST (PR #1413 pattern) `eval_val_sliding_ttt` (lines 338–390) operates chunk-by-chunk: 1. Score chunk `ci` first (lines 358–365): inside `torch.no_grad()`, scores all windows assigned to chunk `ci`, accumulates `loss_sum` / `token_count` / `byte_count`. 2. `is_last_chunk = ci == num_chunks-1` (line 366): last chunk is never adapted. 3. Adapt AFTER scoring (lines 367–383): `if not is_last_chunk and ttt_epochs > 0:` — trains on the full content of chunk `ci` only after its scored tokens are already banked. The updated model is used to score chunk `ci+1`. This is the canonical score-first pattern with the `is_last_chunk` guard intact. The model never sees future chunks' content before scoring them. TTT is Post-Quant: line 665–666 confirms TTT runs on `eval_model` (the dequantized int6 model), not on `base_model` during training. This is Post-Quant TTT, not Pre-Quant TTT. --- ### Pre-Quant TTT check — NOT PRESENT No multi-epoch gradient updates on `val_tokens` occur before quantization. Training loop (lines 590–633) uses only `train_loader` data. `val_tokens` is read-only during training (used only...

Verdict: LOOKS CLEAN — legal TTT implementation matching the PR #1413 (dexhunter) pattern: each chunk scored under torch.no_grad() before optimizer.step(), with is_last_chunk guard preventing adaptation on the final scored chunk.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). TTT implementation follows the legal score-first discipline.

Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

EthanYangTW marked this pull request as ready for review March 31, 2026 07:20

Copilot AI review requested due to automatic review settings March 31, 2026 07:20

Copilot AI reviewed Mar 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.1145 BPB: Parallel Muon + INT5 GPTQ + Legal TTT#1171

1.1145 BPB: Parallel Muon + INT5 GPTQ + Legal TTT#1171
EthanYangTW wants to merge 1 commit intoopenai:mainfrom
EthanYangTW:submission/v47-pmuon-int5-3seed

EthanYangTW commented Mar 31, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

EthanYangTW commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Techniques

1. INT5 GPTQ Quantization (clip_range=15)

2. XSA on All 11 Layers

3. Legal Score-First Chunked TTT

4. Coprime Stride Data Loader

5. Wallclock-Adaptive LR Schedule

6. Parallel Muon Optimizer

Architecture

Compliance

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Community Review — 1.1145 BPB: Parallel Muon + INT5 GPTQ + Legal TTT

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

EthanYangTW commented Mar 31, 2026 •

edited

Loading