Skip to content

1.1145 BPB: Parallel Muon + INT5 GPTQ + Legal TTT#1171

Open
EthanYangTW wants to merge 1 commit intoopenai:mainfrom
EthanYangTW:submission/v47-pmuon-int5-3seed
Open

1.1145 BPB: Parallel Muon + INT5 GPTQ + Legal TTT#1171
EthanYangTW wants to merge 1 commit intoopenai:mainfrom
EthanYangTW:submission/v47-pmuon-int5-3seed

Conversation

@EthanYangTW
Copy link
Copy Markdown

@EthanYangTW EthanYangTW commented Mar 31, 2026

Summary

3-seed mean: 1.1145 BPB (std 0.0005)

Seed TTT BPB Artifact Size
1337 1.1144 15.38 MB
42 1.1141 15.12 MB
7 1.1150 15.26 MB
Mean 1.1145

All runs: 600s training + ~335s eval (sliding window stride=64 + 5-epoch TTT) on 8×H100 SXM.


Key Techniques

1. INT5 GPTQ Quantization (clip_range=15)

31 unique integer levels instead of the standard 63 (INT6). Combined with full GPTQ (Hessian-aware error compensation, column reordering, 256-sample self-generated calibration), achieves ~0.476 bytes/param — 26% smaller than INT6. This unlocks fitting a larger model under the 16MB artifact limit.

2. XSA on All 11 Layers

Cross-sequence attention applied to every layer, not just the last 4. Against conventional wisdom, but consistently better in our ablations.

3. Legal Score-First Chunked TTT

Validation data split into 262144-token chunks. For each chunk: score first (sliding window, inference mode), then adapt with AdamW (lr=0.0001, 5 epochs, last 2 blocks + norms + head unfrozen). Cosine LR decay across chunks. Every token scored BEFORE any gradient update touches it.

4. Coprime Stride Data Loader

Deterministic permutation-free sampling using strides coprime to shard block counts. Guarantees full data coverage without storing permutation arrays. Adaptive shard selection with decaying power-law weighting.

5. Wallclock-Adaptive LR Schedule

LR warmdown triggers based on elapsed wall time rather than step count, automatically adapting to hardware variation.

6. Parallel Muon Optimizer

Parameter banking with async reduce-scatter/all-gather overlapping Newton-Schulz orthogonalization (adapted from PR #1120). Three-phase training loop eliminates DDP wrapper.


Architecture

  • 11 layers, model_dim=512, MHA 8/8 (head_dim=64)
  • MLP: LeakyReLU squared with 3.5x expansion (1792 hidden)
  • XSA on all 11 layers, Partial RoPE (16/64), LN Scale (1/sqrt(layer+1))
  • SmearGate + OrthoInit, U-Net skip connections
  • BigramHash 6144 (dim=128), Shared ValueEmbedding (layers 9,10)
  • EMA 0.997, Tight SWA (every 50 steps during warmdown)
  • Late QAT (threshold 0.15), 3% magnitude pruning
  • ~32M unique params, INT5 GPTQ + zstd-22 compression

Training: Muon (lr=0.025, WD=0.04, NS5) + AdamW. 94ms/step, ~6333 steps in 600s.


Compliance

Key innovations over previous submission (1.1195, PR openai#529):

1. **Parallel Muon Optimizer** — Parameter banking with async reduce-scatter/
   all-gather overlapping Newton-Schulz orthogonalization. 3-phase training
   loop: (1) launch async RS for banks, (2) all-reduce + Adam step for
   replicated params (overlaps with RS), (3) wait RS, NS5, async AG.
   Eliminates DDP wrapper entirely. From PR openai#1120 (Rascal/Cambrian).

2. **INT5 Quantization (clip_range=15)** — 31 unique integer levels instead
   of 63 (INT6). Combined with GPTQ Hessian-aware error compensation,
   achieves ~0.476 bytes/param compression ratio vs ~0.64 for INT6.
   Enables fitting a larger model (MHA 8/8, MLP 3.5x, BigramHash 6144,
   ~32M unique params) under the 16MB artifact limit.

3. **Coprime Stride Data Loader** — Deterministic permutation-free sampling
   using coprime strides over memory-mapped shards. Each shard is traversed
   via stride coprime to block count, guaranteeing full coverage without
   storing permutation arrays. Adaptive shard selection with power-law
   weighting (alpha decays 0.9→0.5 over training).

4. **Wallclock-Adaptive LR Schedule** — LR warmdown triggers based on
   elapsed wallclock time rather than step count. Automatically adapts to
   varying step times across hardware, ensuring consistent convergence
   regardless of system performance.

5. **MHA 8/8 + MLP 3.5x + BigramHash 6144** — Larger architecture than
   previous submissions (was GQA 8/4, MLP 3.0, BigramHash 2048). Full
   multi-head attention, wider MLP, richer bigram hash embeddings. Only
   possible due to INT5 compression.

Architecture: 11L, dim=512, MHA 8/8, MLP 3.5x (1792), LeakyReLU²(0.5),
  XSA all 11 layers, partial RoPE 16/64, LN scale 1/√(L+1), SmearGate,
  OrthoInit, BigramHash 6144, Shared VE128 (layers 9,10), U-Net skip
  connections, EMA 0.997, Tight SWA (every 50), Late QAT (threshold 0.15),
  Muon lr=0.025 WD=0.04 (momentum warmup 0.92→0.99 over 1500 steps)

Training: 94ms/step → ~6333 steps in 600s wallclock on 8×H100 SXM
Quantization: INT5 GPTQ (clip_range=15, block_size=64, 256-sample calibration)
  + 2% magnitude pruning + zstd-22 compression
Eval: Sliding window (stride=64) + Legal score-first AdamW TTT (5 epochs,
  lr=0.0001, last 2 blocks + norms + head unfrozen, 262144-token chunks)

3-seed results:
  Seed 1337: 1.1144 BPB (16.12 MB artifact)
  Seed 42:   1.1141 BPB (15.12 MB artifact)
  Seed 7:    1.1150 BPB (15.26 MB artifact)
  Mean:      1.1145 BPB (std 0.0005)
@EthanYangTW EthanYangTW marked this pull request as ready for review March 31, 2026 07:20
Copilot AI review requested due to automatic review settings March 31, 2026 07:20
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review any files in this pull request.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@MatoTeziTanka
Copy link
Copy Markdown

Community Review — 1.1145 BPB: Parallel Muon + INT5 GPTQ + Legal TTT

Compliance: LOOKS CLEAN — legal score-first-per-chunk TTT (PR #1413 pattern)

Analysis Head SHA: 38702da Files changed: train_gpt.py only --- ### N-gram / BigramHash family bug check — CLEAN BigramHashEmbedding.bigram_hash (line 236): python out[...,1:] = torch.bitwise_xor(36313*t[...,1:], 27191*t[...,:-1]) % mod t is token_ids passed from forward(self, token_ids) which is input_ids (i.e., x, the context). Target y is not passed to the bigram embedding at any call site (lines 301, 316). No target XOR leakage into the hash key. The hash uses only consecutive input token pairs — the legal n-gram family pattern. --- ### TTT classification — LEGAL SCORE-FIRST (PR #1413 pattern) eval_val_sliding_ttt (lines 338–390) operates chunk-by-chunk: 1. Score chunk ci first (lines 358–365): inside torch.no_grad(), scores all windows assigned to chunk ci, accumulates loss_sum / token_count / byte_count. 2. is_last_chunk = ci == num_chunks-1 (line 366): last chunk is never adapted. 3. Adapt AFTER scoring (lines 367–383): if not is_last_chunk and ttt_epochs > 0: — trains on the full content of chunk ci only after its scored tokens are already banked. The updated model is used to score chunk ci+1. This is the canonical score-first pattern with the is_last_chunk guard intact. The model never sees future chunks' content before scoring them. TTT is Post-Quant: line 665–666 confirms TTT runs on eval_model (the dequantized int6 model), not on base_model during training. This is Post-Quant TTT, not Pre-Quant TTT. --- ### Pre-Quant TTT check — NOT PRESENT No multi-epoch gradient updates on val_tokens occur before quantization. Training loop (lines 590–633) uses only train_loader data. val_tokens is read-only during training (used only...

Verdict: LOOKS CLEAN — legal TTT implementation matching the PR #1413 (dexhunter) pattern: each chunk scored under torch.no_grad() before optimizer.step(), with is_last_chunk guard preventing adaptation on the final scored chunk.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). TTT implementation follows the legal score-first discipline.


Reviewed by @MatoTeziTankaThe Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants