Record: Coprime-Stride Loader + Full GPTQ + Score-First TTT — val_bpb 1.08008 (3-seed mean) by Meirzhan05 · Pull Request #1876 · openai/parameter-golf

Meirzhan05 · 2026-04-27T23:15:38Z

Record: Coprime-Stride Loader + Full GPTQ + Score-First TTT — val_bpb 1.08008 (3-seed mean)

val_bpb: 1.08008 (3-seed mean, std 0.0009) | ~15.99 MB | 8×H100 SXM

Beats current SOTA (PR #1493, 1.0810) by 0.00092 BPB with std 0.0009 → comparable in magnitude to recent record gaps on the leaderboard (e.g., #1→#2 was 0.0012, #2→#3 was 0.0006, #3→#4 was 0.0007).

Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)

Seed	Steps	Pre-quant BPB	Quantized BPB	Sliding BPB	TTT BPB	Artifact
1337	4565	1.08596	1.09737	1.08058	1.07907	15,992,892
42	4570	1.08742	1.09877	1.08217	1.08075	15,996,411
2025	4566	1.08722	1.09874	1.08196	1.08043	15,993,485
Mean	4567	1.08686	1.09829	1.08157	1.08008	15,994,263

Key Innovations

1. Coprime-Stride Multi-Shard Loader

Replaces the standard ShuffledSequenceLoader with a coprime-stride data loader (PR #726 style). Within each shard, sequences are accessed with a stride coprime to the block count, guaranteeing every block is visited exactly once per epoch without cyclic patterns. Adaptive shard selection uses progress-based weighting (alpha decays from 0.9 to 0.5) with interleaved bucket draining for maximum diversity per batch.

Effect: +36 extra training steps (4565 vs 4529 baseline), better pre-quant BPB (1.0860 vs 1.0866).

2. Full Hessian GPTQ with Cholesky Fallback

Standard GPTQ with Cholesky error compensation + actorder (column sorting by Hessian diagonal). SD-based clipping at 12.85σ for int6 matrices, 20σ for int8 embeddings. Added Cholesky fallback: if torch.linalg.cholesky fails on an ill-conditioned Hessian, falls back to simple per-row quantization instead of crashing.

3. LZMA Code Compression

Full Python source (53KB) is LZMA-compressed + base85-encoded into a 2-line self-extracting .py file (18KB). Saves ~35KB in artifact size, keeping total under 16MB. Same technique as the current SOTA record.

4. Score-First TTT (Legal)

Score-first per-chunk test-time training following the PR #461/#549 framework:

Score each 32K-token chunk under torch.no_grad() first
Then train on that chunk with SGD (momentum=0.9, LR=0.005, 3 epochs)
Adapted model only scores future chunks — never rescores tokens it trained on

Architecture

SP8192 BPE tokenizer (8192 tokens)
11 physical layers, 17 virtual (depth recurrence: layers 3-5 looped 3×)
dim=512, 8 heads, 4 KV heads (GQA), MLP 4× with LeakyReLU(0.5)²
XSA on all 11 layers, parallel residuals from layer 7+
U-Net skip connections with learnable gates
Tied embeddings, logit softcap=30

Training

Muon optimizer (5-step Newton-Schulz) + AdamW for embeddings/scalars
EMA (decay 0.9965)
72% warmdown, 20-step warmup + 20-step loop warmup
Gradient clipping at 0.3
Brotli-11 compression + byte shuffling

Compliance

Condition 1 (Strict Causal Dependence)

Causal attention via flash_attn_func(causal=True). TTT only incorporates tokens from already-scored chunks.

Condition 2 (Full Normalized Distribution)

Standard F.cross_entropy over full vocab_size logits. No top-k masking.

Condition 3 (Score-Before-Update)

Each chunk scored under torch.no_grad() before any training on that chunk. Model weights at scoring time reflect only prior chunks.

Condition 4 (Single Left-to-Right Pass)

Single for ci in range(num_chunks) loop. Each token scored exactly once. No rescoring or min-over-runs.

Credits

SOTA base: PR Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean) #1394 by @clarkkev (Full Hessian GPTQ + SDClip)
Depth recurrence: PR [Record] 3-Layer Depth Recurrence + EMA 0.9965 + WD 0.095 — val_bpb 1.0889 #1445 by @dexhunter
Score-first TTT: PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549 by @abaybektursun, PR Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB) #461 by @Christopher-Lee-McClendon
Coprime-stride loader: PR Memmap multi-shard data pipeline + GPU prefetch + LeakyReLU² + Legal TTT + Parallel Muon #726 style
XSA: PR Record: 11L XSA-all + Full GPTQ (Budget-Legal) + Parallel Muon + Selective Pruning (val_bpb: 1.1178, 3-seed mean) #634
Parallel residuals: PR Record: SP8192 + Parallel Residuals + Hessian-Aware SDClip — val_bpb 1.08354 (3-seed mean) #1412 by @Robby955
LZMA code compression: PR Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean) #1394 technique

3-seed mean: 1.08008 BPB (std 0.0009), all artifacts under 16MB. records/track_10min_16mb/2026-04-25_CoprimeStride_GPTQ_TTT/ - README.md - submission.json (3-seed metadata) - train_gpt.py (LZMA-compressed self-extracting training script)

Comparable in magnitude to recent merged record gaps: - #1 -> openai#2: 0.0012 BPB - openai#2 -> openai#3: 0.0006 BPB - openai#3 -> openai#4: 0.0007 BPB

Meirzhan05 force-pushed the feature/mtp branch from ad1794f to 7e69daf Compare April 28, 2026 00:12

Meirzhan05 force-pushed the feature/mtp branch from 7e69daf to 46b6d64 Compare April 28, 2026 00:12

Meirzhan05 mentioned this pull request Apr 28, 2026

Record: AttnOutGate + SmearGate + Softcap 15 — val_bpb 1.07750 (3-seed mean) #1880

Open

Mark as record: 1.08008 beats SOTA 1.0810 by 0.00092 BPB

8f7bf23

Comparable in magnitude to recent merged record gaps: - #1 -> openai#2: 0.0012 BPB - openai#2 -> openai#3: 0.0006 BPB - openai#3 -> openai#4: 0.0007 BPB

Meirzhan05 changed the title ~~Non-record: Coprime-Stride Loader + Full GPTQ + Score-First TTT (3-seed mean 1.08008 BPB)~~ Record: Coprime-Stride Loader + Full GPTQ + Score-First TTT — val_bpb 1.08008 (3-seed mean) Apr 28, 2026

Christopher-Lee-McClendon mentioned this pull request Apr 29, 2026

Non-record: Audited Byte-Level Neural/PPM-D Mixture BPB = 1.5221 (Full Validation) — Framework for Legal Score-First PPM-D Mixtures #1916

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Coprime-Stride Loader + Full GPTQ + Score-First TTT — val_bpb 1.08008 (3-seed mean)#1876

Record: Coprime-Stride Loader + Full GPTQ + Score-First TTT — val_bpb 1.08008 (3-seed mean)#1876
Meirzhan05 wants to merge 2 commits intoopenai:mainfrom
Meirzhan05:feature/mtp

Meirzhan05 commented Apr 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Meirzhan05 commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Record: Coprime-Stride Loader + Full GPTQ + Score-First TTT — val_bpb 1.08008 (3-seed mean)

Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)

Key Innovations

1. Coprime-Stride Multi-Shard Loader

2. Full Hessian GPTQ with Cholesky Fallback

3. LZMA Code Compression

4. Score-First TTT (Legal)

Architecture

Training

Compliance

Condition 1 (Strict Causal Dependence)

Condition 2 (Full Normalized Distribution)

Condition 3 (Score-Before-Update)

Condition 4 (Single Left-to-Right Pass)

Credits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Meirzhan05 commented Apr 27, 2026 •

edited

Loading