Record: Coprime-Stride Loader + Full GPTQ + Score-First TTT — val_bpb 1.08008 (3-seed mean)#1876
Open
Meirzhan05 wants to merge 2 commits intoopenai:mainfrom
Open
Record: Coprime-Stride Loader + Full GPTQ + Score-First TTT — val_bpb 1.08008 (3-seed mean)#1876Meirzhan05 wants to merge 2 commits intoopenai:mainfrom
Meirzhan05 wants to merge 2 commits intoopenai:mainfrom
Conversation
3-seed mean: 1.08008 BPB (std 0.0009), all artifacts under 16MB. records/track_10min_16mb/2026-04-25_CoprimeStride_GPTQ_TTT/ - README.md - submission.json (3-seed metadata) - train_gpt.py (LZMA-compressed self-extracting training script)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Record: Coprime-Stride Loader + Full GPTQ + Score-First TTT — val_bpb 1.08008 (3-seed mean)
val_bpb: 1.08008 (3-seed mean, std 0.0009) | ~15.99 MB | 8×H100 SXM
Beats current SOTA (PR #1493, 1.0810) by 0.00092 BPB with std 0.0009 → comparable in magnitude to recent record gaps on the leaderboard (e.g., #1→#2 was 0.0012, #2→#3 was 0.0006, #3→#4 was 0.0007).
Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)
Key Innovations
1. Coprime-Stride Multi-Shard Loader
Replaces the standard
ShuffledSequenceLoaderwith a coprime-stride data loader (PR #726 style). Within each shard, sequences are accessed with a stride coprime to the block count, guaranteeing every block is visited exactly once per epoch without cyclic patterns. Adaptive shard selection uses progress-based weighting (alpha decays from 0.9 to 0.5) with interleaved bucket draining for maximum diversity per batch.Effect: +36 extra training steps (4565 vs 4529 baseline), better pre-quant BPB (1.0860 vs 1.0866).
2. Full Hessian GPTQ with Cholesky Fallback
Standard GPTQ with Cholesky error compensation + actorder (column sorting by Hessian diagonal). SD-based clipping at 12.85σ for int6 matrices, 20σ for int8 embeddings. Added Cholesky fallback: if
torch.linalg.choleskyfails on an ill-conditioned Hessian, falls back to simple per-row quantization instead of crashing.3. LZMA Code Compression
Full Python source (53KB) is LZMA-compressed + base85-encoded into a 2-line self-extracting .py file (18KB). Saves ~35KB in artifact size, keeping total under 16MB. Same technique as the current SOTA record.
4. Score-First TTT (Legal)
Score-first per-chunk test-time training following the PR #461/#549 framework:
torch.no_grad()firstArchitecture
Training
Compliance
Condition 1 (Strict Causal Dependence)
Causal attention via
flash_attn_func(causal=True). TTT only incorporates tokens from already-scored chunks.Condition 2 (Full Normalized Distribution)
Standard
F.cross_entropyover full vocab_size logits. No top-k masking.Condition 3 (Score-Before-Update)
Each chunk scored under
torch.no_grad()before any training on that chunk. Model weights at scoring time reflect only prior chunks.Condition 4 (Single Left-to-Right Pass)
Single
for ci in range(num_chunks)loop. Each token scored exactly once. No rescoring or min-over-runs.Credits