Non-record: SP8192 + dim=464 + Pre-Quantization TTT + Brotli (1.1863 BPB)#1760
Open
BrandtChristian wants to merge 2 commits intoopenai:mainfrom
Open
Non-record: SP8192 + dim=464 + Pre-Quantization TTT + Brotli (1.1863 BPB)#1760BrandtChristian wants to merge 2 commits intoopenai:mainfrom
BrandtChristian wants to merge 2 commits intoopenai:mainfrom
Conversation
…otli (1.1863 BPB) Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…ommand Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
val_bpb: 1.1863 (roundtrip, seed 1337) | 15.92 MB | 1×RTX 5090, 12k steps
Post-TTT: 1.1524 BPB (score-first TTT, 3 epochs on preq-adapted weights)
Submitting to non-record track: trained 12k steps on a single RTX 5090 (~33 min), exceeding the 10-min budget. The technique is designed to run on 8×H100 with
MAX_WALLCLOCK_SECONDS=600andPREQ_TTT_EPOCHS=21.Key Technique: Pre-Quantization TTT
After training ends, before INT6 quantization, adapt the FP32 weights on the full validation set using standard (non-score-first) TTT. This conditions the weights to the val distribution before the precision loss from quantization locks them in.
Scaling law (dim=464, 12k steps, 1×RTX 5090):
Still scaling at 7 epochs. On 8×H100 (DDP-interleaved chunks,
all_reduceper epoch), 21 epochs ≈ 240s — expected ~1.15 BPB.Stack
SP8192 tokenizer · dim=464 · 11 layers · MLP 3× LeakyReLU(0.5)² · BigramHash(1536) · XSA last 4 layers · depth recurrence layers 3–5 ×2 · parallel residuals from layer 7 · QAT INT6 (all layers) · INT8 embeddings · brotli+byte-shuffle compression · EMA+SWA · MuonEq-R
Artifact
15,915,528 bytes (84 KB under 16 MB limit)