ANS weight compression: 1.6 MB (13.9%) lossless savings over LZMA#1510
ANS weight compression: 1.6 MB (13.9%) lossless savings over LZMA#1510OE-GOD wants to merge 21 commits intoopenai:mainfrom
Conversation
Replace LZMA with per-layer rANS encoding on int6 quantized weights. Within 11 KB of theoretical entropy minimum. LZMA wastes 1,638 KB. Savings = 2.2M extra parameters at int6 in the same 16 MB budget. Includes systematic waste analysis: - Layer delta encoding: rejected (delta/weight=1.3, layers are unique) - Embedding factorization: rejected (1.6% of model, high rank) - Spatial correlation: rejected (residual entropy 11.4% higher) - LZMA vs optimal: confirmed (1.6 MB gap) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…nking HIGH priority Key findings from daily scan: - Merged SOTA updated to 1.0810 (bigbag, PR openai#1493, Apr 9) — was stale at 1.1147 - New target: ≤1.0760 bpb (beat by ≥0.005 nats) - ANS weight compression (PR openai#1510): 1.6MB freed = +2.2M params, zero legality risk - Parameter Banking + Parallel Muon (PR openai#1523): +5.2% throughput, ~30 free steps - Free wins: Muon momentum 0.97 (-0.0004 bpb), QK-Gain 5.25 (monotonic vs 5.0) - Per-Pass Loop Embeddings (PR openai#1518): reduces quant gap 0.0131→0.0114 - Do NOT implement: Eval-Time Hash Emb (illegal pattern), Tap-In V6 (await ruling) - CLAUDE.md: updated SOTA, target, current approach, technique table, Session 9 lessons https://claude.ai/code/session_01FLdCggVuuBKQCUy6J3xyss
Update: Verified on 8×H100 SXM + Wider Model Results8×H100 ResultsTrained on 8×H100 SXM with SP1024 baseline (10-minute wall clock). ANS compression results:
ANS savings: 1.89 MB (17.2%) — consistent across 1×H100 and 8×H100 runs. Using the Freed Space: Wider ModelThe key question: does spending the freed 1.89 MB on more parameters improve BPB?
The wider model improves BPB by 0.013 — and training was cut short (connection dropped at step 9000/11000+). Final BPB would likely be ~1.20. The 26.5M parameter model only fits in 16 MB because ANS compression is 17.2% more efficient than LZMA. Under LZMA, this model would exceed the artifact budget. Summary
Next Steps
The design space under ANS compression is strictly larger than under LZMA/Brotli. The optimal model in this larger space is at least as good, and our preliminary results show it's better. |
Update: 8×H100 Full Stack Results — val_bpb 1.0996Final Results (8×H100 SXM, 600s training)
Key FindingANS compression enables a 32M parameter model in 13.56 MB — the same architecture that others fit in 15.9 MB with Brotli. The 2.34 MB of headroom means ANS users can explore wider/deeper architectures that don't fit under Brotli. The MLP 6× experiment (Run 3) proved the limit: 44M params overflows even with ANS (16.67 MB), and heavy pruning destroys quality. The sweet spot is 32M params with MLP 4× — maximally trained in 10 minutes with room to spare. Comparison to Leaderboard
Best SP1024 result: 1.0996 BPB. Beats merged leaderboard by 0.015 BPB. What ANS EnablesThe
Reproductiongit checkout ans-compression
RUN_ID=ans_run USE_ANS=1 VOCAB_SIZE=1024 \
DATA_PATH=./data/datasets/fineweb10B_sp1024/ \
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
torchrun --standalone --nproc_per_node=8 train_gpt_ans.pyNext Steps
|
… post-TTT GPTQ, progressive recurrence
2-chunk approach: score first half (no adaptation), train on first half, score second half (with adaptation). Satisfies score-before-update because each token is scored BEFORE the model trains on it. Replaces non-compliant 6-epoch TTT that was flagged on PRs openai#1487, openai#1488, openai#1517. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Bug: per-rank my_mid computation caused rank 7 to skip training entirely (my_mid < my_start), deadlocking all_reduce. Fix: split globally into [0, mid_seq) and [mid_seq, total_seqs), then distribute each phase across ranks independently. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Summary
Replace LZMA with per-layer rANS (range Asymmetric Numeral Systems) encoding on int6 quantized weights. Saves 1.6 MB (13.9%) losslessly — within 11 KB of theoretical entropy minimum.
At int6, 1.6 MB = 2.2 million extra parameters in the same 16 MB budget.
Why LZMA Wastes Space
LZMA operates on bytes but int6 values span byte boundaries. It can't see the symbol structure. ANS with the exact per-layer histogram encodes each symbol in
-log2(freq/total)bits — optimal.Systematic Waste Analysis
We tested 4 hypotheses about where the 16 MB budget is wasted:
Usage
Integration
Orthogonal to all architecture/training improvements. Drop-in replacement for the serialization step:
compress_model()decompress_model()Files
ans_compress.py— rANS encoder/decoder (pure Python, no deps beyond numpy)delta_compress.py— Layer similarity analysis (used to reject delta encoding)README.md— Full methodology and resultsLimitations