Skip to content

ANS weight compression: 1.6 MB (13.9%) lossless savings over LZMA#1510

Open
OE-GOD wants to merge 21 commits intoopenai:mainfrom
OE-GOD:ans-compression
Open

ANS weight compression: 1.6 MB (13.9%) lossless savings over LZMA#1510
OE-GOD wants to merge 21 commits intoopenai:mainfrom
OE-GOD:ans-compression

Conversation

@OE-GOD
Copy link
Copy Markdown

@OE-GOD OE-GOD commented Apr 9, 2026

Summary

Replace LZMA with per-layer rANS (range Asymmetric Numeral Systems) encoding on int6 quantized weights. Saves 1.6 MB (13.9%) losslessly — within 11 KB of theoretical entropy minimum.

At int6, 1.6 MB = 2.2 million extra parameters in the same 16 MB budget.

Why LZMA Wastes Space

Method Size Bits/param vs Theoretical
Theoretical minimum 9.80 MB 4.82
ANS (this PR) 9.81 MB 4.82 +11 KB
LZMA (current) 11.40 MB 5.60 +1,638 KB

LZMA operates on bytes but int6 values span byte boundaries. It can't see the symbol structure. ANS with the exact per-layer histogram encodes each symbol in -log2(freq/total) bits — optimal.

Systematic Waste Analysis

We tested 4 hypotheses about where the 16 MB budget is wasted:

  1. Layer delta encoding ✗ — Rejected. Delta/weight ratio = 1.3. Layers are unique.
  2. Embedding factorization ✗ — Rejected. Only 1.6% of model. High rank (SVD confirms).
  3. Spatial correlation ✗ — Rejected. Residual entropy 11.4% higher. Adjacent weights uncorrelated.
  4. LZMA vs optimal coding ✓ — Confirmed. 1.6 MB gap.

Usage

# Analyze savings on any trained model
python ans_compress.py --input model.npz --analyze --bits 6

# Compress + verify roundtrip
python ans_compress.py --input model.npz --output model.ans --bits 6 --verify

Integration

Orthogonal to all architecture/training improvements. Drop-in replacement for the serialization step:

  1. After training + GPTQ, replace LZMA with compress_model()
  2. At load time, replace LZMA with decompress_model()
  3. Use freed 1.6 MB for wider model
  4. Retrain + measure BPB

Files

  • ans_compress.py — rANS encoder/decoder (pure Python, no deps beyond numpy)
  • delta_compress.py — Layer similarity analysis (used to reject delta encoding)
  • README.md — Full methodology and results

Limitations

  • Tested on baseline model (17M params). Needs 8×H100 to test on SOTA stack.
  • Pure Python rANS — production version should use C for speed.
  • Not a BPB record submission — this is a compression tool for others to integrate.

Replace LZMA with per-layer rANS encoding on int6 quantized weights.
Within 11 KB of theoretical entropy minimum. LZMA wastes 1,638 KB.

Savings = 2.2M extra parameters at int6 in the same 16 MB budget.

Includes systematic waste analysis:
- Layer delta encoding: rejected (delta/weight=1.3, layers are unique)
- Embedding factorization: rejected (1.6% of model, high rank)
- Spatial correlation: rejected (residual entropy 11.4% higher)
- LZMA vs optimal: confirmed (1.6 MB gap)

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 10, 2026
…nking HIGH priority

Key findings from daily scan:
- Merged SOTA updated to 1.0810 (bigbag, PR openai#1493, Apr 9) — was stale at 1.1147
- New target: ≤1.0760 bpb (beat by ≥0.005 nats)
- ANS weight compression (PR openai#1510): 1.6MB freed = +2.2M params, zero legality risk
- Parameter Banking + Parallel Muon (PR openai#1523): +5.2% throughput, ~30 free steps
- Free wins: Muon momentum 0.97 (-0.0004 bpb), QK-Gain 5.25 (monotonic vs 5.0)
- Per-Pass Loop Embeddings (PR openai#1518): reduces quant gap 0.0131→0.0114
- Do NOT implement: Eval-Time Hash Emb (illegal pattern), Tap-In V6 (await ruling)
- CLAUDE.md: updated SOTA, target, current approach, technique table, Session 9 lessons

https://claude.ai/code/session_01FLdCggVuuBKQCUy6J3xyss
@OE-GOD
Copy link
Copy Markdown
Author

OE-GOD commented Apr 10, 2026

Update: Verified on 8×H100 SXM + Wider Model Results

8×H100 Results

Trained on 8×H100 SXM with SP1024 baseline (10-minute wall clock). ANS compression results:

Method Size Bits/param vs Theoretical
Theoretical minimum 8.89 MB 4.37
ANS (this PR) 8.90 MB 4.37 +11 KB
LZMA 10.74 MB 5.28 +1,903 KB

ANS savings: 1.89 MB (17.2%) — consistent across 1×H100 and 8×H100 runs.

Using the Freed Space: Wider Model

The key question: does spending the freed 1.89 MB on more parameters improve BPB?

Config Params val_bpb Notes
Baseline (MLP 2×) 17.0M 1.2262 Standard config
Wide MLP (MLP 4×) 26.5M 1.2129 Uses ANS-freed space. Still improving at step 9000.

The wider model improves BPB by 0.013 — and training was cut short (connection dropped at step 9000/11000+). Final BPB would likely be ~1.20.

The 26.5M parameter model only fits in 16 MB because ANS compression is 17.2% more efficient than LZMA. Under LZMA, this model would exceed the artifact budget.

Summary

  • ANS compression: 17.2% lossless savings, verified on 8×H100
  • Freed space enables 55% more parameters (17M → 26.5M)
  • Wider model: -0.013 BPB improvement (incomplete training, likely more)
  • ANS is within 11 KB of theoretical entropy minimum on all tested models

Next Steps

  1. Full 3-seed validation on the wider model with ANS serialization
  2. Combine with depth recurrence + SLOT for further gains
  3. Test SP4096 tokenizer where embedding cost makes ANS even more impactful

The design space under ANS compression is strictly larger than under LZMA/Brotli. The optimal model in this larger space is at least as good, and our preliminary results show it's better.

@OE-GOD
Copy link
Copy Markdown
Author

OE-GOD commented Apr 11, 2026

Update: 8×H100 Full Stack Results — val_bpb 1.0996

Final Results (8×H100 SXM, 600s training)

Run Params Config Artifact val_bpb (sliding)
Baseline 17M SP1024, train_gpt.py 15.83 MB 1.2265
ANS + full stack 32M SP1024, XSA-11, TTT 6ep, GPTQ, ANS 13.56 MB 1.0996
ANS + wider MLP 6× 44M SP1024, MLP 6×, same stack 16.67 MB (over!) 1.1251

Key Finding

ANS compression enables a 32M parameter model in 13.56 MB — the same architecture that others fit in 15.9 MB with Brotli. The 2.34 MB of headroom means ANS users can explore wider/deeper architectures that don't fit under Brotli.

The MLP 6× experiment (Run 3) proved the limit: 44M params overflows even with ANS (16.67 MB), and heavy pruning destroys quality. The sweet spot is 32M params with MLP 4× — maximally trained in 10 minutes with room to spare.

Comparison to Leaderboard

Entry BPB Tokenizer
Merged #1 (PR #1019) 1.1147 SP1024
This work (ANS) 1.0996 SP1024
Pending #1 (PR #1517) 1.0632 SP8192

Best SP1024 result: 1.0996 BPB. Beats merged leaderboard by 0.015 BPB.

What ANS Enables

The train_gpt_ans.py script adds USE_ANS=1 flag. Drop-in replacement for Brotli compression:

  • 17% lossless savings on quantized weights
  • Within 11 KB of theoretical entropy minimum
  • Compatible with any architecture, tokenizer, or training recipe

Reproduction

git checkout ans-compression
RUN_ID=ans_run USE_ANS=1 VOCAB_SIZE=1024 \
  DATA_PATH=./data/datasets/fineweb10B_sp1024/ \
  TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
  torchrun --standalone --nproc_per_node=8 train_gpt_ans.py

Next Steps

  • 3-seed validation (pending compute credits)
  • SP8192 integration (pending tokenized data)
  • Combine with progressive recurrence for additional gains

OE-GOD and others added 3 commits April 18, 2026 13:20
2-chunk approach: score first half (no adaptation), train on first half,
score second half (with adaptation). Satisfies score-before-update because
each token is scored BEFORE the model trains on it.

Replaces non-compliant 6-epoch TTT that was flagged on PRs openai#1487, openai#1488, openai#1517.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Bug: per-rank my_mid computation caused rank 7 to skip training
entirely (my_mid < my_start), deadlocking all_reduce.

Fix: split globally into [0, mid_seq) and [mid_seq, total_seqs),
then distribute each phase across ranks independently.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant