ANS weight compression: 1.6 MB (13.9%) lossless savings over LZMA by OE-GOD · Pull Request #1510 · openai/parameter-golf

OE-GOD · 2026-04-09T20:06:21Z

Summary

Replace LZMA with per-layer rANS (range Asymmetric Numeral Systems) encoding on int6 quantized weights. Saves 1.6 MB (13.9%) losslessly — within 11 KB of theoretical entropy minimum.

At int6, 1.6 MB = 2.2 million extra parameters in the same 16 MB budget.

Why LZMA Wastes Space

Method	Size	Bits/param	vs Theoretical
Theoretical minimum	9.80 MB	4.82	—
ANS (this PR)	9.81 MB	4.82	+11 KB
LZMA (current)	11.40 MB	5.60	+1,638 KB

LZMA operates on bytes but int6 values span byte boundaries. It can't see the symbol structure. ANS with the exact per-layer histogram encodes each symbol in -log2(freq/total) bits — optimal.

Systematic Waste Analysis

We tested 4 hypotheses about where the 16 MB budget is wasted:

Layer delta encoding ✗ — Rejected. Delta/weight ratio = 1.3. Layers are unique.
Embedding factorization ✗ — Rejected. Only 1.6% of model. High rank (SVD confirms).
Spatial correlation ✗ — Rejected. Residual entropy 11.4% higher. Adjacent weights uncorrelated.
LZMA vs optimal coding ✓ — Confirmed. 1.6 MB gap.

Usage

# Analyze savings on any trained model
python ans_compress.py --input model.npz --analyze --bits 6

# Compress + verify roundtrip
python ans_compress.py --input model.npz --output model.ans --bits 6 --verify

Integration

Orthogonal to all architecture/training improvements. Drop-in replacement for the serialization step:

After training + GPTQ, replace LZMA with compress_model()
At load time, replace LZMA with decompress_model()
Use freed 1.6 MB for wider model
Retrain + measure BPB

Files

ans_compress.py — rANS encoder/decoder (pure Python, no deps beyond numpy)
delta_compress.py — Layer similarity analysis (used to reject delta encoding)
README.md — Full methodology and results

Limitations

Tested on baseline model (17M params). Needs 8×H100 to test on SOTA stack.
Pure Python rANS — production version should use C for speed.
Not a BPB record submission — this is a compression tool for others to integrate.

Replace LZMA with per-layer rANS encoding on int6 quantized weights. Within 11 KB of theoretical entropy minimum. LZMA wastes 1,638 KB. Savings = 2.2M extra parameters at int6 in the same 16 MB budget. Includes systematic waste analysis: - Layer delta encoding: rejected (delta/weight=1.3, layers are unique) - Embedding factorization: rejected (1.6% of model, high rank) - Spatial correlation: rejected (residual entropy 11.4% higher) - LZMA vs optimal: confirmed (1.6 MB gap) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

…nking HIGH priority Key findings from daily scan: - Merged SOTA updated to 1.0810 (bigbag, PR openai#1493, Apr 9) — was stale at 1.1147 - New target: ≤1.0760 bpb (beat by ≥0.005 nats) - ANS weight compression (PR openai#1510): 1.6MB freed = +2.2M params, zero legality risk - Parameter Banking + Parallel Muon (PR openai#1523): +5.2% throughput, ~30 free steps - Free wins: Muon momentum 0.97 (-0.0004 bpb), QK-Gain 5.25 (monotonic vs 5.0) - Per-Pass Loop Embeddings (PR openai#1518): reduces quant gap 0.0131→0.0114 - Do NOT implement: Eval-Time Hash Emb (illegal pattern), Tap-In V6 (await ruling) - CLAUDE.md: updated SOTA, target, current approach, technique table, Session 9 lessons https://claude.ai/code/session_01FLdCggVuuBKQCUy6J3xyss

OE-GOD · 2026-04-10T21:21:00Z

Update: Verified on 8×H100 SXM + Wider Model Results

8×H100 Results

Trained on 8×H100 SXM with SP1024 baseline (10-minute wall clock). ANS compression results:

Method	Size	Bits/param	vs Theoretical
Theoretical minimum	8.89 MB	4.37	—
ANS (this PR)	8.90 MB	4.37	+11 KB
LZMA	10.74 MB	5.28	+1,903 KB

ANS savings: 1.89 MB (17.2%) — consistent across 1×H100 and 8×H100 runs.

Using the Freed Space: Wider Model

The key question: does spending the freed 1.89 MB on more parameters improve BPB?

Config	Params	val_bpb	Notes
Baseline (MLP 2×)	17.0M	1.2262	Standard config
Wide MLP (MLP 4×)	26.5M	1.2129	Uses ANS-freed space. Still improving at step 9000.

The wider model improves BPB by 0.013 — and training was cut short (connection dropped at step 9000/11000+). Final BPB would likely be ~1.20.

The 26.5M parameter model only fits in 16 MB because ANS compression is 17.2% more efficient than LZMA. Under LZMA, this model would exceed the artifact budget.

Summary

ANS compression: 17.2% lossless savings, verified on 8×H100
Freed space enables 55% more parameters (17M → 26.5M)
Wider model: -0.013 BPB improvement (incomplete training, likely more)
ANS is within 11 KB of theoretical entropy minimum on all tested models

Next Steps

Full 3-seed validation on the wider model with ANS serialization
Combine with depth recurrence + SLOT for further gains
Test SP4096 tokenizer where embedding cost makes ANS even more impactful

The design space under ANS compression is strictly larger than under LZMA/Brotli. The optimal model in this larger space is at least as good, and our preliminary results show it's better.

… adapted weights

…_DETACH=1)

OE-GOD · 2026-04-11T09:00:11Z

Update: 8×H100 Full Stack Results — val_bpb 1.0996

Final Results (8×H100 SXM, 600s training)

Run	Params	Config	Artifact	val_bpb (sliding)
Baseline	17M	SP1024, train_gpt.py	15.83 MB	1.2265
ANS + full stack	32M	SP1024, XSA-11, TTT 6ep, GPTQ, ANS	13.56 MB	1.0996
ANS + wider MLP 6×	44M	SP1024, MLP 6×, same stack	16.67 MB (over!)	1.1251

Key Finding

ANS compression enables a 32M parameter model in 13.56 MB — the same architecture that others fit in 15.9 MB with Brotli. The 2.34 MB of headroom means ANS users can explore wider/deeper architectures that don't fit under Brotli.

The MLP 6× experiment (Run 3) proved the limit: 44M params overflows even with ANS (16.67 MB), and heavy pruning destroys quality. The sweet spot is 32M params with MLP 4× — maximally trained in 10 minutes with room to spare.

Comparison to Leaderboard

Entry	BPB	Tokenizer
Merged #1 (PR #1019)	1.1147	SP1024
This work (ANS)	1.0996	SP1024
Pending #1 (PR #1517)	1.0632	SP8192

Best SP1024 result: 1.0996 BPB. Beats merged leaderboard by 0.015 BPB.

What ANS Enables

The train_gpt_ans.py script adds USE_ANS=1 flag. Drop-in replacement for Brotli compression:

17% lossless savings on quantized weights
Within 11 KB of theoretical entropy minimum
Compatible with any architecture, tokenizer, or training recipe

Reproduction

git checkout ans-compression
RUN_ID=ans_run USE_ANS=1 VOCAB_SIZE=1024 \
  DATA_PATH=./data/datasets/fineweb10B_sp1024/ \
  TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
  torchrun --standalone --nproc_per_node=8 train_gpt_ans.py

Next Steps

3-seed validation (pending compute credits)
SP8192 integration (pending tokenized data)
Combine with progressive recurrence for additional gains

… post-TTT GPTQ, progressive recurrence

…d SP8192

… incrementally

…o fix it

…rom diagnostic

2-chunk approach: score first half (no adaptation), train on first half, score second half (with adaptation). Satisfies score-before-update because each token is scored BEFORE the model trains on it. Replaces non-compliant 6-epoch TTT that was flagged on PRs openai#1487, openai#1488, openai#1517. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Bug: per-rank my_mid computation caused rank 7 to skip training entirely (my_mid < my_start), deadlocking all_reduce. Fix: split globally into [0, mid_seq) and [mid_seq, total_seqs), then distribute each phase across ranks independently. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

OE-GOD added 2 commits April 10, 2026 12:56

Add ready-to-run ANS experiment script for 8xH100

70f5985

Add 5-run design space sweep for ANS advantage

91879fb

OE-GOD added 3 commits April 11, 2026 00:07

Integrate ANS into openai#1 entry (PR 1517) — USE_ANS=1 to enable

67f5f94

Add post-TTT GPTQ calibration (POST_TTT_GPTQ=1) — matches Hessians to…

7042566

… adapted weights

Add progressive recurrence (PROGRESSIVE_RECUR=1, RECUR_MAX_K=4, RECUR…

2a0b935

…_DETACH=1)

OE-GOD added 12 commits April 11, 2026 22:40

Add quantization optimization sweep: k-sweep, ANS vs Brotli, high WD,…

ea25cfd

… post-TTT GPTQ, progressive recurrence

Fix post-TTT GPTQ: handle 1D tensor in collect_hessians_from_tokens

334fd30

Fix post-TTT GPTQ: cast val_tokens to int64 for embedding layer

292fddc

Add casefold retokenization pipeline — CPU-only data prep for casefol…

7b9dcb7

…d SP8192

Add fast multiprocessing casefold tokenization

67a59af

Add streaming casefold tokenization — no RAM explosion, writes shards…

95950db

… incrementally

Add PR 813 BackoffNgramMixer training script (0.6671 BPB)

8cd62ec

Add per-matrix quantization test (Q,K at int4/int5 vs uniform int6)

706c525

Add diagnostic: analyze WHERE and WHY the model fails before trying t…

0cebe52

…o fix it

Add word-start loss weighting (WORD_START_WEIGHT=1.5) — data-driven f…

2ff2f72

…rom diagnostic

Add PR 1493 merged openai#1 entry (decoded + raw)

bc5d0b2

Add GDN-Hybrid architecture from PR 1545 (1.028 BPB, non-transformer)

3677596

OE-GOD mentioned this pull request Apr 16, 2026

GDN-Hybrid: Fix warmdown/SWA/QAT timing — 1.0208 BPB #1681

Closed

5 tasks

OE-GOD and others added 3 commits April 18, 2026 13:20

Fix nonlocal scope for tensor accumulators in _score_range

090291a

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ANS weight compression: 1.6 MB (13.9%) lossless savings over LZMA#1510

ANS weight compression: 1.6 MB (13.9%) lossless savings over LZMA#1510
OE-GOD wants to merge 21 commits intoopenai:mainfrom
OE-GOD:ans-compression

OE-GOD commented Apr 9, 2026

Uh oh!

OE-GOD commented Apr 10, 2026

Uh oh!

OE-GOD commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

OE-GOD commented Apr 9, 2026

Summary

Why LZMA Wastes Space

Systematic Waste Analysis

Usage

Integration

Files

Limitations

Uh oh!

OE-GOD commented Apr 10, 2026

Update: Verified on 8×H100 SXM + Wider Model Results

8×H100 Results

Using the Freed Space: Wider Model

Summary

Next Steps

Uh oh!

OE-GOD commented Apr 11, 2026

Update: 8×H100 Full Stack Results — val_bpb 1.0996

Final Results (8×H100 SXM, 600s training)

Key Finding

Comparison to Leaderboard

What ANS Enables

Reproduction

Next Steps

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant