GDN-Hybrid: Fix warmdown/SWA/QAT timing — 1.0208 BPB by OE-GOD · Pull Request #1681 · openai/parameter-golf

OE-GOD · 2026-04-16T21:56:08Z

Summary

PR #1545's GDN-Hybrid has three training systems (warmdown, SWA, Late QAT) that are configured but never activate due to a timing mismatch:

ITERATIONS=9999 + WARMDOWN_ITERS=3000 → warmdown starts at step 6999
Training only reaches ~2100 steps in 590s wallclock
Result: zero warmdown, zero SWA, zero QAT

The fix (2 lines)

# Before (PR #1545):
iterations = int(os.environ.get("ITERATIONS", 9999))
warmdown_iters = int(os.environ.get("WARMDOWN_ITERS", 3000))

# After:
iterations = int(os.environ.get("ITERATIONS", 2200))
warmdown_iters = int(os.environ.get("WARMDOWN_ITERS", 400))

This activates all three systems:

Warmdown starts at step 1800 (cosine decay lr 1.0 → 0.0)
SWA triggers at step 2100 (lr_mul < 0.2), collects 3 checkpoints
Late QAT triggers at step 2099, reducing quantization degradation

Also fixes a SWA device mismatch bug on line 1037 where swa_avg tensors on CPU were mixed with avg_state tensors on CUDA.

Results

8×H100, seed=42, SP1024, runpod/parameter-golf:latest, 590s wallclock:

	Without fix (PR #1545 defaults)	With fix
Training steps	2119	2200
Warmdown	None (never triggers)	Steps 1800-2200
SWA checkpoints	0	3
Late QAT	Never	Step 2099
EMA BPB	1.0109	1.0064
Quantized BPB	1.0243	1.0208
Artifact	14.91 MB	15.03 MB

Improvement: -0.0035 BPB from fixing the training pipeline alone.

The insight

The model architecture is already good. The waste is in the pipeline around it — three systems designed to improve final model quality were silently disabled by a config mismatch. Same pattern as PR #1510 (ANS compression): optimize the boring parts.

How to reproduce

ARCH_MODE=D VOCAB_SIZE=1024 \
DATA_PATH=./data/datasets/fineweb10B_sp1024 \
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
VAL_LOSS_EVERY=9999 SEED=42 \
ITERATIONS=2200 WARMDOWN_ITERS=400 \
torchrun --standalone --nproc_per_node=8 gdn/gdn_train_gpt.py

Test plan

Verified warmdown triggers at step 1800 (lr_mul starts decreasing)
Verified SWA starts at step 2100 (3 checkpoints collected)
Verified Late QAT at step 2099
Quantized BPB: 1.0208 (artifact 15.03 MB, under 16 MB)
Compared with unfixed baseline on same hardware/seed

Based on PR #1545 GDN-Hybrid architecture.

Replace LZMA with per-layer rANS encoding on int6 quantized weights. Within 11 KB of theoretical entropy minimum. LZMA wastes 1,638 KB. Savings = 2.2M extra parameters at int6 in the same 16 MB budget. Includes systematic waste analysis: - Layer delta encoding: rejected (delta/weight=1.3, layers are unique) - Embedding factorization: rejected (1.6% of model, high rank) - Spatial correlation: rejected (residual entropy 11.4% higher) - LZMA vs optimal: confirmed (1.6 MB gap) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… adapted weights

…_DETACH=1)

… post-TTT GPTQ, progressive recurrence

…d SP8192

… incrementally

…o fix it

…rom diagnostic

The default config (ITERATIONS=9999, WARMDOWN_ITERS=3000) means warmdown starts at step 6999, but training only reaches ~2100 steps in 590s. Result: warmdown never triggers, SWA never collects checkpoints, and Late QAT never activates. Three systems designed to work together but all inactive. Fix: ITERATIONS=2200, WARMDOWN_ITERS=400 - Warmdown starts at step 1800 (cosine decay from lr=1.0 to 0.0) - SWA triggers at step 2100 (lr_mul < 0.2), collects 3 checkpoints - Late QAT triggers at step 2099, reducing quantization degradation Also fixes SWA device mismatch bug (line 1037) where swa_avg tensors on CPU were mixed with avg_state tensors on CUDA. Results (8xH100, seed=42, SP1024, 590s wallclock): Without fix: EMA 1.0109, Quantized 1.0243, Artifact 14.91 MB With fix: EMA 1.0064, Quantized 1.0208, Artifact 15.03 MB Based on PR openai#1545 GDN-Hybrid architecture. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

OE-GOD · 2026-04-17T05:56:55Z

Closing — BPB numbers computed with incorrect byte counting. Will resubmit with corrected evaluation.

OE-GOD and others added 19 commits April 9, 2026 13:05

Add ready-to-run ANS experiment script for 8xH100

70f5985

Add 5-run design space sweep for ANS advantage

91879fb

Integrate ANS into openai#1 entry (PR 1517) — USE_ANS=1 to enable

67f5f94

Add post-TTT GPTQ calibration (POST_TTT_GPTQ=1) — matches Hessians to…

7042566

… adapted weights

Add progressive recurrence (PROGRESSIVE_RECUR=1, RECUR_MAX_K=4, RECUR…

2a0b935

…_DETACH=1)

Add quantization optimization sweep: k-sweep, ANS vs Brotli, high WD,…

ea25cfd

… post-TTT GPTQ, progressive recurrence

Fix post-TTT GPTQ: handle 1D tensor in collect_hessians_from_tokens

334fd30

Fix post-TTT GPTQ: cast val_tokens to int64 for embedding layer

292fddc

Add casefold retokenization pipeline — CPU-only data prep for casefol…

7b9dcb7

…d SP8192

Add fast multiprocessing casefold tokenization

67a59af

Add streaming casefold tokenization — no RAM explosion, writes shards…

95950db

… incrementally

Add PR 813 BackoffNgramMixer training script (0.6671 BPB)

8cd62ec

Add per-matrix quantization test (Q,K at int4/int5 vs uniform int6)

706c525

Add diagnostic: analyze WHERE and WHY the model fails before trying t…

0cebe52

…o fix it

Add word-start loss weighting (WORD_START_WEIGHT=1.5) — data-driven f…

2ff2f72

…rom diagnostic

Add PR 1493 merged openai#1 entry (decoded + raw)

bc5d0b2

Add GDN-Hybrid architecture from PR 1545 (1.028 BPB, non-transformer)

3677596

OE-GOD closed this Apr 17, 2026

bigbag mentioned this pull request Apr 17, 2026

Record: K_KVShare_Wider full-recipe FLA — val_bpb 1.04090 (3-seed mean) #1687

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GDN-Hybrid: Fix warmdown/SWA/QAT timing — 1.0208 BPB#1681

GDN-Hybrid: Fix warmdown/SWA/QAT timing — 1.0208 BPB#1681
OE-GOD wants to merge 19 commits intoopenai:mainfrom
OE-GOD:gdn-warmdown-fix

OE-GOD commented Apr 16, 2026

Uh oh!

OE-GOD commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

OE-GOD commented Apr 16, 2026

Summary

The fix (2 lines)

Results

The insight

How to reproduce

Test plan

Uh oh!

OE-GOD commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant