Skip to content

GDN-Hybrid: Fix warmdown/SWA/QAT timing — 1.0208 BPB#1681

Closed
OE-GOD wants to merge 19 commits intoopenai:mainfrom
OE-GOD:gdn-warmdown-fix
Closed

GDN-Hybrid: Fix warmdown/SWA/QAT timing — 1.0208 BPB#1681
OE-GOD wants to merge 19 commits intoopenai:mainfrom
OE-GOD:gdn-warmdown-fix

Conversation

@OE-GOD
Copy link
Copy Markdown

@OE-GOD OE-GOD commented Apr 16, 2026

Summary

PR #1545's GDN-Hybrid has three training systems (warmdown, SWA, Late QAT) that are configured but never activate due to a timing mismatch:

  • ITERATIONS=9999 + WARMDOWN_ITERS=3000 → warmdown starts at step 6999
  • Training only reaches ~2100 steps in 590s wallclock
  • Result: zero warmdown, zero SWA, zero QAT

The fix (2 lines)

# Before (PR #1545):
iterations = int(os.environ.get("ITERATIONS", 9999))
warmdown_iters = int(os.environ.get("WARMDOWN_ITERS", 3000))

# After:
iterations = int(os.environ.get("ITERATIONS", 2200))
warmdown_iters = int(os.environ.get("WARMDOWN_ITERS", 400))

This activates all three systems:

  • Warmdown starts at step 1800 (cosine decay lr 1.0 → 0.0)
  • SWA triggers at step 2100 (lr_mul < 0.2), collects 3 checkpoints
  • Late QAT triggers at step 2099, reducing quantization degradation

Also fixes a SWA device mismatch bug on line 1037 where swa_avg tensors on CPU were mixed with avg_state tensors on CUDA.

Results

8×H100, seed=42, SP1024, runpod/parameter-golf:latest, 590s wallclock:

Without fix (PR #1545 defaults) With fix
Training steps 2119 2200
Warmdown None (never triggers) Steps 1800-2200
SWA checkpoints 0 3
Late QAT Never Step 2099
EMA BPB 1.0109 1.0064
Quantized BPB 1.0243 1.0208
Artifact 14.91 MB 15.03 MB

Improvement: -0.0035 BPB from fixing the training pipeline alone.

The insight

The model architecture is already good. The waste is in the pipeline around it — three systems designed to improve final model quality were silently disabled by a config mismatch. Same pattern as PR #1510 (ANS compression): optimize the boring parts.

How to reproduce

ARCH_MODE=D VOCAB_SIZE=1024 \
DATA_PATH=./data/datasets/fineweb10B_sp1024 \
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
VAL_LOSS_EVERY=9999 SEED=42 \
ITERATIONS=2200 WARMDOWN_ITERS=400 \
torchrun --standalone --nproc_per_node=8 gdn/gdn_train_gpt.py

Test plan

  • Verified warmdown triggers at step 1800 (lr_mul starts decreasing)
  • Verified SWA starts at step 2100 (3 checkpoints collected)
  • Verified Late QAT at step 2099
  • Quantized BPB: 1.0208 (artifact 15.03 MB, under 16 MB)
  • Compared with unfixed baseline on same hardware/seed

Based on PR #1545 GDN-Hybrid architecture.

OE-GOD and others added 19 commits April 9, 2026 13:05
Replace LZMA with per-layer rANS encoding on int6 quantized weights.
Within 11 KB of theoretical entropy minimum. LZMA wastes 1,638 KB.

Savings = 2.2M extra parameters at int6 in the same 16 MB budget.

Includes systematic waste analysis:
- Layer delta encoding: rejected (delta/weight=1.3, layers are unique)
- Embedding factorization: rejected (1.6% of model, high rank)
- Spatial correlation: rejected (residual entropy 11.4% higher)
- LZMA vs optimal: confirmed (1.6 MB gap)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The default config (ITERATIONS=9999, WARMDOWN_ITERS=3000) means warmdown
starts at step 6999, but training only reaches ~2100 steps in 590s.
Result: warmdown never triggers, SWA never collects checkpoints, and
Late QAT never activates. Three systems designed to work together but
all inactive.

Fix: ITERATIONS=2200, WARMDOWN_ITERS=400
- Warmdown starts at step 1800 (cosine decay from lr=1.0 to 0.0)
- SWA triggers at step 2100 (lr_mul < 0.2), collects 3 checkpoints
- Late QAT triggers at step 2099, reducing quantization degradation

Also fixes SWA device mismatch bug (line 1037) where swa_avg tensors
on CPU were mixed with avg_state tensors on CUDA.

Results (8xH100, seed=42, SP1024, 590s wallclock):
  Without fix: EMA 1.0109, Quantized 1.0243, Artifact 14.91 MB
  With fix:    EMA 1.0064, Quantized 1.0208, Artifact 15.03 MB

Based on PR openai#1545 GDN-Hybrid architecture.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@OE-GOD
Copy link
Copy Markdown
Author

OE-GOD commented Apr 17, 2026

Closing — BPB numbers computed with incorrect byte counting. Will resubmit with corrected evaluation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant