GDN-Hybrid: Fix warmdown/SWA/QAT timing — 1.0208 BPB#1681
Closed
OE-GOD wants to merge 19 commits intoopenai:mainfrom
Closed
GDN-Hybrid: Fix warmdown/SWA/QAT timing — 1.0208 BPB#1681OE-GOD wants to merge 19 commits intoopenai:mainfrom
OE-GOD wants to merge 19 commits intoopenai:mainfrom
Conversation
Replace LZMA with per-layer rANS encoding on int6 quantized weights. Within 11 KB of theoretical entropy minimum. LZMA wastes 1,638 KB. Savings = 2.2M extra parameters at int6 in the same 16 MB budget. Includes systematic waste analysis: - Layer delta encoding: rejected (delta/weight=1.3, layers are unique) - Embedding factorization: rejected (1.6% of model, high rank) - Spatial correlation: rejected (residual entropy 11.4% higher) - LZMA vs optimal: confirmed (1.6 MB gap) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… post-TTT GPTQ, progressive recurrence
The default config (ITERATIONS=9999, WARMDOWN_ITERS=3000) means warmdown starts at step 6999, but training only reaches ~2100 steps in 590s. Result: warmdown never triggers, SWA never collects checkpoints, and Late QAT never activates. Three systems designed to work together but all inactive. Fix: ITERATIONS=2200, WARMDOWN_ITERS=400 - Warmdown starts at step 1800 (cosine decay from lr=1.0 to 0.0) - SWA triggers at step 2100 (lr_mul < 0.2), collects 3 checkpoints - Late QAT triggers at step 2099, reducing quantization degradation Also fixes SWA device mismatch bug (line 1037) where swa_avg tensors on CPU were mixed with avg_state tensors on CUDA. Results (8xH100, seed=42, SP1024, 590s wallclock): Without fix: EMA 1.0109, Quantized 1.0243, Artifact 14.91 MB With fix: EMA 1.0064, Quantized 1.0208, Artifact 15.03 MB Based on PR openai#1545 GDN-Hybrid architecture. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Author
|
Closing — BPB numbers computed with incorrect byte counting. Will resubmit with corrected evaluation. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PR #1545's GDN-Hybrid has three training systems (warmdown, SWA, Late QAT) that are configured but never activate due to a timing mismatch:
ITERATIONS=9999+WARMDOWN_ITERS=3000→ warmdown starts at step 6999The fix (2 lines)
This activates all three systems:
Also fixes a SWA device mismatch bug on line 1037 where
swa_avgtensors on CPU were mixed withavg_statetensors on CUDA.Results
8×H100, seed=42, SP1024,
runpod/parameter-golf:latest, 590s wallclock:Improvement: -0.0035 BPB from fixing the training pipeline alone.
The insight
The model architecture is already good. The waste is in the pipeline around it — three systems designed to improve final model quality were silently disabled by a config mismatch. Same pattern as PR #1510 (ANS compression): optimize the boring parts.
How to reproduce
Test plan
Based on PR #1545 GDN-Hybrid architecture.