Non-record: MoE Upcycling + Depth Recurrence — Quantization Gap Analysis#2102
Open
MaxIv25 wants to merge 2 commits intoopenai:mainfrom
Open
Non-record: MoE Upcycling + Depth Recurrence — Quantization Gap Analysis#2102MaxIv25 wants to merge 2 commits intoopenai:mainfrom
MaxIv25 wants to merge 2 commits intoopenai:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Non-record: MoE Upcycling + Depth Recurrence — Quantization Gap Analysis
Summary
This submission explores Mixture-of-Experts (MoE) upcycling combined with depth recurrence (looping) and Causal Bigram Blending. We find that while MoE upcycling and depth recurrence each improve raw BPB, their combination creates a severe quantization gap under GPTQ int6, making the post-quantization model significantly worse.
The script is kept compact at ~1500 lines (per repository guidelines) while including MoE upcycling, depth recurrence, Causal Bigram Blending, GPTQ quantization, and EMA.
Architecture
[0,1,2, 3,4*,5*, 3,4*,5*, 3,4*,5*, 6,7,8]— 6 of 15 passes use MoE routingKey Finding: MoE × Looping Quantization Gap
Hypothesis: When MoE layers are reused at multiple depths via looping, experts specialize for different depth positions. The router learns to dispatch differently depending on the input distribution at each depth pass. GPTQ int6 quantization destroys this fine-grained depth-dependent specialization, causing catastrophic quality loss.
Evidence: Without looping, MoE quantizes with a normal gap (0.024 BPB) even with more MoE layers (5 vs 2), confirming the interaction between depth recurrence and expert specialization as the root cause.
Results (1×H200, 5000 steps, 3-seed)
9L, MoE layers 4–5, loop 3–5 (×2), no CaseOps, 5/128 training shards.
Causal Bigram Blending
Same technique as PR #2088 — zero-cost eval-time blending with online causal bigram prior. Provides ~0.011 BPB improvement.
Potential Fixes (Future Work)
Reproduction
# 1×H200, single GPU BIGRAM_BLEND_ENABLED=1 BIGRAM_BLEND_LAMBDA=0.03 \ NUM_LAYERS=9 ITERATIONS=5000 \ MOE_START=4 MOE_END=5 MOE_NUM_EXPERTS=2 \ ENABLE_MOE_AT=0.30 ENABLE_LOOPING_AT=0.35 \ LOOP_START=3 LOOP_END=5 PARALLEL_START_LAYER=6 \ SEED=42 MIN_LR=0.1 QK_GAIN_INIT=5.25 \ GPTQ_CALIB_BATCHES=16 \ python train_gpt_exp4_moe.pyFiles
train_gpt_exp4_moe.py— training script with MoE upcycling + Causal Bigram Blendingmoe_9L_loop_seed42.log,moe_9L_seed0.log,moe_9L_seed314.log— 3-seed logsBuilt Upon
This work builds on the following PRs from the parameter-golf leaderboard:
train_gpt_sota_exp.py(PR Non-record: Causal Bigram Blending — eval-time BPB improvement (1×H20… #2088) and architectural reference for MoE exploration.The MoE upcycling mechanism is inspired by the Switch Transformer (Fedus et al., 2022) and Sparse Upcycling (Komatsuzaki et al., 2023).