Skip to content

Non-record: MoE Upcycling + Depth Recurrence — Quantization Gap Analysis#2102

Open
MaxIv25 wants to merge 2 commits intoopenai:mainfrom
MaxIv25:moe-upcycling-nonrecord
Open

Non-record: MoE Upcycling + Depth Recurrence — Quantization Gap Analysis#2102
MaxIv25 wants to merge 2 commits intoopenai:mainfrom
MaxIv25:moe-upcycling-nonrecord

Conversation

@MaxIv25
Copy link
Copy Markdown

@MaxIv25 MaxIv25 commented May 1, 2026

Non-record: MoE Upcycling + Depth Recurrence — Quantization Gap Analysis

Summary

This submission explores Mixture-of-Experts (MoE) upcycling combined with depth recurrence (looping) and Causal Bigram Blending. We find that while MoE upcycling and depth recurrence each improve raw BPB, their combination creates a severe quantization gap under GPTQ int6, making the post-quantization model significantly worse.

The script is kept compact at ~1500 lines (per repository guidelines) while including MoE upcycling, depth recurrence, Causal Bigram Blending, GPTQ quantization, and EMA.

Architecture

  • 9-layer, 512-dim transformer (reduced from 11L to fit MoE params in 16MB)
  • MoE upcycling on layers 4–5: dense MLP → 2-expert top-1 MoE at 30% training progress
  • Depth recurrence (loop layers 3–5, 2 loops) activated at 35% training progress
  • Effective forward path: [0,1,2, 3,4*,5*, 3,4*,5*, 3,4*,5*, 6,7,8] — 6 of 15 passes use MoE routing
  • Parallel residuals from layer 6+
  • U-Net skip connections, SmearGate, BigramHash, XSA
  • Causal Bigram Blending (eval-time, λ=0.03)

Key Finding: MoE × Looping Quantization Gap

Config Raw BPB Quantized BPB Quant Gap
MoE only, no loops (11L, MoE layers 4–8, 1000 steps, no bigram)† 1.2068 1.2304 0.024
MoE + Looping (9L, MoE layers 4–5, 5000 steps, seed 42) 1.1092 1.3367 0.227
MoE + Looping (9L, MoE layers 4–5, 5000 steps, seed 0) 1.1089 1.3226 0.214
MoE + Looping (9L, MoE layers 4–5, 5000 steps, seed 314) 1.1095 1.3688 0.259
Dense 11L baseline (SOTA, no MoE) 1.0789 1.0750 −0.004

†Different setup (11L, 5 MoE layers on 4–8, seed 1337, no Bigram Blend, no looping). Included as evidence that MoE quantization gap is small when depth recurrence is not used.

Hypothesis: When MoE layers are reused at multiple depths via looping, experts specialize for different depth positions. The router learns to dispatch differently depending on the input distribution at each depth pass. GPTQ int6 quantization destroys this fine-grained depth-dependent specialization, causing catastrophic quality loss.

Evidence: Without looping, MoE quantizes with a normal gap (0.024 BPB) even with more MoE layers (5 vs 2), confirming the interaction between depth recurrence and expert specialization as the root cause.

Results (1×H200, 5000 steps, 3-seed)

9L, MoE layers 4–5, loop 3–5 (×2), no CaseOps, 5/128 training shards.

Seed val_bpb (raw) quantized BPB Artifact size
42 1.1092 1.3367 14.97 MB
0 1.1089 1.3226 15.03 MB
314 1.1095 1.3688 15.06 MB
mean 1.1092 1.3427 15.02 MB

Raw BPB is competitive (~1.109 with bigram blend), but quantization degrades to ~1.34 BPB.

Causal Bigram Blending

Same technique as PR #2088 — zero-cost eval-time blending with online causal bigram prior. Provides ~0.011 BPB improvement.

Potential Fixes (Future Work)

  1. Per-depth quantization: Quantize MoE experts separately for each loop pass
  2. Depth-aware GPTQ: Use calibration batches that include all loop depths
  3. Avoid MoE on looped layers: Place MoE only on non-looped layers (0-2, 6-8)
  4. Higher bit-width for MoE experts: Use int8 for expert weights, int6 for dense layers
  5. Distillation-aware quantization: Fine-tune quantized MoE with knowledge distillation

Reproduction

# 1×H200, single GPU
BIGRAM_BLEND_ENABLED=1 BIGRAM_BLEND_LAMBDA=0.03 \
NUM_LAYERS=9 ITERATIONS=5000 \
MOE_START=4 MOE_END=5 MOE_NUM_EXPERTS=2 \
ENABLE_MOE_AT=0.30 ENABLE_LOOPING_AT=0.35 \
LOOP_START=3 LOOP_END=5 PARALLEL_START_LAYER=6 \
SEED=42 MIN_LR=0.1 QK_GAIN_INIT=5.25 \
GPTQ_CALIB_BATCHES=16 \
python train_gpt_exp4_moe.py

Files

  • train_gpt_exp4_moe.py — training script with MoE upcycling + Causal Bigram Blending
  • moe_9L_loop_seed42.log, moe_9L_seed0.log, moe_9L_seed314.log — 3-seed logs

Built Upon

This work builds on the following PRs from the parameter-golf leaderboard:

The MoE upcycling mechanism is inspired by the Switch Transformer (Fedus et al., 2022) and Sparse Upcycling (Komatsuzaki et al., 2023).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant