Non-record: MoE Upcycling + Depth Recurrence — Quantization Gap Analysis by MaxIv25 · Pull Request #2102 · openai/parameter-golf

MaxIv25 · 2026-05-01T06:20:51Z

Non-record: MoE Upcycling + Depth Recurrence — Quantization Gap Analysis

Summary

This submission explores Mixture-of-Experts (MoE) upcycling combined with depth recurrence (looping) and Causal Bigram Blending. We find that while MoE upcycling and depth recurrence each improve raw BPB, their combination creates a severe quantization gap under GPTQ int6, making the post-quantization model significantly worse.

The script is kept compact at ~1500 lines (per repository guidelines) while including MoE upcycling, depth recurrence, Causal Bigram Blending, GPTQ quantization, and EMA.

Architecture

9-layer, 512-dim transformer (reduced from 11L to fit MoE params in 16MB)
MoE upcycling on layers 4–5: dense MLP → 2-expert top-1 MoE at 30% training progress
Depth recurrence (loop layers 3–5, 2 loops) activated at 35% training progress
Effective forward path: [0,1,2, 3,4*,5*, 3,4*,5*, 3,4*,5*, 6,7,8] — 6 of 15 passes use MoE routing
Parallel residuals from layer 6+
U-Net skip connections, SmearGate, BigramHash, XSA
Causal Bigram Blending (eval-time, λ=0.03)

Key Finding: MoE × Looping Quantization Gap

Config	Raw BPB	Quantized BPB	Quant Gap
MoE only, no loops (11L, MoE layers 4–8, 1000 steps, no bigram)†	1.2068	1.2304	0.024
MoE + Looping (9L, MoE layers 4–5, 5000 steps, seed 42)	1.1092	1.3367	0.227
MoE + Looping (9L, MoE layers 4–5, 5000 steps, seed 0)	1.1089	1.3226	0.214
MoE + Looping (9L, MoE layers 4–5, 5000 steps, seed 314)	1.1095	1.3688	0.259
Dense 11L baseline (SOTA, no MoE)	1.0789	1.0750	−0.004

†Different setup (11L, 5 MoE layers on 4–8, seed 1337, no Bigram Blend, no looping). Included as evidence that MoE quantization gap is small when depth recurrence is not used.

Hypothesis: When MoE layers are reused at multiple depths via looping, experts specialize for different depth positions. The router learns to dispatch differently depending on the input distribution at each depth pass. GPTQ int6 quantization destroys this fine-grained depth-dependent specialization, causing catastrophic quality loss.

Evidence: Without looping, MoE quantizes with a normal gap (0.024 BPB) even with more MoE layers (5 vs 2), confirming the interaction between depth recurrence and expert specialization as the root cause.

Results (1×H200, 5000 steps, 3-seed)

9L, MoE layers 4–5, loop 3–5 (×2), no CaseOps, 5/128 training shards.

Seed	val_bpb (raw)	quantized BPB	Artifact size
42	1.1092	1.3367	14.97 MB
0	1.1089	1.3226	15.03 MB
314	1.1095	1.3688	15.06 MB
mean	1.1092	1.3427	15.02 MB

Raw BPB is competitive (~1.109 with bigram blend), but quantization degrades to ~1.34 BPB.

Causal Bigram Blending

Same technique as PR #2088 — zero-cost eval-time blending with online causal bigram prior. Provides ~0.011 BPB improvement.

Potential Fixes (Future Work)

Per-depth quantization: Quantize MoE experts separately for each loop pass
Depth-aware GPTQ: Use calibration batches that include all loop depths
Avoid MoE on looped layers: Place MoE only on non-looped layers (0-2, 6-8)
Higher bit-width for MoE experts: Use int8 for expert weights, int6 for dense layers
Distillation-aware quantization: Fine-tune quantized MoE with knowledge distillation

Reproduction

# 1×H200, single GPU
BIGRAM_BLEND_ENABLED=1 BIGRAM_BLEND_LAMBDA=0.03 \
NUM_LAYERS=9 ITERATIONS=5000 \
MOE_START=4 MOE_END=5 MOE_NUM_EXPERTS=2 \
ENABLE_MOE_AT=0.30 ENABLE_LOOPING_AT=0.35 \
LOOP_START=3 LOOP_END=5 PARALLEL_START_LAYER=6 \
SEED=42 MIN_LR=0.1 QK_GAIN_INIT=5.25 \
GPTQ_CALIB_BATCHES=16 \
python train_gpt_exp4_moe.py

Files

train_gpt_exp4_moe.py — training script with MoE upcycling + Causal Bigram Blending
moe_9L_loop_seed42.log, moe_9L_seed0.log, moe_9L_seed314.log — 3-seed logs

Built Upon

This work builds on the following PRs from the parameter-golf leaderboard:

PR Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean) #1493 (bigbag, Apr 9) — SP8192 + 3-Layer Depth Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT (1.0810). Base architecture: U-Net skips, depth recurrence, parallel residuals.
PR Record: ImprovedParallelResiduals, 1.0758 BPB / 2.7789 nats, -0.0020 BPB / -0.0052 nats vs PR #1523 #1529 (msisovic, Apr 11) — Parallel Residuals + CUTLASS EVT + Legal TTT (1.0758). Inline fused kernels.
PR Record: PR #1736 + Polar Express NS + MIN_LR + Sparse Attn Gate + Fused CE + PR #1767 TTT — val_bpb 1.06335 #1787 (nprime06, Apr 23) — Polar Express Newton-Schulz + MIN_LR + SparseAttnGate + FusedCE (1.0634). Muon optimizer coefficients.
PR Record: SP8192 + LQER + Sparse Attn Gate + BOS-Fixed SmearGate + 9-Hparam Greedy Stack — val_bpb 1.06108 (3-seed mean) #1855 (codemath3000, Apr 27) — BOS-Fixed SmearGate + LQER + SparseAttnGate + 9-Hparam Stack (1.0611). Current merged SOTA — used as the foundation for train_gpt_sota_exp.py (PR Non-record: Causal Bigram Blending — eval-time BPB improvement (1×H20… #2088) and architectural reference for MoE exploration.
PR Non-record: Causal Bigram Blending — eval-time BPB improvement (1×H20… #2088 (MaxIv25) — Causal Bigram Blending technique reused in this submission.

The MoE upcycling mechanism is inspired by the Switch Transformer (Fedus et al., 2022) and Sparse Upcycling (Komatsuzaki et al., 2023).

…0, 3-seed)

MaxIv25 added 2 commits May 1, 2026 06:55

Non-record: Causal Bigram Blending — eval-time BPB improvement (1×H20…

352f562

…0, 3-seed)

Non-record: MoE Upcycling + Depth Recurrence — Quantization Gap Analysis

95b861c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: MoE Upcycling + Depth Recurrence — Quantization Gap Analysis#2102

Non-record: MoE Upcycling + Depth Recurrence — Quantization Gap Analysis#2102
MaxIv25 wants to merge 2 commits intoopenai:mainfrom
MaxIv25:moe-upcycling-nonrecord

MaxIv25 commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MaxIv25 commented May 1, 2026

Non-record: MoE Upcycling + Depth Recurrence — Quantization Gap Analysis

Summary

Architecture

Key Finding: MoE × Looping Quantization Gap

Results (1×H200, 5000 steps, 3-seed)

Causal Bigram Blending

Potential Fixes (Future Work)

Reproduction

Files

Built Upon

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant