π The first cross-architecture distillation framework for diffusion LLMs β 8B dense and 16B MoE teachers into a 0.6B student π
Gongbo Zhang1 Β Β·Β Wen Wang2 Β Β·Β Ye Tian1 Β Β·Β Li Yuan1,*
1 Peking University Β Β·Β 2 Zhejiang University Β (* corresponding author)
- +1.53 average gain over the non-distilled BD3LM baseline across 8 benchmarks (34.20 vs. 32.67).
- +16.48 on HumanEval over the equivalent-size AR baseline (48.78 vs. 32.30) β distilled dLLMs especially excel at code generation.
- 22Γ peak-memory reduction vs. the 16B MoE LLaDA2 teacher (1.4 GB vs. 31.3 GB) and 5.2Γ faster inference (6.25 s vs. 32.55 s for 256 tokens on H100), enabling commodity-hardware deployment.
All numbers reported in the paper β see arxiv.org/abs/2604.26951 for full setup and ablations.
| Component | Paper | Role | One-line description |
|---|---|---|---|
| TIDAL | Β§2.1 | Scheduling β when to learn | Dual-axis interpolation along training-progress AND diffusion-timestep axes; deweights the teacher at high masking ratios where it is unreliable. Generalizes prior single-axis interpolation to the diffusion setting. |
| CompDemo | Β§2.2 | Contextual β what to enrich | Two-pass teacher inference with complementary mask splits; every masked position sees ~50% revealed context, raising teacher signal quality at high noise. |
| Reverse CALM | Β§2.3 | Output β how to project | Reverse-direction chunk-level binary cross-entropy for cross-tokenizer matching. Bounded gradient coefficient (depends only on the fixed teacher) and dual-end noise filtering; equivalent to a Bernoulli-KL mode-seeking objective. |
Headline finding (Β§3.2): each pipeline favors its native strategy.
- Cross-Tokenizer (LLaDA2 β BD3LM): native = TIDE-Cross = Reverse CALM. Bounded-gradient mode-seeking tolerates the alignment noise from chunk-level cross-tokenizer matching. Beats the swapped TIDE-Shared by avg +0.37.
- Shared-Tokenizer (WeDLM β BD3LM): native = TIDE-Shared = TIDAL + CompDemo (over forward KL). Progressive scheduling and enriched signals work best when token-level alignment is exact. Beats the swapped TIDE-Cross by avg +2.76.
| Pipeline | Teacher | Student | Tokenizer | Native strategy | Paper avg |
|---|---|---|---|---|---|
| A β Cross-Tokenizer | LLaDA2.0-mini (16B MoE) | Qwen3-0.6B-BD3LM | Cross (chunk align via tokenkit) | TIDE-Cross = Reverse CALM | 34.20 |
| B β Shared-Tokenizer | WeDLM-8B-Instruct (8B dense) | Qwen3-0.6B-BD3LM | Shared (vocab 151646) | TIDE-Shared = TIDAL + CompDemo | 33.55 |
Main results across eight benchmarks. All distillation methods include a cross-entropy loss term. Bold: best among dLLM models; italic: second best.
| Benchmark | Qwen3-0.6B | Shared-Tokenizer | Cross-Tokenizer | |||||
|---|---|---|---|---|---|---|---|---|
| AR | BD3LM | KL | TIDE-Cross | TIDE-Shared | CALM | TIDE-Shared | TIDE-Cross | |
| GSM8K | 59.60 | 45.56 | 43.97 | 45.03 | 48.98 | 48.60 | 49.89 | 52.24 |
| MATH | 32.40 | 13.08 | 9.40 | 9.76 | 11.16 | 13.14 | 12.98 | 13.20 |
| BBH | 41.50 | 26.32 | 25.79 | 26.00 | 26.79 | 24.21 | 26.85 | 27.37 |
| MMLU-Pro | 24.70 | 13.80 | 13.19 | 12.88 | 14.48 | 13.47 | 14.02 | 14.52 |
| HellaSwag | 47.40 | 39.28 | 39.78 | 39.50 | 40.50 | 40.42 | 39.57 | 39.88 |
| MMLU | 52.80 | 39.15 | 39.57 | 39.09 | 39.92 | 39.42 | 39.54 | 39.59 |
| HumanEval | 32.30 | 46.34 | 41.46 | 42.68 | 48.78 | 43.90 | 49.39 | 48.17 |
| MBPP | 36.60 | 37.80 | 31.20 | 31.40 | 37.80 | 34.80 | 38.40 | 38.60 |
| Avg | 40.91 | 32.67 | 30.55 | 30.79 | 33.55 | 32.25 | 33.83 | 34.20 |
See the paper (Β§3.2) at arxiv.org/abs/2604.26951 for the full discussion.
This is the only place in the README where the legacy CLI strings alm / taid appear, because the --distill_mode flag values include them.
| Paper variant | Pipeline | Command | Notes |
|---|---|---|---|
| CALM (baseline, Cross-Tok) | A | distill_llada2.sh --distill_mode alm |
β |
| TIDE-Cross (native, Cross-Tok) | A | distill_llada2.sh --distill_mode reverse_alm |
β |
| TIDE-Shared (in Cross-Tok pipeline) | A | distill_llada2.sh --distill_mode alm_taid --use_comp_demo True |
TIDAL + CompDemo |
| KL (baseline, Shared-Tok) | B | distill_wedlm.sh --distill_mode kl_aligned |
β |
| TIDE-Shared (native, Shared-Tok) | B | distill_wedlm.sh --distill_mode taid_aligned --use_comp_demo True |
TIDAL + CompDemo |
| TIDE-Cross (in Shared-Tok pipeline) | B | distill_wedlm.sh --distill_mode reverse_kl_aligned |
β |
π‘ Note on combinations. TIDAL is applied only to forward objectives. As discussed in the paper's gradient-analysis appendix, combining TIDAL with reverse objectives is counterproductive β the late-training
$(1-\lambda_t)$ factor suppresses the self-selection mechanism of Reverse CALM.
# Create environment
conda create -n dllm python=3.10 -y && conda activate dllm
# Install PyTorch (CUDA 12.4)
conda install cuda=12.4 -c nvidia
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 \
--index-url https://download.pytorch.org/whl/cu124
# Install dllm
pip install -e .
# Initialize submodules (lm-evaluation-harness + tokenkit)
git submodule update --init --recursive
# Install eval harness
pip install -e "lm-evaluation-harness[ifeval,math]"
# Install tokenkit (required for Pipeline A cross-tokenizer distillation)
pip install -e "tokenkit[full]"Six distilled student checkpoints (3 per pipeline) are released under π€ TIDE-dllm Models, and two preprocessed SFT datasets are released under π€ TIDE-dllm Datasets.
| Pipeline | Variant | π€ Repo |
|---|---|---|
| A β Cross-Tokenizer (LLaDA2 teacher) | TIDE-Cross (native) | distill-LLaDA2-TIDE_Cross |
| A β Cross-Tokenizer (LLaDA2 teacher) | TIDE-Shared variant | distill-LLaDA2-TIDE_Shared |
| A β Cross-Tokenizer (LLaDA2 teacher) | CALM baseline | distill-LLaDA2-CALM |
| B β Shared-Tokenizer (WeDLM teacher) | TIDE-Shared (native) | distill-WeDLM-TIDE_Shared |
| B β Shared-Tokenizer (WeDLM teacher) | TIDE-Cross variant | distill-WeDLM-TIDE_Cross |
| B β Shared-Tokenizer (WeDLM teacher) | KL baseline | distill-WeDLM-KL |
Both datasets share the same composition as dllm-hub/Qwen3-0.6B-diffusion-bd3lm-v0.1 β tulu-3-sft-mixture + smoltalk + opc-sft-stage1 + opc-sft-stage2 β but tokenized for each teacher in advance to avoid NCCL timeouts during distillation.
| Pipeline | π€ Repo |
|---|---|
| A β for the LLaDA2 teacher | distill_llada2_sft |
| B β for the WeDLM teacher | distill_wedlm_sft |
pip install "huggingface_hub[cli]"
# Distilled checkpoint (example: native TIDE-Cross from Pipeline A)
huggingface-cli download TIDE-dllm/distill-LLaDA2-TIDE_Cross \
--local-dir ckpts/distill-LLaDA2-TIDE_Cross
# Preprocessed datasets
huggingface-cli download TIDE-dllm/distill_llada2_sft \
--repo-type dataset --local-dir data/distill_llada2_sft
huggingface-cli download TIDE-dllm/distill_wedlm_sft \
--repo-type dataset --local-dir data/distill_wedlm_sftProject page: pku-yuangroup.github.io/TIDE-Page.
Distillation requires offline-preprocessed data to avoid NCCL timeout during tokenization. The fastest path is to download our preprocessed datasets from TIDE-dllm (see π¦ Released Models & Data above):
huggingface-cli download TIDE-dllm/distill_llada2_sft \
--repo-type dataset --local-dir data/distill_llada2_preprocessed
huggingface-cli download TIDE-dllm/distill_wedlm_sft \
--repo-type dataset --local-dir data/distill_wedlm_preprocessedIf you'd rather preprocess from scratch, the examples below use tatsu-lab/alpaca for a quick smoke test. To reproduce the paper, replace the --dataset value with:
allenai/tulu-3-sft-mixture+HuggingFaceTB/smoltalk+OpenCoder-LLM/opc-sft-stage1[lang:python]+OpenCoder-LLM/opc-sft-stage2[lang:python]
Pipeline A (LLaDA2, cross-tokenizer):
bash scripts/preprocess_llada2_data.sh \
--dataset tatsu-lab/alpaca \
--output_dir data/distill_llada2_preprocessedPipeline B (WeDLM, same-tokenizer):
bash scripts/preprocess_wedlm_data.sh \
--dataset tatsu-lab/alpaca \
--output_dir data/distill_wedlm_preprocessedThe recommended command for each pipeline runs the native strategy (paper-best per Β§3.2).
Pipeline A β LLaDA2 teacher, TIDE-Cross (Reverse CALM):
bash scripts/distill_llada2.sh \
--data_path data/distill_llada2_preprocessed \
--distill_mode reverse_alm \
--num_gpus 8Pipeline B β WeDLM teacher, TIDE-Shared (TIDAL + CompDemo):
bash scripts/distill_wedlm.sh \
--data_path data/distill_wedlm_preprocessed \
--distill_mode taid_aligned \
--use_comp_demo True \
--num_gpus 8π All training script parameters
Both distill_llada2.sh and distill_wedlm.sh support:
| Parameter | Default | Description |
|---|---|---|
--data_path |
required | Preprocessed data directory or HF dataset name |
--output_dir |
output/distill_* |
Checkpoint output directory |
--num_gpus |
8 |
Number of GPUs |
--distill_mode |
alm / taid_aligned |
Distillation mode (see Paper Variants β Code Modes table above) |
--use_comp_demo |
False |
Enable CompDemo (complementary demonstration) |
--epochs |
2 / 3 |
Number of training epochs |
--lr |
5e-5 |
Learning rate |
--batch_size |
8 / 10 |
Per-device batch size |
--student_model |
dllm-collection/Qwen3-0.6B-diffusion-bd3lm-v0.1 |
Student model |
--teacher_model |
inclusionAI/LLaDA2.0-mini / tencent/WeDLM-8B-Instruct |
Teacher model |
WeDLM-specific (TIDAL controls):
| Parameter | Default | Description |
|---|---|---|
--taid_axis_mode |
both |
TIDAL axis: both, training_only, timestep_only |
--taid_timestep_weight |
midrange |
Timestep weighting: uniform, midrange |
--shared_vocab_size |
151646 |
Shared vocabulary size |
--teacher_mask_token_id |
151665 |
Teacher mask token ID |
Run all 8 benchmarks on a trained checkpoint:
bash scripts/eval_all.sh --model_path /path/to/checkpoint --num_gpus 8Benchmarks: mmlu_generative_dream, mmlu_pro, hellaswag_gen, gsm8k_cot, bbh, minerva_math, humaneval_instruct, mbpp_instruct.
Evaluation protocol: block size 32, CFG scale 0.0, sampling steps from 3 (HellaSwag/MMLU) up to 256 (everything else). Results are saved to
eval_results/by default (override with--output_dir).
Training settings used for the paper experiments.
| Parameter | Cross-Tokenizer (Pipeline A) | Shared-Tokenizer (Pipeline B) |
|---|---|---|
| Teacher | LLaDA2.0-mini (16B MoE) | WeDLM-8B-Instruct (8B) |
| Student init | Qwen3-0.6B-BD3LM SFT v0.1 | Qwen3-0.6B-BD3LM SFT v0.1 |
| Native method | Reverse CALM | TIDAL + CompDemo |
| Learning rate | 5e-5 | 5e-5 |
| Epochs | 10 | 10 |
| Student / teacher seq length | 512 / 1024 | 512 / 768 |
| Block size | 32 | 32 |
| Precision | bfloat16 | bfloat16 |
| TIDAL |
β |
|
| CompDemo demo_ratio | β | 0.5 |
| Temperature |
β | 2.0 |
| Dataset | Tulu-3 SFT + SmolTalk + OpenCoder-SFT-1/2 (Python) | (same) |
ValueError: Sequence length N exceeds pad_to_length M during training
For *_aligned modes (Pipeline B) the preprocessing script does not truncate samples to --max_length β it only filters samples whose prompt alone exceeds it. The training --max_length (and --teacher_max_length) must therefore be at least as large as the value used during preprocessing. The simplest rule: pass the same --max_length to both preprocess_wedlm_data.sh and distill_wedlm.sh.
Pipeline B taid_aligned requires aligned preprocessed data
The default --align_mode of preprocess_wedlm_data.sh is kl_aligned, which produces the dual-tokenizer fields (teacher_input_ids, align_student, align_teacher) needed by *_aligned training modes. If you preprocessed with --align_mode none, training in any *_aligned mode will crash with KeyError: 'teacher_input_ids'. Re-run preprocessing without overriding --align_mode.
dllm/core/trainers/
βββ distill_bd3lm.py # DistillBD3LMTrainer β all distillation modes (TIDAL, CompDemo, CALM, Reverse CALM, plus baselines)
βββ distill_collator.py # DistillCollator β chunk-level CALM alignment via tokenkit (paper Β§2.3)
βββ bd3lm.py # BD3LMTrainer (base block diffusion trainer)
βββ mdlm.py # MDLMTrainer (base masked diffusion trainer)
βββ losses/
βββ taid.py # TIDAL loss implementation (paper Β§2.1)
examples/a2d/bd3lm/
βββ distill.py # Pipeline A entry: LLaDA2 cross-tokenizer distillation
βββ distill_wedlm.py # Pipeline B entry: WeDLM same-tokenizer distillation
βββ distill_utils.py # Shared utilities (alignment, tokenization)
βββ preprocess_distill_data.py # Data preprocessing for Pipeline A
βββ preprocess_distill_wedlm_data.py # Data preprocessing for Pipeline B
scripts/
βββ distill_llada2.sh # One-click training: Pipeline A
βββ distill_wedlm.sh # One-click training: Pipeline B
βββ eval_all.sh # One-click evaluation (8 benchmarks)
βββ preprocess_llada2_data.sh # One-click preprocessing: Pipeline A
βββ preprocess_wedlm_data.sh # One-click preprocessing: Pipeline B
If you find TIDE useful for your research, please consider citing:
@misc{zhang2026turningtidecrossarchitecturedistillation,
title={Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models},
author={Gongbo Zhang and Wen Wang and Ye Tian and Li Yuan},
year={2026},
eprint={2604.26951},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2604.26951},
}Built on the dLLM library; cross-tokenizer alignment via tokenkit; evaluation through lm-evaluation-harness.


