Skip to content

PKU-YuanGroup/TIDE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

TIDE logo

Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models

🌊 The first cross-architecture distillation framework for diffusion LLMs β€” 8B dense and 16B MoE teachers into a 0.6B student 🌊

Gongbo Zhang1 Β Β·Β  Wen Wang2 Β Β·Β  Ye Tian1 Β Β·Β  Li Yuan1,*

1 Peking University Β Β·Β  2 Zhejiang University Β  (* corresponding author)

arXiv Project Page HF Paper HF Models HF Datasets License GitHub

This repository is the official implementation of TIDE, the first framework for cross-architecture dLLM distillation. While prior work focuses on step compression within a single architecture, TIDE bridges teachers and students that differ in architecture, attention mechanism, and tokenizer, via three modular components β€” TIDAL, CompDemo, and Reverse CALM.

TIDE: cross-architecture distillation overview

✨ Highlights

  1. +1.53 average gain over the non-distilled BD3LM baseline across 8 benchmarks (34.20 vs. 32.67).
  2. +16.48 on HumanEval over the equivalent-size AR baseline (48.78 vs. 32.30) β€” distilled dLLMs especially excel at code generation.
  3. 22Γ— peak-memory reduction vs. the 16B MoE LLaDA2 teacher (1.4 GB vs. 31.3 GB) and 5.2Γ— faster inference (6.25 s vs. 32.55 s for 256 tokens on H100), enabling commodity-hardware deployment.

All numbers reported in the paper β€” see arxiv.org/abs/2604.26951 for full setup and ablations.

🌊 The TIDE Framework

TIDE framework: TIDAL + CompDemo + Reverse CALM

Component Paper Role One-line description
TIDAL Β§2.1 Scheduling β€” when to learn Dual-axis interpolation along training-progress AND diffusion-timestep axes; deweights the teacher at high masking ratios where it is unreliable. Generalizes prior single-axis interpolation to the diffusion setting.
CompDemo Β§2.2 Contextual β€” what to enrich Two-pass teacher inference with complementary mask splits; every masked position sees ~50% revealed context, raising teacher signal quality at high noise.
Reverse CALM Β§2.3 Output β€” how to project Reverse-direction chunk-level binary cross-entropy for cross-tokenizer matching. Bounded gradient coefficient (depends only on the fixed teacher) and dual-end noise filtering; equivalent to a Bernoulli-KL mode-seeking objective.

πŸ”„ Two Pipelines Γ— Two Strategies

Headline finding (Β§3.2): each pipeline favors its native strategy.

  • Cross-Tokenizer (LLaDA2 β†’ BD3LM): native = TIDE-Cross = Reverse CALM. Bounded-gradient mode-seeking tolerates the alignment noise from chunk-level cross-tokenizer matching. Beats the swapped TIDE-Shared by avg +0.37.
  • Shared-Tokenizer (WeDLM β†’ BD3LM): native = TIDE-Shared = TIDAL + CompDemo (over forward KL). Progressive scheduling and enriched signals work best when token-level alignment is exact. Beats the swapped TIDE-Cross by avg +2.76.
Pipeline Teacher Student Tokenizer Native strategy Paper avg
A β€” Cross-Tokenizer LLaDA2.0-mini (16B MoE) Qwen3-0.6B-BD3LM Cross (chunk align via tokenkit) TIDE-Cross = Reverse CALM 34.20
B β€” Shared-Tokenizer WeDLM-8B-Instruct (8B dense) Qwen3-0.6B-BD3LM Shared (vocab 151646) TIDE-Shared = TIDAL + CompDemo 33.55

πŸ“Š Main Results

Main results across eight benchmarks. All distillation methods include a cross-entropy loss term. Bold: best among dLLM models; italic: second best.

Benchmark Qwen3-0.6B Shared-Tokenizer Cross-Tokenizer
AR BD3LM KL TIDE-Cross TIDE-Shared CALM TIDE-Shared TIDE-Cross
GSM8K 59.60 45.56 43.97 45.03 48.98 48.60 49.89 52.24
MATH 32.40 13.08 9.40 9.76 11.16 13.14 12.98 13.20
BBH 41.50 26.32 25.79 26.00 26.79 24.21 26.85 27.37
MMLU-Pro 24.70 13.80 13.19 12.88 14.48 13.47 14.02 14.52
HellaSwag 47.40 39.28 39.78 39.50 40.50 40.42 39.57 39.88
MMLU 52.80 39.15 39.57 39.09 39.92 39.42 39.54 39.59
HumanEval 32.30 46.34 41.46 42.68 48.78 43.90 49.39 48.17
MBPP 36.60 37.80 31.20 31.40 37.80 34.80 38.40 38.60
Avg 40.91 32.67 30.55 30.79 33.55 32.25 33.83 34.20

See the paper (Β§3.2) at arxiv.org/abs/2604.26951 for the full discussion.

🧭 Paper Variants ↔ Code Modes

This is the only place in the README where the legacy CLI strings alm / taid appear, because the --distill_mode flag values include them.

Paper variant Pipeline Command Notes
CALM (baseline, Cross-Tok) A distill_llada2.sh --distill_mode alm β€”
TIDE-Cross (native, Cross-Tok) A distill_llada2.sh --distill_mode reverse_alm β€”
TIDE-Shared (in Cross-Tok pipeline) A distill_llada2.sh --distill_mode alm_taid --use_comp_demo True TIDAL + CompDemo
KL (baseline, Shared-Tok) B distill_wedlm.sh --distill_mode kl_aligned β€”
TIDE-Shared (native, Shared-Tok) B distill_wedlm.sh --distill_mode taid_aligned --use_comp_demo True TIDAL + CompDemo
TIDE-Cross (in Shared-Tok pipeline) B distill_wedlm.sh --distill_mode reverse_kl_aligned β€”

πŸ’‘ Note on combinations. TIDAL is applied only to forward objectives. As discussed in the paper's gradient-analysis appendix, combining TIDAL with reverse objectives is counterproductive β€” the late-training $(1-\lambda_t)$ factor suppresses the self-selection mechanism of Reverse CALM.

βš™οΈ Setup

# Create environment
conda create -n dllm python=3.10 -y && conda activate dllm

# Install PyTorch (CUDA 12.4)
conda install cuda=12.4 -c nvidia
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 \
    --index-url https://download.pytorch.org/whl/cu124

# Install dllm
pip install -e .

# Initialize submodules (lm-evaluation-harness + tokenkit)
git submodule update --init --recursive

# Install eval harness
pip install -e "lm-evaluation-harness[ifeval,math]"

# Install tokenkit (required for Pipeline A cross-tokenizer distillation)
pip install -e "tokenkit[full]"

πŸ“¦ Released Models & Data

Six distilled student checkpoints (3 per pipeline) are released under πŸ€— TIDE-dllm Models, and two preprocessed SFT datasets are released under πŸ€— TIDE-dllm Datasets.

Distilled student checkpoints

Pipeline Variant πŸ€— Repo
A β€” Cross-Tokenizer (LLaDA2 teacher) TIDE-Cross (native) distill-LLaDA2-TIDE_Cross
A β€” Cross-Tokenizer (LLaDA2 teacher) TIDE-Shared variant distill-LLaDA2-TIDE_Shared
A β€” Cross-Tokenizer (LLaDA2 teacher) CALM baseline distill-LLaDA2-CALM
B β€” Shared-Tokenizer (WeDLM teacher) TIDE-Shared (native) distill-WeDLM-TIDE_Shared
B β€” Shared-Tokenizer (WeDLM teacher) TIDE-Cross variant distill-WeDLM-TIDE_Cross
B β€” Shared-Tokenizer (WeDLM teacher) KL baseline distill-WeDLM-KL

Preprocessed SFT datasets

Both datasets share the same composition as dllm-hub/Qwen3-0.6B-diffusion-bd3lm-v0.1 β€” tulu-3-sft-mixture + smoltalk + opc-sft-stage1 + opc-sft-stage2 β€” but tokenized for each teacher in advance to avoid NCCL timeouts during distillation.

Pipeline πŸ€— Repo
A β€” for the LLaDA2 teacher distill_llada2_sft
B β€” for the WeDLM teacher distill_wedlm_sft

Download

pip install "huggingface_hub[cli]"

# Distilled checkpoint (example: native TIDE-Cross from Pipeline A)
huggingface-cli download TIDE-dllm/distill-LLaDA2-TIDE_Cross \
    --local-dir ckpts/distill-LLaDA2-TIDE_Cross

# Preprocessed datasets
huggingface-cli download TIDE-dllm/distill_llada2_sft \
    --repo-type dataset --local-dir data/distill_llada2_sft
huggingface-cli download TIDE-dllm/distill_wedlm_sft \
    --repo-type dataset --local-dir data/distill_wedlm_sft

Project page: pku-yuangroup.github.io/TIDE-Page.

πŸš€ Quick Start

1. Data Preprocessing

Distillation requires offline-preprocessed data to avoid NCCL timeout during tokenization. The fastest path is to download our preprocessed datasets from TIDE-dllm (see πŸ“¦ Released Models & Data above):

huggingface-cli download TIDE-dllm/distill_llada2_sft \
    --repo-type dataset --local-dir data/distill_llada2_preprocessed
huggingface-cli download TIDE-dllm/distill_wedlm_sft \
    --repo-type dataset --local-dir data/distill_wedlm_preprocessed

If you'd rather preprocess from scratch, the examples below use tatsu-lab/alpaca for a quick smoke test. To reproduce the paper, replace the --dataset value with:

allenai/tulu-3-sft-mixture+HuggingFaceTB/smoltalk+OpenCoder-LLM/opc-sft-stage1[lang:python]+OpenCoder-LLM/opc-sft-stage2[lang:python]

Pipeline A (LLaDA2, cross-tokenizer):

bash scripts/preprocess_llada2_data.sh \
    --dataset tatsu-lab/alpaca \
    --output_dir data/distill_llada2_preprocessed

Pipeline B (WeDLM, same-tokenizer):

bash scripts/preprocess_wedlm_data.sh \
    --dataset tatsu-lab/alpaca \
    --output_dir data/distill_wedlm_preprocessed

2. Distillation Training

The recommended command for each pipeline runs the native strategy (paper-best per Β§3.2).

Pipeline A β€” LLaDA2 teacher, TIDE-Cross (Reverse CALM):

bash scripts/distill_llada2.sh \
    --data_path data/distill_llada2_preprocessed \
    --distill_mode reverse_alm \
    --num_gpus 8

Pipeline B β€” WeDLM teacher, TIDE-Shared (TIDAL + CompDemo):

bash scripts/distill_wedlm.sh \
    --data_path data/distill_wedlm_preprocessed \
    --distill_mode taid_aligned \
    --use_comp_demo True \
    --num_gpus 8
πŸ“‹ All training script parameters

Both distill_llada2.sh and distill_wedlm.sh support:

Parameter Default Description
--data_path required Preprocessed data directory or HF dataset name
--output_dir output/distill_* Checkpoint output directory
--num_gpus 8 Number of GPUs
--distill_mode alm / taid_aligned Distillation mode (see Paper Variants ↔ Code Modes table above)
--use_comp_demo False Enable CompDemo (complementary demonstration)
--epochs 2 / 3 Number of training epochs
--lr 5e-5 Learning rate
--batch_size 8 / 10 Per-device batch size
--student_model dllm-collection/Qwen3-0.6B-diffusion-bd3lm-v0.1 Student model
--teacher_model inclusionAI/LLaDA2.0-mini / tencent/WeDLM-8B-Instruct Teacher model

WeDLM-specific (TIDAL controls):

Parameter Default Description
--taid_axis_mode both TIDAL axis: both, training_only, timestep_only
--taid_timestep_weight midrange Timestep weighting: uniform, midrange
--shared_vocab_size 151646 Shared vocabulary size
--teacher_mask_token_id 151665 Teacher mask token ID

3. Evaluation

Run all 8 benchmarks on a trained checkpoint:

bash scripts/eval_all.sh --model_path /path/to/checkpoint --num_gpus 8

Benchmarks: mmlu_generative_dream, mmlu_pro, hellaswag_gen, gsm8k_cot, bbh, minerva_math, humaneval_instruct, mbpp_instruct.

Evaluation protocol: block size 32, CFG scale 0.0, sampling steps from 3 (HellaSwag/MMLU) up to 256 (everything else). Results are saved to eval_results/ by default (override with --output_dir).

πŸ“‹ Training Hyperparameters

Training settings used for the paper experiments.

Parameter Cross-Tokenizer (Pipeline A) Shared-Tokenizer (Pipeline B)
Teacher LLaDA2.0-mini (16B MoE) WeDLM-8B-Instruct (8B)
Student init Qwen3-0.6B-BD3LM SFT v0.1 Qwen3-0.6B-BD3LM SFT v0.1
Native method Reverse CALM TIDAL + CompDemo
Learning rate 5e-5 5e-5
Epochs 10 10
Student / teacher seq length 512 / 1024 512 / 768
Block size 32 32
Precision bfloat16 bfloat16
TIDAL $\lambda_{\text{init}} \to \lambda_{\max}$ β€” $0.1 \to 0.9$, cosine, midrange weighting
CompDemo demo_ratio β€” 0.5
Temperature $T$ β€” 2.0
Dataset Tulu-3 SFT + SmolTalk + OpenCoder-SFT-1/2 (Python) (same)

πŸ› οΈ Troubleshooting

ValueError: Sequence length N exceeds pad_to_length M during training

For *_aligned modes (Pipeline B) the preprocessing script does not truncate samples to --max_length β€” it only filters samples whose prompt alone exceeds it. The training --max_length (and --teacher_max_length) must therefore be at least as large as the value used during preprocessing. The simplest rule: pass the same --max_length to both preprocess_wedlm_data.sh and distill_wedlm.sh.

Pipeline B taid_aligned requires aligned preprocessed data

The default --align_mode of preprocess_wedlm_data.sh is kl_aligned, which produces the dual-tokenizer fields (teacher_input_ids, align_student, align_teacher) needed by *_aligned training modes. If you preprocessed with --align_mode none, training in any *_aligned mode will crash with KeyError: 'teacher_input_ids'. Re-run preprocessing without overriding --align_mode.

πŸ“ File Structure

dllm/core/trainers/
β”œβ”€β”€ distill_bd3lm.py        # DistillBD3LMTrainer β€” all distillation modes (TIDAL, CompDemo, CALM, Reverse CALM, plus baselines)
β”œβ”€β”€ distill_collator.py     # DistillCollator β€” chunk-level CALM alignment via tokenkit (paper Β§2.3)
β”œβ”€β”€ bd3lm.py                # BD3LMTrainer (base block diffusion trainer)
β”œβ”€β”€ mdlm.py                 # MDLMTrainer (base masked diffusion trainer)
└── losses/
    └── taid.py             # TIDAL loss implementation (paper Β§2.1)

examples/a2d/bd3lm/
β”œβ”€β”€ distill.py              # Pipeline A entry: LLaDA2 cross-tokenizer distillation
β”œβ”€β”€ distill_wedlm.py        # Pipeline B entry: WeDLM same-tokenizer distillation
β”œβ”€β”€ distill_utils.py        # Shared utilities (alignment, tokenization)
β”œβ”€β”€ preprocess_distill_data.py       # Data preprocessing for Pipeline A
└── preprocess_distill_wedlm_data.py # Data preprocessing for Pipeline B

scripts/
β”œβ”€β”€ distill_llada2.sh       # One-click training: Pipeline A
β”œβ”€β”€ distill_wedlm.sh        # One-click training: Pipeline B
β”œβ”€β”€ eval_all.sh             # One-click evaluation (8 benchmarks)
β”œβ”€β”€ preprocess_llada2_data.sh   # One-click preprocessing: Pipeline A
└── preprocess_wedlm_data.sh    # One-click preprocessing: Pipeline B

πŸ“ Citation

If you find TIDE useful for your research, please consider citing:

@misc{zhang2026turningtidecrossarchitecturedistillation,
      title={Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models},
      author={Gongbo Zhang and Wen Wang and Ye Tian and Li Yuan},
      year={2026},
      eprint={2604.26951},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2604.26951},
}

πŸ™ Acknowledgements

Built on the dLLM library; cross-tokenizer alignment via tokenkit; evaluation through lm-evaluation-harness.

About

Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors