SMC Speculative Decoding

Warning: This repository is under active development. APIs, configuration flags, and internal interfaces may go through breaking changes.

This repository implements Sequential Monte Carlo Speculative Decoding (SMC-SD) on top of SGLang. SMC-SD is a population-based alternative to rejection-based speculative decoding: N particles maintain parallel generation paths, weighted by target/draft likelihood ratios, and resampled when effective sample size drops. All drafted tokens are accepted (no rejection), and throughput scales with batch size by increasing arithmetic intensity toward the GPU compute bound.

Paper: Faster LLM Inference via Sequential Monte Carlo

Installation

This repo vendors a patched SGLang as a git submodule at 3rdparty/sglang (branch smc_v2_clean). Install both in editable mode.

SMCEngine will not import until the patched SGLang submodule is both checked out and installed. If you hit ModuleNotFoundError: No module named 'sglang', run:

git submodule update --init --recursive
uv pip install -e 3rdparty/sglang/python
uv pip install -e .

# 1. Clone with submodules
git clone --recurse-submodules https://github.com/abdelfattah-lab/smcsd.git
cd smcsd

# If you already cloned without --recurse-submodules, initialise now:
# git submodule update --init --recursive

# 2. Create a Python 3.12 environment
uv venv --python 3.12
source .venv/bin/activate

# 3. Install the patched SGLang (from the submodule), then this package
uv pip install -e 3rdparty/sglang/python
uv pip install -e .

Quick Start

# SMC-SD throughput on ShareGPT
python -O scripts/tps_benchmark_scripts/bench_offline_throughput.py \
  --backend smc_engine \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --speculative-draft-model-path meta-llama/Llama-3.2-1B-Instruct \
  --smc-n-particles 8 --smc-gamma 8 \
  --smc-draft-temperature 0.7 --smc-target-temperature 0.7 \
  --attention-backend fa3 \
  --mem-fraction-static 0.60 \
  --max-running-requests 1 \
  --cuda-graph-max-bs 8 \
  --dataset-name sharegpt \
  --num-prompts 200

# SMC-SD accuracy on GSM8K (N=12 particles, gamma=8)
python scripts/accuracy_test_gsm8k.py \
  --mode smc_engine \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --draft-model meta-llama/Llama-3.2-1B-Instruct \
  --particles 12 --gamma 8 \
  --temperature 0.7 \
  --attention-backend fa3 \
  --num-questions 400

Note

When using non-Hopper GPU (such as A100, A6000), specify --attention-backend to be triton

See scripts/README.md for more benchmark entrypoints.

SMC-SD Parameters

Parameter	Flag	Description
Particles (N)	`--smc-n-particles`	Number of parallel generation paths per request
Gamma (K)	`--smc-gamma`	Draft tokens per speculative step
Draft temp	`--smc-draft-temperature`	Sampling temperature for draft model
Target temp	`--smc-target-temperature`	Scoring temperature for target model
Resample threshold	`--smc-resample-threshold`	Resample when ESS < N × threshold (0 = disable)

Architecture

SMC lives in the top-level smcsd/ package, layered over the patched SGLang via a handful of extension points (ModelRunner._init_pools, ModelRunner._build_dummy_run_spec_info, ModelRunner._get_graph_runner_class, CudaGraphRunner.get_spec_info, Scheduler.init_tp_model_worker, TpModelWorker._init_model_runner).

Path	Description
`smcsd/engine.py`	`SMCEngine` — standalone offline engine (bypasses Tokenizer/Detokenizer managers)
`smcsd/core/scheduler.py`	`SMCScheduler` + `SMCCoordinator` — slot-based decode loop and resampler
`smcsd/core/worker.py`	`SMCWorker` — draft AR loop + target scoring + importance weights
`smcsd/core/req_state.py`	`ScheduleBatchSMC` — per-slot decode state, flat slot-major weights, and group lookup
`smcsd/core/info.py`	`SMCDraftInput`, `SMCDecodeContext` — spec-info wiring
`smcsd/core/kernels/`	Fused Triton kernels (`fused_collect`, `fused_resample_kv`)
`smcsd/managers/smc_tp_worker.py`	`SMCTpModelWorker` — wires `SMCModelRunner` into the target TP worker
`smcsd/model_executor/smc_model_runner.py`	`SMCModelRunner` — installs refcounted allocator + SMC warmup spec-info
`smcsd/model_executor/smc_cuda_graph_runner.py`	`SMCCudaGraphRunner` — `SMCVerifyInput` during CUDA graph capture
`smcsd/mem_cache/allocator.py`	`SMCRefCountedTokenAllocator` + `copy_block_table`
`smcsd/common/verify.py`	`SMCVerifyInput` + Triton cache-assignment kernel
`smcsd/common/utils.py`	Particle cloning, weight normalization, ESS / resample helpers

See docs/smc/architecture.md for the detailed design overview.

Citation

@misc{smcsd2026,
  title         = {Faster LLM Inference via Sequential Monte Carlo},
  author        = {Emara, Yahya and Barba da Costa, Mauricio and Chang, Chi-Chih
                   and Freer, Cameron and Vieira, Tim and Cotterell, Ryan
                   and Abdelfattah, Mohamed S.},
  year          = {2026},
  eprint        = {2604.15672},
  archivePrefix = {arXiv},
  primaryClass  = {cs.LG},
  url           = {https://arxiv.org/abs/2604.15672},
}

Roadmap

EAGLE support
vLLM support
Async/Delayed resampling (CPU/GPU overlap for KV cache rewrites)
Async SMC-SD at resample threshold 0 (overlap draft and target for SIS)
Disaggregation (draft/target separation)

PRs welcome!

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
3rdparty		3rdparty
docs/smc		docs/smc
scripts		scripts
smcsd		smcsd
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SMC Speculative Decoding

Installation

Quick Start

SMC-SD Parameters

Architecture

Citation

Roadmap

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SMC Speculative Decoding

Installation

Quick Start

SMC-SD Parameters

Architecture

Citation

Roadmap

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages