Out-of-core inference engine that runs LLMs of any size by streaming layers from disk through memory, one layer at a time.
Inspired by AirLLM.
pip install -e .import asyncio
from splicellm import OutOfCoreEngine
async def main():
engine = OutOfCoreEngine()
await engine.load_model("/path/to/huggingface-model")
async for token in engine.generate_stream("Hello, world!"):
print(token, end="", flush=True)
await engine.unload_model()
asyncio.run(main())- Out-of-core inference — only one transformer layer in memory at a time
- MXFP4 dequantization — on-the-fly decoding of quantized weights
- Speculative decoding — 2-3x speedup with a small draft model
- Memory guard — OOM prevention with preflight checks and background monitoring
- LoRA adapters — hot-swap PEFT adapters without reloading the base model
- Model splitter — memory-safe splitting of HuggingFace models into per-layer files
- KV-cache — incremental decoding, only processes new tokens
- NF4/8-bit compression — compress split layers for faster disk I/O
- HuggingFace downloading — load models by repo ID, auto-downloads
- Profiler — per-layer timing and GPU memory tracking
- Cross-platform — CUDA, MPS (Apple Silicon), and CPU
Apache 2.0 — see LICENSE.