Skip to content

mchawda/spliceLLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SpliceLLM

Out-of-core inference engine that runs LLMs of any size by streaming layers from disk through memory, one layer at a time.

Inspired by AirLLM.

Install

pip install -e .

Quick Start

import asyncio
from splicellm import OutOfCoreEngine

async def main():
    engine = OutOfCoreEngine()
    await engine.load_model("/path/to/huggingface-model")

    async for token in engine.generate_stream("Hello, world!"):
        print(token, end="", flush=True)

    await engine.unload_model()

asyncio.run(main())

Features

  • Out-of-core inference — only one transformer layer in memory at a time
  • MXFP4 dequantization — on-the-fly decoding of quantized weights
  • Speculative decoding — 2-3x speedup with a small draft model
  • Memory guard — OOM prevention with preflight checks and background monitoring
  • LoRA adapters — hot-swap PEFT adapters without reloading the base model
  • Model splitter — memory-safe splitting of HuggingFace models into per-layer files
  • KV-cache — incremental decoding, only processes new tokens
  • NF4/8-bit compression — compress split layers for faster disk I/O
  • HuggingFace downloading — load models by repo ID, auto-downloads
  • Profiler — per-layer timing and GPU memory tracking
  • Cross-platform — CUDA, MPS (Apple Silicon), and CPU

License

Apache 2.0 — see LICENSE.

About

Out-of-core inference engine — streams LLM layers from disk, one at a time.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages