β‘ Real-time inference optimizer for LLMs β faster generation, smarter decoding, and live observability πβ¨
Kairu (ζ΅γγ) β to flow, to stream.
Inference should be fluid β not blocked by latency, inefficiency, or opaque performance.
Kairu wraps any HuggingFace model and adds:
-
π¦ Speculative decoding (EAGLE-style)
-
β© Dynamic early exit
-
πΈ Token budget enforcement
-
π Live dashboard:
- tokens/sec
- latency
- quality tradeoffs
Speculative decoding works β but:
- locked inside heavy frameworks (vLLM, etc.)
- hard to experiment with
- no lightweight tooling
- no built-in observability
- Speculative decoding internals (EAGLE, Medusa)
- KV cache management
- Streaming inference
- Performance optimization
pip install kairufrom kairu import wrap_model
model = wrap_model("your-model")
model.generate("Hello world")Make LLM inference fast, transparent, and controllable.