Skip to content

sushildalavi/nanoserve

Repository files navigation

nanoserve

nanoserve is a local LLM serving engine for Apple Silicon.

What it does

  • Serves an OpenAI-compatible chat API
  • Supports continuous batching with FCFS and synchronized admission
  • Reuses KV cache prefixes when prompts overlap
  • Includes fp16, INT8, INT4, and MLX paths
  • Exposes Prometheus metrics and a Grafana dashboard
  • Ships benchmark and evaluation harnesses

Architecture

flowchart LR
    A[Chat request] --> B[Scheduler]
    B --> C[Batch builder]
    C --> D[Model engine]
    D --> E[Streaming response]
    D --> F[(Metrics / eval artifacts)]
Loading

What’s included

  • API server
  • Scheduler and engine
  • Prefix cache
  • Quantization paths
  • Metrics and ops dashboards
  • Benchmark and eval scripts

Quick start

make dev-install
make models
make baseline-hf
make parity
make serve
make observe
make eval

Notes

  • The project is built around local MPS inference on Mac hardware.
  • Continuous batching helps when the workload and admission policy line up.
  • Quantization is useful when the runtime has native support for it; on MPS that depends on the path.

Portfolio Proof

About

OpenAI-compatible LLM serving engine built from scratch on Apple Silicon. Continuous batching, paged KV-cache, prefix caching, INT8/INT4 quantization, Prometheus/Grafana observability. Benchmarked against llama.cpp and HuggingFace baselines.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors