Inferno is a lightweight inference server that hosts multiple machine learning models with dynamic batching, configurable concurrency, and structured observability. It is designed for production deployments where predictable input validation, consistent error handling, and clear operational controls are critical.
- Multi-model hosting: Serve several versions side-by-side, with lazy loading and hot swap support.
- Dynamic batching: Latency- and size-aware batching with per-model thresholds and async batching queues.
- Strict validation: Pydantic-powered schema checks and signature-aware validation guard against malformed requests and surface helpful 422 errors.
- Flexible codecs: JSON for small payloads and base64 for contiguous binary data, with dtype enforcement and shape checks.
- Observability first: Structured logging, Prometheus metrics, and readiness/liveness endpoints for orchestration systems.
- Pluggable runtimes: TorchScript models or user-defined Python entry points that conform to a simple signature contract.
- Shared state (optional): Coordinate active and loaded model metadata via Redis when running multiple replicas and let a background reconciler in each pod keep local runners aligned with the shared state.
- Getting Started
- Configuration
- Model Repository Layout
- Prediction API
- Administrative APIs
- Health & Metrics
- Batching & Concurrency
- Logging
- Deployment Notes
- Development Workflow
- API Reference
- Deployment Guide
- Python 3.12+
uvorpipfor dependency management- PyTorch (installed automatically via dependencies but may require system-specific wheels)
python -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install -e .Tip: You can also use
uvwithuv syncwhich respectsuv.lock.
uvicorn app.main:app \
--host 0.0.0.0 \
--port 8000The server lazily discovers models under models/ and preloads active versions on startup (configurable via environment variables).
PYTHONPATH=. pytestAll tests rely on the in-repo sample model stored under models/python/1.
Runtime configuration is managed through environment variables interpreted by app.config.Settings:
| Variable | Default | Description |
|---|---|---|
DEVICE |
auto |
Selects cpu, cuda, or auto-detection using torch.cuda.is_available(). |
REPO_ROOT |
models |
Root directory containing model packages. |
PRELOAD |
True |
Eagerly load active versions at startup when True; lazy load otherwise. |
UVICORN_ACCESS_LOG |
False |
Enables Uvicorn access logging when set to True. |
METRICS_BUCKETS |
0.005,0.01,0.02,0.04,0.08,0.16,0.32,0.64,1.28 |
Prometheus latency histogram buckets (seconds). |
REDIS_URL |
unset | When set, enables a Redis-backed state store (e.g., redis://hostname:6379/0). |
REDIS_PREFIX |
inferno:model_state |
Key prefix used for Redis state entries. |
STATE_RECONCILE_ENABLED |
True |
Toggle the background reconciler that syncs runners with Redis state. |
STATE_RECONCILE_INTERVAL_S |
5.0 |
Seconds between reconciliation sweeps when Redis is enabled. |
Set variables using your process manager (e.g., export DEVICE=cpu). Unknown environment variables are ignored.
Models live under models/<model-name>/<version>/ alongside a model.yaml that describes runtime behavior.
models/
my_classifier/
1/
model.ts # TorchScript file or Python module
model.yaml # schema, batching, concurrency, codecs
2/
model.ts
model.yaml
For a field-by-field reference, including validation rules and advanced examples, see the Model Packaging Guide.
Key elements to provide:
name/version: identify the model and the specific revision being shipped.format: choosetorchscriptfor serialized modules orpythonwith anentrypointfor custom code.signature: declare required and optional tensors (names, dtypes, and shapes) for both inputs and outputs.batchingandconcurrency(optional): tune latency versus throughput per model.codecs(optional): select the default encoding (jsonorbase64) for responses.
Minimal TorchScript example:
name: my_classifier
version: 1
format: torchscript
signature:
inputs:
- name: tokens
dtype: int64
shape: [batch, seq]
outputs:
- name: probs
dtype: float32
shape: [batch, num_classes]Augment with batching, concurrency, or codec overrides as needed. Python models require an entrypoint such as model:build that returns an eval()'d callable.
For complete request and response schemas, consult the API Reference.
POST /v1/models/{name}:predict
{
"inputs": {
"tokens": {
"dtype": "float32",
"shape": [2, 4],
"data": [[1, 2, 3, 4], [4, 3, 2, 1]]
},
"mask": {
"dtype": "float32",
"shape": [2, 4],
"base64": "<base64 payload>"
}
},
"outputs": ["probs", "total"],
"timeout_ms": 2000
}inputs: Map of tensor names to payloads. Each payload requiresdtype,shape, and eitherdata(nested lists) orbase64(contiguous binary). Both cannot be provided simultaneously.outputs(optional): Restricts returned tensors; defaults to all declared outputs.timeout_ms(optional): Overrides the per-request wait limit for batching; must be positive.
{
"outputs": {
"probs": {"dtype": "float32", "shape": [2, 4], "data": [[...], [...]]},
"total": {"dtype": "float32", "shape": [1, 4], "data": [[...]]}
}
}Outputs mirror the codec used by the model runner (defaults to the model-level codecs.default).
- Missing required inputs, unexpected tensor names, dtype/shape mismatches, or invalid output selections return
422 Unprocessable Entitywith descriptive messages (e.g.,"Missing required input(s): tokens. Expected inputs: mask, tokens"). - Unsupported dtypes, malformed base64 payloads, or shape/data inconsistencies are rejected during request parsing.
- Unloaded or unknown models yield
404("model not found or not loaded").
For payload schemas and error codes, refer to the API Reference.
GET /v1/models— Unified view of active and (optionally) loaded versions per model.GET /v1/models/{name}— Model metadata and full signature (triggers lazy load if needed).GET /v1/models/{name}/active— Active version metadata (name, version, loaded state).POST /v1/models:load— Body{ "name": "model", "version": 3 }loads a model and registers a runner.POST /v1/models:unload— Body{ "name": "model", "version": 2 }unloads the runner and model from memory.
These endpoints are synchronous and return { "loaded": true } / { "unloaded": true } on success.
Additional details and sample responses are listed in the API Reference.
GET /health/live— Process liveness (always200when the service is up).GET /health/ready— Indicates whether runners have started successfully (READYgauge set to1).GET /metrics— Prometheus exposition including:infer_requests_totalwith labels (model,route)infer_latency_secondshistogram per modelinfer_batch_sizedistributioninfer_exceptions_totallabeled by exception classmodel_loadedgauge per model/version
Scrape /metrics from your observability stack to track throughput, latency, and error rates.
- Each model has an
AsyncBatcherthat gathers requests until eithermax_batch_sizeis reached ormax_batch_latency_mselapses. - A per-model semaphore (
max_concurrency) limits concurrent execution to protect GPU/CPU resources. - Timeouts supplied via
timeout_msbound how long a request waits in the batching queue before failing with a504(handled byasyncio.wait_for).
Logging is configured through app.logging.configure_logging() which emits JSON-structured logs with the following fields:
ts— ISO8601 timestamplevel— Log levellogger— Logger name (server,runtime,httpx, etc.)msg— Message text Additional request context (request id, route, batch size) can be layered in downstream processors.
Integrate with your log aggregation platform by redirecting stdout/stderr or using Uvicorn gunicorn workers (gunicorn dependency is included).
For end-to-end deployment patterns—including Docker builds, Gunicorn process management, Kubernetes rollouts, and observability hooks—see the Deployment Guide. Key reminders:
- Keep model artifacts on durable storage and mount them read-only in production.
- Configure
DEVICE,REPO_ROOT, and related environment variables per environment. - Use the provided health endpoints (
/health/live,/health/ready) for liveness/readiness checks. - Scrape
/metricsand stream JSON logs to integrate with your monitoring stack.
Install Locust (pip install locust) and use scripts/load_test.py as the Locust file to stress the prediction endpoint.
locust -f scripts/load_test.py \
--headless \
-u 50 \
-r 10 \
--run-time 2m \
--host http://127.0.0.1:8000 \
--model-name torchscript \
--load-model \
--model-version 2 \
--warmup 5 \
--verify-responseScript-specific options:
--payload-filepoints to a JSON file if you need a custom request body (defaults to the sample model payload undermodels/).--request-timeoutoverrides the per-request timeout (seconds).--model-name,--model-version,--load-model, and--warmupmirror the standalone script behaviour for staging models before traffic.--verify-responseasserts that the server returns at least one output tensor per request.
- Create a virtual environment (
python -m venv .venv && source .venv/bin/activate). - Install dependencies (
pip install -e .[dev]if you maintain an optional dev extra). - Run formatting and linting (add tools such as
rufforblackas needed). - Execute the test suite (
PYTHONPATH=. pytest). - Add or update models under
models/and extend the test coverage for new endpoints or validation rules.
Contributions should include updated tests and documentation when behavior changes. For large modifications, describe the expected API changes and operational impact in your pull request.
- Configure
REDIS_URLwhen running multiple replicas so they share active/loaded state. Without it, each pod keeps state in-process and must receive administrative calls individually. - With Redis configured, pods run a background reconciler that periodically reads the shared state and auto-loads or unloads runners to match the desired versions.
