Inferno

Inferno is a lightweight inference server that hosts multiple machine learning models with dynamic batching, configurable concurrency, and structured observability. It is designed for production deployments where predictable input validation, consistent error handling, and clear operational controls are critical.

Features

Multi-model hosting: Serve several versions side-by-side, with lazy loading and hot swap support.
Dynamic batching: Latency- and size-aware batching with per-model thresholds and async batching queues.
Strict validation: Pydantic-powered schema checks and signature-aware validation guard against malformed requests and surface helpful 422 errors.
Flexible codecs: JSON for small payloads and base64 for contiguous binary data, with dtype enforcement and shape checks.
Observability first: Structured logging, Prometheus metrics, and readiness/liveness endpoints for orchestration systems.
Pluggable runtimes: TorchScript models or user-defined Python entry points that conform to a simple signature contract.
Shared state (optional): Coordinate active and loaded model metadata via Redis when running multiple replicas and let a background reconciler in each pod keep local runners aligned with the shared state.

Getting Started

Prerequisites

Python 3.12+
uv or pip for dependency management
PyTorch (installed automatically via dependencies but may require system-specific wheels)

Installation

python -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install -e .

Tip: You can also use uv with uv sync which respects uv.lock.

Run the Server

uvicorn app.main:app \
  --host 0.0.0.0 \
  --port 8000

The server lazily discovers models under models/ and preloads active versions on startup (configurable via environment variables).

Run Tests

PYTHONPATH=. pytest

All tests rely on the in-repo sample model stored under models/python/1.

Configuration

Runtime configuration is managed through environment variables interpreted by app.config.Settings:

Variable	Default	Description
`DEVICE`	`auto`	Selects `cpu`, `cuda`, or auto-detection using `torch.cuda.is_available()`.
`REPO_ROOT`	`models`	Root directory containing model packages.
`PRELOAD`	`True`	Eagerly load active versions at startup when `True`; lazy load otherwise.
`UVICORN_ACCESS_LOG`	`False`	Enables Uvicorn access logging when set to `True`.
`METRICS_BUCKETS`	`0.005,0.01,0.02,0.04,0.08,0.16,0.32,0.64,1.28`	Prometheus latency histogram buckets (seconds).
`REDIS_URL`	unset	When set, enables a Redis-backed state store (e.g., `redis://hostname:6379/0`).
`REDIS_PREFIX`	`inferno:model_state`	Key prefix used for Redis state entries.
`STATE_RECONCILE_ENABLED`	`True`	Toggle the background reconciler that syncs runners with Redis state.
`STATE_RECONCILE_INTERVAL_S`	`5.0`	Seconds between reconciliation sweeps when Redis is enabled.

Set variables using your process manager (e.g., export DEVICE=cpu). Unknown environment variables are ignored.

Model Repository Layout

Models live under models/<model-name>/<version>/ alongside a model.yaml that describes runtime behavior.

models/
  my_classifier/
    1/
      model.ts            # TorchScript file or Python module
      model.yaml          # schema, batching, concurrency, codecs
    2/
      model.ts
      model.yaml

Model Configuration (`model.yaml`)

For a field-by-field reference, including validation rules and advanced examples, see the Model Packaging Guide.

Key elements to provide:

name / version: identify the model and the specific revision being shipped.
format: choose torchscript for serialized modules or python with an entrypoint for custom code.
signature: declare required and optional tensors (names, dtypes, and shapes) for both inputs and outputs.
batching and concurrency (optional): tune latency versus throughput per model.
codecs (optional): select the default encoding (json or base64) for responses.

Minimal TorchScript example:

name: my_classifier
version: 1
format: torchscript
signature:
  inputs:
    - name: tokens
      dtype: int64
      shape: [batch, seq]
  outputs:
    - name: probs
      dtype: float32
      shape: [batch, num_classes]

Augment with batching, concurrency, or codec overrides as needed. Python models require an entrypoint such as model:build that returns an eval()'d callable.

Prediction API

For complete request and response schemas, consult the API Reference.

POST /v1/models/{name}:predict

Request Payload

{
  "inputs": {
    "tokens": {
      "dtype": "float32",
      "shape": [2, 4],
      "data": [[1, 2, 3, 4], [4, 3, 2, 1]]
    },
    "mask": {
      "dtype": "float32",
      "shape": [2, 4],
      "base64": "<base64 payload>"
    }
  },
  "outputs": ["probs", "total"],
  "timeout_ms": 2000
}

inputs: Map of tensor names to payloads. Each payload requires dtype, shape, and either data (nested lists) or base64 (contiguous binary). Both cannot be provided simultaneously.
outputs (optional): Restricts returned tensors; defaults to all declared outputs.
timeout_ms (optional): Overrides the per-request wait limit for batching; must be positive.

Response Payload

{
  "outputs": {
    "probs": {"dtype": "float32", "shape": [2, 4], "data": [[...], [...]]},
    "total": {"dtype": "float32", "shape": [1, 4], "data": [[...]]}
  }
}

Outputs mirror the codec used by the model runner (defaults to the model-level codecs.default).

Validation & Errors

Missing required inputs, unexpected tensor names, dtype/shape mismatches, or invalid output selections return 422 Unprocessable Entity with descriptive messages (e.g., "Missing required input(s): tokens. Expected inputs: mask, tokens").
Unsupported dtypes, malformed base64 payloads, or shape/data inconsistencies are rejected during request parsing.
Unloaded or unknown models yield 404 ("model not found or not loaded").

Administrative APIs

For payload schemas and error codes, refer to the API Reference.

GET /v1/models — Unified view of active and (optionally) loaded versions per model.
GET /v1/models/{name} — Model metadata and full signature (triggers lazy load if needed).
GET /v1/models/{name}/active — Active version metadata (name, version, loaded state).
POST /v1/models:load — Body { "name": "model", "version": 3 } loads a model and registers a runner.
POST /v1/models:unload — Body { "name": "model", "version": 2 } unloads the runner and model from memory.

These endpoints are synchronous and return { "loaded": true } / { "unloaded": true } on success.

Health & Metrics

Additional details and sample responses are listed in the API Reference.

GET /health/live — Process liveness (always 200 when the service is up).
GET /health/ready — Indicates whether runners have started successfully (READY gauge set to 1).
GET /metrics — Prometheus exposition including:
- infer_requests_total with labels (model, route)
- infer_latency_seconds histogram per model
- infer_batch_size distribution
- infer_exceptions_total labeled by exception class
- model_loaded gauge per model/version

Scrape /metrics from your observability stack to track throughput, latency, and error rates.

Batching & Concurrency

Each model has an AsyncBatcher that gathers requests until either max_batch_size is reached or max_batch_latency_ms elapses.
A per-model semaphore (max_concurrency) limits concurrent execution to protect GPU/CPU resources.
Timeouts supplied via timeout_ms bound how long a request waits in the batching queue before failing with a 504 (handled by asyncio.wait_for).

Logging

Logging is configured through app.logging.configure_logging() which emits JSON-structured logs with the following fields:

ts — ISO8601 timestamp
level — Log level
logger — Logger name (server, runtime, httpx, etc.)
msg — Message text Additional request context (request id, route, batch size) can be layered in downstream processors.

Integrate with your log aggregation platform by redirecting stdout/stderr or using Uvicorn gunicorn workers (gunicorn dependency is included).

Deployment Notes

For end-to-end deployment patterns—including Docker builds, Gunicorn process management, Kubernetes rollouts, and observability hooks—see the Deployment Guide. Key reminders:

Keep model artifacts on durable storage and mount them read-only in production.
Configure DEVICE, REPO_ROOT, and related environment variables per environment.
Use the provided health endpoints (/health/live, /health/ready) for liveness/readiness checks.
Scrape /metrics and stream JSON logs to integrate with your monitoring stack.

Load Testing

Install Locust (pip install locust) and use scripts/load_test.py as the Locust file to stress the prediction endpoint.

locust -f scripts/load_test.py \
  --headless \
  -u 50 \
  -r 10 \
  --run-time 2m \
  --host http://127.0.0.1:8000 \
  --model-name torchscript \
  --load-model \
  --model-version 2 \
  --warmup 5 \
  --verify-response

Script-specific options:

--payload-file points to a JSON file if you need a custom request body (defaults to the sample model payload under models/).
--request-timeout overrides the per-request timeout (seconds).
--model-name, --model-version, --load-model, and --warmup mirror the standalone script behaviour for staging models before traffic.
--verify-response asserts that the server returns at least one output tensor per request.

Development Workflow

Create a virtual environment (python -m venv .venv && source .venv/bin/activate).
Install dependencies (pip install -e .[dev] if you maintain an optional dev extra).
Run formatting and linting (add tools such as ruff or black as needed).
Execute the test suite (PYTHONPATH=. pytest).
Add or update models under models/ and extend the test coverage for new endpoints or validation rules.

Contributions should include updated tests and documentation when behavior changes. For large modifications, describe the expected API changes and operational impact in your pull request.

Configure REDIS_URL when running multiple replicas so they share active/loaded state. Without it, each pod keeps state in-process and must receive administrative calls individually.
With Redis configured, pods run a background reconciler that periodically reads the shared state and auto-loads or unloads runners to match the desired versions.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
app		app
docs		docs
models		models
scripts		scripts
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.python-version		.python-version
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Inferno

Features

Table of Contents

Getting Started

Prerequisites

Installation

Run the Server

Run Tests

Configuration

Model Repository Layout

Model Configuration (`model.yaml`)

Prediction API

Request Payload

Response Payload

Validation & Errors

Administrative APIs

Health & Metrics

Batching & Concurrency

Logging

Deployment Notes

Load Testing

Development Workflow

About

Uh oh!

Releases

Packages

Languages

bpradana/inferno

Folders and files

Latest commit

History

Repository files navigation

Inferno

Features

Table of Contents

Getting Started

Prerequisites

Installation

Run the Server

Run Tests

Configuration

Model Repository Layout

Model Configuration (model.yaml)

Prediction API

Request Payload

Response Payload

Validation & Errors

Administrative APIs

Health & Metrics

Batching & Concurrency

Logging

Deployment Notes

Load Testing

Development Workflow

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Model Configuration (`model.yaml`)

Packages