Skip to content

bpradana/inferno

Repository files navigation

Inferno

Inferno Logo

Inferno is a lightweight inference server that hosts multiple machine learning models with dynamic batching, configurable concurrency, and structured observability. It is designed for production deployments where predictable input validation, consistent error handling, and clear operational controls are critical.

Features

  • Multi-model hosting: Serve several versions side-by-side, with lazy loading and hot swap support.
  • Dynamic batching: Latency- and size-aware batching with per-model thresholds and async batching queues.
  • Strict validation: Pydantic-powered schema checks and signature-aware validation guard against malformed requests and surface helpful 422 errors.
  • Flexible codecs: JSON for small payloads and base64 for contiguous binary data, with dtype enforcement and shape checks.
  • Observability first: Structured logging, Prometheus metrics, and readiness/liveness endpoints for orchestration systems.
  • Pluggable runtimes: TorchScript models or user-defined Python entry points that conform to a simple signature contract.
  • Shared state (optional): Coordinate active and loaded model metadata via Redis when running multiple replicas and let a background reconciler in each pod keep local runners aligned with the shared state.

Table of Contents

Getting Started

Prerequisites

  • Python 3.12+
  • uv or pip for dependency management
  • PyTorch (installed automatically via dependencies but may require system-specific wheels)

Installation

python -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install -e .

Tip: You can also use uv with uv sync which respects uv.lock.

Run the Server

uvicorn app.main:app \
  --host 0.0.0.0 \
  --port 8000

The server lazily discovers models under models/ and preloads active versions on startup (configurable via environment variables).

Run Tests

PYTHONPATH=. pytest

All tests rely on the in-repo sample model stored under models/python/1.

Configuration

Runtime configuration is managed through environment variables interpreted by app.config.Settings:

Variable Default Description
DEVICE auto Selects cpu, cuda, or auto-detection using torch.cuda.is_available().
REPO_ROOT models Root directory containing model packages.
PRELOAD True Eagerly load active versions at startup when True; lazy load otherwise.
UVICORN_ACCESS_LOG False Enables Uvicorn access logging when set to True.
METRICS_BUCKETS 0.005,0.01,0.02,0.04,0.08,0.16,0.32,0.64,1.28 Prometheus latency histogram buckets (seconds).
REDIS_URL unset When set, enables a Redis-backed state store (e.g., redis://hostname:6379/0).
REDIS_PREFIX inferno:model_state Key prefix used for Redis state entries.
STATE_RECONCILE_ENABLED True Toggle the background reconciler that syncs runners with Redis state.
STATE_RECONCILE_INTERVAL_S 5.0 Seconds between reconciliation sweeps when Redis is enabled.

Set variables using your process manager (e.g., export DEVICE=cpu). Unknown environment variables are ignored.

Model Repository Layout

Models live under models/<model-name>/<version>/ alongside a model.yaml that describes runtime behavior.

models/
  my_classifier/
    1/
      model.ts            # TorchScript file or Python module
      model.yaml          # schema, batching, concurrency, codecs
    2/
      model.ts
      model.yaml

Model Configuration (model.yaml)

For a field-by-field reference, including validation rules and advanced examples, see the Model Packaging Guide.

Key elements to provide:

  • name / version: identify the model and the specific revision being shipped.
  • format: choose torchscript for serialized modules or python with an entrypoint for custom code.
  • signature: declare required and optional tensors (names, dtypes, and shapes) for both inputs and outputs.
  • batching and concurrency (optional): tune latency versus throughput per model.
  • codecs (optional): select the default encoding (json or base64) for responses.

Minimal TorchScript example:

name: my_classifier
version: 1
format: torchscript
signature:
  inputs:
    - name: tokens
      dtype: int64
      shape: [batch, seq]
  outputs:
    - name: probs
      dtype: float32
      shape: [batch, num_classes]

Augment with batching, concurrency, or codec overrides as needed. Python models require an entrypoint such as model:build that returns an eval()'d callable.

Prediction API

For complete request and response schemas, consult the API Reference.

POST /v1/models/{name}:predict

Request Payload

{
  "inputs": {
    "tokens": {
      "dtype": "float32",
      "shape": [2, 4],
      "data": [[1, 2, 3, 4], [4, 3, 2, 1]]
    },
    "mask": {
      "dtype": "float32",
      "shape": [2, 4],
      "base64": "<base64 payload>"
    }
  },
  "outputs": ["probs", "total"],
  "timeout_ms": 2000
}
  • inputs: Map of tensor names to payloads. Each payload requires dtype, shape, and either data (nested lists) or base64 (contiguous binary). Both cannot be provided simultaneously.
  • outputs (optional): Restricts returned tensors; defaults to all declared outputs.
  • timeout_ms (optional): Overrides the per-request wait limit for batching; must be positive.

Response Payload

{
  "outputs": {
    "probs": {"dtype": "float32", "shape": [2, 4], "data": [[...], [...]]},
    "total": {"dtype": "float32", "shape": [1, 4], "data": [[...]]}
  }
}

Outputs mirror the codec used by the model runner (defaults to the model-level codecs.default).

Validation & Errors

  • Missing required inputs, unexpected tensor names, dtype/shape mismatches, or invalid output selections return 422 Unprocessable Entity with descriptive messages (e.g., "Missing required input(s): tokens. Expected inputs: mask, tokens").
  • Unsupported dtypes, malformed base64 payloads, or shape/data inconsistencies are rejected during request parsing.
  • Unloaded or unknown models yield 404 ("model not found or not loaded").

Administrative APIs

For payload schemas and error codes, refer to the API Reference.

  • GET /v1/models — Unified view of active and (optionally) loaded versions per model.
  • GET /v1/models/{name} — Model metadata and full signature (triggers lazy load if needed).
  • GET /v1/models/{name}/active — Active version metadata (name, version, loaded state).
  • POST /v1/models:load — Body { "name": "model", "version": 3 } loads a model and registers a runner.
  • POST /v1/models:unload — Body { "name": "model", "version": 2 } unloads the runner and model from memory.

These endpoints are synchronous and return { "loaded": true } / { "unloaded": true } on success.

Health & Metrics

Additional details and sample responses are listed in the API Reference.

  • GET /health/live — Process liveness (always 200 when the service is up).
  • GET /health/ready — Indicates whether runners have started successfully (READY gauge set to 1).
  • GET /metrics — Prometheus exposition including:
    • infer_requests_total with labels (model, route)
    • infer_latency_seconds histogram per model
    • infer_batch_size distribution
    • infer_exceptions_total labeled by exception class
    • model_loaded gauge per model/version

Scrape /metrics from your observability stack to track throughput, latency, and error rates.

Batching & Concurrency

  • Each model has an AsyncBatcher that gathers requests until either max_batch_size is reached or max_batch_latency_ms elapses.
  • A per-model semaphore (max_concurrency) limits concurrent execution to protect GPU/CPU resources.
  • Timeouts supplied via timeout_ms bound how long a request waits in the batching queue before failing with a 504 (handled by asyncio.wait_for).

Logging

Logging is configured through app.logging.configure_logging() which emits JSON-structured logs with the following fields:

  • ts — ISO8601 timestamp
  • level — Log level
  • logger — Logger name (server, runtime, httpx, etc.)
  • msg — Message text Additional request context (request id, route, batch size) can be layered in downstream processors.

Integrate with your log aggregation platform by redirecting stdout/stderr or using Uvicorn gunicorn workers (gunicorn dependency is included).

Deployment Notes

For end-to-end deployment patterns—including Docker builds, Gunicorn process management, Kubernetes rollouts, and observability hooks—see the Deployment Guide. Key reminders:

  • Keep model artifacts on durable storage and mount them read-only in production.
  • Configure DEVICE, REPO_ROOT, and related environment variables per environment.
  • Use the provided health endpoints (/health/live, /health/ready) for liveness/readiness checks.
  • Scrape /metrics and stream JSON logs to integrate with your monitoring stack.

Load Testing

Install Locust (pip install locust) and use scripts/load_test.py as the Locust file to stress the prediction endpoint.

locust -f scripts/load_test.py \
  --headless \
  -u 50 \
  -r 10 \
  --run-time 2m \
  --host http://127.0.0.1:8000 \
  --model-name torchscript \
  --load-model \
  --model-version 2 \
  --warmup 5 \
  --verify-response

Script-specific options:

  • --payload-file points to a JSON file if you need a custom request body (defaults to the sample model payload under models/).
  • --request-timeout overrides the per-request timeout (seconds).
  • --model-name, --model-version, --load-model, and --warmup mirror the standalone script behaviour for staging models before traffic.
  • --verify-response asserts that the server returns at least one output tensor per request.

Development Workflow

  1. Create a virtual environment (python -m venv .venv && source .venv/bin/activate).
  2. Install dependencies (pip install -e .[dev] if you maintain an optional dev extra).
  3. Run formatting and linting (add tools such as ruff or black as needed).
  4. Execute the test suite (PYTHONPATH=. pytest).
  5. Add or update models under models/ and extend the test coverage for new endpoints or validation rules.

Contributions should include updated tests and documentation when behavior changes. For large modifications, describe the expected API changes and operational impact in your pull request.

  • Configure REDIS_URL when running multiple replicas so they share active/loaded state. Without it, each pod keeps state in-process and must receive administrative calls individually.
  • With Redis configured, pods run a background reconciler that periodically reads the shared state and auto-loads or unloads runners to match the desired versions.

About

Inference Server

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages