Skip to content

feat(common): add retry/failover engine#927

Open
TroyMitchell911 wants to merge 14 commits into
katanemo:mainfrom
TroyMitchell911:feat/retry-core
Open

feat(common): add retry/failover engine#927
TroyMitchell911 wants to merge 14 commits into
katanemo:mainfrom
TroyMitchell911:feat/retry-core

Conversation

@TroyMitchell911
Copy link
Copy Markdown

@TroyMitchell911 TroyMitchell911 commented Apr 28, 2026

Summary

Continuation of #733. Implements the core retry/failover engine as a standalone library within the common crate.

Addresses maintainer feedback from #733:

  • Exponential backoff with configurable base_ms/max_ms and jitter
  • Configurable max attempts (default_max_attempts and per-status-code max_attempts)
  • Provider selection when failing over — uses fallback_models list or strategy-based selection

PR Chain

This is PR 2 of 5 in the retry/failover feature series:

  1. feat(common): add retry policy configuration types #926 — Configuration types (must merge first)
  2. This PR — Retry/failover engine (includes feat(common): add retry policy configuration types #926 commits)
  3. feat(brightstaff): integrate retry orchestrator into LLM handler #928 — Handler integration
  4. feat: multi-provider failover support #929 — Multi-provider failover support
  5. test: add retry/failover e2e and unit tests #930 — E2E and unit tests

Note: This PR includes commits from #926. Please merge #926 first, then the diff here will show only the retry engine commits.

Components

  • Backoff calculator: Exponential backoff with equal/full/decorrelated jitter strategies, configurable base_ms (default 100) and max_ms (default 5000)
  • Error detector: Classifies HTTP responses as retriable based on on_status_codes config (e.g. 429, 503)
  • Provider selector: Picks fallback providers from fallback_models list using configured RetryStrategy (same_model, same_provider, different_provider)
  • State managers: Tracks Retry-After headers and latency-based blocking per provider
  • Error response builder: Constructs detailed error responses when all retries are exhausted
  • Retry orchestrator: Coordinates all components into a single retry loop

Changes

  • crates/common/src/retry/: New module with all retry components
  • crates/common/Cargo.toml: Add sha2, dashmap, tokio dependencies
  • crates/Cargo.lock: Updated lockfile

Signed-off-by: Troy Mitchell troymitchell988@gmail.com

Signed-off-by: Troy Mitchell <i@troy-y.org>
Add retry policy configuration types to support automatic retry and
failover for LLM requests:

- RetryPolicy: top-level config with fallback_models, default_strategy,
  default_max_attempts, and per-status-code overrides
- BackoffConfig: exponential backoff with base_ms, max_ms, jitter, and
  scope (per-model, per-provider, or global)
- RetryAfterConfig: Retry-After header handling with block scope and
  duration limits
- HighLatencyConfig: latency-based blocking with threshold, measurement
  type, and trigger conditions
- LatencyTriggerConfig: min_triggers and trigger_window for debouncing
- RetryStrategy enum: same_model, same_provider, different_provider
- StatusCodeEntry: flexible status code matching (single, range, list)

Also add retry_policy field to GatewayConfig with Default impl.

Signed-off-by: Troy Mitchell <i@troy-y.org>
Add comprehensive tests for retry policy configuration:

- proptest: round-trip serialization, default invariants, status code
  expansion (single, range, full range)
- YAML pattern tests covering 17 real-world configuration patterns:
  multi-provider failover, same-provider model downgrade, backoff on
  multiple error types, per-status-code strategy customization,
  timeout-specific config, no-retry, backoff scopes (model/provider/
  global), high-latency blocking, retry-after handling, fallback
  models list, mixed integer and range codes

Signed-off-by: Troy Mitchell <i@troy-y.org>
Add JSON schema definitions for retry policy configuration including
RetryPolicy, BackoffConfig, RetryAfterConfig, HighLatencyConfig,
LatencyTriggerConfig, RetryStrategy, StatusCodeEntry, and all
associated enums.

Signed-off-by: Troy Mitchell <i@troy-y.org>
Add the retry module with core type definitions including:
- RequestContext, RequestSignature for request deduplication
- RetryExhaustedError, AllProvidersExhaustedError for error handling
- AttemptError, AttemptErrorType for attempt tracking
- ValidationError, ValidationWarning for config validation
- Helper functions for provider extraction and hashing

Wire up pub mod retry in lib.rs.

Signed-off-by: Troy Mitchell <i@troy-y.org>
Implement BackoffCalculator supporting:
- Exponential backoff with configurable base/max delay
- Full, equal, and decorrelated jitter strategies
- Per-provider and per-status-code backoff overrides
- Comprehensive unit tests for all strategies

Signed-off-by: Troy Mitchell <i@troy-y.org>
Implement ErrorDetector that classifies HTTP responses into:
- Retryable errors (5xx, 429, timeouts)
- Non-retryable errors (4xx client errors)
- Successful responses
Supports configurable status code matching and latency-based
error detection with measurement strategies (TTFB/total).

Signed-off-by: Troy Mitchell <i@troy-y.org>
Implement RetryErrorResponseBuilder that constructs structured
JSON error responses when all retry attempts are exhausted,
including per-attempt error details and provider information.

Signed-off-by: Troy Mitchell <i@troy-y.org>
Add three state management components:
- LatencyBlockStateManager: tracks providers blocked due to
  high latency with configurable block duration and scope
- LatencyTriggerCounter: counts consecutive latency threshold
  breaches before triggering provider blocking
- RetryAfterStateManager: honors Retry-After headers with
  per-provider/model/endpoint blocking scope

Signed-off-by: Troy Mitchell <i@troy-y.org>
Implement validate_retry_policy() that checks retry policy
configuration for errors and warnings including:
- Invalid max_retries/timeout ranges
- Conflicting backoff and jitter settings
- Missing or invalid provider references
- Latency threshold consistency checks

Signed-off-by: Troy Mitchell <i@troy-y.org>
Implement ProviderSelector that determines the next provider
for retry attempts based on:
- Failover provider list with priority ordering
- Latency-blocked provider filtering
- Retry-After header honoring
- Round-robin and priority-based selection strategies

Signed-off-by: Troy Mitchell <i@troy-y.org>
Implement RetryOrchestrator as the top-level coordinator that:
- Manages the full retry lifecycle per request
- Integrates backoff, error detection, provider selection
- Handles request deduplication via content hashing
- Supports both same-provider retry and cross-provider failover
- Emits structured attempt records for observability

Signed-off-by: Troy Mitchell <i@troy-y.org>
Signed-off-by: Troy Mitchell <i@troy-y.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant