feat(common): add retry/failover engine#927
Open
TroyMitchell911 wants to merge 14 commits into
Open
Conversation
Signed-off-by: Troy Mitchell <i@troy-y.org>
Add retry policy configuration types to support automatic retry and failover for LLM requests: - RetryPolicy: top-level config with fallback_models, default_strategy, default_max_attempts, and per-status-code overrides - BackoffConfig: exponential backoff with base_ms, max_ms, jitter, and scope (per-model, per-provider, or global) - RetryAfterConfig: Retry-After header handling with block scope and duration limits - HighLatencyConfig: latency-based blocking with threshold, measurement type, and trigger conditions - LatencyTriggerConfig: min_triggers and trigger_window for debouncing - RetryStrategy enum: same_model, same_provider, different_provider - StatusCodeEntry: flexible status code matching (single, range, list) Also add retry_policy field to GatewayConfig with Default impl. Signed-off-by: Troy Mitchell <i@troy-y.org>
Add comprehensive tests for retry policy configuration: - proptest: round-trip serialization, default invariants, status code expansion (single, range, full range) - YAML pattern tests covering 17 real-world configuration patterns: multi-provider failover, same-provider model downgrade, backoff on multiple error types, per-status-code strategy customization, timeout-specific config, no-retry, backoff scopes (model/provider/ global), high-latency blocking, retry-after handling, fallback models list, mixed integer and range codes Signed-off-by: Troy Mitchell <i@troy-y.org>
Add JSON schema definitions for retry policy configuration including RetryPolicy, BackoffConfig, RetryAfterConfig, HighLatencyConfig, LatencyTriggerConfig, RetryStrategy, StatusCodeEntry, and all associated enums. Signed-off-by: Troy Mitchell <i@troy-y.org>
Signed-off-by: Troy Mitchell <i@troy-y.org>
Add the retry module with core type definitions including: - RequestContext, RequestSignature for request deduplication - RetryExhaustedError, AllProvidersExhaustedError for error handling - AttemptError, AttemptErrorType for attempt tracking - ValidationError, ValidationWarning for config validation - Helper functions for provider extraction and hashing Wire up pub mod retry in lib.rs. Signed-off-by: Troy Mitchell <i@troy-y.org>
Implement BackoffCalculator supporting: - Exponential backoff with configurable base/max delay - Full, equal, and decorrelated jitter strategies - Per-provider and per-status-code backoff overrides - Comprehensive unit tests for all strategies Signed-off-by: Troy Mitchell <i@troy-y.org>
Implement ErrorDetector that classifies HTTP responses into: - Retryable errors (5xx, 429, timeouts) - Non-retryable errors (4xx client errors) - Successful responses Supports configurable status code matching and latency-based error detection with measurement strategies (TTFB/total). Signed-off-by: Troy Mitchell <i@troy-y.org>
Implement RetryErrorResponseBuilder that constructs structured JSON error responses when all retry attempts are exhausted, including per-attempt error details and provider information. Signed-off-by: Troy Mitchell <i@troy-y.org>
Add three state management components: - LatencyBlockStateManager: tracks providers blocked due to high latency with configurable block duration and scope - LatencyTriggerCounter: counts consecutive latency threshold breaches before triggering provider blocking - RetryAfterStateManager: honors Retry-After headers with per-provider/model/endpoint blocking scope Signed-off-by: Troy Mitchell <i@troy-y.org>
Implement validate_retry_policy() that checks retry policy configuration for errors and warnings including: - Invalid max_retries/timeout ranges - Conflicting backoff and jitter settings - Missing or invalid provider references - Latency threshold consistency checks Signed-off-by: Troy Mitchell <i@troy-y.org>
Implement ProviderSelector that determines the next provider for retry attempts based on: - Failover provider list with priority ordering - Latency-blocked provider filtering - Retry-After header honoring - Round-robin and priority-based selection strategies Signed-off-by: Troy Mitchell <i@troy-y.org>
Implement RetryOrchestrator as the top-level coordinator that: - Manages the full retry lifecycle per request - Integrates backoff, error detection, provider selection - Handles request deduplication via content hashing - Supports both same-provider retry and cross-provider failover - Emits structured attempt records for observability Signed-off-by: Troy Mitchell <i@troy-y.org>
Signed-off-by: Troy Mitchell <i@troy-y.org>
This was referenced Apr 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Continuation of #733. Implements the core retry/failover engine as a standalone library within the
commoncrate.Addresses maintainer feedback from #733:
base_ms/max_msand jitterdefault_max_attemptsand per-status-codemax_attempts)fallback_modelslist or strategy-based selectionPR Chain
This is PR 2 of 5 in the retry/failover feature series:
Components
base_ms(default 100) andmax_ms(default 5000)on_status_codesconfig (e.g. 429, 503)fallback_modelslist using configuredRetryStrategy(same_model,same_provider,different_provider)Retry-Afterheaders and latency-based blocking per providerChanges
crates/common/src/retry/: New module with all retry componentscrates/common/Cargo.toml: Addsha2,dashmap,tokiodependenciescrates/Cargo.lock: Updated lockfileSigned-off-by: Troy Mitchell troymitchell988@gmail.com