feat(signer): add health monitoring and automatic failover for signer pool#399
Conversation
Greptile SummaryThis PR introduces per-signer health monitoring and automatic failover for the Confidence Score: 4/5Safe to merge; all prior P0/P1 threading concerns are resolved and the new logic is well-tested. All three previously flagged issues (atomic race, unused error variant, no time-based recovery) are addressed. The remaining findings are P2: health status is not exposed in SignerInfo (observability gap) and a silent fallback in weighted_select_from. Neither blocks correctness. Score is 4 rather than 5 only because the SignerInfo gap is a notable operator UX regression given that health monitoring is the central feature of this PR. crates/lib/src/signer/pool.rs — get_signers_info and weighted_select_from edge cases worth addressing before shipping to production monitoring. Important Files Changed
Sequence DiagramsequenceDiagram
participant Client
participant SignerPool
participant HealthState
participant Signer
Client->>SignerPool: get_next_signer()
SignerPool->>HealthState: is_eligible_for_selection()
alt signer healthy
HealthState-->>SignerPool: true
else unhealthy + cooldown elapsed + no probe in flight
HealthState-->>SignerPool: true (probe eligible)
SignerPool->>HealthState: try_acquire_probe_lock_if_needed()
HealthState-->>SignerPool: probe_in_flight = true
else unhealthy within cooldown OR probe in flight
HealthState-->>SignerPool: false (excluded)
end
SignerPool-->>Client: Arc<Signer>
Client->>Signer: sign_message()
alt signing succeeds
Signer-->>Client: Signature
Client->>SignerPool: record_signing_success()
SignerPool->>HealthState: reset to HealthState::default()
else signing fails
Signer-->>Client: Error
Client->>SignerPool: record_signing_failure()
SignerPool->>HealthState: consecutive_failures += 1
alt consecutive_failures >= 3
HealthState->>HealthState: is_healthy=false, probe_in_flight=false, last_failed_at=now()
end
Client-->>Client: return KoraError::SigningError
end
Reviews (2): Last reviewed commit: "fix(tests): update jito bundle integrati..." | Re-trigger Greptile |
Harden signer pool health handling so unhealthy pools fail fast instead of routing to unhealthy signers. Centralize probe eligibility logic and add lease-based stale probe lock recovery to avoid stuck in-flight probes. Also wire bundle signing health telemetry and align transaction RPC test imports to use mocked config under cfg(test), which fixes pre-existing invalid signer-key test failures. Validated with cargo fmt and cargo test -p kora-lib.
Sanitize signer and pool errors in bundle and transaction signing telemetry paths before logging or returning SigningError. Validated with cargo fmt plus signer::bundle_signer and transaction::versioned_transaction tests.
|
✅ Fork external live tests passed. fork-external-live-pass:0e4dade121c6efe73f2ea3b6e1e6388dc62a60da |
|
Thank you sir |
Remote signers communicate over HTTP and can fail intermittently. Previously,
SignerPool had no mechanism to detect degraded signers — it would keep routing
requests to a failing signer indefinitely.
This PR adds per-signer health tracking with automatic failover: