Skip to content

feat(signer): add health monitoring and automatic failover for signer pool#399

Merged
dev-jodee merged 7 commits intosolana-foundation:mainfrom
raushan728:feat/signer-health-monitoring
Apr 2, 2026
Merged

feat(signer): add health monitoring and automatic failover for signer pool#399
dev-jodee merged 7 commits intosolana-foundation:mainfrom
raushan728:feat/signer-health-monitoring

Conversation

@raushan728
Copy link
Copy Markdown
Contributor

@raushan728 raushan728 commented Mar 23, 2026

Remote signers communicate over HTTP and can fail intermittently. Previously,
SignerPool had no mechanism to detect degraded signers — it would keep routing
requests to a failing signer indefinitely.

This PR adds per-signer health tracking with automatic failover:

  • After 3 consecutive failures, a signer is marked unhealthy and excluded from selection
  • Unhealthy signers get a recovery probe chance after 30 seconds
  • Pinned signers (get_signer_by_pubkey) now also respect the recovery probe
  • Health is automatically restored on successful signing or after successful probe
  • Consolidated health state into single HealthState struct (eliminates paired-lock pattern)

Open with Devin

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 23, 2026

Greptile Summary

This PR introduces per-signer health monitoring and automatic failover for the SignerPool. It consolidates the previously racy paired atomics (consecutive_failures + is_healthy) into a single parking_lot::Mutex<HealthState>, fully resolving the race condition flagged in prior review threads. Unhealthy signers are excluded after 3 consecutive failures and re-admitted for a single recovery probe after a 30-second cooldown; a 10-second stale-probe lease prevents stuck locks. Health callbacks are wired into both the regular transaction signing path (versioned_transaction.rs) and the bundle signing path (bundle_signer.rs). The remaining findings are P2 observability and style suggestions that do not block merge.

Confidence Score: 4/5

Safe to merge; all prior P0/P1 threading concerns are resolved and the new logic is well-tested.

All three previously flagged issues (atomic race, unused error variant, no time-based recovery) are addressed. The remaining findings are P2: health status is not exposed in SignerInfo (observability gap) and a silent fallback in weighted_select_from. Neither blocks correctness. Score is 4 rather than 5 only because the SignerInfo gap is a notable operator UX regression given that health monitoring is the central feature of this PR.

crates/lib/src/signer/pool.rs — get_signers_info and weighted_select_from edge cases worth addressing before shipping to production monitoring.

Important Files Changed

Filename Overview
crates/lib/src/signer/pool.rs Core of the PR: replaces paired atomics with a single parking_lot::Mutex, adds per-signer health tracking with 30-second recovery probes, stale-lock lease, and a retry loop in get_next_signer. Logic is sound; SignerInfo omits health fields (observability gap).
crates/lib/src/transaction/versioned_transaction.rs Adds record_signing_success/record_signing_failure callbacks around sign_message in sign_and_serialize; error path propagates correctly.
crates/lib/src/signer/bundle_signer.rs Mirrors the health-tracking pattern from versioned_transaction.rs for the bundle signing path; no issues found.
crates/lib/src/rpc_server/method/sign_transaction.rs Adds #[cfg(test)] conditional import to swap in mock_state::get_config; borrows get_config() result via &; no functional change to signing logic.
crates/lib/src/rpc_server/method/sign_and_send_transaction.rs Same conditional-import and &get_config() borrow change as sign_transaction.rs; no issues.
tests/external/jito_integration.rs Removes intermediate VersionedTransactionResolved step; now passes the already-encoded &[String] directly to send_bundle, matching the updated JitoClient::send_bundle signature.

Sequence Diagram

sequenceDiagram
    participant Client
    participant SignerPool
    participant HealthState
    participant Signer

    Client->>SignerPool: get_next_signer()
    SignerPool->>HealthState: is_eligible_for_selection()
    alt signer healthy
        HealthState-->>SignerPool: true
    else unhealthy + cooldown elapsed + no probe in flight
        HealthState-->>SignerPool: true (probe eligible)
        SignerPool->>HealthState: try_acquire_probe_lock_if_needed()
        HealthState-->>SignerPool: probe_in_flight = true
    else unhealthy within cooldown OR probe in flight
        HealthState-->>SignerPool: false (excluded)
    end
    SignerPool-->>Client: Arc<Signer>

    Client->>Signer: sign_message()
    alt signing succeeds
        Signer-->>Client: Signature
        Client->>SignerPool: record_signing_success()
        SignerPool->>HealthState: reset to HealthState::default()
    else signing fails
        Signer-->>Client: Error
        Client->>SignerPool: record_signing_failure()
        SignerPool->>HealthState: consecutive_failures += 1
        alt consecutive_failures >= 3
            HealthState->>HealthState: is_healthy=false, probe_in_flight=false, last_failed_at=now()
        end
        Client-->>Client: return KoraError::SigningError
    end
Loading

Reviews (2): Last reviewed commit: "fix(tests): update jito bundle integrati..." | Re-trigger Greptile

greptile-apps[bot]

This comment was marked as resolved.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 4 additional findings.

Open in Devin Review

devin-ai-integration[bot]

This comment was marked as resolved.

Harden signer pool health handling so unhealthy pools fail fast instead of routing to unhealthy signers. Centralize probe eligibility logic and add lease-based stale probe lock recovery to avoid stuck in-flight probes.

Also wire bundle signing health telemetry and align transaction RPC test imports to use mocked config under cfg(test), which fixes pre-existing invalid signer-key test failures.

Validated with cargo fmt and cargo test -p kora-lib.
Sanitize signer and pool errors in bundle and transaction signing telemetry paths before logging or returning SigningError.

Validated with cargo fmt plus signer::bundle_signer and transaction::versioned_transaction tests.
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 2, 2026

✅ Fork external live tests passed.

fork-external-live-pass:0e4dade121c6efe73f2ea3b6e1e6388dc62a60da
run: https://github.com/solana-foundation/kora/actions/runs/23913498177

@dev-jodee dev-jodee merged commit 776e255 into solana-foundation:main Apr 2, 2026
13 of 14 checks passed
@raushan728
Copy link
Copy Markdown
Contributor Author

Thank you sir

@raushan728 raushan728 deleted the feat/signer-health-monitoring branch April 3, 2026 13:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants