feat(signer): implement graceful retry and timeout for remote signers by raushan728 · Pull Request #412 · solana-foundation/kora

raushan728 · 2026-03-28T11:03:23Z

Remote signers (KMS, Fireblocks) can hang indefinitely on HTTP calls.
This PR adds a configurable retry and timeout mechanism to prevent that.

Changes:

Each signing attempt wrapped with tokio::timeout (default 10s)
Exponential backoff (100ms × 2^exp) capped at ~12.8s (exp max 7)
New config fields: sign_timeout_seconds (default 10), sign_max_retries (default 2)
Validator rejects timeout < 1s, warns if retries > 10
Signing failures reported to SignerPool once per request (not per retry) to prevent single transient failure from blacklisting a signer
TOML serialization order for sign_timeout_seconds and sign_max_retries
Added #[serial] on tests sharing global state to fix flakiness

… pool

raushan728 · 2026-03-28T11:05:25Z

hi @dev-jodee this PR depends on feat/signer-health-monitoring for pool telemetry. The diff will resolve automatically once the health PR is merged into main

greptile-apps · 2026-03-28T11:09:20Z

Greptile Summary

This PR adds configurable retry and timeout logic to remote signer calls (sign_message) in VersionedTransactionResolved::sign_transaction, along with a new health-tracking mechanism in SignerPool that marks signers as unhealthy after repeated failures and probes for recovery after a 30-second cooldown. The two new KoraConfig fields (sign_timeout_seconds, sign_max_retries) are properly serialized, defaulted, and validated.

Key changes:

versioned_transaction.rs: Retry loop wraps each sign_message call in tokio::timeout; exponential backoff (100ms × 2^exp, capped at 7) between attempts; success/failure reported to SignerPool.
pool.rs: SignerWithMetadata gains two Mutex-protected health fields; healthy_signers() filters the pool before selection; get_next_signer and get_signer_by_pubkey both respect the health check with a 30-second recovery probe window.
config.rs / config_validator.rs: Two new fields with serde defaults and startup validation (error on timeout = 0, warn on retries > 10).

Issues found:

P1 — Each retry attempt within a single request calls record_signing_failure independently. With defaults (sign_max_retries = 2, MAX_CONSECUTIVE_FAILURES = 3), a single request that exhausts all retries generates exactly 3 failure reports, immediately marking the signer unhealthy. This couples two otherwise-independent constants in a non-obvious way and defeats the retry mechanism's goal of transparently handling transient failures.
P2 — health and last_failed_at are two separate Mutexes that are always acquired together in a fixed order. While deadlock-safe today, combining them into a single struct under one lock would be simpler and less fragile.
P2 — get_signers_info / SignerInfo doesn't expose health state, making it impossible to observe which signers are being bypassed via monitoring endpoints.
P2 — The global current_index round-robin counter is modulo'd against the filtered healthy slice; pool shrinkage can cause non-uniform distribution until the counter wraps.

Confidence Score: 4/5

Safe to merge after addressing the per-retry health reporting — the current defaults cause a single failed request to immediately blacklist a signer, undermining the retry mechanism's resilience goal.

The P1 finding is a real present-behavior defect: with the default sign_max_retries = 2 and the hardcoded MAX_CONSECUTIVE_FAILURES = 3, one request exhausting all retries generates exactly 3 consecutive record_signing_failure calls, hitting the threshold and marking the signer unhealthy immediately. This makes healthy signers susceptible to a 30-second blackout after a single transient failure — the opposite of what retries are meant to achieve. The P2 findings are quality/observability concerns that do not block correctness. Config, validator, and test-mock changes are clean.

crates/lib/src/transaction/versioned_transaction.rs (per-retry health reporting) and crates/lib/src/signer/pool.rs (dual-mutex design, observability gap).

Important Files Changed

Filename	Overview
crates/lib/src/transaction/versioned_transaction.rs	Adds retry/timeout loop around `sign_message`; each retry attempt reports independently to the health pool, meaning one fully-retried request can immediately mark a signer unhealthy (P1).
crates/lib/src/signer/pool.rs	Adds health tracking (consecutive failures, 30s recovery probe) to SignerPool; lock ordering is consistent so no deadlock, but two separate Mutexes for health state are fragile; round-robin index skew and missing observability are minor concerns.
crates/lib/src/config.rs	Adds `sign_timeout_seconds` (default 10) and `sign_max_retries` (default 2) to `KoraConfig` with serde defaults and `Default` impl; straightforward and correct.
crates/lib/src/validator/config_validator.rs	Adds validation: errors on `sign_timeout_seconds == 0`, warns if `sign_max_retries > 10`; correct and consistent with existing validator patterns.
crates/lib/src/tests/config_mock.rs	Adds hardcoded defaults for the two new fields in both mock builders; no issues.
crates/lib/src/tests/toml_mock.rs	Extends `ConfigBuilder` with `with_signing_config` helper and TOML serialization for the two new fields; clean addition with no issues.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant VTx as VersionedTransactionResolved
    participant Pool as SignerPool
    participant Signer as Remote Signer

    Caller->>Pool: get_next_signer()
    Pool->>Pool: healthy_signers() — filter unhealthy
    Pool-->>Caller: Arc<Signer>

    Caller->>VTx: sign_transaction(config, signer, rpc_client)

    loop attempt 0..=max_retries
        VTx->>VTx: tokio::timeout(sign_timeout, ...)
        VTx->>Signer: sign_message(&message_bytes)

        alt Success
            Signer-->>VTx: Ok(signature)
            VTx->>Pool: record_signing_success(signer)
            VTx-->>Caller: (transaction, encoded)
        else Signing error
            Signer-->>VTx: Err(e)
            VTx->>Pool: record_signing_failure(signer) per-attempt
            VTx->>VTx: backoff (100ms x 2^exp)
        else Timeout
            VTx->>Pool: record_signing_failure(signer) per-attempt
            VTx->>VTx: backoff (100ms x 2^exp)
        end
    end

    opt All retries exhausted
        VTx-->>Caller: Err(SigningError)
    end

    Note over Pool: After MAX_CONSECUTIVE_FAILURES=3 signer marked unhealthy. Recovery probe allowed after 30s.

Comments Outside Diff (1)

crates/lib/src/signer/pool.rs, line 297-307 (link)

get_signers_info Omits Health State — Limits Observability

SignerInfo doesn't surface is_healthy or consecutive_failures. Operators consuming /getConfig or similar monitoring endpoints have no way to tell which signers are actively being skipped by the health filter. Given that this PR introduces the health mechanism as a production reliability feature, exposing it in the monitoring surface would make it far easier to debug degraded pools.

Consider adding a healthy: bool (or consecutive_failures: u32) field to SignerInfo and populating it from the health lock in get_signers_info.

_{Reviews (1): Last reviewed commit: "merge(signer): resolve conflicts and fix..." | Re-trigger Greptile}

crates/lib/src/transaction/versioned_transaction.rs

crates/lib/src/signer/pool.rs

greptile-apps · 2026-03-28T11:09:29Z

crates/lib/src/signer/pool.rs

+    fn round_robin_select_from<'a>(
+        &self,
+        signers: &[&'a SignerWithMetadata],
+    ) -> Result<&'a SignerWithMetadata, KoraError> {
        let index = self.current_index.fetch_add(1, Ordering::AcqRel);
-        let signer_index = index % self.signers.len();
-        Ok(&self.signers[signer_index])
+        Ok(signers[index % signers.len()])


Round-Robin Counter Skews Distribution When Healthy Pool Shrinks

current_index is a global monotonically-increasing counter that is modulo'd against healthy.len(). When the healthy slice shrinks (e.g., from 3 signers to 2 because one became unhealthy), the modulo boundary shifts and two consecutive selections can land on the same signer, breaking the strict alternating guarantee.

This is a pre-existing design trade-off, but it becomes more visible with the dynamic health filtering introduced by this PR. Consider documenting the known limitation or resetting the counter when healthy pool composition changes.

Known trade-off with monotonic counter. Distribution skew is transient
and self-corrects. Will document in follow-up if needed.

…dd regression test

…e safety

raushan728 added 3 commits March 23, 2026 15:40

feat(signer): add health monitoring and automatic failover for signer…

cd8a08c

… pool

feat(signer): add health monitoring and automatic failover for signer…

a83c137

… pool

fix: make pinned signer health check respect recovery probe cooldown

8faa894

raushan728 requested review from amilz and dev-jodee as code owners March 28, 2026 11:03

This comment was marked as resolved.

Sign in to view

greptile-apps bot reviewed Mar 28, 2026

View reviewed changes

This comment was marked as resolved.

Sign in to view

raushan728 added 4 commits March 29, 2026 06:11

refactor: consolidate health state into single Mutex struct

bf0870d

feat(signer): add graceful retry and timeout logic

56b4dbb

fix: move signing config fields before subsections in toml_mock and a…

727d97a

…dd regression test

fix: report signing failure once after all retries exhausted

7d3c420

raushan728 force-pushed the feat/signer-retry-timeout branch from f67e065 to 7d3c420 Compare March 29, 2026 06:38

feat: rebase retry-timeout on updated health-monitoring branch

817d15c

This comment was marked as resolved.

Sign in to view

fix: add #[serial] to test_validate_sign_timeout_zero for global stat…

e5b9cde

…e safety

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(signer): implement graceful retry and timeout for remote signers#412

feat(signer): implement graceful retry and timeout for remote signers#412
raushan728 wants to merge 9 commits intosolana-foundation:mainfrom
raushan728:feat/signer-retry-timeout

raushan728 commented Mar 28, 2026 •

edited

Loading

Uh oh!

raushan728 commented Mar 28, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

greptile-apps bot commented Mar 28, 2026 •

edited

Loading

Comments Outside Diff (1)

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot Mar 28, 2026

Uh oh!

raushan728 Mar 29, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

raushan728 commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

raushan728 commented Mar 28, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

greptile-apps bot commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (1)

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

raushan728 Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

raushan728 commented Mar 28, 2026 •

edited

Loading

greptile-apps bot commented Mar 28, 2026 •

edited

Loading