Skip to content

feat(signer): implement graceful retry and timeout for remote signers#412

Open
raushan728 wants to merge 9 commits intosolana-foundation:mainfrom
raushan728:feat/signer-retry-timeout
Open

feat(signer): implement graceful retry and timeout for remote signers#412
raushan728 wants to merge 9 commits intosolana-foundation:mainfrom
raushan728:feat/signer-retry-timeout

Conversation

@raushan728
Copy link
Copy Markdown
Contributor

@raushan728 raushan728 commented Mar 28, 2026

Remote signers (KMS, Fireblocks) can hang indefinitely on HTTP calls.
This PR adds a configurable retry and timeout mechanism to prevent that.

Changes:

  • Each signing attempt wrapped with tokio::timeout (default 10s)
  • Exponential backoff (100ms × 2^exp) capped at ~12.8s (exp max 7)
  • New config fields: sign_timeout_seconds (default 10), sign_max_retries (default 2)
  • Validator rejects timeout < 1s, warns if retries > 10
  • Signing failures reported to SignerPool once per request (not per retry) to prevent single transient failure from blacklisting a signer
  • TOML serialization order for sign_timeout_seconds and sign_max_retries
  • Added #[serial] on tests sharing global state to fix flakiness

Open with Devin

@raushan728
Copy link
Copy Markdown
Contributor Author

hi @dev-jodee this PR depends on feat/signer-health-monitoring for pool telemetry. The diff will resolve automatically once the health PR is merged into main

devin-ai-integration[bot]

This comment was marked as resolved.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 28, 2026

Greptile Summary

This PR adds configurable retry and timeout logic to remote signer calls (sign_message) in VersionedTransactionResolved::sign_transaction, along with a new health-tracking mechanism in SignerPool that marks signers as unhealthy after repeated failures and probes for recovery after a 30-second cooldown. The two new KoraConfig fields (sign_timeout_seconds, sign_max_retries) are properly serialized, defaulted, and validated.

Key changes:

  • versioned_transaction.rs: Retry loop wraps each sign_message call in tokio::timeout; exponential backoff (100ms × 2^exp, capped at 7) between attempts; success/failure reported to SignerPool.
  • pool.rs: SignerWithMetadata gains two Mutex-protected health fields; healthy_signers() filters the pool before selection; get_next_signer and get_signer_by_pubkey both respect the health check with a 30-second recovery probe window.
  • config.rs / config_validator.rs: Two new fields with serde defaults and startup validation (error on timeout = 0, warn on retries > 10).

Issues found:

  • P1 — Each retry attempt within a single request calls record_signing_failure independently. With defaults (sign_max_retries = 2, MAX_CONSECUTIVE_FAILURES = 3), a single request that exhausts all retries generates exactly 3 failure reports, immediately marking the signer unhealthy. This couples two otherwise-independent constants in a non-obvious way and defeats the retry mechanism's goal of transparently handling transient failures.
  • P2health and last_failed_at are two separate Mutexes that are always acquired together in a fixed order. While deadlock-safe today, combining them into a single struct under one lock would be simpler and less fragile.
  • P2get_signers_info / SignerInfo doesn't expose health state, making it impossible to observe which signers are being bypassed via monitoring endpoints.
  • P2 — The global current_index round-robin counter is modulo'd against the filtered healthy slice; pool shrinkage can cause non-uniform distribution until the counter wraps.

Confidence Score: 4/5

Safe to merge after addressing the per-retry health reporting — the current defaults cause a single failed request to immediately blacklist a signer, undermining the retry mechanism's resilience goal.

The P1 finding is a real present-behavior defect: with the default sign_max_retries = 2 and the hardcoded MAX_CONSECUTIVE_FAILURES = 3, one request exhausting all retries generates exactly 3 consecutive record_signing_failure calls, hitting the threshold and marking the signer unhealthy immediately. This makes healthy signers susceptible to a 30-second blackout after a single transient failure — the opposite of what retries are meant to achieve. The P2 findings are quality/observability concerns that do not block correctness. Config, validator, and test-mock changes are clean.

crates/lib/src/transaction/versioned_transaction.rs (per-retry health reporting) and crates/lib/src/signer/pool.rs (dual-mutex design, observability gap).

Important Files Changed

Filename Overview
crates/lib/src/transaction/versioned_transaction.rs Adds retry/timeout loop around sign_message; each retry attempt reports independently to the health pool, meaning one fully-retried request can immediately mark a signer unhealthy (P1).
crates/lib/src/signer/pool.rs Adds health tracking (consecutive failures, 30s recovery probe) to SignerPool; lock ordering is consistent so no deadlock, but two separate Mutexes for health state are fragile; round-robin index skew and missing observability are minor concerns.
crates/lib/src/config.rs Adds sign_timeout_seconds (default 10) and sign_max_retries (default 2) to KoraConfig with serde defaults and Default impl; straightforward and correct.
crates/lib/src/validator/config_validator.rs Adds validation: errors on sign_timeout_seconds == 0, warns if sign_max_retries > 10; correct and consistent with existing validator patterns.
crates/lib/src/tests/config_mock.rs Adds hardcoded defaults for the two new fields in both mock builders; no issues.
crates/lib/src/tests/toml_mock.rs Extends ConfigBuilder with with_signing_config helper and TOML serialization for the two new fields; clean addition with no issues.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant VTx as VersionedTransactionResolved
    participant Pool as SignerPool
    participant Signer as Remote Signer

    Caller->>Pool: get_next_signer()
    Pool->>Pool: healthy_signers() — filter unhealthy
    Pool-->>Caller: Arc<Signer>

    Caller->>VTx: sign_transaction(config, signer, rpc_client)

    loop attempt 0..=max_retries
        VTx->>VTx: tokio::timeout(sign_timeout, ...)
        VTx->>Signer: sign_message(&message_bytes)

        alt Success
            Signer-->>VTx: Ok(signature)
            VTx->>Pool: record_signing_success(signer)
            VTx-->>Caller: (transaction, encoded)
        else Signing error
            Signer-->>VTx: Err(e)
            VTx->>Pool: record_signing_failure(signer) per-attempt
            VTx->>VTx: backoff (100ms x 2^exp)
        else Timeout
            VTx->>Pool: record_signing_failure(signer) per-attempt
            VTx->>VTx: backoff (100ms x 2^exp)
        end
    end

    opt All retries exhausted
        VTx-->>Caller: Err(SigningError)
    end

    Note over Pool: After MAX_CONSECUTIVE_FAILURES=3 signer marked unhealthy. Recovery probe allowed after 30s.
Loading

Comments Outside Diff (1)

  1. crates/lib/src/signer/pool.rs, line 297-307 (link)

    P2 get_signers_info Omits Health State — Limits Observability

    SignerInfo doesn't surface is_healthy or consecutive_failures. Operators consuming /getConfig or similar monitoring endpoints have no way to tell which signers are actively being skipped by the health filter. Given that this PR introduces the health mechanism as a production reliability feature, exposing it in the monitoring surface would make it far easier to debug degraded pools.

    Consider adding a healthy: bool (or consecutive_failures: u32) field to SignerInfo and populating it from the health lock in get_signers_info.

Reviews (1): Last reviewed commit: "merge(signer): resolve conflicts and fix..." | Re-trigger Greptile

Comment on lines +263 to +268
fn round_robin_select_from<'a>(
&self,
signers: &[&'a SignerWithMetadata],
) -> Result<&'a SignerWithMetadata, KoraError> {
let index = self.current_index.fetch_add(1, Ordering::AcqRel);
let signer_index = index % self.signers.len();
Ok(&self.signers[signer_index])
Ok(signers[index % signers.len()])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Round-Robin Counter Skews Distribution When Healthy Pool Shrinks

current_index is a global monotonically-increasing counter that is modulo'd against healthy.len(). When the healthy slice shrinks (e.g., from 3 signers to 2 because one became unhealthy), the modulo boundary shifts and two consecutive selections can land on the same signer, breaking the strict alternating guarantee.

This is a pre-existing design trade-off, but it becomes more visible with the dynamic health filtering introduced by this PR. Consider documenting the known limitation or resetting the counter when healthy pool composition changes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Known trade-off with monotonic counter. Distribution skew is transient
and self-corrects. Will document in follow-up if needed.

devin-ai-integration[bot]

This comment was marked as resolved.

@raushan728 raushan728 force-pushed the feat/signer-retry-timeout branch from f67e065 to 7d3c420 Compare March 29, 2026 06:38
devin-ai-integration[bot]

This comment was marked as resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant