feat(signer): implement graceful retry and timeout for remote signers#412
feat(signer): implement graceful retry and timeout for remote signers#412raushan728 wants to merge 9 commits intosolana-foundation:mainfrom
Conversation
|
hi @dev-jodee this PR depends on |
Greptile SummaryThis PR adds configurable retry and timeout logic to remote signer calls ( Key changes:
Issues found:
Confidence Score: 4/5Safe to merge after addressing the per-retry health reporting — the current defaults cause a single failed request to immediately blacklist a signer, undermining the retry mechanism's resilience goal. The P1 finding is a real present-behavior defect: with the default
Important Files Changed
Sequence DiagramsequenceDiagram
participant Caller
participant VTx as VersionedTransactionResolved
participant Pool as SignerPool
participant Signer as Remote Signer
Caller->>Pool: get_next_signer()
Pool->>Pool: healthy_signers() — filter unhealthy
Pool-->>Caller: Arc<Signer>
Caller->>VTx: sign_transaction(config, signer, rpc_client)
loop attempt 0..=max_retries
VTx->>VTx: tokio::timeout(sign_timeout, ...)
VTx->>Signer: sign_message(&message_bytes)
alt Success
Signer-->>VTx: Ok(signature)
VTx->>Pool: record_signing_success(signer)
VTx-->>Caller: (transaction, encoded)
else Signing error
Signer-->>VTx: Err(e)
VTx->>Pool: record_signing_failure(signer) per-attempt
VTx->>VTx: backoff (100ms x 2^exp)
else Timeout
VTx->>Pool: record_signing_failure(signer) per-attempt
VTx->>VTx: backoff (100ms x 2^exp)
end
end
opt All retries exhausted
VTx-->>Caller: Err(SigningError)
end
Note over Pool: After MAX_CONSECUTIVE_FAILURES=3 signer marked unhealthy. Recovery probe allowed after 30s.
|
| fn round_robin_select_from<'a>( | ||
| &self, | ||
| signers: &[&'a SignerWithMetadata], | ||
| ) -> Result<&'a SignerWithMetadata, KoraError> { | ||
| let index = self.current_index.fetch_add(1, Ordering::AcqRel); | ||
| let signer_index = index % self.signers.len(); | ||
| Ok(&self.signers[signer_index]) | ||
| Ok(signers[index % signers.len()]) |
There was a problem hiding this comment.
Round-Robin Counter Skews Distribution When Healthy Pool Shrinks
current_index is a global monotonically-increasing counter that is modulo'd against healthy.len(). When the healthy slice shrinks (e.g., from 3 signers to 2 because one became unhealthy), the modulo boundary shifts and two consecutive selections can land on the same signer, breaking the strict alternating guarantee.
This is a pre-existing design trade-off, but it becomes more visible with the dynamic health filtering introduced by this PR. Consider documenting the known limitation or resetting the counter when healthy pool composition changes.
There was a problem hiding this comment.
Known trade-off with monotonic counter. Distribution skew is transient
and self-corrects. Will document in follow-up if needed.
f67e065 to
7d3c420
Compare
Remote signers (KMS, Fireblocks) can hang indefinitely on HTTP calls.
This PR adds a configurable retry and timeout mechanism to prevent that.
Changes: