feat: 2-of-3 threshold ECDSA signing (CGGMP21)#14
feat: 2-of-3 threshold ECDSA signing (CGGMP21)#14Slymeofthemonth wants to merge 26 commits intodaydreamsai:masterfrom
Conversation
- saw-mpc: shared MPC core wrapping cggmp21 (keygen, presigning, signing, transport) - saw-policy: policy agent binary (Share 2) with rule evaluation engine - saw-cosigner: human cosigner binary (Share 3) for recovery + override - All scaffolding with TODO markers at cggmp21 integration points
- Wired cggmp21 keygen, aux_info_gen, and signing into saw-mpc - Built in-memory transport layer (Delivery impl for testing) - Integration test: full 2-of-3 keygen → signing → verification - Presignature pool structure ready for background generation - All parties derive same ETH address from shared public key
Mode 1: Single-key (SAW as-is, no threshold) Mode 2: Self-hosted threshold (separate machine) Mode 3: TEE-hosted policy via Phala (future)
- WsConnection<S> generic over stream type (client + server) - ws_connect() for client, ws_accept() for server - into_delivery() bridges WS into cggmp21's Delivery trait - WireMessage routing: MPC messages go through delivery, others filtered - Integration test: full keygen → WS signing → verification passes - Clean up unused import in keygen.rs
saw-daemon: - ThresholdClient: connects to policy via WS, sends sign requests, participates in MPC signing if policy approves - ThresholdError with PolicyUnavailable/PolicyDenied/Escalated/Mpc variants - 5s timeout on policy decision (fail-fast as specced) saw-policy: - WebSocket server accepting persistent connections from saw-daemon - Receives SignRequest → evaluates policy rules → responds with decision - If approved, participates in MPC signing as party 1 - Sequential request processing per connection Integration test (daemon_policy_flow): - Full end-to-end: keygen → WS connect → sign request → policy approve → MPC signing → ECDSA verification — all passing
- config.rs: DaemonConfig with per-wallet SigningMode (single-key or threshold) loaded from config.yaml; defaults to single-key if no config file - Server now holds config, tokio runtime, and ThresholdClient instances - handle_sign_evm_tx branches on threshold mode: - Single-key: unchanged (local policy + local sign) - Threshold: compute sighash → ThresholdClient::sign() → reconstruct signed tx - Extract compute_evm_tx_sighash() and build_signed_evm_tx() helpers - Recovery id (v) computed by trying both parities via secp256k1 recovery - Existing tests still pass (backward compatible)
- handle_sign_eip2612_permit now branches on threshold mode - Derives owner address from key share's shared public key - Verifies owner matches payload if provided - Computes EIP-712 digest via extracted compute_permit_digest() - Proper recovery id: compares recovered pubkey against known key share pubkey - get_address returns address + public_key for threshold wallets (with mode: threshold) - Fixed EVM tx threshold path to also compare against known pubkey for v recovery - ThresholdClient::public_key() exposes the shared public key - All existing tests pass
New components: - saw-mpc/relay.rs: WebSocket relay server for multi-party MPC - run_relay(): accepts N parties, routes MPC messages (broadcast + P2P) - connect_to_relay(): connects as a party, returns Delivery-compatible channels - RelayEnvelope protocol: Join/Joined/AllJoined/Mpc/PhaseComplete - Phase-aware routing (aux, keygen, sign can use same relay) - saw-cli/keygen_threshold.rs: CLI for threshold keygen ceremony - run_relay(): start relay server on specified address - run_party(): connect to relay, run full ceremony (primes → aux → keygen → save) - Saves key share (0600) + metadata JSON to keys/threshold/ - saw keygen-threshold CLI command: --relay --listen <addr> Start relay --party <0|1|2> --wallet <name> --connect <url> Join as party - Integration test (keygen_ceremony.rs): relay-routed signing test (keygen via in-memory + signing via relay, verifies relay routing works) Usage: Machine A: saw keygen-threshold --relay --listen 0.0.0.0:9444 Machine B: saw keygen-threshold --party 0 --wallet main --connect ws://A:9444 Machine C: saw keygen-threshold --party 1 --wallet main --connect ws://A:9444 Machine D: saw keygen-threshold --party 2 --wallet main --connect ws://A:9444
- Add saw-mpc/encryption module: encrypt/decrypt with SAW1 binary format - Argon2id KDF (memory-hard) derives 256-bit key from passphrase + salt - ChaCha20-Poly1305 AEAD for authenticated encryption - Backward compatible: auto-detects plaintext vs encrypted key shares - SAW_PASSPHRASE env var controls encryption on keygen + decryption on load - saw-cli keygen-threshold encrypts if SAW_PASSPHRASE is set, warns if not - saw-daemon decrypts on startup, errors clearly if encrypted but no passphrase - 5 unit tests covering round-trip, wrong passphrase, passthrough, etc.
Presignature pool enables sub-50ms signing latency by splitting MPC into: - Offline phase (background): daemon + policy generate presignatures via MPC - Online phase (fast): local partial sig + exchange, no MPC rounds needed Changes: - saw-mpc/protocol: add PresignRequest, PresignReady, PartialSignRequest, PartialSignResponse wire messages - saw-mpc/signing: PresignaturePool now indexed by presig_index (BTreeMap) with reserve_index(), take(index), take_next() for coordinated consumption - saw-daemon/threshold: background refill loop (10s interval, target 5, refill at 2), sign() tries fast path first then falls back to sign_full - saw-policy/server: handles PresignRequest (MPC presign generation) and PartialSignRequest (policy eval + local partial sig + response) - saw-daemon/lib: starts presign refill tasks on Server::new() Signing paths: 1. Fast path: take presig from pool → issue_partial_signature locally → send PartialSignRequest to policy → receive policy's partial → combine_partial_signatures → done (no MPC rounds) 2. Slow path: full inline MPC signing (existing behavior, fallback)
Add PersistentWs struct for the daemon's fast-path signing: - Lazy connect on first use, auto-reconnect on failure - Exponential backoff on consecutive connection failures (1s → 30s max) - invalidate() marks connection dead so next call reconnects - send()/recv()/recv_timeout() handle connection lifecycle Signing strategy: - Fast path (presignature): uses persistent WS → avoids handshake overhead on every sign request, enables sub-50ms latency - Fast path failure: gracefully falls back to slow path instead of failing - Slow path (full MPC): separate per-request connection (not latency-critical) - Presign generation (background): separate connections (not latency-critical)
Add per-package opt-level=3 for heavy crypto dependencies in dev profile: - rug, gmp-mpfr-sys (big integer arithmetic) - paillier-zk (Paillier prime generation — the main bottleneck) - cggmp21 (MPC protocol) - generic-ec (elliptic curve ops) Result: keygen_2of3_and_sign integration test completes in ~195s (was 39+ min in pure debug mode, killed without completing). Paillier prime generation: ~60-80s each (was 39+ min for first prime). This avoids needing --release for MPC tests while keeping fast compile times for non-crypto code.
New CLI command that generates all 3 key shares in one process using
in-memory MPC transport. Much simpler than the distributed ceremony
for cases where you generate locally and distribute shares after.
Usage: SAW_PASSPHRASE=secret saw keygen-local --wallet my-wallet
Outputs 3 encrypted key share files + metadata:
{wallet}_party0.json → saw-daemon
{wallet}_party1.json → saw-policy (Railway)
{wallet}_party2.json → cosigner (recovery)
- Update saw-policy main.rs: support KEY_SHARE_BASE64 and POLICY_YAML env vars for containerized deployments (no files needed) - Support PORT env var (Railway-style) for listen address - Add base64 dep for key share decoding - Add Dockerfile (multi-stage: rust:1.85 builder + debian-slim runtime) - Add railway.toml pointing to the Dockerfile - Deployed to: saw-policy-production.up.railway.app
start_presign_refill() called tokio::spawn synchronously from Server::new(), which runs outside the tokio runtime. Moved the refill loop to a standalone async function and spawn it from rt.spawn() so tokio::spawn has proper runtime context. Also added pool()/key_share_clone()/policy_url() accessors on ThresholdClient for external refill loop spawning.
EVM transaction sighash is already Keccak256'd. The MPC signing was additionally SHA256-hashing it via DataToSign::from_digest, causing the recovered signer address to differ from the threshold wallet address. Changed to DataToSign::from_scalar with raw bytes so ecrecover works correctly for EVM transactions.
- Replace verbose draft spec with concise documentation of what's built - Update recovery README with correct port and macOS/Linux base64 syntax - Add critical implementation note about DataToSign::from_scalar - Document all components, config, deployment, and recovery procedures
Add patterns for key shares (.json.enc), private keys (.key, .pem), env files (.env), and SAW data directories (.saw/).
📝 WalkthroughWalkthroughAdds full 2-of-3 threshold MPC support: a new saw-mpc library with transports, keygen, presignature pool, and signing; a saw-policy agent (policy evaluation + WS server); a saw-cosigner scaffold; CLI keygen (local and threshold); daemon integration for threshold signing; Dockerfiles, tests, and docs. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Daemon as Daemon<br/>(Party 0)
participant Relay
participant Policy as Policy<br/>(Party 1)
participant Cosigner as Cosigner<br/>(Party 2)
Note over Daemon,Cosigner: Threshold Signing Flow (Presignature Fast Path)
Client->>Daemon: Sign Request
Daemon->>Daemon: Compute EVM sighash
Daemon->>Policy: PartialSignRequest<br/>(presign available)
rect rgba(173,216,230,0.5)
Note over Policy: Fast Path - Presignature
Policy->>Policy: Evaluate Policy Rules
alt Policy Approved
Policy->>Policy: Issue Partial Signature<br/>(from presignature pool)
Policy->>Daemon: PartialSignResponse<br/>(with partial_signature)
else Policy Denied/Escalated
Policy->>Daemon: PartialSignResponse<br/>(decision without signature)
end
end
alt Presignature Available & Approved
Daemon->>Daemon: Combine Partial Signatures<br/>(local + policy)
Daemon->>Client: Signed Transaction
else Presignature Unavailable or Denied
rect rgba(255,182,193,0.5)
Note over Daemon,Cosigner: Slow Path - Full MPC Signing
Daemon->>Relay: Connect (Party 0)
Policy->>Relay: Connect (Party 1)
Cosigner->>Relay: Connect (Party 2)
Relay->>Relay: AllJoined signal
Daemon->>Relay: MPC Sign Messages
Policy->>Relay: MPC Sign Messages
Cosigner->>Relay: MPC Sign Messages
Relay->>Daemon: MPC Messages routed
Relay->>Policy: MPC Messages routed
Relay->>Cosigner: MPC Messages routed
Daemon->>Daemon: Complete Signature<br/>(from MPC rounds)
Daemon->>Client: Signed Transaction
end
end
Estimated code review effort🎯 5 (Critical) | ⏱️ ~90+ minutes Poem
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 19
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
🤖 Fix all issues with AI agents
In `@crates/saw-cli/src/keygen_local.rs`:
- Around line 48-51: The ExecutionId values built from deterministic strings
("local-{wallet}-aux" and "local-{wallet}-dkg") can be reused across runs;
update the eid construction in the aux and dkg tasks so they include a
run-unique random nonce or high-resolution timestamp before calling
cggmp21::ExecutionId::new (the variables eid_bytes / eid and the calls to
keygen::generate_aux_info and keygen::generate_dkg_info are the places to
change). Generate a cryptographically random nonce (or use a UUID or
chrono::Utc::now().timestamp_nanos()), append it to the existing string when
creating eid_bytes for both the aux and dkg branches, and keep the rest of the
async spawn and generate_* calls unchanged.
- Around line 105-110: The code currently uses assert_eq!(address,
output.address, ...) which panics on mismatch; change this to return an Err from
the surrounding function (which returns Result<String, String>) with a clear
message including both addresses. Specifically, in the block that compares
address and output.address (where address, public_key, output.address,
output.public_key are used), replace the assert_eq! with an early return like
Err(format!(...)) describing the "address mismatch between parties" and include
the conflicting values so the caller can handle the error instead of panicking.
In `@crates/saw-daemon/src/lib.rs`:
- Around line 540-577: The loop that tries both y-parities to recover v is
duplicated in handle_sign_evm_tx_threshold and
handle_sign_eip2612_permit_threshold—extract it into a single helper fn
recover_y_parity(r_bytes: &[u8], s_bytes: &[u8], msg_hash: &[u8;32],
expected_uncompressed_pk: &[u8]) -> u8 that creates Secp256k1 and Message,
builds the 64-byte compact sig for v=0..1, attempts RecoveryId::from_i32,
RecoverableSignature::from_compact and secp.recover_ecdsa, compares
recovered.serialize_uncompressed()[1..] to expected_uncompressed_pk[1..],
returning the matching v (or 0 fallback); then replace the duplicated loops in
handle_sign_evm_tx_threshold and handle_sign_eip2612_permit_threshold to call
this helper with the local r_bytes, s_bytes, &sighash (as [u8;32]) and
expected_pk.to_bytes(false).as_ref().
- Around line 161-165: The current code uses a single hardcoded default
"keys/threshold/key_share.json" for every wallet when wcfg.key_share_path is
None, causing different threshold wallets to share the same key share; update
the logic in the wallet setup (where wcfg, key_share_path, root and full_path
are used) to either (a) require wcfg.key_share_path when wcfg.mode ==
"threshold" and return an error if missing, or (b) compute a unique default per
wallet by incorporating a wallet identifier (e.g., wcfg.name or the wallet ID)
into the path (example: format!("keys/threshold/{}_key_share.json",
wallet_name)) before joining with root to form full_path; pick one approach and
ensure validation/error messages reference wcfg.key_share_path and the wallet
identifier.
In `@crates/saw-daemon/src/threshold.rs`:
- Around line 541-545: The PresignRequest is being sent with wallet:
String::new(), which can break routing; update the WireMessage::PresignRequest
construction to pass the actual wallet name used elsewhere (match the wallet
field used in the SignRequest path) instead of an empty string so the policy
server can look up the correct key share (modify the PresignRequest wallet value
to use the wallet variable/identifier used in your SignRequest flow).
In `@crates/saw-mpc/tests/daemon_policy_flow.rs`:
- Around line 289-295: The verification is double-hashing the message by using
DataToSign::from_digest with SHA-256; change the verification to mirror the
signing path by constructing the DataToSign from the raw message hash scalar
instead. Replace the DataToSign::from_digest(...) call used before
sig_daemon.verify(...) with
DataToSign::from_scalar(generic_ec::Scalar::from_be_bytes_mod_order(message_hash))
so verification uses the same scalar representation as the signing code in
signing.rs (the instances at DataToSign::from_scalar in that file).
In `@crates/saw-mpc/tests/keygen_and_sign.rs`:
- Around line 104-109: The test constructs DataToSign via
DataToSign::from_digest which double-hashes the message and so won't match the
signing path; change the construction to use
DataToSign::from_scalar(generic_ec::Scalar::from_be_bytes_mod_order(message_hash))
so verification matches how signatures are produced (see issue_partial_signature
and sign_full which use from_scalar), ensuring
signatures[0].verify(&key_shares[0].shared_public_key, &data) uses the same
raw-scalar DataToSign as the signer.
In `@crates/saw-mpc/tests/keygen_ceremony.rs`:
- Around line 81-90: listener2 is bound to addr2 and remains alive, causing
run_relay (which calls TcpListener::bind) to fail with EADDRINUSE; fix by
dropping or moving ownership of listener2 before spawning the relay (i.e.,
release the TcpListener held in listener2 after obtaining addr2 and before
calling tokio::spawn for relay_task) or modify relay::run_relay to accept an
existing TcpListener and pass listener2 directly so run_relay doesn't re-bind
the same address (referencing listener2, addr2, run_relay, and relay_task).
In `@crates/saw-mpc/tests/ws_transport.rs`:
- Around line 130-136: The test verifies the signature using
DataToSign::from_digest which re-hashes message_hash, but the signing code
(sign_full) used DataToSign::from_scalar with
generic_ec::Scalar::from_be_bytes_mod_order(message_hash); update the
verification to construct the same DataToSign via
DataToSign::from_scalar(generic_ec::Scalar::from_be_bytes_mod_order(&message_hash))
so sig_daemon.verify(&complete_shares[0].shared_public_key, &data) checks the
same value that was signed.
In `@crates/saw-policy/Dockerfile`:
- Around line 14-34: The Dockerfile currently runs the runtime stage as root;
modify the runtime stage to create a dedicated non-root user (e.g., group/user
like sawpolicy or use UID 1000), change ownership of the copied binary and /data
to that user, and add a USER directive before CMD so saw-policy runs
unprivileged; update references in the stage that use /data (WORKDIR, ENV like
SAW_ROOT, POLICY_PATH, KEY_SHARE_PATH) to preserve functionality under the new
user and ensure the binary at /usr/local/bin/saw-policy is executable by that
user (adjust ownership/permissions where needed).
In `@crates/saw-policy/src/policy.rs`:
- Around line 116-124: The tripped_breakers lookup in evaluate is dead because
CircuitBreaker configs (struct CircuitBreaker with fields condition, action,
cooldown_hours) are parsed but never evaluated or inserted into
state.tripped_breakers; fix by implementing circuit-breaker evaluation where
requests are processed (e.g., inside evaluate or the request pre-check flow):
iterate configured breakers, evaluate each breaker.condition against the current
request/context, and when a breaker fires insert an entry into
state.tripped_breakers (keyed by request.wallet) with the breaker name and
cooldown expiry (use cooldown_hours to compute expiry), then have evaluate
return a Deny (or the configured action) with matched_rule set to the breaker
name; alternatively if you intended scaffolding, remove the tripped_breakers
check from evaluate or add a clear TODO comment in evaluate referencing
CircuitBreaker, condition, action, and cooldown_hours so users aren’t misled.
- Around line 147-195: matches_rule currently ignores RuleConditions fields
max_value_usd, max_daily_spend_usd, and max_per_minute which lets rules that
specify only those conditions match everything; update matches_rule (in function
matches_rule) to fail-closed by returning false when any of
rule.conditions.max_value_usd, rule.conditions.max_daily_spend_usd, or
rule.conditions.max_per_minute is Some(...) and the corresponding enforcement is
not yet implemented, or alternatively implement the actual checks (compare tx
value in USD for max_value_usd, consult spend-tracking for max_daily_spend_usd,
and apply rate tracking for max_per_minute); ensure you reference the
RuleConditions fields (max_value_usd, max_daily_spend_usd, max_per_minute) and
add/adjust tests to cover the fail-closed behavior so a rule that only sets
these fields no longer matches by default.
In `@crates/saw-policy/src/server.rs`:
- Around line 465-475: The signing result is only logged and never sent to the
daemon; after awaiting sign_task (the block around sign_task.await and the match
on result) construct a WireMessage::SignResult (or equivalent) containing
sign_req.request_id and either success info or the error string, then send it
back over the same outbound websocket channel used elsewhere in this file (the
outbound/sender object used for out_task) and await/handle the send result
(log/send error if it fails); ensure you send the message before aborting or
dropping out_task so the daemon receives the explicit success/failure signal.
- Around line 43-60: The current loop uses listener.accept().await.map_err(|e|
MpcError::Transport(...)) which returns Err on any accept failure and kills the
server; change the accept logic in run() to match on listener.accept().await
instead of mapping the error: on Ok((tcp_stream, addr)) proceed as before (clone
key_share and policy_config and tokio::spawn calling handle_connection), and on
Err(e) log the failure (e.g., tracing::warn!(%e, "accept failed; continuing"))
and continue the loop instead of returning MpcError::Transport; this removes the
immediate propagation of accept errors while preserving the existing connection
handler behavior.
In `@docker/recovery/Dockerfile`:
- Around line 5-11: The Dockerfile currently relies on sensitive environment
variables (SAW_PASSPHRASE, KEY_SHARE_BASE64, POLICY_YAML) which can be leaked;
change the image and entrypoint so secrets are read from files (e.g.
/run/secrets/ or a mounted file) instead of env vars: update the Dockerfile to
document and expect docker secrets or a mounted secrets file (and remove
guidance to pass SAW_PASSPHRASE/KEY_SHARE_BASE64 via -e), and modify the
container startup/entrypoint logic to load SAW_PASSPHRASE, KEY_SHARE_BASE64 and
POLICY_YAML from secure file paths (and support an --env-file fallback only with
strict file permissions) so sensitive values are not passed as process
environment variables.
- Around line 21-30: Add a dedicated non-root user and switch to it in the
Dockerfile: create a system user/group (e.g., "saw") before copying or after
COPY, chown the binary /usr/local/bin/saw-policy and the data directory (/data)
to that user, set WORKDIR ownership, and add USER saw (or UID/GID) before the
CMD so the container does not run as root; ensure the user has no interactive
login (system user) and keep EXPOSE/CMD unchanged.
In `@Dockerfile.policy`:
- Around line 14-34: The runtime Docker stage runs as root; add a non-root user
and switch to it to reduce blast radius: in the stage that builds from
debian:bookworm-slim (the one that COPYs /build/target/release/saw-policy and
sets WORKDIR /data), create a dedicated unprivileged user (e.g., user "saw" or
UID 1000), change ownership of /data and the copied binary (saw-policy) to that
user, set HOME/ENV as needed, and add USER <username> before CMD so the
container runs the saw-policy binary as the non-root user instead of root.
Ensure directories and the binary remain readable/executable by that user.
In `@docs/threat-model-and-design-rationale.md`:
- Around line 60-65: The Accepted risks section is incomplete—add explicit
entries for missing attack vectors: document risks and mitigations for
Unauthenticated WebSocket/relay transport (MITM, message injection,
impersonation) and note design decision "Not included: mTLS/WS auth"; for Policy
server compromise (policy signer root compromise, bypassing rules, exfiltration
of signer share) reference "policy signer" and explain detection/response and
acceptance or hardening options; for Presignature pool attacks (exhaustion,
tampering, integrity loss) reference "presignature pool" and "presignature" and
describe rate-limits, authentication, integrity checks, and pool
monitoring/rehydration; for Denial-of-service on signing ceremonies (relay
flood, dropped party) reference MPC ceremonies and add timeouts, retries,
fallback policies; and for Key share encryption at rest (Argon2id +
ChaCha20-Poly1305) describe threats from disk theft/backups and mitigations like
stronger KDF params, hardware-backed keys, and key rotation/re-key procedures or
explicitly mark as accepted risk—update the Accepted risks list to include these
items and link to existing items like "2-of-2 mode" and "policy_unavailable"
where relevant.
In `@docs/threshold-signing-spec.md`:
- Around line 156-159: The Transport section currently documents an
unauthenticated plaintext WebSocket between "Daemon ↔ Policy" and lists message
types like the JSON-serialized WireMessage (SignRequest, PolicyDecision,
MpcWireMessage, PresignRequest) and “mTLS” as future — add a tracked task:
create a repository issue titled something like "Add WebSocket auth for
Daemon↔Policy (mTLS/token)" describing required options (mTLS with pinned certs,
token-based auth, and migration plan), assign owner and milestone, then update
the docs/threshold-signing-spec.md Transport paragraph to include a link to that
issue and a TODO note next to "mTLS" so reviewers can find the tracking item
before production; ensure the issue references WireMessage/PolicyDecision to
show impact on co-signing.
🟡 Minor comments (11)
docs/threat-model-and-design-rationale.md-31-31 (1)
31-31:⚠️ Potential issue | 🟡 MinorClarify the complexity of share rotation.
Line 31 states "rotate the shares via a new keygen ceremony and the stolen share becomes useless," which makes recovery sound straightforward. However, line 65 reveals "No key refresh in v1" and that mitigation requires "a full re-keygen to a new wallet and on-chain fund transfer." This is a significant operational burden (new address, fund migration) that should be acknowledged here to avoid misleading readers about the ease of recovery.
📝 Suggested clarification
-With MPC, a compromise of one machine is recoverable — rotate the shares via a new keygen ceremony and the stolen share becomes useless. +With MPC, a compromise of one machine is recoverable — generate a new wallet via a fresh keygen ceremony, transfer funds on-chain, and the stolen share becomes useless (note: in-place key refresh is not supported in v1).docs/threat-model-and-design-rationale.md-56-56 (1)
56-56:⚠️ Potential issue | 🟡 MinorTEE/premium threshold mode is not yet implemented.
Line 56 presents "TEE/premium threshold" as mode 3, but the PR objectives explicitly list "TEE deployment" under "Not included / future work." This could mislead readers into thinking this deployment mode is currently available. Either mark this as planned/future work or remove it from the current deployment modes list.
📝 Suggested clarification
-3. **TEE/premium threshold** — Managed co-signing service running in a Trusted Execution Environment. Appropriate for agents handling real money where operators want professional-grade security without running their own infrastructure. +3. **TEE/premium threshold (planned)** — Managed co-signing service running in a Trusted Execution Environment. Appropriate for agents handling real money where operators want professional-grade security without running their own infrastructure. *(Not yet implemented; see future work.)*docs/threat-model-and-design-rationale.md-64-64 (1)
64-64:⚠️ Potential issue | 🟡 MinorThe 2-of-2 mode is implemented but untested — clarify support status in documentation.
The 2-of-2 mode is actually implemented (
new_2of2()function exists incrates/saw-mpc/src/types.rs), not theoretical. However, it appears completely untested: there are no test cases for 2-of-2 in the provided test suite (all tests use 2-of-3). Claiming it is "Supported" in the documentation is misleading without evidence that 2-of-2 signing actually works. Either add integration tests for 2-of-2 or qualify the documentation to state it is "available as a configuration option" rather than actively "supported" until validated.docs/threshold-signing-spec.md-70-70 (1)
70-70:⚠️ Potential issue | 🟡 MinorTypo:
mywallt→mywallet.-SAW_PASSPHRASE="secret" saw keygen-local --wallet mywallt --root ~/.saw +SAW_PASSPHRASE="secret" saw keygen-local --wallet mywallet --root ~/.sawcrates/saw-policy/Dockerfile-5-5 (1)
5-5:⚠️ Potential issue | 🟡 MinorAdd
--no-install-recommendstoapt-get installinvocations.Both
apt-get installlines (builder Line 5, runtime Line 17) pull in recommended packages unnecessarily, increasing image size and attack surface.Proposed fix
-RUN apt-get update && apt-get install -y m4 pkg-config libssl-dev && rm -rf /var/lib/apt/lists/* +RUN apt-get update && apt-get install -y --no-install-recommends m4 pkg-config libssl-dev && rm -rf /var/lib/apt/lists/*-RUN apt-get update && apt-get install -y ca-certificates libssl3 && rm -rf /var/lib/apt/lists/* +RUN apt-get update && apt-get install -y --no-install-recommends ca-certificates libssl3 && rm -rf /var/lib/apt/lists/*Also applies to: 17-17
crates/saw-daemon/src/config.rs-48-59 (1)
48-59:⚠️ Potential issue | 🟡 Minor
Err(_)silently swallows permission and other I/O errors.Currently, any
read_to_stringfailure (including permission denied, I/O errors) falls through to defaults without warning. Only a missing file should silently use defaults.Proposed fix
match std::fs::read_to_string(&config_path) { Ok(contents) => { serde_yaml::from_str(&contents).unwrap_or_else(|e| { eprintln!("warning: invalid config.yaml: {e}, using defaults"); DaemonConfig::default() }) } - Err(_) => DaemonConfig::default(), + Err(e) if e.kind() == std::io::ErrorKind::NotFound => DaemonConfig::default(), + Err(e) => { + eprintln!("warning: could not read config.yaml: {e}, using defaults"); + DaemonConfig::default() + } }crates/saw-policy/src/main.rs-125-135 (1)
125-135:⚠️ Potential issue | 🟡 MinorEnvironment variable silently overrides
--configCLI flag.If
POLICY_YAMLis set,load_policywill use it regardless of whether--config <path>was explicitly passed. This could silently ignore the user's explicit CLI flag. Consider checking the env var only when no CLI path was provided, or at least logging which source was used.♻️ Suggested approach: check env var before CLI resolution
Move the
POLICY_YAMLcheck into the config_path resolution block (lines 89-91) so that CLI flags take precedence:fn load_policy(path: &PathBuf) -> Result<policy::PolicyConfig, String> { - let contents = if let Ok(yaml) = std::env::var("POLICY_YAML") { - tracing::info!("loading policy from POLICY_YAML env var"); - yaml - } else { - std::fs::read_to_string(path) - .map_err(|e| format!("read policy {}: {e}", path.display()))? - }; + let contents = std::fs::read_to_string(path) + .map_err(|e| format!("read policy {}: {e}", path.display()))?; serde_yaml::from_str(&contents).map_err(|e| format!("parse policy: {e}")) }And in the config_path resolution:
let config_path = config_path.unwrap_or_else(|| { - PathBuf::from(std::env::var("POLICY_PATH").unwrap_or_else(|_| "policy.yaml".into())) + if let Ok(_yaml) = std::env::var("POLICY_YAML") { + // POLICY_YAML will be handled inline by load_policy + PathBuf::from("(inline)") + } else { + PathBuf::from(std::env::var("POLICY_PATH").unwrap_or_else(|_| "policy.yaml".into())) + } });Alternatively, keep the current behavior but add a warning log when CLI flag is overridden.
crates/saw-policy/src/server.rs-330-340 (1)
330-340:⚠️ Potential issue | 🟡 MinorInconsistent error handling for bad message hash between
PartialSignRequestandSignRequest.For
PartialSignRequest(Lines 266-284), an invalid hash returns a denial response to the daemon. ForSignRequest(Lines 331-337), an invalid hash propagates anMpcErrorthat terminates the connection. TheSignRequestpath should also send an error response andcontinueto remain consistent and avoid dropping the connection.crates/saw-mpc/src/keygen.rs-141-149 (1)
141-149:⚠️ Potential issue | 🟡 MinorDoc comment is misleading — no plaintext fallback implemented.
The doc says "If the data is not encrypted, falls back to plaintext deserialization", but the implementation unconditionally calls
crate::encryption::decrypt(data, passphrase)?, which will fail on plaintext JSON input. Either update the doc to match the behavior or implement the fallback.Option A: Fix the doc
-/// Decrypt and deserialize a KeyShare from storage. -/// If the data is not encrypted, falls back to plaintext deserialization. +/// Decrypt and deserialize a KeyShare from encrypted storage.Option B: Implement the fallback
pub fn deserialize_key_share_encrypted( data: &[u8], passphrase: &[u8], ) -> Result<cggmp21::KeyShare<Secp256k1>, MpcError> { - let plaintext = crate::encryption::decrypt(data, passphrase)?; - serde_json::from_slice(&plaintext).map_err(MpcError::Serde) + match crate::encryption::decrypt(data, passphrase) { + Ok(plaintext) => serde_json::from_slice(&plaintext).map_err(MpcError::Serde), + Err(_) => serde_json::from_slice(data).map_err(MpcError::Serde), + } }crates/saw-daemon/src/threshold.rs-208-229 (1)
208-229:⚠️ Potential issue | 🟡 MinorPresignature is consumed even when the fast path fails — expensive resource leak.
At Line 210–212, the presignature is removed from the pool via
take_next(). Ifsign_with_presignaturefails (Line 221), the presignature is already consumed and cannot be returned to the pool. Since presignature generation requires multiple MPC rounds over the network, each lost presignature is expensive.Consider returning the presignature to the pool on recoverable failures (e.g., transport errors), or at minimum logging at a higher severity level:
Suggested improvement
match self .sign_with_presignature(wallet, action.clone(), tx_details.clone(), message_hash, presig_index, presignature) .await { Ok(result) => return Ok(result), Err(e) => { - // Fast path failed (e.g. presig index mismatch after reconnect). - // Fall through to slow path rather than losing the request. - eprintln!("fast path failed: {e}, falling back to slow path"); + // Fast path failed — presignature is lost (cannot be safely reused + // after partial signature was issued). Fall through to slow path. + eprintln!("WARN: fast path failed, presignature {presig_index} lost: {e}"); } }Note: once
issue_partial_signaturehas been called (Line 248), the presignature must not be reused (security invariant). So recovery is only possible if the failure occurred before that point.crates/saw-mpc/src/signing.rs-183-183 (1)
183-183:⚠️ Potential issue | 🟡 MinorRemove unused import.
The
use sha2::Digest as _;import at the end of the file is not required. The trait is not used anywhere in the code —sha2::Sha256appears only as a type parameter in generic bounds (lines 110, 160), not for trait method dispatch. No trait methods fromDigestare invoked. Remove this import.
🧹 Nitpick comments (36)
docker/recovery/Dockerfile (2)
18-18: Add--no-install-recommendsto bothapt-get installcalls.Avoids pulling unnecessary recommended packages, reducing image size and attack surface (Trivy DS-0029).
Proposed fix
-RUN apt-get update && apt-get install -y m4 pkg-config libssl-dev && rm -rf /var/lib/apt/lists/* +RUN apt-get update && apt-get install -y --no-install-recommends m4 pkg-config libssl-dev && rm -rf /var/lib/apt/lists/*-RUN apt-get update && apt-get install -y ca-certificates libssl3 && rm -rf /var/lib/apt/lists/* +RUN apt-get update && apt-get install -y --no-install-recommends ca-certificates libssl3 && rm -rf /var/lib/apt/lists/*Also applies to: 22-22
15-16: Add a.dockerignorefile to exclude unnecessary files from the build context.The
COPY . .at lines 15-16 sends the entire repository to the Docker daemon. Without a.dockerignore, directories liketarget/,.git/,node_modules/, and local configuration files will be included, increasing build context size and risking accidental exposure of sensitive files in the builder layer.Create
.dockerignoreat the repository root with at minimum:.git target node_modules .env .env.local *.logcrates/saw-daemon/Cargo.toml (1)
6-27: Consider consolidatingcggmp21version via workspace dependencies.Same observation as
saw-cli:cggmp21appears as both a direct dependency and a transitive one throughsaw-mpc. With three crates all pinningcggmp21 = "0.6"independently, a[workspace.dependencies]entry would be the standard way to keep versions synchronized as the project evolves.Dockerfile.policy (1)
5-5: Add--no-install-recommendstoapt-get installin both stages.Reduces image size and attack surface by skipping unnecessary recommended packages. Flagged by Trivy DS-0029.
Proposed fix
-RUN apt-get update && apt-get install -y m4 pkg-config libssl-dev && rm -rf /var/lib/apt/lists/* +RUN apt-get update && apt-get install -y --no-install-recommends m4 pkg-config libssl-dev && rm -rf /var/lib/apt/lists/*-RUN apt-get update && apt-get install -y ca-certificates libssl3 && rm -rf /var/lib/apt/lists/* +RUN apt-get update && apt-get install -y --no-install-recommends ca-certificates libssl3 && rm -rf /var/lib/apt/lists/*Also applies to: 17-17
docs/threshold-signing-spec.md (1)
12-12: Add language identifiers to fenced code blocks.The ASCII diagram blocks at lines 12 and 37 are missing language specifiers (flagged by markdownlint MD040). Using
```textwould silence the lint and improve rendering.Also applies to: 37-37
crates/saw-cli/src/keygen_local.rs (1)
12-14: Unix-only import:PermissionsExtprevents compilation on Windows.
std::os::unix::fs::PermissionsExt(Line 13) andfrom_mode(0o600)(Line 126) are Unix-only. If cross-platform support is ever needed, this would need a#[cfg(unix)]gate. Fine if the target is strictly Linux/macOS.crates/saw-mpc/src/protocol.rs (2)
22-25:from_strmethod shadows thestd::str::FromStrtrait convention.Defining an inherent
from_stronSessionIdis confusing — it looks like a trait impl but isn't. Users expecting"x".parse::<SessionId>()will find it doesn't work. Consider implementingFrom<&str>or theFromStrtrait instead.Proposed alternatives
- /// Create from a known string (e.g., received from coordinator). - pub fn from_str(s: &str) -> Self { - Self(s.to_string()) - }Then either:
impl From<&str> for SessionId { fn from(s: &str) -> Self { Self(s.to_string()) } }or implement
std::str::FromStr.
46-48:message_hashas a hexStringis fragile — consider[u8; 32]with hex serde.Passing the 32-byte hash as a hex-encoded
StringinSignRequestandPartialSignRequestpushes encoding/decoding responsibility to every call site and allows malformed values (odd-length hex, wrong length, non-hex chars) to slip through at runtime. A typed[u8; 32]with a hex serde adapter (e.g.,serde_withor a custom module) catches errors at deserialization time.Also applies to: 118-119
crates/saw-policy/Dockerfile (1)
8-9: Docker layer caching is suboptimal — dependency changes always rebuild everything.
COPY crates/ crates/on Line 9 copies all source files in a single layer. Any source change invalidates thecargo buildcache. A common pattern is to first copy onlyCargo.toml/Cargo.lockfiles, run a dummy build to cache dependencies, then copy sources. However, with a workspace this is admittedly non-trivial to set up, so this is fine for now.crates/saw-mpc/tests/ws_transport.rs (1)
101-102:sleep(50ms)for server readiness is a flaky-test risk.Under heavy CI load, the server task may not have accepted the listener before the client connects. A synchronization primitive (e.g.,
tokio::sync::oneshotsignaled afterws_acceptstarts) or a connect-with-retry loop would be more robust.Sketch: retry loop alternative
// Client side: retry connection let conn = loop { match transport::ws_connect(&ws_url, PARTY_DAEMON, PARTY_POLICY, n).await { Ok(c) => break c, Err(_) => tokio::time::sleep(std::time::Duration::from_millis(10)).await, } };crates/saw-mpc/tests/keygen_and_sign.rs (2)
114-114: Move import to the top of the file with other imports.The
use sha2::Digest;import is placed at the very end of the file (line 114), detached from the other imports at lines 4-6. This is unconventional and easy to miss.Suggested move
use saw_mpc::keygen; use saw_mpc::signing; use saw_mpc::transport; +use sha2::Digest;And remove line 114.
15-17: Prime generation is sequential despite the comment saying "do in parallel".
pregenerate_primes()is CPU-intensive and called sequentially in the.map().collect(). For a test this is likely acceptable, but the comment on line 16 is misleading.crates/saw-mpc/tests/keygen_ceremony.rs (1)
19-21: Firstlistenerandrelay_urlare unused — dead code.The
TcpListenerbound at line 19 andrelay_urlat line 21 are never used. The test performs keygen in-memory and only uses a second listener for the relay signing phase. This is likely a leftover from an earlier iteration.Suggested cleanup
- let listener = tokio::net::TcpListener::bind("127.0.0.1:0").await.unwrap(); - let addr = listener.local_addr().unwrap(); - let relay_url = format!("ws://{addr}"); - - // We need the relay to accept connections for TWO phases (aux + keygen). - // The current relay accepts N parties then finishes. - // For multi-phase, we run the relay once per phase. - // Actually, let's test the connect_to_relay function directly - // with our existing in-memory approach for keygen, and just - // verify the relay routing works for a single phase. - - // --- Test relay with a simple signing phase --- - // First, do keygen in-memory (proven), then test relay routing for signing. -crates/saw-cli/src/lib.rs (1)
611-627:CliError::InvalidArgis misused for non-argument errors.Runtime creation failures (line 615), relay server errors (line 620), and party keygen errors (line 626) are all mapped to
CliError::InvalidArg, which is semantically incorrect. This will produce misleading error messages like"invalid argument: relay error: ...".Consider adding a variant to
CliError(e.g.,Keygen(String)) for these operational errors.♻️ Sketch
pub enum CliError { MissingArg(&'static str), InvalidArg(String), + Threshold(String), GenKey(GenKeyError), // ... }Then use
CliError::Threshold(e)in the threshold/keygen command handlers.crates/saw-cosigner/src/main.rs (1)
1-67: Scaffold looks reasonable.The CLI structure follows the same pattern as
saw-policy. Multiple TODO comments mark the unimplemented functionality clearly.Would you like me to open an issue to track the
--joinimplementation and the other TODO items (lines 51-53, 62-64)?crates/saw-policy/src/main.rs (1)
125-125: Prefer&Pathover&PathBuffor function parameters.Both
load_policyandload_key_sharetake&PathBufas parameters. Idiomatic Rust prefers&PathsincePathBufauto-derefs toPath, and&Pathis more general (accepts both&PathBufand&Path).♻️ Proposed fix
-fn load_policy(path: &PathBuf) -> Result<policy::PolicyConfig, String> { +fn load_policy(path: &Path) -> Result<policy::PolicyConfig, String> {-fn load_key_share(path: &PathBuf) -> Result<saw_mpc::KeyShare<saw_mpc::Secp256k1>, String> { +fn load_key_share(path: &Path) -> Result<saw_mpc::KeyShare<saw_mpc::Secp256k1>, String> {Add
use std::path::Path;alongside the existingPathBufimport.Also applies to: 137-137
crates/saw-mpc/src/encryption.rs (1)
70-74: Silent passthrough for unencrypted data — all call sites properly guard against downgrade attacks.The backward-compatible passthrough is intentionally documented for graceful fallback, but does create a downgrade risk if callers don't validate. Verification confirms both discovered call sites (
saw-policy/src/main.rs:156andsaw-daemon/src/lib.rs:175) correctly checkis_encrypted()before callingdeserialize_key_share_encrypted(). Current usage is safe. Consider adding a prominent note in function documentation emphasizing that callers expecting encrypted data must explicitly validate withis_encrypted()first.crates/saw-mpc/src/types.rs (2)
62-63: Doc says "Ethereum address" butChaincan beSol.The doc comment on
addressfield says "Ethereum address derived from the combined public key", butKeyShareDataembeds aThresholdConfigthat supportsChain::Sol. Consider generalizing the doc to "derived address" or making it chain-agnostic.
33-54: Constructors accept anyparty_idwithout bounds check.
new_2of3andnew_2of2don't validateparty_id < num_parties. Callers likekeygen_threshold::run_partydo check before calling, but a defensivedebug_assert!or explicit validation here would prevent misuse from other call sites.crates/saw-cli/src/keygen_threshold.rs (2)
64-68:pregenerate_primesis synchronous and CPU-intensive — considerspawn_blocking.
keygen::pregenerate_primes()is described as taking ~1 minute. It's called directly on the async runtime, which will block the Tokio executor thread. Wrap intokio::task::spawn_blockingto avoid starving other tasks (e.g., the relay connection timeout).Suggested fix
- let primes = keygen::pregenerate_primes(); + let primes = tokio::task::spawn_blocking(keygen::pregenerate_primes) + .await + .map_err(|e| format!("prime gen task: {e}"))?;
132-145: Metadata file not restricted to 0o600 and chain is hardcoded toEvm.Two observations:
The key share file gets
chmod 0o600(Line 129), but the metadata file atmeta_path(Line 145) does not. While metadata doesn't contain secret key material, it does reveal wallet address/public key — applying the same permissions would be consistent.
Chain::Evmis hardcoded at Line 134. Therun_partyfunction signature doesn't accept achainparameter, so Solana keygen would produce incorrect metadata. Consider addingchainas a parameter for forward-compatibility.Suggested fix for permissions
fs::write(&meta_path, meta_json).map_err(|e| format!("write meta: {e}"))?; + fs::set_permissions(&meta_path, fs::Permissions::from_mode(0o600)) + .map_err(|e| format!("chmod meta: {e}"))?;crates/saw-mpc/tests/daemon_policy_flow.rs (1)
183-184: 50ms sleep for server readiness is potentially flaky.The 50ms sleep between starting the policy server task and connecting the daemon client may not be enough under load (e.g., CI). Consider using a readiness signal (e.g., a
tokio::sync::Notifyor anoneshotchannel) after the listener is bound, or retry the connection with backoff.crates/saw-policy/src/server.rs (2)
396-463: Busy-wait polling loop with 50ms timeout in production server code.The MPC message feed loop uses
tokio::time::timeout(50ms, ws_rx.next())as a poll, burning CPU cycles in a tight loop. This pattern is acceptable in tests but suboptimal for a production server.Consider using
tokio::select!to concurrently await both the WebSocket message and the sign task completion:Sketch
loop { tokio::select! { result = &mut sign_task => { // sign_task completed break; } msg = ws_rx.next() => { // process incoming WS message } } }
76-76: Hardcoded presignature pool parameters.
PresignaturePool::new(10, 2)uses magic numbers. These should come from configuration (e.g.,PolicyConfigor a separate pool config) to allow tuning without code changes.crates/saw-mpc/src/lib.rs (1)
21-27: Consider re-exportingChainalongsideThresholdConfig.
Chainis a field ofThresholdConfig(which is re-exported) and is needed by consumers (e.g.,saw-cli/src/keygen_threshold.rsimportssaw_mpc::types::Chaindirectly). Re-exporting it at the crate root would be consistent with the other type re-exports.-pub use types::{KeyShareData, PartyId, ThresholdConfig}; +pub use types::{Chain, KeyShareData, PartyId, ThresholdConfig};crates/saw-mpc/src/relay.rs (3)
161-161: Dead code:_nis assigned but never used.- let _n = expected_parties;
297-335: UnnecessaryArc<Mutex>wrapping forws_txandws_rxinconnect_to_relay.Both
ws_tx(Line 299) andws_rx(Line 329) are wrapped inArc<Mutex<>>but are each consumed by exactly one spawned task. They could be moved directly into their respective tasks without the synchronization overhead.Suggested simplification for ws_tx
- let ws_tx = Arc::new(Mutex::new(ws_tx)); - let ws_tx_out = ws_tx.clone(); tokio::spawn(async move { use cggmp21::round_based::MessageDestination; while let Some(outgoing) = outgoing_rx.next().await { // ... - let mut tx = ws_tx_out.lock().await; - if tx.send(WsMessage::Text(json.into())).await.is_err() { + if ws_tx.send(WsMessage::Text(json.into())).await.is_err() { break; } } });
83-86: WebSocket handshake failure kills the relay server.If one party's WS handshake fails (e.g., TLS issue, protocol mismatch), the entire relay returns an error and all connected parties are dropped. For a ceremony relay this may be acceptable since all parties are required, but consider logging and waiting for a retry from that party instead.
crates/saw-mpc/src/transport.rs (1)
338-339: Minor:out_rxs.remove(0)in a loop is O(n²).Each
remove(0)shifts all remaining elements. For the typical n=3, this is negligible. For clarity, consider draining:- for sender_idx in 0..n { - let mut rx = out_rxs.remove(0); + for (sender_idx, mut rx) in (0..n).zip(out_rxs.drain(..)) {crates/saw-daemon/src/lib.rs (2)
1247-1320:sign_evm_txduplicates the sighash + build logic now extracted into helpers.
compute_evm_tx_sighashandbuild_signed_evm_txwere extracted for the threshold path, butsign_evm_tx(the single-key path) still has its own copy of the same RLP encoding logic. Consider refactoringsign_evm_txto use these helpers to keep EIP-1559 encoding in one place.Also applies to: 1322-1385
532-538:block_onin the synchronous request handler blocks the entire daemon.While
rt.block_on()is safe here (no nested runtime), the daemon's synchronousserve_loopprocesses one request at a time. During threshold signing — especially the slow MPC path which involves multiple WS round-trips and up to 5-second timeouts — the daemon is completely unresponsive to all other requests. This may be acceptable for an MVP but should be noted as a scaling limitation.crates/saw-policy/src/policy.rs (1)
78-96:PolicyStatefields are never read or written.
daily_spend,rate_history, andtripped_breakersare initialized innew()but never populated or queried anywhere in this module (aside from the read-onlytripped_breakerscheck inevaluate). Consider adding a doc comment noting these are scaffolding for the unimplemented TODO conditions, so future contributors know the state management is intentionally deferred.crates/saw-daemon/src/threshold.rs (4)
263-280: Persistent WS lock held for up to 5 seconds during fast-path signing.The
wsmutex is acquired at Line 263 and held through the 5-second timeout recv loop (Line 267). While this is safe today (the daemon is single-threaded due toblock_on), if the daemon is ever made concurrent, this would block all other WS operations for the timeout duration. Consider releasing the lock aftersendand re-acquiring forrecv, or using a split lock pattern.
382-470: ~180 lines of nearly identical MPC-over-WS dispatch code duplicated betweensign_full_mpcandgenerate_one_presignature.The outgoing forwarding task and the incoming message dispatch loop are structurally identical in both functions. Extract a shared helper, e.g.:
async fn run_mpc_over_ws<F, T>( ws_tx: Arc<Mutex<WsTx>>, ws_rx: &mut WsRx, session_id: SessionId, mpc_future: F, ) -> Result<T, ThresholdError> where F: Future<Output = Result<T, MpcError>>,This would eliminate the duplication and make future changes (e.g., adding message type filtering, better error handling) easier to maintain.
Also applies to: 558-645
249-251: Dead computation —our_partial_bytesserialized but immediately discarded.Lines 249–251 serialize the partial signature into bytes and then immediately suppress the unused variable warning. If this is scaffolding for a future optimization, a TODO comment with the planned usage would be more informative than the current pattern.
488-518: Presign refill loop has no graceful shutdown mechanism.The
presign_refill_loopruns an infiniteloop(Line 493) with no cancellation token or shutdown signal. It will run until the tokio runtime is dropped. While the runtime drop will abort the task, this means in-progress presignature generation (which involves WS connections and MPC rounds) will be abruptly terminated. Consider accepting aCancellationTokenorwatch::Receiverfor clean shutdown.
| # Usage: | ||
| # docker build -t saw-recovery . | ||
| # docker run -p 8080:8080 \ | ||
| # -e SAW_PASSPHRASE="your-passphrase" \ | ||
| # -e KEY_SHARE_BASE64="<base64-encoded-party2-share>" \ | ||
| # -e POLICY_YAML="<base64-encoded-policy>" \ | ||
| # saw-recovery |
There was a problem hiding this comment.
Secrets passed as environment variables are easily leaked.
SAW_PASSPHRASE and KEY_SHARE_BASE64 contain highly sensitive cryptographic material. Environment variables are visible via docker inspect, /proc/*/environ, orchestrator dashboards, and crash logs. For a recovery co-signer, this is a meaningful security exposure.
Prefer Docker secrets (docker run --secret), a mounted secrets file, or an environment-file (--env-file) with restricted permissions, and read the values from files at runtime instead.
🤖 Prompt for AI Agents
In `@docker/recovery/Dockerfile` around lines 5 - 11, The Dockerfile currently
relies on sensitive environment variables (SAW_PASSPHRASE, KEY_SHARE_BASE64,
POLICY_YAML) which can be leaked; change the image and entrypoint so secrets are
read from files (e.g. /run/secrets/ or a mounted file) instead of env vars:
update the Dockerfile to document and expect docker secrets or a mounted secrets
file (and remove guidance to pass SAW_PASSPHRASE/KEY_SHARE_BASE64 via -e), and
modify the container startup/entrypoint logic to load SAW_PASSPHRASE,
KEY_SHARE_BASE64 and POLICY_YAML from secure file paths (and support an
--env-file fallback only with strict file permissions) so sensitive values are
not passed as process environment variables.
| ### Accepted risks | ||
|
|
||
| - **Root compromise + active policy signer**: An attacker with root on the agent machine can sign transactions as long as they pass policy rules and the policy signer is reachable. Mitigation: strict policy rules (allowlists, rate limits, circuit breakers). | ||
| - **Policy signer downtime**: Signing fails fast (5s timeout) and returns `policy_unavailable`. The agent decides whether to retry, queue, or skip. No silent failures. | ||
| - **2-of-2 mode (no recovery share)**: Supported but explicitly warned against. If either share is lost, the wallet is irrecoverable. Only appropriate when operators accept this tradeoff. | ||
| - **No key refresh in v1**: The cggmp21 crate doesn't support threshold key refresh. If a share is suspected compromised, the mitigation is a full re-keygen to a new wallet and on-chain fund transfer. This is a known limitation we accept for v1. |
There was a problem hiding this comment.
Threat model is incomplete—several attack vectors are missing.
The "Accepted risks" section focuses primarily on the agent compromise scenario, but several other threat vectors relevant to the MPC architecture are not addressed:
-
Unauthenticated WebSocket/relay transport: PR objectives state "Not included: mTLS/WS auth," meaning the party-to-party communication during MPC ceremonies is unauthenticated. This opens risks of man-in-the-middle attacks, protocol message injection, or impersonation during keygen/signing.
-
Policy server compromise: The doc discusses agent compromise but not policy co-signer compromise. What are the implications if an attacker gains root on the policy server? Can they bypass rules, exfiltrate their key share, or cause signing to fail?
-
Presignature pool attacks: The PR adds a presignature pool for performance. What happens if an attacker can exhaust the pool, cause presignature generation to fail, or tamper with stored presignatures?
-
Denial-of-service on signing ceremonies: MPC protocols require multi-party coordination. What happens if an attacker disrupts the relay, floods the WebSocket, or causes one party to drop during signing?
-
Key share encryption at rest: The PR implements "key-share encryption using Argon2id + ChaCha20-Poly1305," but the threat model doesn't discuss the scenario where an attacker gains disk access without runtime access (e.g., stolen drive, backup compromise). What protections exist?
These should be documented with their mitigations or acknowledged as accepted risks. A threat model document should comprehensively cover the attack surface introduced by the architecture.
Would you like me to draft expanded threat scenarios covering these attack vectors?
🤖 Prompt for AI Agents
In `@docs/threat-model-and-design-rationale.md` around lines 60 - 65, The Accepted
risks section is incomplete—add explicit entries for missing attack vectors:
document risks and mitigations for Unauthenticated WebSocket/relay transport
(MITM, message injection, impersonation) and note design decision "Not included:
mTLS/WS auth"; for Policy server compromise (policy signer root compromise,
bypassing rules, exfiltration of signer share) reference "policy signer" and
explain detection/response and acceptance or hardening options; for Presignature
pool attacks (exhaustion, tampering, integrity loss) reference "presignature
pool" and "presignature" and describe rate-limits, authentication, integrity
checks, and pool monitoring/rehydration; for Denial-of-service on signing
ceremonies (relay flood, dropped party) reference MPC ceremonies and add
timeouts, retries, fallback policies; and for Key share encryption at rest
(Argon2id + ChaCha20-Poly1305) describe threats from disk theft/backups and
mitigations like stronger KDF params, hardware-backed keys, and key
rotation/re-key procedures or explicitly mark as accepted risk—update the
Accepted risks list to include these items and link to existing items like
"2-of-2 mode" and "policy_unavailable" where relevant.
| - **Daemon ↔ Policy:** WebSocket (plaintext WS over Railway's TLS termination) | ||
| - **Message format:** JSON-serialized `WireMessage` enum (SignRequest, PolicyDecision, MpcWireMessage, PresignRequest, etc.) | ||
| - **Reconnection:** Persistent WS with exponential backoff (1s → 30s) | ||
| - **Future:** mTLS with pinned certificates |
There was a problem hiding this comment.
Transport section acknowledges no WS authentication — ensure this is tracked.
The doc notes that auth is "planned for production" and mTLS is future work. Given that the policy server co-signs transactions, an unauthenticated WebSocket is a significant gap. Consider linking to a tracking issue so this doesn't slip through the cracks.
Would you like me to open an issue to track adding WebSocket authentication (mTLS / token-based) before production deployment?
🤖 Prompt for AI Agents
In `@docs/threshold-signing-spec.md` around lines 156 - 159, The Transport section
currently documents an unauthenticated plaintext WebSocket between "Daemon ↔
Policy" and lists message types like the JSON-serialized WireMessage
(SignRequest, PolicyDecision, MpcWireMessage, PresignRequest) and “mTLS” as
future — add a tracked task: create a repository issue titled something like
"Add WebSocket auth for Daemon↔Policy (mTLS/token)" describing required options
(mTLS with pinned certs, token-based auth, and migration plan), assign owner and
milestone, then update the docs/threshold-signing-spec.md Transport paragraph to
include a link to that issue and a TODO note next to "mTLS" so reviewers can
find the tracking item before production; ensure the issue references
WireMessage/PolicyDecision to show impact on co-signing.
There was a problem hiding this comment.
Actionable comments posted: 9
🤖 Fix all issues with AI agents
In `@crates/saw-cli/src/keygen_local.rs`:
- Around line 136-138: The code currently does fs::write(&share_path,
&share_data) then fs::set_permissions(&share_path, ...), which temporarily
creates the file with the process umask; instead open and write the file with
explicit restrictive mode atomically: replace the fs::write +
fs::set_permissions sequence with creating the file via
OpenOptions::new().create(true).write(true).mode(0o600).open(&share_path), write
share_data into that handle (e.g., write_all), and handle errors similarly so
the file is never observable with default umask; update the error messages that
referenced fs::write and fs::set_permissions to reference the new open/write
operation and keep the same formatting (e.g., "write party {i}: {e}").
In `@crates/saw-daemon/src/lib.rs`:
- Around line 1281-1304: The function recover_y_parity currently falls back to
returning 0 on failure which yields an invalid v value; change its signature to
return a Result<u8, E> (e.g., Result<u8, anyhow::Error> or a custom error) and
on success return Ok(v_candidate), but if no recovery id matches return Err with
a descriptive message mentioning the input r_bytes/s_bytes/msg_hash/expected_pk
context; update callers to propagate or handle the error (or log and abort)
instead of assuming 0. Keep the existing recovery loop using Secp256k1::new(),
RecoveryId::from_i32, RecoverableSignature::from_compact, and
secp.recover_ecdsa; ensure the error path includes which stage failed (no
matching recovery id / recovery failed) to aid debugging.
In `@crates/saw-daemon/src/threshold.rs`:
- Around line 248-251: The code serializes the partial signature into
our_partial_bytes via serde_json::to_vec and then immediately discards it;
remove that dead serialization and the unused our_partial_bytes binding: keep
the call to signing::issue_partial_signature to produce our_partial, delete the
serde_json::to_vec(...).map_err(|e| ThresholdError::Mpc(format!("serialize
partial: {e}")))? expression and the let _ = our_partial_bytes; line so no
unused variable or unnecessary serialization remains.
In `@crates/saw-mpc/src/relay.rs`:
- Around line 269-286: The loop in connect_to_relay that awaits AllJoined via
ws_rx currently blocks indefinitely; wrap the waiting logic in a configurable
timeout (e.g., using tokio::time::timeout with a Duration pulled from
configuration or a new parameter) so that if AllJoined is not received within
the timeout you return a meaningful MpcError (e.g., MpcError::Transport or a new
MpcError::Timeout) instead of hanging; keep the existing envelope parsing
(RelayEnvelope) and error handling for Joined/Error cases but place the entire
receive-and-match loop under the timeout future and surface the timeout error to
the caller.
- Around line 90-103: The join read loop blocks forever when waiting for a
RelayEnvelope; wrap the receive logic (the loop that consumes ws_rx and
constructs join_msg) in a tokio::time::timeout (choose a 10–30s Duration, e.g.,
15s) so the await fails if no Join arrives; on timeout return an appropriate
crate::error::MpcError::Transport (e.g., "join timeout") instead of hanging.
Locate the loop that assigns join_msg, the ws_rx usage, and the RelayEnvelope
deserialization to apply the timeout wrapper and map the timeout error into
MpcError::Transport.
In `@crates/saw-policy/src/server.rs`:
- Line 246: The partial-sign path inconsistently swallows send_wire errors
(calls like let _ = send_wire(&ws_tx, &resp).await;) while one call uses ?; make
the error-handling consistent by either propagating errors from send_wire
(replace the ignore with send_wire(&ws_tx, &resp).await? so the connection
handler can clean up) or by explicitly logging and continuing (match
send_wire(...).await { Ok(_) => {}, Err(e) => tracing::error!("send_wire failed:
{:?}", e) }). Update all occurrences in the partial-sign flow that reference
send_wire, ws_tx, and resp to use the same chosen strategy.
- Around line 78-82: The outer message loop currently breaks unconditionally
when read_next_wire(&mut ws_rx).await returns Err, which disconnects the client
on any malformed frame; change the logic in that loop (the match around
read_next_wire in server.rs) to inspect the error variant returned by
read_next_wire: log and continue the loop for deserialization/parse errors so a
single bad frame is skipped, but only break/return on transport-level errors
(EOF/connection closed) so the session stays alive; use the existing logger to
record the parse error and which client/ws_rx message was dropped.
In `@docs/threshold-signing-spec.md`:
- Around line 67-73: Add a security note under the "Local keygen (current)"
section clarifying that the keygen-local mode (invoked via the keygen-local
command/SAW_PASSPHRASE example) generates all shares in one process and
therefore temporarily creates full signing capability on a single machine; mark
it explicitly as a convenience/testing mode and warn that it should not be used
in production where the threat model requires that no single party ever holds
all shares, and recommend using distributed key generation (distributed keygen)
for production deployments instead.
- Line 90: The spec currently says "detects plaintext shares and loads them
directly" which is a security risk; update the spec text to require a
mitigation: either emit a clear, high-severity warning when plaintext shares are
detected, require an explicit opt-in flag (e.g., --allow-plaintext-shares)
before loading plaintext shares, or mandate automatic encryption-on-first-load
(and refuse subsequent unencrypted loads) with details on behavior and failure
modes; refer to the existing phrase "detects plaintext shares and loads them
directly" when replacing it and add the chosen mitigation option and required
CLI/behavioral semantics.
🧹 Nitpick comments (17)
docs/threshold-signing-spec.md (2)
56-62: Document the rationale for memory-only presignatures.The presignature pool is marked as "memory-only" and "not persisted to disk," which means they're regenerated on every restart. Consider adding a brief explanation of why this design choice was made (e.g., security trade-offs, performance considerations, or attack surface reduction) to help readers understand the availability implications.
12-12: Address markdown linting issues.Three minor formatting issues flagged by markdownlint:
- Lines 12 & 37: Fenced code blocks (ASCII diagrams) should specify a language. Use
```textfor ASCII art.- Line 122: The table at line 122 should be surrounded by blank lines (add a blank line before it).
✨ Proposed formatting fixes
Fix 1: Add language specifier to architecture diagram (line 12)
-``` +```text Agent Machine Policy Server (Railway/VPS) Human (Docker/offline)Fix 2: Add language specifier to signing flow diagram (line 37)
-``` +```text Agent ──signTx()──► saw-daemon ──SignRequest──► saw-policyFix 3: Add blank line before table (line 121-122)
| `SAW_PASSPHRASE` | Passphrase to decrypt key share | | `PORT` | Listen port (default: 9443) | + ## DeploymentAs per coding guidelines, markdownlint-cli2 flagged these formatting issues.
Also applies to: 37-37, 122-122
crates/saw-mpc/src/signing.rs (2)
46-52: Consider documenting thread-safety expectations.
PresignaturePooluses&mut selfthroughout, so callers (e.g., the daemon doing background refill alongside signing) must wrap it in aMutexor similar. This is a reasonable design, but a brief doc comment on the struct noting it's not internally synchronized would help prevent misuse.
135-137: ExtractDataToSignconstruction into a shared helper.The
DataToSign::from_scalar(generic_ec::Scalar::from_be_bytes_mod_order(message_hash))pattern is duplicated inissue_partial_signature(Line 137) andsign_full(Line 171). A small helper likefn data_to_sign(hash: &[u8; 32]) -> DataToSign<Secp256k1>would reduce duplication and centralize the "no double-hash" invariant.♻️ Proposed helper
+/// Build a DataToSign from a pre-hashed message (no double-hashing). +fn data_to_sign(message_hash: &[u8; 32]) -> DataToSign<Secp256k1> { + DataToSign::from_scalar(generic_ec::Scalar::from_be_bytes_mod_order(message_hash)) +} + /// Issue a partial signature from a presignature (local, no network). pub fn issue_partial_signature( presig: cggmp21::Presignature<Secp256k1>, message_hash: &[u8; 32], ) -> PartialSignature<Secp256k1> { - let data = DataToSign::from_scalar(generic_ec::Scalar::from_be_bytes_mod_order(message_hash)); + let data = data_to_sign(message_hash); presig.issue_partial_signature(data) }And similarly in
sign_full.Also applies to: 169-171
crates/saw-mpc/tests/ws_transport.rs (1)
99-100: 50ms sleep is a race-prone synchronization mechanism.While acceptable for tests, the
sleepbetween server bind and client connect can fail on slow CI runners. A more robust approach is to use atokio::sync::Notifyor anoneshotchannel signaled afterws_acceptis ready.crates/saw-cli/src/keygen_local.rs (1)
148-149: Metadata hardcodesChain::Evm— future chains will need plumbing here.
ThresholdConfig::new_2of3(0, wallet, Chain::Evm)locks local keygen to EVM. Consider accepting a--chainCLI flag so this generalizes when other chains are supported.crates/saw-mpc/src/protocol.rs (1)
22-25:from_strmethod name shadows theFromStrtrait convention.Naming an inherent method
from_stris confusing because it shadowsstd::str::FromStr::from_str. If someone importsFromStrand callsSessionId::from_str(s), the inherent method wins silently. Consider renaming tofrom_stringorfrom_id, or implementingFrom<&str>/FromStrinstead.crates/saw-mpc/src/relay.rs (2)
296-329: UnnecessaryArc<Mutex>wrappers aroundws_txandws_rx.
ws_tx(line 298) is only used by the outgoing task, andws_rx(line 328) is only used by the incoming task. Each can be moved directly into its respective spawned task withoutArc<Mutex>overhead.Proposed fix (ws_tx)
- let ws_tx = Arc::new(Mutex::new(ws_tx)); - let ws_tx_out = ws_tx.clone(); tokio::spawn(async move { use cggmp21::round_based::MessageDestination; while let Some(outgoing) = outgoing_rx.next().await { // ... - let mut tx = ws_tx_out.lock().await; - if tx.send(WsMessage::Text(json.into())).await.is_err() { + if ws_tx.send(WsMessage::Text(json.into())).await.is_err() { break; } } });
308-311: Serialization/deserialization errors silently dropped.At lines 308-311 and 359, failures to serialize outgoing messages or deserialize incoming
Mare silently ignored withcontinue. These should at least emit atracing::warn!to aid debugging MPC protocol issues.Also applies to: 359-374
crates/saw-daemon/src/threshold.rs (2)
371-470: Massive code duplication betweensign_full_mpcandgenerate_one_presignature.The outgoing task (lines 382–407 vs 560–585) and the WS-to-MPC bridging loop (lines 419–470 vs 596–647) are nearly identical. Extract a shared helper, e.g.
run_mpc_over_ws(ws_tx, ws_rx, session_id, incoming_tx) -> ..., to reduce ~100 lines of duplication and prevent the two paths from diverging.Also applies to: 550-647
419-470: Polling loop with 50ms timeout is less efficient thantokio::select!.The loop at lines 419–470 polls
ws_rx.next()with a 50ms timeout and checkssign_task.is_finished()each iteration. This introduces up to 50ms latency between task completion and detection. Usingtokio::select!on both the WS stream and the sign task'sJoinHandlewould be more responsive and idiomatic.crates/saw-daemon/src/lib.rs (2)
534-534:rt.block_on()will panic if called from within an existing tokio runtime.
handle_sign_evm_tx_thresholdandhandle_sign_eip2612_permit_thresholdusert.block_on(...)(lines 534, 846). Ifserve_loopis ever invoked from within a tokio context (e.g., integration tests or future refactoring to async), this panics with "Cannot start a runtime from within a runtime". The current synchronous Unix socket server is safe, but document this constraint or considerrt.spawn+ a blocking channel for resilience.Also applies to: 846-846
303-326: Threshold address derivation duplicates logic fromderive_evm_address.Lines 303–326 re-implement Keccak256-based address derivation from an uncompressed public key. This duplicates the pattern in
derive_evm_address(lines 1155–1179), using a different public key source. Consider extracting aaddress_from_uncompressed_pubkey(pub_bytes: &[u8]) -> Stringhelper to keep the derivation logic in one place.crates/saw-mpc/tests/daemon_policy_flow.rs (2)
182-182: Fixed sleep before connect is a test-fragility risk.The 50ms sleep between server bind and client connect is a race. Under CI load, the listener may not be ready. Consider passing the bound
addrto the daemon task and retrying the connect in a loop, or using atokio::sync::oneshotto signal readiness.
130-180: Significant duplication of the MPC↔WS routing loop between daemon and policy sides.The outgoing router (serialize → WS send) and incoming router (WS recv → deserialize → channel) are nearly identical between the two tasks, differing only in
fromparty ID. Consider extracting a helper to reduce ~90 lines of copy-paste. This is test code, so it's a nice-to-have.Also applies to: 233-278
crates/saw-policy/src/server.rs (1)
86-216: Extract the MPC↔WS routing loop into a shared helper.The presign flow (Lines 86–216) and sign flow (Lines 349–470) contain nearly identical MPC message routing: outgoing task (serialize → WS send), incoming loop (WS recv → deserialize → channel feed), and result collection. The only differences are the MPC operation called and the execution ID prefix.
Extracting a generic
run_mpc_over_ws(...)helper that takes a future-producing closure would eliminate ~120 lines of duplication and make it easier to keep both paths consistent (e.g., error handling, timeout values).Also applies to: 349-470
crates/saw-policy/Dockerfile (1)
8-9: Add a.dockerignoreto exclude unnecessary build artifacts and sensitive files from the Docker build context.Without a
.dockerignore, runningdocker buildfrom the repository root sends the entire repo — includingtarget/,.git/, and any local key material — to the Docker daemon, unnecessarily increasing build context size and exposing potential secrets. Add a.dockerignorefile at the repository root with entries fortarget/,.git/,node_modules/, and other build artifacts.
| fs::write(&share_path, &share_data).map_err(|e| format!("write party {i}: {e}"))?; | ||
| fs::set_permissions(&share_path, fs::Permissions::from_mode(0o600)) | ||
| .map_err(|e| format!("chmod party {i}: {e}"))?; |
There was a problem hiding this comment.
Key share written with default umask before set_permissions narrows it.
Between fs::write and fs::set_permissions, the file briefly exists with whatever the process umask allows (often 0o644). For secret key material, consider using OpenOptions::new().mode(0o600).write(true).create(true) to create the file with restrictive permissions atomically.
Proposed fix
- fs::write(&share_path, &share_data).map_err(|e| format!("write party {i}: {e}"))?;
- fs::set_permissions(&share_path, fs::Permissions::from_mode(0o600))
- .map_err(|e| format!("chmod party {i}: {e}"))?;
+ {
+ use std::fs::OpenOptions;
+ let mut f = OpenOptions::new()
+ .write(true)
+ .create(true)
+ .truncate(true)
+ .mode(0o600)
+ .open(&share_path)
+ .map_err(|e| format!("write party {i}: {e}"))?;
+ f.write_all(&share_data)
+ .map_err(|e| format!("write party {i}: {e}"))?;
+ }🤖 Prompt for AI Agents
In `@crates/saw-cli/src/keygen_local.rs` around lines 136 - 138, The code
currently does fs::write(&share_path, &share_data) then
fs::set_permissions(&share_path, ...), which temporarily creates the file with
the process umask; instead open and write the file with explicit restrictive
mode atomically: replace the fs::write + fs::set_permissions sequence with
creating the file via
OpenOptions::new().create(true).write(true).mode(0o600).open(&share_path), write
share_data into that handle (e.g., write_all), and handle errors similarly so
the file is never observable with default umask; update the error messages that
referenced fs::write and fs::set_permissions to reference the new open/write
operation and keep the same formatting (e.g., "write party {i}: {e}").
| fn recover_y_parity( | ||
| r_bytes: &[u8], | ||
| s_bytes: &[u8], | ||
| msg_hash: &[u8; 32], | ||
| expected_pk: &[u8], | ||
| ) -> u8 { | ||
| let secp = Secp256k1::new(); | ||
| let msg = Message::from_digest_slice(msg_hash).expect("valid hash"); | ||
| let mut sig_bytes = [0u8; 64]; | ||
| sig_bytes[..32].copy_from_slice(r_bytes); | ||
| sig_bytes[32..].copy_from_slice(s_bytes); | ||
| for v_candidate in 0u8..2 { | ||
| if let Ok(rid) = secp256k1::ecdsa::RecoveryId::from_i32(v_candidate as i32) { | ||
| if let Ok(rec_sig) = RecoverableSignature::from_compact(&sig_bytes, rid) { | ||
| if let Ok(recovered) = secp.recover_ecdsa(&msg, &rec_sig) { | ||
| if recovered.serialize_uncompressed()[1..] == expected_pk[1..] { | ||
| return v_candidate; | ||
| } | ||
| } | ||
| } | ||
| } | ||
| } | ||
| 0 // fallback | ||
| } |
There was a problem hiding this comment.
recover_y_parity silently returns 0 on failure — produces an invalid transaction.
If neither recovery ID matches the expected public key (e.g., due to a bug or key mismatch), the fallback 0 at line 1303 silently produces a transaction with the wrong v value. This would result in a valid-looking but unbroadcastable transaction, making the root cause hard to diagnose.
Consider returning a Result or at least logging a warning on fallback.
Proposed fix
-fn recover_y_parity(
+fn recover_y_parity(
r_bytes: &[u8],
s_bytes: &[u8],
msg_hash: &[u8; 32],
expected_pk: &[u8],
-) -> u8 {
+) -> Result<u8, String> {
let secp = Secp256k1::new();
let msg = Message::from_digest_slice(msg_hash).expect("valid hash");
let mut sig_bytes = [0u8; 64];
sig_bytes[..32].copy_from_slice(r_bytes);
sig_bytes[32..].copy_from_slice(s_bytes);
for v_candidate in 0u8..2 {
if let Ok(rid) = secp256k1::ecdsa::RecoveryId::from_i32(v_candidate as i32) {
if let Ok(rec_sig) = RecoverableSignature::from_compact(&sig_bytes, rid) {
if let Ok(recovered) = secp.recover_ecdsa(&msg, &rec_sig) {
if recovered.serialize_uncompressed()[1..] == expected_pk[1..] {
- return v_candidate;
+ return Ok(v_candidate);
}
}
}
}
}
- 0 // fallback
+ Err("failed to recover y_parity: no recovery ID matched the expected public key".into())
}🤖 Prompt for AI Agents
In `@crates/saw-daemon/src/lib.rs` around lines 1281 - 1304, The function
recover_y_parity currently falls back to returning 0 on failure which yields an
invalid v value; change its signature to return a Result<u8, E> (e.g.,
Result<u8, anyhow::Error> or a custom error) and on success return
Ok(v_candidate), but if no recovery id matches return Err with a descriptive
message mentioning the input r_bytes/s_bytes/msg_hash/expected_pk context;
update callers to propagate or handle the error (or log and abort) instead of
assuming 0. Keep the existing recovery loop using Secp256k1::new(),
RecoveryId::from_i32, RecoverableSignature::from_compact, and
secp.recover_ecdsa; ensure the error path includes which stage failed (no
matching recovery id / recovery failed) to aid debugging.
| let our_partial = signing::issue_partial_signature(presignature, message_hash); | ||
| let our_partial_bytes = serde_json::to_vec(&our_partial) | ||
| .map_err(|e| ThresholdError::Mpc(format!("serialize partial: {e}")))?; | ||
| let _ = our_partial_bytes; // used for potential future optimization |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Dead code: our_partial_bytes is computed then immediately discarded.
our_partial_bytes is serialized on line 249 and then explicitly dropped on line 251. Remove the unused serialization.
Proposed fix
// Issue our partial signature locally (zero network!)
let our_partial = signing::issue_partial_signature(presignature, message_hash);
- let our_partial_bytes = serde_json::to_vec(&our_partial)
- .map_err(|e| ThresholdError::Mpc(format!("serialize partial: {e}")))?;
- let _ = our_partial_bytes; // used for potential future optimization📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| let our_partial = signing::issue_partial_signature(presignature, message_hash); | |
| let our_partial_bytes = serde_json::to_vec(&our_partial) | |
| .map_err(|e| ThresholdError::Mpc(format!("serialize partial: {e}")))?; | |
| let _ = our_partial_bytes; // used for potential future optimization | |
| let our_partial = signing::issue_partial_signature(presignature, message_hash); |
🤖 Prompt for AI Agents
In `@crates/saw-daemon/src/threshold.rs` around lines 248 - 251, The code
serializes the partial signature into our_partial_bytes via serde_json::to_vec
and then immediately discards it; remove that dead serialization and the unused
our_partial_bytes binding: keep the call to signing::issue_partial_signature to
produce our_partial, delete the serde_json::to_vec(...).map_err(|e|
ThresholdError::Mpc(format!("serialize partial: {e}")))? expression and the let
_ = our_partial_bytes; line so no unused variable or unnecessary serialization
remains.
| let join_msg = loop { | ||
| match ws_rx.next().await { | ||
| Some(Ok(WsMessage::Text(t))) => { | ||
| break serde_json::from_str::<RelayEnvelope>(&t) | ||
| .map_err(|e| crate::error::MpcError::Transport(format!("bad join: {e}")))?; | ||
| } | ||
| Some(Ok(WsMessage::Binary(d))) => { | ||
| break serde_json::from_slice::<RelayEnvelope>(&d) | ||
| .map_err(|e| crate::error::MpcError::Transport(format!("bad join: {e}")))?; | ||
| } | ||
| Some(Ok(_)) => continue, | ||
| _ => return Err(crate::error::MpcError::Transport("connection lost before join".into())), | ||
| } | ||
| }; |
There was a problem hiding this comment.
No timeout on Join message read — a slow or malicious peer blocks the entire relay.
The relay loops indefinitely at line 90 waiting for a Join message from each incoming connection. A client that connects but never sends data will prevent additional parties from joining. Add a timeout (e.g., 10–30 seconds) around this read loop.
Proposed fix
// Read Join message
- let join_msg = loop {
- match ws_rx.next().await {
+ let join_msg = tokio::time::timeout(Duration::from_secs(30), async {
+ loop {
+ match ws_rx.next().await {
+ // ... existing arms ...
+ }
+ }
+ })
+ .await
+ .map_err(|_| crate::error::MpcError::Transport("join timeout".into()))??;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| let join_msg = loop { | |
| match ws_rx.next().await { | |
| Some(Ok(WsMessage::Text(t))) => { | |
| break serde_json::from_str::<RelayEnvelope>(&t) | |
| .map_err(|e| crate::error::MpcError::Transport(format!("bad join: {e}")))?; | |
| } | |
| Some(Ok(WsMessage::Binary(d))) => { | |
| break serde_json::from_slice::<RelayEnvelope>(&d) | |
| .map_err(|e| crate::error::MpcError::Transport(format!("bad join: {e}")))?; | |
| } | |
| Some(Ok(_)) => continue, | |
| _ => return Err(crate::error::MpcError::Transport("connection lost before join".into())), | |
| } | |
| }; | |
| let join_msg = tokio::time::timeout(Duration::from_secs(30), async { | |
| loop { | |
| match ws_rx.next().await { | |
| Some(Ok(WsMessage::Text(t))) => { | |
| break serde_json::from_str::<RelayEnvelope>(&t) | |
| .map_err(|e| crate::error::MpcError::Transport(format!("bad join: {e}")))?; | |
| } | |
| Some(Ok(WsMessage::Binary(d))) => { | |
| break serde_json::from_slice::<RelayEnvelope>(&d) | |
| .map_err(|e| crate::error::MpcError::Transport(format!("bad join: {e}")))?; | |
| } | |
| Some(Ok(_)) => continue, | |
| _ => return Err(crate::error::MpcError::Transport("connection lost before join".into())), | |
| } | |
| } | |
| }) | |
| .await | |
| .map_err(|_| crate::error::MpcError::Transport("join timeout".into()))??; |
🤖 Prompt for AI Agents
In `@crates/saw-mpc/src/relay.rs` around lines 90 - 103, The join read loop blocks
forever when waiting for a RelayEnvelope; wrap the receive logic (the loop that
consumes ws_rx and constructs join_msg) in a tokio::time::timeout (choose a
10–30s Duration, e.g., 15s) so the await fails if no Join arrives; on timeout
return an appropriate crate::error::MpcError::Transport (e.g., "join timeout")
instead of hanging. Locate the loop that assigns join_msg, the ws_rx usage, and
the RelayEnvelope deserialization to apply the timeout wrapper and map the
timeout error into MpcError::Transport.
| loop { | ||
| let raw = match ws_rx.next().await { | ||
| Some(Ok(WsMessage::Text(t))) => t.into_bytes(), | ||
| Some(Ok(WsMessage::Binary(d))) => d.to_vec(), | ||
| Some(Ok(_)) => continue, | ||
| _ => return Err(crate::error::MpcError::Transport("connection lost".into())), | ||
| }; | ||
| let env: RelayEnvelope = serde_json::from_slice(&raw) | ||
| .map_err(|e| crate::error::MpcError::Transport(format!("bad msg: {e}")))?; | ||
| match env { | ||
| RelayEnvelope::Joined { .. } => continue, | ||
| RelayEnvelope::AllJoined { .. } => break, | ||
| RelayEnvelope::Error { message } => { | ||
| return Err(crate::error::MpcError::Transport(format!("relay error: {message}"))); | ||
| } | ||
| _ => continue, | ||
| } | ||
| } |
There was a problem hiding this comment.
No timeout on AllJoined wait — client blocks indefinitely if not all parties connect.
connect_to_relay loops forever at line 269 waiting for AllJoined. If the relay never sends it (e.g., a party is missing), the caller hangs with no recourse. Wrap in a configurable timeout.
🤖 Prompt for AI Agents
In `@crates/saw-mpc/src/relay.rs` around lines 269 - 286, The loop in
connect_to_relay that awaits AllJoined via ws_rx currently blocks indefinitely;
wrap the waiting logic in a configurable timeout (e.g., using
tokio::time::timeout with a Duration pulled from configuration or a new
parameter) so that if AllJoined is not received within the timeout you return a
meaningful MpcError (e.g., MpcError::Transport or a new MpcError::Timeout)
instead of hanging; keep the existing envelope parsing (RelayEnvelope) and error
handling for Joined/Error cases but place the entire receive-and-match loop
under the timeout future and surface the timeout error to the caller.
| loop { | ||
| let wire = match read_next_wire(&mut ws_rx).await { | ||
| Ok(msg) => msg, | ||
| Err(_) => break, | ||
| }; |
There was a problem hiding this comment.
Any deserialization failure on the outer message loop disconnects the client.
read_next_wire returns Err for malformed messages, and Line 81 unconditionally breaks the connection loop. A single garbled or unexpected frame from the daemon kills the entire session. Consider logging and continuing for parse errors, only breaking on transport-level errors (close/EOF).
Proposed approach
let wire = match read_next_wire(&mut ws_rx).await {
Ok(msg) => msg,
- Err(_) => break,
+ Err(MpcError::Transport(_)) => break,
+ Err(e) => {
+ tracing::warn!(error = %e, "ignoring malformed message");
+ continue;
+ }
};🤖 Prompt for AI Agents
In `@crates/saw-policy/src/server.rs` around lines 78 - 82, The outer message loop
currently breaks unconditionally when read_next_wire(&mut ws_rx).await returns
Err, which disconnects the client on any malformed frame; change the logic in
that loop (the match around read_next_wire in server.rs) to inspect the error
variant returned by read_next_wire: log and continue the loop for
deserialization/parse errors so a single bad frame is skipped, but only
break/return on transport-level errors (EOF/connection closed) so the session
stays alive; use the existing logger to record the parse error and which
client/ws_rx message was dropped.
| reason: decision.reason, | ||
| partial_signature: None, | ||
| }); | ||
| let _ = send_wire(&ws_tx, &resp).await; |
There was a problem hiding this comment.
Inconsistent error handling on send_wire calls in the partial-sign path.
Lines 246, 261, and 281 discard send errors with let _ =, but Line 303 propagates with ?. If the WebSocket is broken, the let _ = calls silently swallow the failure (the daemon never learns why its request was denied), while Line 303 kills the connection handler. Pick a consistent strategy — either always propagate (so the handler cleans up) or always log-and-continue.
Also applies to: 261-261, 281-281, 303-303
🤖 Prompt for AI Agents
In `@crates/saw-policy/src/server.rs` at line 246, The partial-sign path
inconsistently swallows send_wire errors (calls like let _ = send_wire(&ws_tx,
&resp).await;) while one call uses ?; make the error-handling consistent by
either propagating errors from send_wire (replace the ignore with
send_wire(&ws_tx, &resp).await? so the connection handler can clean up) or by
explicitly logging and continuing (match send_wire(...).await { Ok(_) => {},
Err(e) => tracing::error!("send_wire failed: {:?}", e) }). Update all
occurrences in the partial-sign flow that reference send_wire, ws_tx, and resp
to use the same chosen strategy.
| ### Local keygen (current) | ||
| Generate all 3 shares in one process, then distribute: | ||
| ```bash | ||
| SAW_PASSPHRASE="secret" saw keygen-local --wallet mywallet --root ~/.saw | ||
| ``` | ||
| Outputs 3 encrypted share files + metadata with the derived address. | ||
|
|
There was a problem hiding this comment.
Local keygen undermines threshold security during setup.
The local keygen mode generates all 3 shares in a single process before distribution. This means the full signing capability exists on one machine during keygen, which contradicts the core security property stated in the overview ("The full private key never exists on any single machine"). While shares are encrypted before distribution, an attacker with access to that machine during keygen could compromise all shares.
Consider adding a security note that local keygen is a convenience mode for testing/development and that distributed keygen should be used for production deployments where the threat model requires that no single party ever holds all shares.
🤖 Prompt for AI Agents
In `@docs/threshold-signing-spec.md` around lines 67 - 73, Add a security note
under the "Local keygen (current)" section clarifying that the keygen-local mode
(invoked via the keygen-local command/SAW_PASSPHRASE example) generates all
shares in one process and therefore temporarily creates full signing capability
on a single machine; mark it explicitly as a convenience/testing mode and warn
that it should not be used in production where the threat model requires that no
single party ever holds all shares, and recommend using distributed key
generation (distributed keygen) for production deployments instead.
| - AEAD: ChaCha20-Poly1305 | ||
| - Format: `SAW1` magic bytes + salt + nonce + ciphertext + tag | ||
| - Passphrase via `SAW_PASSPHRASE` env var | ||
| - Backward-compatible: detects plaintext shares and loads them directly |
There was a problem hiding this comment.
Backward compatibility with plaintext shares poses a security risk.
The system detects and loads plaintext shares directly. This creates a risk where shares might be accidentally stored or remain unencrypted without the user's knowledge, potentially exposing the key material. Consider one of the following mitigations:
- Emit a loud warning when plaintext shares are detected
- Require an explicit flag like
--allow-plaintext-sharesto load unencrypted shares - Automatically encrypt plaintext shares on first load and refuse to load them subsequently
🤖 Prompt for AI Agents
In `@docs/threshold-signing-spec.md` at line 90, The spec currently says "detects
plaintext shares and loads them directly" which is a security risk; update the
spec text to require a mitigation: either emit a clear, high-severity warning
when plaintext shares are detected, require an explicit opt-in flag (e.g.,
--allow-plaintext-shares) before loading plaintext shares, or mandate automatic
encryption-on-first-load (and refuse subsequent unencrypted loads) with details
on behavior and failure modes; refer to the existing phrase "detects plaintext
shares and loads them directly" when replacing it and add the chosen mitigation
option and required CLI/behavioral semantics.
Adds threshold ECDSA signing to SAW using the CGGMP21 protocol. The full private key never exists on any single machine.
What's included
keygen-local(generate all 3 shares locally) andkeygen-threshold(distributed relay-based ceremony)Architecture
Normal ops: party 0 + party 1 auto-sign per policy rules. Any 2-of-3 can sign for recovery.
Tested
Not included (future work)
single-key mode unchanged.
Summary by CodeRabbit
New Features
Documentation
Chores