FGP defense-in-depth hardening — follow-ups from the PR #13 audit

**Context**   

  During the May 2026 audit of PR #13 / PR #11 (issues #6, #8, #10, #12) and a follow-up systematic test-design pass, ten defense-in-depth gaps were identified in the FGP codebase. Each was confirmed with a test; several were probed live on bft-fgp-2sh or via integration tests through the real IPC path. The tests will land in a single PR linked to this issue (currently   negative-assertion form — they pass by documenting the present-but-undesired behavior, and flip to positive assertions as each fix lands).

  Most of these are narrower than they first looked — defense-in-depth at a downstream layer (kernel, BFT certificate/CBOR/signature validation, shard-config hash check, transport  self-healing) catches the worst case. Per finding below: initial view → final verdict after deeper testing.

**Hard rules for the fixes (please follow)**

  1. Propose and implement the EXACT minimal fix. Do not introduce new abstractions, retry loops, fallback paths, or key-management layers that would themselves need hardening. Each fix  below is scoped to the smallest change that closes the gap. If a fix seems to require new subsystems, stop and discuss — that's a signal the scope is wrong.
  2. Every fix must be covered by tests exercising BOTH paths — the positive path (valid input still works) AND the negative path (the bad input is now rejected/handled). The negative-assertion tests already in the PR become the negative-path tests once flipped; add the positive-path assertions alongside.
  3. No behavior change beyond the stated fix. If closing a gap changes a wire format or an error surface, call it out explicitly (only F20 and the F12-family touch wire/serialization; none  of the items below should).
  
**Findings (priority order)**

  **1. F14 — Private key bytes never zeroed in memory**

  **- Found during:** Task 2, #8 audit — "Related Issues" in the body.
**- System impact now:** the validator's signing identity is extractable from node memory. If extracted (core dump, swap, /proc/<pid>/mem, malware), an attacker can forge that validator's  signatures — sign arbitrary certification requests / impersonate the node in BFT. One key compromise = ability to act as that validator in consensus. Operator note: a routine gcore for  debugging writes the key to disk.
  **- System impact after fix:** the KeyConf-held copy is wiped at shutdown, removing it from post-shutdown dumps/snapshots. Partial — libp2p-held copies remain (documented), so the extraction  surface shrinks but isn't eliminated.
  **- Initial view:** HIGH — "private key material never zeroed; recoverable from memory."
  **- Final verdict:** HIGH and the only finding NOT narrowed by any downstream defense — the memory is the key. Live gcore on fgp-node-0 found the real keys (SIG, AUTH) at 2 distinct memory locations each → a single scrub site is insufficient (KeyConf raw bytes + libp2p/crypto-derived copies coexist).
  **- Exact fix (no new hardening surface):** add func (c *KeyConf) Zero() that overwrites SigKey.PrivateKey and AuthKey.PrivateKey in place; call it from the quitSignalContext() cancellation path in cmd/main.go AFTER the signer/host are done. Do not build a key-vault abstraction. Document explicitly that libp2p-held copies persist (can't be scrubbed without upstream support) —
   the fix is partial by nature; don't over-engineer to chase the libp2p copy.
  **- Expected result:** KeyConf-held raw bytes are zeroed on shutdown; comment states the known libp2p-copy limitation.
  **- Tests in the PR:** partition/f14_keyzero_gap_test.go (6 — incl. K3 unsafe-pointer readback). Live evidence: 2 copies each from gcore.
  **- Fix-test coverage required:** positive — signer/host still work before shutdown; negative — after Zero(), the KeyConf backing arrays read all-zero (flip K3's "still readable" assertion to   "zeroed at the controlled location").

  **2. F11 — BlockHeader not validated at the IPC boundary**

  **- Found during:** Task 1, #6 audit — "Related Issues" in the body (silently dropped by PR #11).
  **- System impact now:** a single follower's own compromised/buggy unicityd can make that node accept an orphan/invalid PoW block as certifiable state (or reject a valid one). Scoped to one node — 2f+1 + each node querying its own unicityd prevents unilateral consensus corruption — but that node contributes a bad vote and its local finality view diverges.
  **- System impact after fix:** the node anchors its consensus view to well-formed PoW headers; malformed data fails fast at the IPC boundary instead of flowing into the IsActive() gate and the certification path.
  **- Initial view:** MEDIUM — "block header data lacks validation."
  **- Final verdict:** MEDIUM, narrowed by BFT certification-request validation downstream — BUT the integration test confirmed a concrete consensus consequence: the Confirmations field (attacker-controlled, unvalidated) gates IsActive() in FollowerVerify (txsystem/fgp_txsystem.go:193). A Confirmations=1 orphan is accepted by a single follower. Cluster-level 2f+1 + per-node-own-unicityd still bounds the blast radius.
  **- Exact fix:** add func (b *BlockHeader) Validate() error checking: hash is 64 hex chars + valid hex; Confirmations within a sane bound; Height within an upper bound; (optional)  PreviousBlockHash hex/length. Call it immediately after JSON unmarshal in GetTip, GetBlockHeaderByHash, GetBlockHeaderByHeight. Reject on failure — do not transform/normalize values.
  **- Expected result:** malformed headers (bad hash length/hex, absurd height, nonsense Confirmations) are rejected at the IPC boundary with a clear error.
  **- Tests in the PR:** pow/ipc_client_test.go — Test_IPCClient_NoBlockHeaderValidation_DocumentsGap (6), Test_IPCClient_NonHexHash_AcceptedByClient, Test_IPCClient_BlockHeader_HashLengthBVA_DocumentsGap (7), Test_IPCClient_BlockHeader_UncheckedNumericFields_DocumentsGap (3); txsystem/fgp_txsystem_ipc_integration_test.go —  TestFGP11_Integration_ConfirmationsGatesIsActive (4).
  **- Fix-test coverage required:** positive — a well-formed 64-hex hash + sane fields passes; negative — each of the BVA/equivalence cases above now returns an error (flip require.NoError →   require.Error).
  
  **3. F17 — Validators() returns zero-value peer.ID("") entries on decode failure**
  
  **- Found during:** Task 3, #10 audit — "Related Issues" in the body.
**- System impact now:** the effective validator set silently shrinks (N → N-bad), reducing Byzantine fault tolerance (fewer honest nodes counting toward 2f+1) and dropping broadcasts to phantom peers. Enough malformed IDs at genesis/setup could prevent the partition from reaching quorum at all. Operator note: cluster reports Active while participation is silently reduced.
  **- System impact after fix:** the validator set contains only valid members; quorum math is correct; misconfiguration surfaces as a logged error instead of silent degradation.
  **- Initial view:** LOW-MEDIUM — "causes invalid downstream sends."
  **- Final verdict:** narrowed by the BFT shard-config hash check (e2e: a tampered nodeId got the node isolated before broadcasts mattered). Fires for real only on genesis-gen bugs or operator typos at setup. Loop testing confirmed the empty-set case is safe (returns empty slice, no panic).
  **- Exact fix:** change make(peer.IDSlice, len(validators)) + index-assignment to make(peer.IDSlice, 0, len(validators)) + append — skip failed decodes from the result. That's the whole  change.
  **- Expected result:** Validators() returns only successfully-decoded IDs; no peer.ID("") entries.
  **- Tests in the PR:** partition/issue10_audit_test.go — TestFGP10_RELATED_Validators_ZeroValueOnBadDecode_DocumentsGap, TestFGP17_Validators_LoopBoundaries_DocumentsGap (5).
  **- Fix-test coverage required:** positive — all-valid input returns N IDs; negative — mixed valid/invalid returns only the valid ones (0 zero-values), and all-invalid returns an empty slice (flip the zero-value-count assertions).

  **4. F3 — Pool-side stale-FD partition wedge after PoW restart**
  
  **- Found during:** Task 2, #8 SIGKILL e2e window; also pre-existing as Scenario E / SESSION_FGP_IPC_AND_BFT_SNAPSHOTS.md §7.
**- System impact now:** finality (FGP round progression) stalls ~60s after each PoW node restart while the connection pool self-heals — a temporary, self-recovering liveness degradation. 
  _Operator note: a burst of connection reset by peer warnings that looks like a hard outage._ 
  **- System impact after fix:** finality continuity across PoW restarts; the liveness gap and the warning burst go away.
  **- Initial view:** HIGH — "permanent wedge for 3+ rounds."
  **- Final verdict:** MEDIUM operator-experience. Live probe: the transport pool self-heals within ~1 T1 cycle (~60s); the originally-observed multi-round wedge was compound with the unicityd-startmining one-shot gotcha (separate; file in aggregator-subscription). Even a hostile abrupt-close unit test could not reproduce the wedge — it self-heals. Not unit-reproducible; live e2e is the only repro.
  **- Exact fix:** in the http.Transport config in pow/ipc_client.go, set IdleConnTimeout: 30s (or DisableKeepAlives: true if simpler) and call transport.CloseIdleConnections() on RPC error.
  **Optional:** leader rotation on repeated failure. Do not add a custom retry/backoff loop that could mask other failures — the goal is to retire dead connections faster, not to paper over errors.
  **- Expected result:** recovery after PoW restart drops from ~60s to ~immediate; no behavior change in the healthy path.
  **- Tests in the PR:** partition/t1_timer_rearm_regression_test.go (4); pow/ipc_client_test.go — Test_IPCClient_StalePoolAfterServerRestart_SelfHeals,
  Test_IPCClient_AbruptServerClose_ReproducesStaleFD_ThenRecovers. Live: F3-S1/S2 in E2E_FOLLOWUP_PROBES.md.
  **- Fix-test coverage required:** positive — healthy calls still pooled/reused; negative — after a server restart, the next call succeeds without the ~60s delay (add a timing assertion once   IdleConnTimeout/CloseIdleConnections is in).
  
  **5. F20 — JSON-RPC error field type confusion masks valid results**

  **- Found during:** 2026-05-20 technique pass (decision table over error×result + equivalence partitioning of the error field's JSON type).
  **- System impact now:** if unicityd ever returns a non-object error field (false/0/""), all of that node's PoW queries fail → it can't fetch headers/tip → that node's finality stalls until unicityd is corrected. Liveness impact, single node, triggered only by an upstream regression. Operator note: near-undiagnosable error ("rpc error: ").
  **- System impact after fix:** spurious error-field values no longer break PoW queries; node liveness is preserved and errors are diagnosable.
  **- Initial view:** (new finding — not in any issue body).
  **- Final verdict:** LOW-MEDIUM defense-in-depth. if rpcResp.Error != nil with Error interface{} treats false/0/"" as an RPC error and discards the valid result. Spec-compliant unicityd never triggers it, but a buggy one makes every call fail with cryptic messages ("rpc error: false", or "rpc error: " for empty string). Confirmed through FollowerVerify (integration test).
  **- Exact fix:** type the Error field as *jsonRpcError struct (Code int, Message string) instead of interface{}. Non-object JSON values then fail to decode or decode to nil, so only a real error object triggers the error path. Wire/serialization note: this changes how the response is parsed (not the request) — confirm it still decodes spec-compliant error objects. No request-side wire change.
  **- Expected result:** valid result with error: false/0/"" → result is returned, not masked; a real {"code":...,"message":...} → error as before.
  **- Tests in the PR:** pow/ipc_client_test.go — Test_IPCClient_ResponseEnvelope_DecisionTable (8); txsystem/fgp_txsystem_ipc_integration_test.go — TestFGP20_Integration_ErrorFieldMasksResult  (3).
  **- Fix-test coverage required:** positive — real error object still surfaces an error; valid result with non-object error field now returns the result; negative — flip the false/0/"" PROBE rows to require.NoError + correct result.
  
  **6. F16 — printUC panics on nil InputRecord**
  
  **- Found during:** Task 3, #10 audit — "Related Issues" in the body.
 **- System impact now:** a crafted/corrupt UC reaching a debug log statement panics the message-handler goroutine → node instability and dropped message processing for that handler. Higher
  exposure when debug logging is on.
  **- System impact after fix:** the UC logs safely; message processing continues; no panic at any log level.
  **- Initial view:** LOW — nil dereference in consensus paths.
  **- Final verdict:** LOW, narrowed by CBOR schema + BFT signature validation upstream (threat model = direct blocks.db tampering).
  **- Exact fix:** replace the four direct field accesses with the existing nil-safe accessors — uc.GetStateHash(), uc.GetPreviousStateHash(), uc.GetBlockHash(), uc.GetFeeSum(). No new guard
  needed; matches the codebase convention. (The GetRoundNumber()/GetRootRoundNumber() calls already there are already nil-safe.)
  **- Expected result:** printUC is fully nil-safe; a UC with nil InputRecord renders zero values instead of panicking.
  **- Tests in the PR:** partition/issue10_audit_test.go — TestFGP10_RELATED_PrintUC_NilInputRecord_PanicsCurrently_DocumentsGap.
  **- Fix-test coverage required:** positive — a complete UC prints the real values; negative — nil InputRecord returns a string without panic (flip require.Panics → require.NotPanics).

  **7. F18 — sendCertificationRequest panics on nil luc.UnicitySeal**

  **- Found during:** Task 3, #10 audit — "Related Issues" in the body.
**- System impact now:** a corrupt/partial stored latest-UC panics the leader during proposal construction → that node drops out of leadership, stalling round progress until recovery/rotation.
  **- System impact after fix:** the leader handles the malformed UC with an error instead of crashing; round continuity preserved.
  **- Initial view:** LOW — nil dereference.
  **- Final verdict:** LOW, same narrowing as F16. Refinement: sendCertificationRequest already uses the nil-safe luc.GetStateHash() right next to the unsafe luc.UnicitySeal.Timestamp — there's no GetTimestamp() accessor (see F21).
  **- Exact fix:** depends on F21 — once GetTimestamp() exists in bft-go-base, use luc.GetTimestamp(). Until then, a one-line guard at function start: if luc == nil || luc.UnicitySeal == nil { 
  return fmt.Errorf("nil UnicitySeal in latest UC") }. Prefer the accessor route (no local guard).
  **- Expected result:** no panic when UnicitySeal is nil.
  **- Tests in the PR:** partition/issue10_audit_test.go — TestFGP10_RELATED_NilUnicitySeal_PanicsOnTimestampAccess_DocumentsGap.
  **- Fix-test coverage required:** positive — valid luc produces a correct certification request; negative — nil UnicitySeal returns an error / uses the safe accessor without panic.

  **8. F13 — Logger file handle never Sync()'d / Close()'d on shutdown**

  **- Found during:** Task 2, #8 audit — "Related Issues" in the body.
  **- System impact now:** none to consensus or liveness — observability only. On a kernel panic or power loss (not normal crashes), the last log lines may never reach disk — a forensic gap
  exactly where post-mortems need data.
  **- System impact after fix:** explicit flush on graceful shutdown; log-loss window shrinks to true sudden power loss; better post-mortems.
  **- Initial view:** "risking data loss."
  **- Final verdict:** LOW — downgraded by the validity probe. Go's *os.File.Write is unbuffered; the kernel page cache survives process death (SIGTERM and SIGKILL). Only system-level crashes
  (kernel panic, power loss) lose un-Sync()'d data (probe: 500 lines all persisted without Sync()).
  **- Exact fix:** track the *os.File returned by filenameToWriter; add (*Logger).Close() (or expose the file) that calls Sync() then Close(); invoke from the shutdown handler. Do not add a
  buffered writer — that would introduce a flush dependency (new hardening surface), the opposite of the goal.
  **- Expected result:** explicit flush+close on shutdown; system-crash durability improved; no change to the steady-state write path.
  **- Tests in the PR:** logger/f13_sync_gap_test.go (4 — incl. the severity probe showing 500 lines persist).
  **- Fix-test coverage required:** positive — normal logging still writes every line; negative — after Close(), the file is synced and the handle released (assert Sync() succeeds, no fd leak).

  **9. F10 — Response-ID validation TODO cleanup (BLOCKED)**

  **- Found during:** Task 1, #6 audit.
 **- System impact now:** none currently — spec-compliant unicityd echoes the id correctly. Latent integrity gap: a misrouted/interleaved or socket-MITM-substituted RPC response would not be detected, so a node could act on consensus-relevant PoW data that doesn't correspond to its request.
  **- System impact after fix:** response-to-request integrity is enforced; mismatched responses rejected. No change in normal operation.
  **- Initial view:** stale TODO.   
  **- Final verdict:** TRIVIAL — the upstream cause was fixed in unicity-node@f7d9fdb ("Mirror request id in rpc response") on bft-snapshots. Blocked on unicity-node merging bft-snapshots → main
   (otherwise FGP against a main-built unicityd would fail every call).
  **- Exact fix:** uncomment the response-ID validation block at pow/ipc_client.go:109-112. No new logic.
  **- Expected result:** a response whose id ≠ the request id is rejected.
  **- Tests in the PR:** pow/ipc_client_test.go — Test_IPCClient_RequestID_ResponseMismatchAcceptedToday_DocumentsGap (+ existing …_StrictlyMonotonic / …_UniqueUnderConcurrency).
  **- Fix-test coverage required:** positive — matching id passes; negative — mismatched id rejected with "rpc response ID mismatch" (flip the gap test).

  **Test PR plan**

  - All the tests listed above land in one PR titled e.g. "FGP hardening: gap-documentation + regression tests (F3, F10, F11, F13, F14, F16, F17, F18, F20)", linked to this issue.
  - The tests are currently negative-assertion (they pass by reproducing the gap; named *_DocumentsGap / *_AcceptedToday). As each fix lands, flip its assertion and add the positive-path
  assertion — that's the per-finding "fix-test coverage required" above.

  **Priority order**

  1. F14 (HIGH, no upstream defense) — file/fix first; partial-fix limitations documented
  2. F11 (concrete consensus-gate consequence)
  3. F17 (small, genesis-gen-bug protection)
  4. F3 (operator-experience; transport config)
  5. F20 (parsing robustness)
  6. F16, F18 (one-liners; F18 waits on F21)
  7. F13 (low; no buffering)
  8. F10 (blocked on unicity-node main merge)

Current set of tests: https://github.com/unicitynetwork/finality-gadget/pull/18


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FGP defense-in-depth hardening — follow-ups from the PR #13 audit #17

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

FGP defense-in-depth hardening — follow-ups from the PR #13 audit #17

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions