Skip to content

FGP defense-in-depth hardening — follow-ups from the PR #13 audit #17

@b3y0urs3lf

Description

@b3y0urs3lf

Context

During the May 2026 audit of PR #13 / PR #11 (issues #6, #8, #10, #12) and a follow-up systematic test-design pass, ten defense-in-depth gaps were identified in the FGP codebase. Each was confirmed with a test; several were probed live on bft-fgp-2sh or via integration tests through the real IPC path. The tests will land in a single PR linked to this issue (currently negative-assertion form — they pass by documenting the present-but-undesired behavior, and flip to positive assertions as each fix lands).

Most of these are narrower than they first looked — defense-in-depth at a downstream layer (kernel, BFT certificate/CBOR/signature validation, shard-config hash check, transport self-healing) catches the worst case. Per finding below: initial view → final verdict after deeper testing.

Hard rules for the fixes (please follow)

  1. Propose and implement the EXACT minimal fix. Do not introduce new abstractions, retry loops, fallback paths, or key-management layers that would themselves need hardening. Each fix below is scoped to the smallest change that closes the gap. If a fix seems to require new subsystems, stop and discuss — that's a signal the scope is wrong.
  2. Every fix must be covered by tests exercising BOTH paths — the positive path (valid input still works) AND the negative path (the bad input is now rejected/handled). The negative-assertion tests already in the PR become the negative-path tests once flipped; add the positive-path assertions alongside.
  3. No behavior change beyond the stated fix. If closing a gap changes a wire format or an error surface, call it out explicitly (only F20 and the F12-family touch wire/serialization; none of the items below should).

Findings (priority order)

1. F14 — Private key bytes never zeroed in memory

- Found during: Task 2, #8 audit — "Related Issues" in the body.
- System impact now: the validator's signing identity is extractable from node memory. If extracted (core dump, swap, /proc//mem, malware), an attacker can forge that validator's signatures — sign arbitrary certification requests / impersonate the node in BFT. One key compromise = ability to act as that validator in consensus. Operator note: a routine gcore for debugging writes the key to disk.
- System impact after fix: the KeyConf-held copy is wiped at shutdown, removing it from post-shutdown dumps/snapshots. Partial — libp2p-held copies remain (documented), so the extraction surface shrinks but isn't eliminated.
- Initial view: HIGH — "private key material never zeroed; recoverable from memory."
- Final verdict: HIGH and the only finding NOT narrowed by any downstream defense — the memory is the key. Live gcore on fgp-node-0 found the real keys (SIG, AUTH) at 2 distinct memory locations each → a single scrub site is insufficient (KeyConf raw bytes + libp2p/crypto-derived copies coexist).
- Exact fix (no new hardening surface): add func (c *KeyConf) Zero() that overwrites SigKey.PrivateKey and AuthKey.PrivateKey in place; call it from the quitSignalContext() cancellation path in cmd/main.go AFTER the signer/host are done. Do not build a key-vault abstraction. Document explicitly that libp2p-held copies persist (can't be scrubbed without upstream support) —
the fix is partial by nature; don't over-engineer to chase the libp2p copy.
- Expected result: KeyConf-held raw bytes are zeroed on shutdown; comment states the known libp2p-copy limitation.
- Tests in the PR: partition/f14_keyzero_gap_test.go (6 — incl. K3 unsafe-pointer readback). Live evidence: 2 copies each from gcore.
- Fix-test coverage required: positive — signer/host still work before shutdown; negative — after Zero(), the KeyConf backing arrays read all-zero (flip K3's "still readable" assertion to "zeroed at the controlled location").

2. F11 — BlockHeader not validated at the IPC boundary

- Found during: Task 1, #6 audit — "Related Issues" in the body (silently dropped by PR #11).
- System impact now: a single follower's own compromised/buggy unicityd can make that node accept an orphan/invalid PoW block as certifiable state (or reject a valid one). Scoped to one node — 2f+1 + each node querying its own unicityd prevents unilateral consensus corruption — but that node contributes a bad vote and its local finality view diverges.
- System impact after fix: the node anchors its consensus view to well-formed PoW headers; malformed data fails fast at the IPC boundary instead of flowing into the IsActive() gate and the certification path.
- Initial view: MEDIUM — "block header data lacks validation."
- Final verdict: MEDIUM, narrowed by BFT certification-request validation downstream — BUT the integration test confirmed a concrete consensus consequence: the Confirmations field (attacker-controlled, unvalidated) gates IsActive() in FollowerVerify (txsystem/fgp_txsystem.go:193). A Confirmations=1 orphan is accepted by a single follower. Cluster-level 2f+1 + per-node-own-unicityd still bounds the blast radius.
- Exact fix: add func (b *BlockHeader) Validate() error checking: hash is 64 hex chars + valid hex; Confirmations within a sane bound; Height within an upper bound; (optional) PreviousBlockHash hex/length. Call it immediately after JSON unmarshal in GetTip, GetBlockHeaderByHash, GetBlockHeaderByHeight. Reject on failure — do not transform/normalize values.
- Expected result: malformed headers (bad hash length/hex, absurd height, nonsense Confirmations) are rejected at the IPC boundary with a clear error.
- Tests in the PR: pow/ipc_client_test.go — Test_IPCClient_NoBlockHeaderValidation_DocumentsGap (6), Test_IPCClient_NonHexHash_AcceptedByClient, Test_IPCClient_BlockHeader_HashLengthBVA_DocumentsGap (7), Test_IPCClient_BlockHeader_UncheckedNumericFields_DocumentsGap (3); txsystem/fgp_txsystem_ipc_integration_test.go — TestFGP11_Integration_ConfirmationsGatesIsActive (4).
- Fix-test coverage required: positive — a well-formed 64-hex hash + sane fields passes; negative — each of the BVA/equivalence cases above now returns an error (flip require.NoError → require.Error).

3. F17 — Validators() returns zero-value peer.ID("") entries on decode failure

- Found during: Task 3, #10 audit — "Related Issues" in the body.
- System impact now: the effective validator set silently shrinks (N → N-bad), reducing Byzantine fault tolerance (fewer honest nodes counting toward 2f+1) and dropping broadcasts to phantom peers. Enough malformed IDs at genesis/setup could prevent the partition from reaching quorum at all. Operator note: cluster reports Active while participation is silently reduced.
- System impact after fix: the validator set contains only valid members; quorum math is correct; misconfiguration surfaces as a logged error instead of silent degradation.
- Initial view: LOW-MEDIUM — "causes invalid downstream sends."
- Final verdict: narrowed by the BFT shard-config hash check (e2e: a tampered nodeId got the node isolated before broadcasts mattered). Fires for real only on genesis-gen bugs or operator typos at setup. Loop testing confirmed the empty-set case is safe (returns empty slice, no panic).
- Exact fix: change make(peer.IDSlice, len(validators)) + index-assignment to make(peer.IDSlice, 0, len(validators)) + append — skip failed decodes from the result. That's the whole change.
- Expected result: Validators() returns only successfully-decoded IDs; no peer.ID("") entries.
- Tests in the PR: partition/issue10_audit_test.go — TestFGP10_RELATED_Validators_ZeroValueOnBadDecode_DocumentsGap, TestFGP17_Validators_LoopBoundaries_DocumentsGap (5).
- Fix-test coverage required: positive — all-valid input returns N IDs; negative — mixed valid/invalid returns only the valid ones (0 zero-values), and all-invalid returns an empty slice (flip the zero-value-count assertions).

4. F3 — Pool-side stale-FD partition wedge after PoW restart

- Found during: Task 2, #8 SIGKILL e2e window; also pre-existing as Scenario E / SESSION_FGP_IPC_AND_BFT_SNAPSHOTS.md §7.
- System impact now: finality (FGP round progression) stalls ~60s after each PoW node restart while the connection pool self-heals — a temporary, self-recovering liveness degradation.
Operator note: a burst of connection reset by peer warnings that looks like a hard outage.
- System impact after fix: finality continuity across PoW restarts; the liveness gap and the warning burst go away.
- Initial view: HIGH — "permanent wedge for 3+ rounds."
- Final verdict: MEDIUM operator-experience. Live probe: the transport pool self-heals within ~1 T1 cycle (~60s); the originally-observed multi-round wedge was compound with the unicityd-startmining one-shot gotcha (separate; file in aggregator-subscription). Even a hostile abrupt-close unit test could not reproduce the wedge — it self-heals. Not unit-reproducible; live e2e is the only repro.
- Exact fix: in the http.Transport config in pow/ipc_client.go, set IdleConnTimeout: 30s (or DisableKeepAlives: true if simpler) and call transport.CloseIdleConnections() on RPC error.
Optional: leader rotation on repeated failure. Do not add a custom retry/backoff loop that could mask other failures — the goal is to retire dead connections faster, not to paper over errors.
- Expected result: recovery after PoW restart drops from ~60s to ~immediate; no behavior change in the healthy path.
- Tests in the PR: partition/t1_timer_rearm_regression_test.go (4); pow/ipc_client_test.go — Test_IPCClient_StalePoolAfterServerRestart_SelfHeals,
Test_IPCClient_AbruptServerClose_ReproducesStaleFD_ThenRecovers. Live: F3-S1/S2 in E2E_FOLLOWUP_PROBES.md.
- Fix-test coverage required: positive — healthy calls still pooled/reused; negative — after a server restart, the next call succeeds without the ~60s delay (add a timing assertion once IdleConnTimeout/CloseIdleConnections is in).

5. F20 — JSON-RPC error field type confusion masks valid results

- Found during: 2026-05-20 technique pass (decision table over error×result + equivalence partitioning of the error field's JSON type).
- System impact now: if unicityd ever returns a non-object error field (false/0/""), all of that node's PoW queries fail → it can't fetch headers/tip → that node's finality stalls until unicityd is corrected. Liveness impact, single node, triggered only by an upstream regression. Operator note: near-undiagnosable error ("rpc error: ").
- System impact after fix: spurious error-field values no longer break PoW queries; node liveness is preserved and errors are diagnosable.
- Initial view: (new finding — not in any issue body).
- Final verdict: LOW-MEDIUM defense-in-depth. if rpcResp.Error != nil with Error interface{} treats false/0/"" as an RPC error and discards the valid result. Spec-compliant unicityd never triggers it, but a buggy one makes every call fail with cryptic messages ("rpc error: false", or "rpc error: " for empty string). Confirmed through FollowerVerify (integration test).
- Exact fix: type the Error field as *jsonRpcError struct (Code int, Message string) instead of interface{}. Non-object JSON values then fail to decode or decode to nil, so only a real error object triggers the error path. Wire/serialization note: this changes how the response is parsed (not the request) — confirm it still decodes spec-compliant error objects. No request-side wire change.
- Expected result: valid result with error: false/0/"" → result is returned, not masked; a real {"code":...,"message":...} → error as before.
- Tests in the PR: pow/ipc_client_test.go — Test_IPCClient_ResponseEnvelope_DecisionTable (8); txsystem/fgp_txsystem_ipc_integration_test.go — TestFGP20_Integration_ErrorFieldMasksResult (3).
- Fix-test coverage required: positive — real error object still surfaces an error; valid result with non-object error field now returns the result; negative — flip the false/0/"" PROBE rows to require.NoError + correct result.

6. F16 — printUC panics on nil InputRecord

- Found during: Task 3, #10 audit — "Related Issues" in the body.
- System impact now: a crafted/corrupt UC reaching a debug log statement panics the message-handler goroutine → node instability and dropped message processing for that handler. Higher
exposure when debug logging is on.
- System impact after fix: the UC logs safely; message processing continues; no panic at any log level.
- Initial view: LOW — nil dereference in consensus paths.
- Final verdict: LOW, narrowed by CBOR schema + BFT signature validation upstream (threat model = direct blocks.db tampering).
- Exact fix: replace the four direct field accesses with the existing nil-safe accessors — uc.GetStateHash(), uc.GetPreviousStateHash(), uc.GetBlockHash(), uc.GetFeeSum(). No new guard
needed; matches the codebase convention. (The GetRoundNumber()/GetRootRoundNumber() calls already there are already nil-safe.)
- Expected result: printUC is fully nil-safe; a UC with nil InputRecord renders zero values instead of panicking.
- Tests in the PR: partition/issue10_audit_test.go — TestFGP10_RELATED_PrintUC_NilInputRecord_PanicsCurrently_DocumentsGap.
- Fix-test coverage required: positive — a complete UC prints the real values; negative — nil InputRecord returns a string without panic (flip require.Panics → require.NotPanics).

7. F18 — sendCertificationRequest panics on nil luc.UnicitySeal

- Found during: Task 3, #10 audit — "Related Issues" in the body.
- System impact now: a corrupt/partial stored latest-UC panics the leader during proposal construction → that node drops out of leadership, stalling round progress until recovery/rotation.
- System impact after fix: the leader handles the malformed UC with an error instead of crashing; round continuity preserved.
- Initial view: LOW — nil dereference.
- Final verdict: LOW, same narrowing as F16. Refinement: sendCertificationRequest already uses the nil-safe luc.GetStateHash() right next to the unsafe luc.UnicitySeal.Timestamp — there's no GetTimestamp() accessor (see F21).
- Exact fix: depends on F21 — once GetTimestamp() exists in bft-go-base, use luc.GetTimestamp(). Until then, a one-line guard at function start: if luc == nil || luc.UnicitySeal == nil {
return fmt.Errorf("nil UnicitySeal in latest UC") }. Prefer the accessor route (no local guard).
- Expected result: no panic when UnicitySeal is nil.
- Tests in the PR: partition/issue10_audit_test.go — TestFGP10_RELATED_NilUnicitySeal_PanicsOnTimestampAccess_DocumentsGap.
- Fix-test coverage required: positive — valid luc produces a correct certification request; negative — nil UnicitySeal returns an error / uses the safe accessor without panic.

8. F13 — Logger file handle never Sync()'d / Close()'d on shutdown

- Found during: Task 2, #8 audit — "Related Issues" in the body.
- System impact now: none to consensus or liveness — observability only. On a kernel panic or power loss (not normal crashes), the last log lines may never reach disk — a forensic gap
exactly where post-mortems need data.
- System impact after fix: explicit flush on graceful shutdown; log-loss window shrinks to true sudden power loss; better post-mortems.
- Initial view: "risking data loss."
- Final verdict: LOW — downgraded by the validity probe. Go's *os.File.Write is unbuffered; the kernel page cache survives process death (SIGTERM and SIGKILL). Only system-level crashes
(kernel panic, power loss) lose un-Sync()'d data (probe: 500 lines all persisted without Sync()).
- Exact fix: track the *os.File returned by filenameToWriter; add (*Logger).Close() (or expose the file) that calls Sync() then Close(); invoke from the shutdown handler. Do not add a
buffered writer — that would introduce a flush dependency (new hardening surface), the opposite of the goal.
- Expected result: explicit flush+close on shutdown; system-crash durability improved; no change to the steady-state write path.
- Tests in the PR: logger/f13_sync_gap_test.go (4 — incl. the severity probe showing 500 lines persist).
- Fix-test coverage required: positive — normal logging still writes every line; negative — after Close(), the file is synced and the handle released (assert Sync() succeeds, no fd leak).

9. F10 — Response-ID validation TODO cleanup (BLOCKED)

- Found during: Task 1, #6 audit.
- System impact now: none currently — spec-compliant unicityd echoes the id correctly. Latent integrity gap: a misrouted/interleaved or socket-MITM-substituted RPC response would not be detected, so a node could act on consensus-relevant PoW data that doesn't correspond to its request.
- System impact after fix: response-to-request integrity is enforced; mismatched responses rejected. No change in normal operation.
- Initial view: stale TODO.
- Final verdict: TRIVIAL — the upstream cause was fixed in unicity-node@f7d9fdb ("Mirror request id in rpc response") on bft-snapshots. Blocked on unicity-node merging bft-snapshots → main
(otherwise FGP against a main-built unicityd would fail every call).
- Exact fix: uncomment the response-ID validation block at pow/ipc_client.go:109-112. No new logic.
- Expected result: a response whose id ≠ the request id is rejected.
- Tests in the PR: pow/ipc_client_test.go — Test_IPCClient_RequestID_ResponseMismatchAcceptedToday_DocumentsGap (+ existing …_StrictlyMonotonic / …_UniqueUnderConcurrency).
- Fix-test coverage required: positive — matching id passes; negative — mismatched id rejected with "rpc response ID mismatch" (flip the gap test).

Test PR plan

  • All the tests listed above land in one PR titled e.g. "FGP hardening: gap-documentation + regression tests (F3, F10, F11, F13, F14, F16, F17, F18, F20)", linked to this issue.
  • The tests are currently negative-assertion (they pass by reproducing the gap; named *_DocumentsGap / *_AcceptedToday). As each fix lands, flip its assertion and add the positive-path
    assertion — that's the per-finding "fix-test coverage required" above.

Priority order

  1. F14 (HIGH, no upstream defense) — file/fix first; partial-fix limitations documented
  2. F11 (concrete consensus-gate consequence)
  3. F17 (small, genesis-gen-bug protection)
  4. F3 (operator-experience; transport config)
  5. F20 (parsing robustness)
  6. F16, F18 (one-liners; F18 waits on F21)
  7. F13 (low; no buffering)
  8. F10 (blocked on unicity-node main merge)

Current set of tests: #18

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions