Context
During the May 2026 audit of PR #13 / PR #11 (issues #6, #8, #10, #12) and a follow-up systematic test-design pass, ten defense-in-depth gaps were identified in the FGP codebase. Each was confirmed with a test; several were probed live on bft-fgp-2sh or via integration tests through the real IPC path. The tests will land in a single PR linked to this issue (currently negative-assertion form — they pass by documenting the present-but-undesired behavior, and flip to positive assertions as each fix lands).
Most of these are narrower than they first looked — defense-in-depth at a downstream layer (kernel, BFT certificate/CBOR/signature validation, shard-config hash check, transport self-healing) catches the worst case. Per finding below: initial view → final verdict after deeper testing.
Hard rules for the fixes (please follow)
- Propose and implement the EXACT minimal fix. Do not introduce new abstractions, retry loops, fallback paths, or key-management layers that would themselves need hardening. Each fix below is scoped to the smallest change that closes the gap. If a fix seems to require new subsystems, stop and discuss — that's a signal the scope is wrong.
- Every fix must be covered by tests exercising BOTH paths — the positive path (valid input still works) AND the negative path (the bad input is now rejected/handled). The negative-assertion tests already in the PR become the negative-path tests once flipped; add the positive-path assertions alongside.
- No behavior change beyond the stated fix. If closing a gap changes a wire format or an error surface, call it out explicitly (only F20 and the F12-family touch wire/serialization; none of the items below should).
Findings (priority order)
1. F14 — Private key bytes never zeroed in memory
- Found during: Task 2, #8 audit — "Related Issues" in the body.
- System impact now: the validator's signing identity is extractable from node memory. If extracted (core dump, swap, /proc//mem, malware), an attacker can forge that validator's signatures — sign arbitrary certification requests / impersonate the node in BFT. One key compromise = ability to act as that validator in consensus. Operator note: a routine gcore for debugging writes the key to disk.
- System impact after fix: the KeyConf-held copy is wiped at shutdown, removing it from post-shutdown dumps/snapshots. Partial — libp2p-held copies remain (documented), so the extraction surface shrinks but isn't eliminated.
- Initial view: HIGH — "private key material never zeroed; recoverable from memory."
- Final verdict: HIGH and the only finding NOT narrowed by any downstream defense — the memory is the key. Live gcore on fgp-node-0 found the real keys (SIG, AUTH) at 2 distinct memory locations each → a single scrub site is insufficient (KeyConf raw bytes + libp2p/crypto-derived copies coexist).
- Exact fix (no new hardening surface): add func (c *KeyConf) Zero() that overwrites SigKey.PrivateKey and AuthKey.PrivateKey in place; call it from the quitSignalContext() cancellation path in cmd/main.go AFTER the signer/host are done. Do not build a key-vault abstraction. Document explicitly that libp2p-held copies persist (can't be scrubbed without upstream support) —
the fix is partial by nature; don't over-engineer to chase the libp2p copy.
- Expected result: KeyConf-held raw bytes are zeroed on shutdown; comment states the known libp2p-copy limitation.
- Tests in the PR: partition/f14_keyzero_gap_test.go (6 — incl. K3 unsafe-pointer readback). Live evidence: 2 copies each from gcore.
- Fix-test coverage required: positive — signer/host still work before shutdown; negative — after Zero(), the KeyConf backing arrays read all-zero (flip K3's "still readable" assertion to "zeroed at the controlled location").
2. F11 — BlockHeader not validated at the IPC boundary
- Found during: Task 1, #6 audit — "Related Issues" in the body (silently dropped by PR #11).
- System impact now: a single follower's own compromised/buggy unicityd can make that node accept an orphan/invalid PoW block as certifiable state (or reject a valid one). Scoped to one node — 2f+1 + each node querying its own unicityd prevents unilateral consensus corruption — but that node contributes a bad vote and its local finality view diverges.
- System impact after fix: the node anchors its consensus view to well-formed PoW headers; malformed data fails fast at the IPC boundary instead of flowing into the IsActive() gate and the certification path.
- Initial view: MEDIUM — "block header data lacks validation."
- Final verdict: MEDIUM, narrowed by BFT certification-request validation downstream — BUT the integration test confirmed a concrete consensus consequence: the Confirmations field (attacker-controlled, unvalidated) gates IsActive() in FollowerVerify (txsystem/fgp_txsystem.go:193). A Confirmations=1 orphan is accepted by a single follower. Cluster-level 2f+1 + per-node-own-unicityd still bounds the blast radius.
- Exact fix: add func (b *BlockHeader) Validate() error checking: hash is 64 hex chars + valid hex; Confirmations within a sane bound; Height within an upper bound; (optional) PreviousBlockHash hex/length. Call it immediately after JSON unmarshal in GetTip, GetBlockHeaderByHash, GetBlockHeaderByHeight. Reject on failure — do not transform/normalize values.
- Expected result: malformed headers (bad hash length/hex, absurd height, nonsense Confirmations) are rejected at the IPC boundary with a clear error.
- Tests in the PR: pow/ipc_client_test.go — Test_IPCClient_NoBlockHeaderValidation_DocumentsGap (6), Test_IPCClient_NonHexHash_AcceptedByClient, Test_IPCClient_BlockHeader_HashLengthBVA_DocumentsGap (7), Test_IPCClient_BlockHeader_UncheckedNumericFields_DocumentsGap (3); txsystem/fgp_txsystem_ipc_integration_test.go — TestFGP11_Integration_ConfirmationsGatesIsActive (4).
- Fix-test coverage required: positive — a well-formed 64-hex hash + sane fields passes; negative — each of the BVA/equivalence cases above now returns an error (flip require.NoError → require.Error).
3. F17 — Validators() returns zero-value peer.ID("") entries on decode failure
- Found during: Task 3, #10 audit — "Related Issues" in the body.
- System impact now: the effective validator set silently shrinks (N → N-bad), reducing Byzantine fault tolerance (fewer honest nodes counting toward 2f+1) and dropping broadcasts to phantom peers. Enough malformed IDs at genesis/setup could prevent the partition from reaching quorum at all. Operator note: cluster reports Active while participation is silently reduced.
- System impact after fix: the validator set contains only valid members; quorum math is correct; misconfiguration surfaces as a logged error instead of silent degradation.
- Initial view: LOW-MEDIUM — "causes invalid downstream sends."
- Final verdict: narrowed by the BFT shard-config hash check (e2e: a tampered nodeId got the node isolated before broadcasts mattered). Fires for real only on genesis-gen bugs or operator typos at setup. Loop testing confirmed the empty-set case is safe (returns empty slice, no panic).
- Exact fix: change make(peer.IDSlice, len(validators)) + index-assignment to make(peer.IDSlice, 0, len(validators)) + append — skip failed decodes from the result. That's the whole change.
- Expected result: Validators() returns only successfully-decoded IDs; no peer.ID("") entries.
- Tests in the PR: partition/issue10_audit_test.go — TestFGP10_RELATED_Validators_ZeroValueOnBadDecode_DocumentsGap, TestFGP17_Validators_LoopBoundaries_DocumentsGap (5).
- Fix-test coverage required: positive — all-valid input returns N IDs; negative — mixed valid/invalid returns only the valid ones (0 zero-values), and all-invalid returns an empty slice (flip the zero-value-count assertions).
4. F3 — Pool-side stale-FD partition wedge after PoW restart
- Found during: Task 2, #8 SIGKILL e2e window; also pre-existing as Scenario E / SESSION_FGP_IPC_AND_BFT_SNAPSHOTS.md §7.
- System impact now: finality (FGP round progression) stalls ~60s after each PoW node restart while the connection pool self-heals — a temporary, self-recovering liveness degradation.
Operator note: a burst of connection reset by peer warnings that looks like a hard outage.
- System impact after fix: finality continuity across PoW restarts; the liveness gap and the warning burst go away.
- Initial view: HIGH — "permanent wedge for 3+ rounds."
- Final verdict: MEDIUM operator-experience. Live probe: the transport pool self-heals within ~1 T1 cycle (~60s); the originally-observed multi-round wedge was compound with the unicityd-startmining one-shot gotcha (separate; file in aggregator-subscription). Even a hostile abrupt-close unit test could not reproduce the wedge — it self-heals. Not unit-reproducible; live e2e is the only repro.
- Exact fix: in the http.Transport config in pow/ipc_client.go, set IdleConnTimeout: 30s (or DisableKeepAlives: true if simpler) and call transport.CloseIdleConnections() on RPC error.
Optional: leader rotation on repeated failure. Do not add a custom retry/backoff loop that could mask other failures — the goal is to retire dead connections faster, not to paper over errors.
- Expected result: recovery after PoW restart drops from ~60s to ~immediate; no behavior change in the healthy path.
- Tests in the PR: partition/t1_timer_rearm_regression_test.go (4); pow/ipc_client_test.go — Test_IPCClient_StalePoolAfterServerRestart_SelfHeals,
Test_IPCClient_AbruptServerClose_ReproducesStaleFD_ThenRecovers. Live: F3-S1/S2 in E2E_FOLLOWUP_PROBES.md.
- Fix-test coverage required: positive — healthy calls still pooled/reused; negative — after a server restart, the next call succeeds without the ~60s delay (add a timing assertion once IdleConnTimeout/CloseIdleConnections is in).
5. F20 — JSON-RPC error field type confusion masks valid results
- Found during: 2026-05-20 technique pass (decision table over error×result + equivalence partitioning of the error field's JSON type).
- System impact now: if unicityd ever returns a non-object error field (false/0/""), all of that node's PoW queries fail → it can't fetch headers/tip → that node's finality stalls until unicityd is corrected. Liveness impact, single node, triggered only by an upstream regression. Operator note: near-undiagnosable error ("rpc error: ").
- System impact after fix: spurious error-field values no longer break PoW queries; node liveness is preserved and errors are diagnosable.
- Initial view: (new finding — not in any issue body).
- Final verdict: LOW-MEDIUM defense-in-depth. if rpcResp.Error != nil with Error interface{} treats false/0/"" as an RPC error and discards the valid result. Spec-compliant unicityd never triggers it, but a buggy one makes every call fail with cryptic messages ("rpc error: false", or "rpc error: " for empty string). Confirmed through FollowerVerify (integration test).
- Exact fix: type the Error field as *jsonRpcError struct (Code int, Message string) instead of interface{}. Non-object JSON values then fail to decode or decode to nil, so only a real error object triggers the error path. Wire/serialization note: this changes how the response is parsed (not the request) — confirm it still decodes spec-compliant error objects. No request-side wire change.
- Expected result: valid result with error: false/0/"" → result is returned, not masked; a real {"code":...,"message":...} → error as before.
- Tests in the PR: pow/ipc_client_test.go — Test_IPCClient_ResponseEnvelope_DecisionTable (8); txsystem/fgp_txsystem_ipc_integration_test.go — TestFGP20_Integration_ErrorFieldMasksResult (3).
- Fix-test coverage required: positive — real error object still surfaces an error; valid result with non-object error field now returns the result; negative — flip the false/0/"" PROBE rows to require.NoError + correct result.
6. F16 — printUC panics on nil InputRecord
- Found during: Task 3, #10 audit — "Related Issues" in the body.
- System impact now: a crafted/corrupt UC reaching a debug log statement panics the message-handler goroutine → node instability and dropped message processing for that handler. Higher
exposure when debug logging is on.
- System impact after fix: the UC logs safely; message processing continues; no panic at any log level.
- Initial view: LOW — nil dereference in consensus paths.
- Final verdict: LOW, narrowed by CBOR schema + BFT signature validation upstream (threat model = direct blocks.db tampering).
- Exact fix: replace the four direct field accesses with the existing nil-safe accessors — uc.GetStateHash(), uc.GetPreviousStateHash(), uc.GetBlockHash(), uc.GetFeeSum(). No new guard
needed; matches the codebase convention. (The GetRoundNumber()/GetRootRoundNumber() calls already there are already nil-safe.)
- Expected result: printUC is fully nil-safe; a UC with nil InputRecord renders zero values instead of panicking.
- Tests in the PR: partition/issue10_audit_test.go — TestFGP10_RELATED_PrintUC_NilInputRecord_PanicsCurrently_DocumentsGap.
- Fix-test coverage required: positive — a complete UC prints the real values; negative — nil InputRecord returns a string without panic (flip require.Panics → require.NotPanics).
7. F18 — sendCertificationRequest panics on nil luc.UnicitySeal
- Found during: Task 3, #10 audit — "Related Issues" in the body.
- System impact now: a corrupt/partial stored latest-UC panics the leader during proposal construction → that node drops out of leadership, stalling round progress until recovery/rotation.
- System impact after fix: the leader handles the malformed UC with an error instead of crashing; round continuity preserved.
- Initial view: LOW — nil dereference.
- Final verdict: LOW, same narrowing as F16. Refinement: sendCertificationRequest already uses the nil-safe luc.GetStateHash() right next to the unsafe luc.UnicitySeal.Timestamp — there's no GetTimestamp() accessor (see F21).
- Exact fix: depends on F21 — once GetTimestamp() exists in bft-go-base, use luc.GetTimestamp(). Until then, a one-line guard at function start: if luc == nil || luc.UnicitySeal == nil {
return fmt.Errorf("nil UnicitySeal in latest UC") }. Prefer the accessor route (no local guard).
- Expected result: no panic when UnicitySeal is nil.
- Tests in the PR: partition/issue10_audit_test.go — TestFGP10_RELATED_NilUnicitySeal_PanicsOnTimestampAccess_DocumentsGap.
- Fix-test coverage required: positive — valid luc produces a correct certification request; negative — nil UnicitySeal returns an error / uses the safe accessor without panic.
8. F13 — Logger file handle never Sync()'d / Close()'d on shutdown
- Found during: Task 2, #8 audit — "Related Issues" in the body.
- System impact now: none to consensus or liveness — observability only. On a kernel panic or power loss (not normal crashes), the last log lines may never reach disk — a forensic gap
exactly where post-mortems need data.
- System impact after fix: explicit flush on graceful shutdown; log-loss window shrinks to true sudden power loss; better post-mortems.
- Initial view: "risking data loss."
- Final verdict: LOW — downgraded by the validity probe. Go's *os.File.Write is unbuffered; the kernel page cache survives process death (SIGTERM and SIGKILL). Only system-level crashes
(kernel panic, power loss) lose un-Sync()'d data (probe: 500 lines all persisted without Sync()).
- Exact fix: track the *os.File returned by filenameToWriter; add (*Logger).Close() (or expose the file) that calls Sync() then Close(); invoke from the shutdown handler. Do not add a
buffered writer — that would introduce a flush dependency (new hardening surface), the opposite of the goal.
- Expected result: explicit flush+close on shutdown; system-crash durability improved; no change to the steady-state write path.
- Tests in the PR: logger/f13_sync_gap_test.go (4 — incl. the severity probe showing 500 lines persist).
- Fix-test coverage required: positive — normal logging still writes every line; negative — after Close(), the file is synced and the handle released (assert Sync() succeeds, no fd leak).
9. F10 — Response-ID validation TODO cleanup (BLOCKED)
- Found during: Task 1, #6 audit.
- System impact now: none currently — spec-compliant unicityd echoes the id correctly. Latent integrity gap: a misrouted/interleaved or socket-MITM-substituted RPC response would not be detected, so a node could act on consensus-relevant PoW data that doesn't correspond to its request.
- System impact after fix: response-to-request integrity is enforced; mismatched responses rejected. No change in normal operation.
- Initial view: stale TODO.
- Final verdict: TRIVIAL — the upstream cause was fixed in unicity-node@f7d9fdb ("Mirror request id in rpc response") on bft-snapshots. Blocked on unicity-node merging bft-snapshots → main
(otherwise FGP against a main-built unicityd would fail every call).
- Exact fix: uncomment the response-ID validation block at pow/ipc_client.go:109-112. No new logic.
- Expected result: a response whose id ≠ the request id is rejected.
- Tests in the PR: pow/ipc_client_test.go — Test_IPCClient_RequestID_ResponseMismatchAcceptedToday_DocumentsGap (+ existing …_StrictlyMonotonic / …_UniqueUnderConcurrency).
- Fix-test coverage required: positive — matching id passes; negative — mismatched id rejected with "rpc response ID mismatch" (flip the gap test).
Test PR plan
- All the tests listed above land in one PR titled e.g. "FGP hardening: gap-documentation + regression tests (F3, F10, F11, F13, F14, F16, F17, F18, F20)", linked to this issue.
- The tests are currently negative-assertion (they pass by reproducing the gap; named *_DocumentsGap / *_AcceptedToday). As each fix lands, flip its assertion and add the positive-path
assertion — that's the per-finding "fix-test coverage required" above.
Priority order
- F14 (HIGH, no upstream defense) — file/fix first; partial-fix limitations documented
- F11 (concrete consensus-gate consequence)
- F17 (small, genesis-gen-bug protection)
- F3 (operator-experience; transport config)
- F20 (parsing robustness)
- F16, F18 (one-liners; F18 waits on F21)
- F13 (low; no buffering)
- F10 (blocked on unicity-node main merge)
Current set of tests: #18
Context
During the May 2026 audit of PR #13 / PR #11 (issues #6, #8, #10, #12) and a follow-up systematic test-design pass, ten defense-in-depth gaps were identified in the FGP codebase. Each was confirmed with a test; several were probed live on bft-fgp-2sh or via integration tests through the real IPC path. The tests will land in a single PR linked to this issue (currently negative-assertion form — they pass by documenting the present-but-undesired behavior, and flip to positive assertions as each fix lands).
Most of these are narrower than they first looked — defense-in-depth at a downstream layer (kernel, BFT certificate/CBOR/signature validation, shard-config hash check, transport self-healing) catches the worst case. Per finding below: initial view → final verdict after deeper testing.
Hard rules for the fixes (please follow)
Findings (priority order)
1. F14 — Private key bytes never zeroed in memory
- Found during: Task 2, #8 audit — "Related Issues" in the body.
- System impact now: the validator's signing identity is extractable from node memory. If extracted (core dump, swap, /proc//mem, malware), an attacker can forge that validator's signatures — sign arbitrary certification requests / impersonate the node in BFT. One key compromise = ability to act as that validator in consensus. Operator note: a routine gcore for debugging writes the key to disk.
- System impact after fix: the KeyConf-held copy is wiped at shutdown, removing it from post-shutdown dumps/snapshots. Partial — libp2p-held copies remain (documented), so the extraction surface shrinks but isn't eliminated.
- Initial view: HIGH — "private key material never zeroed; recoverable from memory."
- Final verdict: HIGH and the only finding NOT narrowed by any downstream defense — the memory is the key. Live gcore on fgp-node-0 found the real keys (SIG, AUTH) at 2 distinct memory locations each → a single scrub site is insufficient (KeyConf raw bytes + libp2p/crypto-derived copies coexist).
- Exact fix (no new hardening surface): add func (c *KeyConf) Zero() that overwrites SigKey.PrivateKey and AuthKey.PrivateKey in place; call it from the quitSignalContext() cancellation path in cmd/main.go AFTER the signer/host are done. Do not build a key-vault abstraction. Document explicitly that libp2p-held copies persist (can't be scrubbed without upstream support) —
the fix is partial by nature; don't over-engineer to chase the libp2p copy.
- Expected result: KeyConf-held raw bytes are zeroed on shutdown; comment states the known libp2p-copy limitation.
- Tests in the PR: partition/f14_keyzero_gap_test.go (6 — incl. K3 unsafe-pointer readback). Live evidence: 2 copies each from gcore.
- Fix-test coverage required: positive — signer/host still work before shutdown; negative — after Zero(), the KeyConf backing arrays read all-zero (flip K3's "still readable" assertion to "zeroed at the controlled location").
2. F11 — BlockHeader not validated at the IPC boundary
- Found during: Task 1, #6 audit — "Related Issues" in the body (silently dropped by PR #11).
- System impact now: a single follower's own compromised/buggy unicityd can make that node accept an orphan/invalid PoW block as certifiable state (or reject a valid one). Scoped to one node — 2f+1 + each node querying its own unicityd prevents unilateral consensus corruption — but that node contributes a bad vote and its local finality view diverges.
- System impact after fix: the node anchors its consensus view to well-formed PoW headers; malformed data fails fast at the IPC boundary instead of flowing into the IsActive() gate and the certification path.
- Initial view: MEDIUM — "block header data lacks validation."
- Final verdict: MEDIUM, narrowed by BFT certification-request validation downstream — BUT the integration test confirmed a concrete consensus consequence: the Confirmations field (attacker-controlled, unvalidated) gates IsActive() in FollowerVerify (txsystem/fgp_txsystem.go:193). A Confirmations=1 orphan is accepted by a single follower. Cluster-level 2f+1 + per-node-own-unicityd still bounds the blast radius.
- Exact fix: add func (b *BlockHeader) Validate() error checking: hash is 64 hex chars + valid hex; Confirmations within a sane bound; Height within an upper bound; (optional) PreviousBlockHash hex/length. Call it immediately after JSON unmarshal in GetTip, GetBlockHeaderByHash, GetBlockHeaderByHeight. Reject on failure — do not transform/normalize values.
- Expected result: malformed headers (bad hash length/hex, absurd height, nonsense Confirmations) are rejected at the IPC boundary with a clear error.
- Tests in the PR: pow/ipc_client_test.go — Test_IPCClient_NoBlockHeaderValidation_DocumentsGap (6), Test_IPCClient_NonHexHash_AcceptedByClient, Test_IPCClient_BlockHeader_HashLengthBVA_DocumentsGap (7), Test_IPCClient_BlockHeader_UncheckedNumericFields_DocumentsGap (3); txsystem/fgp_txsystem_ipc_integration_test.go — TestFGP11_Integration_ConfirmationsGatesIsActive (4).
- Fix-test coverage required: positive — a well-formed 64-hex hash + sane fields passes; negative — each of the BVA/equivalence cases above now returns an error (flip require.NoError → require.Error).
3. F17 — Validators() returns zero-value peer.ID("") entries on decode failure
- Found during: Task 3, #10 audit — "Related Issues" in the body.
- System impact now: the effective validator set silently shrinks (N → N-bad), reducing Byzantine fault tolerance (fewer honest nodes counting toward 2f+1) and dropping broadcasts to phantom peers. Enough malformed IDs at genesis/setup could prevent the partition from reaching quorum at all. Operator note: cluster reports Active while participation is silently reduced.
- System impact after fix: the validator set contains only valid members; quorum math is correct; misconfiguration surfaces as a logged error instead of silent degradation.
- Initial view: LOW-MEDIUM — "causes invalid downstream sends."
- Final verdict: narrowed by the BFT shard-config hash check (e2e: a tampered nodeId got the node isolated before broadcasts mattered). Fires for real only on genesis-gen bugs or operator typos at setup. Loop testing confirmed the empty-set case is safe (returns empty slice, no panic).
- Exact fix: change make(peer.IDSlice, len(validators)) + index-assignment to make(peer.IDSlice, 0, len(validators)) + append — skip failed decodes from the result. That's the whole change.
- Expected result: Validators() returns only successfully-decoded IDs; no peer.ID("") entries.
- Tests in the PR: partition/issue10_audit_test.go — TestFGP10_RELATED_Validators_ZeroValueOnBadDecode_DocumentsGap, TestFGP17_Validators_LoopBoundaries_DocumentsGap (5).
- Fix-test coverage required: positive — all-valid input returns N IDs; negative — mixed valid/invalid returns only the valid ones (0 zero-values), and all-invalid returns an empty slice (flip the zero-value-count assertions).
4. F3 — Pool-side stale-FD partition wedge after PoW restart
- Found during: Task 2, #8 SIGKILL e2e window; also pre-existing as Scenario E / SESSION_FGP_IPC_AND_BFT_SNAPSHOTS.md §7.
- System impact now: finality (FGP round progression) stalls ~60s after each PoW node restart while the connection pool self-heals — a temporary, self-recovering liveness degradation.
Operator note: a burst of connection reset by peer warnings that looks like a hard outage.
- System impact after fix: finality continuity across PoW restarts; the liveness gap and the warning burst go away.
- Initial view: HIGH — "permanent wedge for 3+ rounds."
- Final verdict: MEDIUM operator-experience. Live probe: the transport pool self-heals within ~1 T1 cycle (~60s); the originally-observed multi-round wedge was compound with the unicityd-startmining one-shot gotcha (separate; file in aggregator-subscription). Even a hostile abrupt-close unit test could not reproduce the wedge — it self-heals. Not unit-reproducible; live e2e is the only repro.
- Exact fix: in the http.Transport config in pow/ipc_client.go, set IdleConnTimeout: 30s (or DisableKeepAlives: true if simpler) and call transport.CloseIdleConnections() on RPC error.
Optional: leader rotation on repeated failure. Do not add a custom retry/backoff loop that could mask other failures — the goal is to retire dead connections faster, not to paper over errors.
- Expected result: recovery after PoW restart drops from ~60s to ~immediate; no behavior change in the healthy path.
- Tests in the PR: partition/t1_timer_rearm_regression_test.go (4); pow/ipc_client_test.go — Test_IPCClient_StalePoolAfterServerRestart_SelfHeals,
Test_IPCClient_AbruptServerClose_ReproducesStaleFD_ThenRecovers. Live: F3-S1/S2 in E2E_FOLLOWUP_PROBES.md.
- Fix-test coverage required: positive — healthy calls still pooled/reused; negative — after a server restart, the next call succeeds without the ~60s delay (add a timing assertion once IdleConnTimeout/CloseIdleConnections is in).
5. F20 — JSON-RPC error field type confusion masks valid results
- Found during: 2026-05-20 technique pass (decision table over error×result + equivalence partitioning of the error field's JSON type).
- System impact now: if unicityd ever returns a non-object error field (false/0/""), all of that node's PoW queries fail → it can't fetch headers/tip → that node's finality stalls until unicityd is corrected. Liveness impact, single node, triggered only by an upstream regression. Operator note: near-undiagnosable error ("rpc error: ").
- System impact after fix: spurious error-field values no longer break PoW queries; node liveness is preserved and errors are diagnosable.
- Initial view: (new finding — not in any issue body).
- Final verdict: LOW-MEDIUM defense-in-depth. if rpcResp.Error != nil with Error interface{} treats false/0/"" as an RPC error and discards the valid result. Spec-compliant unicityd never triggers it, but a buggy one makes every call fail with cryptic messages ("rpc error: false", or "rpc error: " for empty string). Confirmed through FollowerVerify (integration test).
- Exact fix: type the Error field as *jsonRpcError struct (Code int, Message string) instead of interface{}. Non-object JSON values then fail to decode or decode to nil, so only a real error object triggers the error path. Wire/serialization note: this changes how the response is parsed (not the request) — confirm it still decodes spec-compliant error objects. No request-side wire change.
- Expected result: valid result with error: false/0/"" → result is returned, not masked; a real {"code":...,"message":...} → error as before.
- Tests in the PR: pow/ipc_client_test.go — Test_IPCClient_ResponseEnvelope_DecisionTable (8); txsystem/fgp_txsystem_ipc_integration_test.go — TestFGP20_Integration_ErrorFieldMasksResult (3).
- Fix-test coverage required: positive — real error object still surfaces an error; valid result with non-object error field now returns the result; negative — flip the false/0/"" PROBE rows to require.NoError + correct result.
6. F16 — printUC panics on nil InputRecord
- Found during: Task 3, #10 audit — "Related Issues" in the body.
- System impact now: a crafted/corrupt UC reaching a debug log statement panics the message-handler goroutine → node instability and dropped message processing for that handler. Higher
exposure when debug logging is on.
- System impact after fix: the UC logs safely; message processing continues; no panic at any log level.
- Initial view: LOW — nil dereference in consensus paths.
- Final verdict: LOW, narrowed by CBOR schema + BFT signature validation upstream (threat model = direct blocks.db tampering).
- Exact fix: replace the four direct field accesses with the existing nil-safe accessors — uc.GetStateHash(), uc.GetPreviousStateHash(), uc.GetBlockHash(), uc.GetFeeSum(). No new guard
needed; matches the codebase convention. (The GetRoundNumber()/GetRootRoundNumber() calls already there are already nil-safe.)
- Expected result: printUC is fully nil-safe; a UC with nil InputRecord renders zero values instead of panicking.
- Tests in the PR: partition/issue10_audit_test.go — TestFGP10_RELATED_PrintUC_NilInputRecord_PanicsCurrently_DocumentsGap.
- Fix-test coverage required: positive — a complete UC prints the real values; negative — nil InputRecord returns a string without panic (flip require.Panics → require.NotPanics).
7. F18 — sendCertificationRequest panics on nil luc.UnicitySeal
- Found during: Task 3, #10 audit — "Related Issues" in the body.
- System impact now: a corrupt/partial stored latest-UC panics the leader during proposal construction → that node drops out of leadership, stalling round progress until recovery/rotation.
- System impact after fix: the leader handles the malformed UC with an error instead of crashing; round continuity preserved.
- Initial view: LOW — nil dereference.
- Final verdict: LOW, same narrowing as F16. Refinement: sendCertificationRequest already uses the nil-safe luc.GetStateHash() right next to the unsafe luc.UnicitySeal.Timestamp — there's no GetTimestamp() accessor (see F21).
- Exact fix: depends on F21 — once GetTimestamp() exists in bft-go-base, use luc.GetTimestamp(). Until then, a one-line guard at function start: if luc == nil || luc.UnicitySeal == nil {
return fmt.Errorf("nil UnicitySeal in latest UC") }. Prefer the accessor route (no local guard).
- Expected result: no panic when UnicitySeal is nil.
- Tests in the PR: partition/issue10_audit_test.go — TestFGP10_RELATED_NilUnicitySeal_PanicsOnTimestampAccess_DocumentsGap.
- Fix-test coverage required: positive — valid luc produces a correct certification request; negative — nil UnicitySeal returns an error / uses the safe accessor without panic.
8. F13 — Logger file handle never Sync()'d / Close()'d on shutdown
- Found during: Task 2, #8 audit — "Related Issues" in the body.
- System impact now: none to consensus or liveness — observability only. On a kernel panic or power loss (not normal crashes), the last log lines may never reach disk — a forensic gap
exactly where post-mortems need data.
- System impact after fix: explicit flush on graceful shutdown; log-loss window shrinks to true sudden power loss; better post-mortems.
- Initial view: "risking data loss."
- Final verdict: LOW — downgraded by the validity probe. Go's *os.File.Write is unbuffered; the kernel page cache survives process death (SIGTERM and SIGKILL). Only system-level crashes
(kernel panic, power loss) lose un-Sync()'d data (probe: 500 lines all persisted without Sync()).
- Exact fix: track the *os.File returned by filenameToWriter; add (*Logger).Close() (or expose the file) that calls Sync() then Close(); invoke from the shutdown handler. Do not add a
buffered writer — that would introduce a flush dependency (new hardening surface), the opposite of the goal.
- Expected result: explicit flush+close on shutdown; system-crash durability improved; no change to the steady-state write path.
- Tests in the PR: logger/f13_sync_gap_test.go (4 — incl. the severity probe showing 500 lines persist).
- Fix-test coverage required: positive — normal logging still writes every line; negative — after Close(), the file is synced and the handle released (assert Sync() succeeds, no fd leak).
9. F10 — Response-ID validation TODO cleanup (BLOCKED)
- Found during: Task 1, #6 audit.
- System impact now: none currently — spec-compliant unicityd echoes the id correctly. Latent integrity gap: a misrouted/interleaved or socket-MITM-substituted RPC response would not be detected, so a node could act on consensus-relevant PoW data that doesn't correspond to its request.
- System impact after fix: response-to-request integrity is enforced; mismatched responses rejected. No change in normal operation.
- Initial view: stale TODO.
- Final verdict: TRIVIAL — the upstream cause was fixed in unicity-node@f7d9fdb ("Mirror request id in rpc response") on bft-snapshots. Blocked on unicity-node merging bft-snapshots → main
(otherwise FGP against a main-built unicityd would fail every call).
- Exact fix: uncomment the response-ID validation block at pow/ipc_client.go:109-112. No new logic.
- Expected result: a response whose id ≠ the request id is rejected.
- Tests in the PR: pow/ipc_client_test.go — Test_IPCClient_RequestID_ResponseMismatchAcceptedToday_DocumentsGap (+ existing …_StrictlyMonotonic / …_UniqueUnderConcurrency).
- Fix-test coverage required: positive — matching id passes; negative — mismatched id rejected with "rpc response ID mismatch" (flip the gap test).
Test PR plan
assertion — that's the per-finding "fix-test coverage required" above.
Priority order
Current set of tests: #18