Skip to content

Add RocksDB-backed disk SMT for BFT shard mode#156

Open
jait91 wants to merge 10 commits into
mainfrom
disk-backed-smt-port
Open

Add RocksDB-backed disk SMT for BFT shard mode#156
jait91 wants to merge 10 commits into
mainfrom
disk-backed-smt-port

Conversation

@jait91
Copy link
Copy Markdown
Contributor

@jait91 jait91 commented Jun 3, 2026

Summary

Adds an optional RocksDB-backed SMT backend for single-active BFT shard mode while keeping the existing in-memory SMT as the default.

What changed

  • Added a backend abstraction so SMT can run in memory or on RocksDB
  • Wired RocksDB SMT into BFT-shard finalization, startup verification, replay, and recovery.
  • Added proof metadata cache / optional precomputed proof storage.
  • Added RocksDB build/config support and guards that reject unsupported HA and child/parent modes for now. HA support will be added in a separate PR.

Notes

Mongo still stores finalized block history; RocksDB is the local SMT state and is verified against Mongo on startup. This keeps the data needed for future HA catch-up/rebuild work.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a disk-backed Sparse Merkle Tree (SMT) using RocksDB as an alternative to the in-memory SMT backend, along with Docker, Makefile, and configuration updates to support it. It also adds a precomputed proof cache and performance diagnostics. The code review identified critical issues that must be addressed: a bug in the FIFO cache (proofMetadataCache) where updates to existing records can lead to premature eviction, and several resource leaks where SMT snapshots or precollector snapshots are not properly discarded on error or exit paths.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread internal/round/proof_metadata_cache.go
Comment thread internal/round/proof_metadata_cache.go
Comment thread internal/round/proof_metadata_cache.go
Comment thread internal/round/recovery.go
Comment thread internal/round/recovery.go
Comment thread internal/round/round_manager.go
Comment thread internal/round/batch_processor.go Outdated
Comment thread internal/round/precollector.go
@jait91 jait91 requested a review from MastaP June 3, 2026 09:04
@MastaP MastaP requested a review from Copilot June 3, 2026 11:47
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new SMT backend abstraction and introduces an optional RocksDB-backed “disk SMT” for single-active BFT shard mode, including startup verification/replay and proof-related caches/diagnostics, while keeping the current in-memory SMT as the default.

Changes:

  • Introduces internal/smt/backend with MemoryBackend + DiskBackend (RocksDB) and updates round/service/HA codepaths to use the backend interface.
  • Adds disk-SMT startup verification + bounded replay/recovery logic and proof metadata caching / optional precomputed proof response storage.
  • Extends build & runtime configuration (Makefile, Docker, compose, config/env vars) to support building with -tags rocksdb, plus adds new metrics and trace-level logging.

Reviewed changes

Copilot reviewed 64 out of 65 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
pkg/jsonrpc/handler.go Downgrades JSON-RPC request logs to a trace-level log call.
Makefile Adds configurable Go build tags/flags for build/test/bench targets.
internal/storage/redis/commitment.go Adds configurable XACK batching, optional async XDEL after ack, and Redis queue operation metrics.
internal/storage/redis/commitment_integration_test.go Integration test ensuring acked entries can be deleted asynchronously after MarkProcessed.
internal/storage/mongodb/connection.go Improves Mongo connection robustness during primary transitions (wait/retry) and ensures disconnect on failures.
internal/storage/mongodb/connection_test.go Tests primary-transition error classification behavior.
internal/storage/factory.go Wires new Redis ack/delete configuration into Redis commitment queue setup.
internal/smt/disk/tree_test.go Unit tests for disk SMT tree behavior and parity with legacy in-memory SMT roots.
internal/smt/disk/storage/types.go Defines disk SMT store/batch/snapshot/proof-response storage interfaces and counters/metrics types.
internal/smt/disk/serde.go Implements node serialization/deserialization for disk SMT nodes with validation.
internal/smt/disk/rocksstore/store_test.go RocksDB store tests (atomic batch, multiget behavior, snapshots lifecycle).
internal/smt/disk/rocksstore/batch.go Implements RocksDB write batch operations for nodes + committed-state metadata.
internal/smt/disk/path.go Implements compressed path encoding for internal nodes.
internal/smt/disk/node.go Defines disk SMT node/branch structures and hashing accessors.
internal/smt/disk/node_key.go Implements stable on-disk node-key encoding/parsing and ordering.
internal/smt/disk/model_test.go Tests key/path encoding, hashing parity, and (de)serialization round-trips.
internal/smt/disk/hash.go Implements yellowpaper-compatible hashing primitives for disk SMT.
internal/smt/backend/disk_backend.go Adds the disk SMT backend wrapper, snapshot handling, inclusion proofs with optional read snapshots, and precomputed proof storage hooks.
internal/smt/backend/disk_backend_test.go RocksDB-gated backend tests for parity, classification, snapshot lifecycle, and stats surfacing.
internal/smt/backend/disk_backend_rocksdb_test.go Additional RocksDB backend tests for snapshot usage, proof storage round-trip, and multi-commit survivability.
internal/smt/backend/backend.go Introduces the common SMT backend + snapshot interfaces and refactors in-memory SMT into MemoryBackend.
internal/smt/backend/backend_test.go Tests backend semantics (classification, snapshot commit rules, fork/commit sequencing) for memory backend.
internal/service/service.go Switches proof generation to use SMT backend (supports precomputed proof reads + proof metadata caching) and adds a non-blocking finalization read-lock path for “not-ready” responses.
internal/service/proof_diagnostics.go Adds proof-path metrics/counters and periodic summary logging.
internal/service/disk_bft_proof_rocksdb_test.go RocksDB-gated integration test around proof behavior before/after finalization in BFT shard mode.
internal/round/smt_persistence_integration_test.go Updates tests to use backend snapshot helpers rather than legacy snapshot APIs directly.
internal/round/smt_backend_rocksdb.go Adds RocksDB SMT backend wiring (build-tag gated).
internal/round/smt_backend_norocksdb.go Provides a helpful error when RocksDB backend is requested without the rocksdb build tag.
internal/round/smt_backend_factory.go Centralizes SMT backend selection and enforces current mode constraints (reject HA/child/parent for rocksdb).
internal/round/smt_backend_factory_test.go Tests backend selection defaults and rocksdb+HA rejection.
internal/round/smt_backend_factory_rocksdb_test.go RocksDB-gated test ensuring configured RocksDB backend opens and has expected empty root.
internal/round/recovery.go Refactors recovery to load recovered SMT state into the active backend (not just legacy SMT).
internal/round/recovery_test.go Updates recovery tests for the backend-based recovery path.
internal/round/proof_metadata_cache.go Adds an in-memory cache for proof block+record metadata keyed by root/state.
internal/round/precollector.go Refactors precollector to operate on the backend snapshot interface and propagate snapshot add errors.
internal/round/precollection_test.go Updates precollection tests for backend snapshots and adds a test for snapshot add error propagation.
internal/round/parent_round_manager.go Introduces GetSMTBackend() for parent mode and keeps committed block metadata in the memory backend wrapper.
internal/round/parent_round_manager_test.go Adds a test ensuring parent manager returns a persistent SMT backend.
internal/round/leaf_add.go Converts commitment-to-leaf handling to backend inputs and records batch application metrics.
internal/round/finalize_duplicate_test.go Adds a test ensuring post-finalization Redis ACK failure doesn’t fail block finalization; updates snapshot usage.
internal/round/factory.go Extends round manager interface with backend + proof-cache methods and non-blocking finalization read-lock.
internal/round/disk_smt_startup.go Implements disk SMT startup verification, bounded replay, and recovery sync behaviors.
internal/round/disk_bft_integration_rocksdb_test.go RocksDB-gated BFT shard integration tests covering empty rounds, duplicate filtering alignment, parity vs memory backend, and precollector lifecycle.
internal/round/backend_test_helpers_test.go Adds shared helpers for backend snapshot usage in tests.
internal/metrics/metrics.go Adds new Prometheus metrics for proof-path, SMT commit/apply stats, disk snapshot usage, startup recovery actions, and Redis op timings.
internal/logger/logger.go Adds trace log level support and formatting for trace in slog output.
internal/logger/async_logger.go Adds async trace logging support.
internal/ha/block_syncer.go Updates HA block syncer to apply/commit via SMT backend snapshots.
internal/ha/block_syncer_test.go Updates block syncer test to pass an SMT backend rather than a legacy SMT instance.
internal/config/config.go Adds SMT backend + RocksDB tuning options and Redis ack/delete options, plus validates trace log level.
internal/config/config_test.go Adds tests for SMT backend validation and env parsing, plus trace level validation.
Dockerfile Adds a multi-stage RocksDB build and builds the Go binary with build tags + CGO settings for RocksDB support.
docker-compose.yml Adds build args + runtime env/config for SMT RocksDB usage and mounts a local RocksDB data volume.
cmd/performance-test/types.go Replaces O(n²) sort loop with sort.Slice for proof duration statistics.
cmd/aggregator/main.go Passes the SMT backend to HA block syncer.
.gitignore Ignores Codex temp directories.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/ha/block_syncer.go
Comment thread internal/round/precollector.go
Copy link
Copy Markdown
Member

@MastaP MastaP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated-assisted deep review (disk SMT)

I reviewed this with a multi-agent pass (per-subsystem + adversarial verification) and built/ran the suites locally, including the -tags rocksdb tests that CI does not run.

Good news first: all 7 prior bot findings (FIFO cache, the various snapshot-leak sites, block_syncer discard) are genuinely fixed; the backend abstraction and disk tree are clean; startup verification is fail-closed.

However, two blocking issues are currently masked by the fact that no disk-SMT test runs in CI:

  • 🔴 The PR's own TestDiskSMTBFTShardProofBeforeAndAfterFinalization fails 5/5 deterministically (see comment on service.go).
  • 🔴 A disk-vs-memory parity defect can panic the round goroutine (see comment on disk/tree.go).

Test results I observed: default build/vet/tests ✅; -tags rocksdb for smt/disk*, smt/backend, internal/round ✅ (backend + rocksstore also clean under -race); internal/service ❌.

The 6 inline comments below cover findings ordered by severity. The in-memory default path is unaffected, so risk is scoped to SMT_BACKEND=rocksdb deployments — i.e. the feature this PR ships.

Comment thread internal/service/service.go Outdated
}
}()

unlock, ok := as.roundManager.TryFinalizationReadLock()
Copy link
Copy Markdown
Member

@MastaP MastaP Jun 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 High (blocking) — proofs for already-finalized states can return "not ready"; a latent availability race that shows up as a flaky test.

This replaces the blocking FinalizationReadLock() with the non-blocking TryFinalizationReadLock(). When finalization holds finalizationMu, the read returns an empty proof (CertificationData == nil, line ~250). But FinalizeBlock writes the AggregatorRecord to Mongo inside that same critical section. So a client that polls on record existence and then reads the proof can observe the record as finalized while the lock is still held, and gets a spurious "not ready" for a fully-provable state. The old blocking lock waited and returned the real proof.

Observable via internal/service/disk_bft_proof_rocksdb_test.go:89 (require.NotNil(... CertificationData)):

CGO_CFLAGS=-I$(brew --prefix rocksdb)/include CGO_LDFLAGS="-L$(brew --prefix rocksdb)/lib -lrocksdb" \
  go test -tags rocksdb -run TestDiskSMTBFTShardProofBeforeAndAfterFinalization ./internal/service/

Update / correction (was previously stated as "fails 5/5 deterministically"): this reproduces 5/5 locally (macOS, librocksdb 11.x) but passes on the GitHub ubuntu-latest runner (librocksdb 8.9.1) — confirmed by the new test-rocksdb CI job, which is green. So it is not a hard failure everywhere; whether the single read at line 86 lands inside the finalization write-lock window is timing-dependent (hardware / RocksDB version / I/O latency). Treat it as a flaky test masking a real availability race, not a deterministic break. The race itself is real and matters under production load (longer/more frequent finalization windows ⇒ more "record finalized but proof not yet readable" responses), and a test that is red locally yet green in CI is itself worth fixing.

Fix options: (a) keep a short bounded blocking wait before falling back to "not ready", so an already-finalized state isn't reported not-ready merely because a later round holds the lock; or (b) move record-store / precompute out of the finalizationMu window so "record durably finalized" implies "proof readable"; or (c) if the transient is intended, make the test poll line 86 instead of asserting once — but that weakens the proof-availability guarantee and is worth an explicit decision.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by removing the non-blocking proof-read fallback for disk SMT. Disk proof reads now use a published RocksDB snapshot for the last finalized root, so reads stay consistent and do not return false “not ready” while a new block is finalizing

Comment thread internal/smt/disk/tree.go Outdated
root, err := batchInsert(t.root, items, 0, len(items), 0, ctx, sem)
if err != nil {
result.Rejected = append(result.Rejected, RejectedLeaf{
Index: -1,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 High (blocking) — swallowed batchInsert error + Index:-1 ⇒ panic in the shared caller, and a phantom-accept.

On a batchInsert error (e.g. ErrStubOnPath, malformed branch, NewInternal failure) ApplyLeaves appends RejectedLeaf{Index: -1} and then returns (result, nil) (line ~188) — the error is swallowed. The shared consumer internal/round/leaf_add.go:64 then does commitments[rejected.Index]commitments[-1]runtime panic, crashing the round-processing goroutine (both live-add and precollector paths).

The memory backend never emits a negative index (its rejected indexes come from a real loop index), so this is a disk-vs-memory parity defect.

Compounding it: AcceptedIndexes is populated before the insert (line ~162) and on error t.root is left unchanged, so the failed leaves are simultaneously reported accepted while CandidateRoot is the old root — a silent consistency/data-loss hazard if the panic were ever guarded away.

Fix: return (ApplyResult{}, err) on batchInsert failure (and don't add to AcceptedIndexes until after success); add an Index >= 0 guard in leaf_add.go. See also the related comment on disk_backend.go (fromDiskApplyResult).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Internal disk apply failures now propagate as errors instead of being reported as rejected leaves, and negative indexes are rejected defensively before callers can use them.

Comment thread Dockerfile

FROM builder AS test

RUN if [ -n "$BUILD_TAGS" ]; then \
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 High — the entire disk-SMT test suite never runs in CI, which is how the two bugs above slipped through.

This FROM builder AS test stage runs go test -tags rocksdb, but it's an orphan: the final image stage COPY --from=builder (not from=test), no compose file or docker/build-push-action sets --target test, so BuildKit never builds it. Separately, .github/workflows/ci.yml and docker.yml run bare make test / make build, and the Makefile defaults GO_BUILD_TAGS empty — so every //go:build rocksdb file (the whole disk store, backend, snapshot/commit/fork, root parity, recovery/replay, and the failing internal/service test) is excluded. make vet is hardcoded go vet ./... too, so the gated files aren't even vetted → silent bit-rot risk.

Even if this stage did run, it only covers ./internal/smt/disk/... and would still miss the internal/service / internal/round gated tests.

Fix: add a CI job that installs librocksdb and runs make test GO_BUILD_TAGS=rocksdb (ideally with -race), mirroring the rocksdb-builder stage. At minimum go build -tags rocksdb ./... + go vet -tags rocksdb ./... to stop bit-rot.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. CI now exercises the RocksDB build/test path, and we added RocksDB-tagged vet as an extra guard. The one vet issue it exposed in MultiGet pointer handling was fixed.

s.metaCF = nil
}
if s.db != nil {
C.rocksdb_close(s.db)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Medium — Store.Close() frees RocksDB handles without synchronization (TOCTOU use-after-free).

Close() sets s.db = nil / frees the CF handles and readOpts without holding s.mu, while the read paths touch the same fields unlocked: getCFWithReadOptions (nil-check ~693, rocksdb_get_cf(s.db,...) ~714), NewReadSnapshot (rocksdb_create_snapshot(s.db) ~408), and ReadSnapshot.Close (~436). The s.db != nil checks are a TOCTOU window: an in-flight GetInclusionProofV2 can pass the check, then Close() frees s.db before the C.rocksdb_get_cf call → use-after-free of a freed rocksdb_t*.

Normal shutdown is safe by ordering (HTTP server stops before roundManager.Stop), so the trigger is narrow — an http.Server.Shutdown timeout with a slow proof read still running. But the field accesses are genuinely unsynchronized.

Fix: guard the db/CF lifetime — take s.mu in Close() (and RLock in the read paths), or use a closed atomic + WaitGroup/refcount so Close blocks until outstanding reads/ReadSnapshots are released.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. RocksDB store close is now synchronized with all CGO handle users. Close() cannot free RocksDB handles while reads, writes, batches, or read snapshots are still using them, and a test covers the long-lived snapshot case.

rejected := make([]RejectedLeaf, len(result.Rejected))
for i, item := range result.Rejected {
rejected[i] = RejectedLeaf{
Index: item.Index,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Medium — fromDiskApplyResult leaks the disk backend's -1 sentinel index through the abstraction boundary.

This copies item.Index verbatim with no validation, so the Index:-1 produced by disk/tree.go's internal-error path (see the tree.go comment) escapes to callers that assume every Rejected.Index is a valid slice index — which the memory backend always guarantees. This is the abstraction-layer half of the same parity defect.

Fix: treat item.Index < 0 here as a whole-batch internal error returned via the error channel of AddLeavesClassified, so disk and memory backends stay behaviorally identical at the interface (and the shared leaf_add.go consumer can't index commitments[-1]).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. fromDiskApplyResult now treats negative rejected indexes as backend errors instead of passing them through the abstraction boundary, so callers only ever see valid rejected indexes.

Comment thread bft-sharding-compose.yml
volumes:
- ./data/bft-sharding/genesis:/app/bft-config
- ./logs/bft-shard0:/app/logs
- ./data/bft-sharding/smt-rocksdb-shard0:/app/data/smt-rocksdb
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Medium — RocksDB data dir not pre-created/chmod'd ⇒ startup EACCES when SMT_BACKEND=rocksdb.

This bind-mounts ./data/bft-sharding/smt-rocksdb-shard0 (and -shard1, line ~309) into a container that runs as a non-root user (${USER_UID:-1001}), and SMT_DISK_PATH points at the mount itself. But the docker-run-bft-sh-clean* Makefile targets only mkdir/chmod 777 the genesis/mongo/redis dirs — never these smt-rocksdb-* dirs. Since they don't exist when chmod runs, Docker auto-creates the bind-mount sources at up-time as root:root 0755, and rocksstore.Open() (non-root) fails with EACCES, crashing the aggregator at startup. The default SMT_BACKEND=memory hides it, so it only bites the documented rocksdb opt-in.

Fix: add ./data/bft-sharding/smt-rocksdb-shard0 ./data/bft-sharding/smt-rocksdb-shard1 to the mkdir -p + chmod -R 777 lines in both bft-sharding make targets.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Local clean targets now create the RocksDB bind-mount directories and clean them together with Mongo/Redis state, so RocksDB cannot accidentally retain stale SMT data across clean runs.

MastaP and others added 3 commits June 4, 2026 13:33
Add a test-rocksdb job that installs librocksdb + compression libs and runs
the suite with -tags rocksdb. The disk-SMT backend (internal/smt/disk,
internal/smt/backend) and its round/service integration tests are gated behind
the rocksdb build tag, so the default 'make test' job skips them entirely;
this job builds and tests that code on every PR instead of leaving it
unexercised in CI.
Previously the job ran 'make test' (go test -tags rocksdb ./...), which re-ran
the entire suite a second time. Discover the packages and test functions gated
behind //go:build rocksdb and run only those, so the non-gated tests that share
the round/service/backend packages aren't re-executed (the default 'test' job
already covers them). Discovery is dynamic, so newly added gated tests are
picked up automatically.
@jait91 jait91 linked an issue Jun 5, 2026 that may be closed by this pull request
@jait91 jait91 requested a review from MastaP June 5, 2026 05:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Disk-backed SMT

3 participants