Skip to content

state-prune snapshot resume can drop nodes #3943

@jolestar

Description

@jolestar

Summary

Resume path of rooch db state-prune snapshot can drop child nodes when a run is interrupted, causing final integrity check failures (missing child node).

Impact

Snapshots built via resume may be unusable; integrity check fails even when source DB is healthy.

Root causes

  • Progress is persisted every 5 minutes; newly enqueued children can be lost if the process dies before save.
  • Resume trusts snapshot_progress.json for worklist and nodes_written without reconciling with snapshot.db contents.
  • nodes_written restored from file masks missing nodes; crash after pushing children but before write can leave parent present and child absent.

Repro (high level)

  1. Run rooch db state-prune snapshot (default resume enabled).
  2. Interrupt between progress saves (e.g., kill process after some batches).
  3. Resume; run completes but final integrity check reports missing child node.

Proposed fix (MVP)

  1. On resume, recompute nodes_written from snapshot.db (actual count) and prefer DB over progress file; warn on divergence.
  2. Make frontier durable: persist worklist/batch_buffer much more frequently (seconds) or log transactionally before batch writes.
  3. Safe resume: optionally rebuild worklist by scanning snapshot.db from root (enqueue parents with missing children) or force restart when progress is stale.
  4. Progress hygiene: if progress file is older/shorter than DB, delete or ignore to avoid partial frontier.

Acceptance

  • Kill-and-resume cycles no longer produce missing-child errors.
  • Integrity check passes after resumed runs; logged node count matches RocksDB actual.
  • --no-resume behavior unchanged.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions