State sync worker should be only responsible for initializing and syncing the state #6308

martintomazic · 2025-08-27T13:12:26Z

WIP

Follow-up of #6306, inspired by #6299 (comment), part of the #6307.

After that we will refactor state sync worker into smaller workers (#6307) = #6299.

Finally, we will have simple PR that optimizes things.

TODO:

Move out pruner
Move out p2p server registration
Add tests
Checkpoint sync is independent worker and once it finishes we should close the checkpointsync p2p client, else we are leaking peermgr goroutine...

End goal:
Runtime storage worker should orchestrate workers, implement hooks and be responsible for the status:

Init state, by filling the possible missing rounds in nodedb and reindexed history.
Depending on the result from 1. start checkpoint sync worker, that also handles lifetime of the underlying checkpoint sync clients(!).
State is now fully initialized (update the status)
Start remaining workers:
- Checkpointer, also responsible for genesis checkpoint, even if not enabled or could be separate task.
- Availability nudger
- Diff sync worker.
Register p2p protocol servers
Register prune handler.

This way all workers that actually do work can be easily tested in isolation.

Also rename node to worker, to avoid confusion. Ideally, the parent package (storage) would have runtime as a prefix to make it clearer this is a runtime worker.

Previously, genesis checkpoint was created right after the state may be initialized from the runtime descriptor. If this is not the case it is first fetched from the peers. Thus, we should force the genesis checkpoint only after checkpoint sync finishes.

This is desirable, so that worker that initilize a new checkpointer don't require accepting context, but instead the lifetime and initialization of checkpointer is handled by the worker's Serve method.

This also serves as step towards passing the context explicitly.

Previously if the context was not canceled the fetcher might be sending the diffs on a channel that cannot be emptied, since we are already out of the main for loop, resulting in wg.Wait to never complete.

In addition, statesync worker now implements the Service interface. The corresponding BackgroundService methods (already not used) have been removed. Similarly, the storage worker was internally refactored to Service interface to ease eventual removal of the BackgroundService. Note that semantic changed slightly: Previously storage worker would wait for all state sync workers to finish. Now it will terminate when the first one finishes. Technically, this was already the case if the storage worker panicked.

Additionally, observe that the parent (storage worker) is registered as background service, thus upon error inside state sync worker there is no need to manually request the node shutdown.

Synced is a synonim for last finalized round inside the state sync worker. This change should made the code more readable. Eventually, we should ideally use either sync or finalized. Finally, the metrics use synced as last fully applied, but this would be breaking to change.

netlify · 2025-08-27T13:12:39Z

✅ Deploy Preview for oasisprotocol-oasis-core canceled.

Name	Link
🔨 Latest commit	`8bc3c32`
🔍 Latest deploy log	https://app.netlify.com/projects/oasisprotocol-oasis-core/deploys/68b093799a64b00008a3049e

This worker will be responsible for orchestrating storage operations for the given runtime. This includes: 1. Registering availability 2. Creating checkpoints if configured 3. Pruning state 4. Syncing state (internally init, checkpoint sync and diff sync) Subsequent commit will do the outlined refactor above.

This method is needed so that other worker (checkpointer, availability nudger) will be able to react.

go/worker/storage/checkpointer/checkpointer.go

martintomazic · 2025-08-27T20:24:44Z

On one hand I like granular packages, so that I can reuse simple words [package_name].New, and conveniently return Worker type. I can use [package_name].Config etc. If package has a good name this becomes very useful... Generally, the namespace is very clean.

On the other hand having workers nested under single package, I can make them private, and thus they are not seen in a documentation, which in this context is probably a good thing. update: Not true, as documentation should then be moved to the worker orchestrating... Finally, all this packages should be under internal, to make it clear sub-packaging is only for namespacing and API is not stable (e.g. storage/committee even storage). I think pkgsite documentation correctly hides everything under internal and by default shows only what the client (someone importing oasis protocol) should use and know about. Still not convinced entirely with neither of the approaches :).

If we decide to go with the latter approach, then runtime_worker.go from this PR should probably be moved inside committee package. In such case renaming committee to statesync was premature, given we will have many workers not directly related to the state sync inside this package. Regardless, we should definitely find a better name, but maybe later. update: Renaming to statesync is definitely premature given that likely we would have diffsync worker. I would however keep refactor that internally renames Node to Worker as this is not node...

The good point it is relatively easy to refactor between any of the approaches. Finally, this PR reinforced that State sync worker should be only responsible for initializing and syncing the state so factoring out checkpointer and availability nudger after #6306 is the right step to do.

Generally this worker should be tested explicitly. For this checkpointer worker should probably take finalizedCh instead of the whole state sync worker, waiting for the storage initialization should be moved to the parent worker. Finally, instead of the full common worker we should accept interface to ease mocking. Technically this worker does two things, creates and notifies. I am open to moving checkpointer creation to the parent worker, and notify there in the separate goroutine. This way this worker is only responsible for creating checkpoints. I have kept it this way as this way all the code relevant for checkpoint creation or checkpointer functionality is encapsulated together.

Same issues as with checkpointer. Possibly we would want to accept WatchFinalizedRounds as part of an interface. Co-authored-by: Peter Nose <[email protected]>

martintomazic added 13 commits August 25, 2025 23:27

go/worker/storage: Rename committee package to statesync

6476aaa

Also rename node to worker, to avoid confusion. Ideally, the parent package (storage) would have runtime as a prefix to make it clearer this is a runtime worker.

go/worker/storage/statesync: Move pruning to separate file

2a22222

go/storage/mkvs: Add a Serve method to the checkpointer

06762da

This is desirable, so that worker that initilize a new checkpointer don't require accepting context, but instead the lifetime and initialization of checkpointer is handled by the worker's Serve method.

go/storage/mkvs/checkpoint: Move methods around for readability

97728c6

go/worker/storage/statesync: Add explicit timeout for fetching diff

81a4381

go/worker/storage/statesync: Reduce the scope of workers

874bca6

This also serves as step towards passing the context explicitly.

go/worker/storage/statesync: Fix deadlock on the cleanup

2cec7ce

Previously if the context was not canceled the fetcher might be sending the diffs on a channel that cannot be emptied, since we are already out of the main for loop, resulting in wg.Wait to never complete.

go/worker/storage/statesync: Do not panic

f0172a3

Additionally, observe that the parent (storage worker) is registered as background service, thus upon error inside state sync worker there is no need to manually request the node shutdown.

go/worker/storage/statesync: Improve var grouping for syncing

f5666e8

go/worker/storage/statesync: Add timeout to checkpoint restoration

ce420f1

martintomazic changed the title ~~Martin/internal/state sync worker only syncing~~ State sync worker should be only responsible for initializing and syncing the state Aug 27, 2025

martintomazic changed the base branch from master to martin/trivial/state-sync-refactor-1 August 27, 2025 13:13

martintomazic changed the base branch from martin/trivial/state-sync-refactor-1 to martin/trivial/state-sync-refactor August 27, 2025 13:14

martintomazic force-pushed the martin/internal/state-sync-worker-only-syncing branch from 42886b7 to a53d9d0 Compare August 27, 2025 13:18

go/worker/storage: Remove unused method and config

aad4b24

martintomazic force-pushed the martin/internal/state-sync-worker-only-syncing branch from a53d9d0 to 55970b7 Compare August 27, 2025 13:59

martintomazic changed the base branch from martin/trivial/state-sync-refactor to master August 27, 2025 18:28

martintomazic force-pushed the martin/internal/state-sync-worker-only-syncing branch from 7cd6a91 to 867ed87 Compare August 27, 2025 18:28

worker/storage/statesync: Add WatchFinalizedRound method

3762e46

This method is needed so that other worker (checkpointer, availability nudger) will be able to react.

martintomazic force-pushed the martin/internal/state-sync-worker-only-syncing branch from 867ed87 to fde0e1d Compare August 27, 2025 19:16

martintomazic commented Aug 27, 2025

View reviewed changes

go/worker/storage/checkpointer/checkpointer.go Show resolved Hide resolved

martintomazic mentioned this pull request Aug 27, 2025

go/worker/storage: Refactor state sync worker pt1 #6306

Merged

martintomazic and others added 2 commits August 28, 2025 19:32

go/worker/storage: Move availability nudger to separate worker

8bc3c32

Same issues as with checkpointer. Possibly we would want to accept WatchFinalizedRounds as part of an interface. Co-authored-by: Peter Nose <[email protected]>

martintomazic force-pushed the martin/internal/state-sync-worker-only-syncing branch 2 times, most recently from 6548fb6 to 8bc3c32 Compare August 28, 2025 17:35

martintomazic mentioned this pull request Sep 9, 2025

Optimize runtime iterative state sync #6241

Open

martintomazic linked an issue Nov 13, 2025 that may be closed by this pull request

Refactor runtime storage committee worker into smaller and independent workers #6307

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

State sync worker should be only responsible for initializing and syncing the state #6308

State sync worker should be only responsible for initializing and syncing the state #6308

Uh oh!

martintomazic commented Aug 27, 2025 •

edited

Loading

Uh oh!

netlify bot commented Aug 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

martintomazic commented Aug 27, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

State sync worker should be only responsible for initializing and syncing the state #6308

Are you sure you want to change the base?

State sync worker should be only responsible for initializing and syncing the state #6308

Uh oh!

Conversation

martintomazic commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for oasisprotocol-oasis-core canceled.

Uh oh!

Uh oh!

martintomazic commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

martintomazic commented Aug 27, 2025 •

edited

Loading

netlify bot commented Aug 27, 2025 •

edited

Loading

martintomazic commented Aug 27, 2025 •

edited

Loading