Skip to content

Conversation

@martintomazic
Copy link
Contributor

@martintomazic martintomazic commented Aug 27, 2025

WIP

Follow-up of #6306, inspired by #6299 (comment), part of the #6307.

After that we will refactor state sync worker into smaller workers (#6307) = #6299.

Finally, we will have simple PR that optimizes things.

TODO:

  • Move out pruner
  • Move out p2p server registration
  • Add tests
  • Checkpoint sync is independent worker and once it finishes we should close the checkpointsync p2p client, else we are leaking peermgr goroutine...

End goal:
Runtime storage worker should orchestrate workers, implement hooks and be responsible for the status:

  1. Init state, by filling the possible missing rounds in nodedb and reindexed history.
  2. Depending on the result from 1. start checkpoint sync worker, that also handles lifetime of the underlying checkpoint sync clients(!).
  3. State is now fully initialized (update the status)
  4. Start remaining workers:
    • Checkpointer, also responsible for genesis checkpoint, even if not enabled or could be separate task.
    • Availability nudger
    • Diff sync worker.
  5. Register p2p protocol servers
  6. Register prune handler.

This way all workers that actually do work can be easily tested in isolation.

Also rename node to worker, to avoid confusion.

Ideally, the parent package (storage) would have runtime
as a prefix to make it clearer this is a runtime worker.
Previously, genesis checkpoint was created right after the state
may be initialized from the runtime descriptor. If this is not the
case it is first fetched from the peers.

Thus, we should force the genesis checkpoint only after checkpoint
sync finishes.
This is desirable, so that worker that initilize a new
checkpointer don't require accepting context, but instead the
lifetime and initialization of checkpointer is handled by the
worker's Serve method.
This also serves as step towards passing the context explicitly.
Previously if the context was not canceled the fetcher might
be sending the diffs on a channel that cannot be emptied,
since we are already out of the main for loop, resulting in
wg.Wait to never complete.
In addition, statesync worker now implements the Service
interface. The corresponding BackgroundService methods
(already not used) have been removed.

Similarly, the storage worker was internally refactored
to Service interface to ease eventual removal of the
BackgroundService.

Note that semantic changed slightly: Previously storage worker
would wait for all state sync workers to finish. Now it will
terminate when the first one finishes. Technically, this was
already the case if the storage worker panicked.
Additionally, observe that the parent (storage worker) is
registered as background service, thus upon error inside state
sync worker there is no need to manually request the node
shutdown.
Synced is a synonim for last finalized round inside the state
sync worker. This change should made the code more readable.
Eventually, we should ideally use either sync or finalized.

Finally, the metrics use synced as last fully applied,
but this would be breaking to change.
@netlify
Copy link

netlify bot commented Aug 27, 2025

Deploy Preview for oasisprotocol-oasis-core canceled.

Name Link
🔨 Latest commit 8bc3c32
🔍 Latest deploy log https://app.netlify.com/projects/oasisprotocol-oasis-core/deploys/68b093799a64b00008a3049e

@martintomazic martintomazic changed the title Martin/internal/state sync worker only syncing State sync worker should be only responsible for initializing and syncing the state Aug 27, 2025
@martintomazic martintomazic changed the base branch from master to martin/trivial/state-sync-refactor-1 August 27, 2025 13:13
@martintomazic martintomazic changed the base branch from martin/trivial/state-sync-refactor-1 to martin/trivial/state-sync-refactor August 27, 2025 13:14
@martintomazic martintomazic force-pushed the martin/internal/state-sync-worker-only-syncing branch from 42886b7 to a53d9d0 Compare August 27, 2025 13:18
@martintomazic martintomazic force-pushed the martin/internal/state-sync-worker-only-syncing branch from a53d9d0 to 55970b7 Compare August 27, 2025 13:59
This worker will be responsible for orchestrating storage
operations for the given runtime. This includes:
1. Registering availability
2. Creating checkpoints if configured
3. Pruning state
4. Syncing state (internally init, checkpoint sync and diff sync)

Subsequent commit will do the outlined refactor above.
@martintomazic martintomazic changed the base branch from martin/trivial/state-sync-refactor to master August 27, 2025 18:28
@martintomazic martintomazic force-pushed the martin/internal/state-sync-worker-only-syncing branch from 7cd6a91 to 867ed87 Compare August 27, 2025 18:28
This method is needed so that other worker (checkpointer,
availability nudger) will be able to react.
@martintomazic martintomazic force-pushed the martin/internal/state-sync-worker-only-syncing branch from 867ed87 to fde0e1d Compare August 27, 2025 19:16
@martintomazic
Copy link
Contributor Author

martintomazic commented Aug 27, 2025

On one hand I like granular packages, so that I can reuse simple words [package_name].New, and conveniently return Worker type. I can use [package_name].Config etc. If package has a good name this becomes very useful... Generally, the namespace is very clean.

On the other hand having workers nested under single package, I can make them private, and thus they are not seen in a documentation, which in this context is probably a good thing. update: Not true, as documentation should then be moved to the worker orchestrating... Finally, all this packages should be under internal, to make it clear sub-packaging is only for namespacing and API is not stable (e.g. storage/committee even storage). I think pkgsite documentation correctly hides everything under internal and by default shows only what the client (someone importing oasis protocol) should use and know about. Still not convinced entirely with neither of the approaches :).

If we decide to go with the latter approach, then runtime_worker.go from this PR should probably be moved inside committee package. In such case renaming committee to statesync was premature, given we will have many workers not directly related to the state sync inside this package. Regardless, we should definitely find a better name, but maybe later. update: Renaming to statesync is definitely premature given that likely we would have diffsync worker. I would however keep refactor that internally renames Node to Worker as this is not node...

The good point it is relatively easy to refactor between any of the approaches. Finally, this PR reinforced that State sync worker should be only responsible for initializing and syncing the state so factoring out checkpointer and availability nudger after #6306 is the right step to do.

martintomazic and others added 2 commits August 28, 2025 19:32
Generally this worker should be tested explicitly.

For this checkpointer worker should probably take finalizedCh
instead of the whole state sync worker, waiting for the storage
initialization should be moved to the parent worker. Finally,
instead of the full common worker we should accept interface
to ease mocking.

Technically this worker does two things, creates and notifies.
I am open to moving checkpointer creation to the parent worker,
and notify there in the separate goroutine. This way this worker
is only responsible for creating checkpoints.

I have kept it this way as this way all the code relevant for
checkpoint creation or checkpointer functionality is encapsulated
together.
Same issues as with checkpointer. Possibly we would want
to accept WatchFinalizedRounds as part of an interface.

Co-authored-by: Peter Nose <[email protected]>
@martintomazic martintomazic force-pushed the martin/internal/state-sync-worker-only-syncing branch 2 times, most recently from 6548fb6 to 8bc3c32 Compare August 28, 2025 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactor runtime storage committee worker into smaller and independent workers

2 participants