go/worker/storage: Fix runtime state pruner #6446

martintomazic · 2026-01-21T13:47:38Z

Bug

Prior to this change it could happen that if you had pruning configured and then you disable it, the runtime state would still be pruned on the next restart.

This is because light history and runtime state pruning are decoupled, and state pruner simply removes all state rounds older than the last retained light history round. Since light history pruning is much faster, it could happen that even after restart and explicitly disabling pruning the pruner would still have some pending work to do.

Context

This bug was introduced as part of #6355, which actually didn't solve the issue: #6446 (comment).

At that time #6355 was benched and it indeed solved the issue of sync getting effectively stopped due to slow pruning (in case of large state DB, #6334). Exploring again if background pruning can still negatively impact syncing speed, as there were some such reports recently. Likely this is the case with "hot loops", although it seems to me that speed is also affected a lot by the age of the storage diffs (data availability).

Additional consideration

We should rate limit max number of versions that can be pruned in the given interval. This is already the case for the runtime light history. Consensus also suffers from this problem ~~but is out of our hands~~ and indeed configuring pruning there if done aggressively on large state can cause the node to halt a bit (negligible compared to runtimes) (currently mitigated with offline pruning).

Future Direction / Testing / Performance Regression

Like the previous 2 PRs about the storage worker, this is impossible to test unless this worker is refactored as proposed, so that we can test unhappy paths and or mock resources. This would also open possibility for automating performance regression tests, e.g. sanity sync(s) with pruning enabled prior to releasing a new release.

Imo, we should step it up when it comes to integration tests, else it is hard to guarantee correct code behavior. At the same time this can quickly scope creep, as there is a lot of tech debt when it comes to making our workers testable and as with every refactor it increases the risk of introducing new bugs, unless done "perfectly" (takes resources :/).

netlify · 2026-01-21T13:47:44Z

✅ Deploy Preview for oasisprotocol-oasis-core canceled.

Name	Link
🔨 Latest commit	`aba232e`
🔍 Latest deploy log	https://app.netlify.com/projects/oasisprotocol-oasis-core/deploys/697739ece0d6d80008617bb6

martintomazic · 2026-01-21T23:52:12Z

At that time #6355 was benched and it indeed solved the issue of sync getting effectively stopped due to slow pruning (in case of large state DB, #6334).

This was false positive :/. Regardless 6355 is an improvement, as we fixed nesting long running operation inside database transactions that should be short-lived and improved overall code.

Nevertheless doing initial sync with pruning enabled from the start seems to slow down the sync significantly. E.g. 5 rounds/s was measured once pruning was basically at the tail of the sync. It gets worst if e.g. you are syncing faster then pruning as this becomes slower and slower until you are pruning 1 round every 30s+, effectively stopping the sync.

Temporary solution for this particular issues is:

Prune runtime state aggressively or not at all: docs/node/run-your-node/advanced: Add new pruning section docs#1526

Long term solution would be to find a reason why pruning is so slow, possibly profile badger/pathbadger implementations.

All this only affects archivers, validators should not retain all state and or sync from genesis.

codecov · 2026-01-22T13:58:51Z

Codecov Report

❌ Patch coverage is 83.33333% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.29%. Comparing base (4dcd07c) to head (aba232e).
⚠️ Report is 14 commits behind head on master.

Files with missing lines	Patch %	Lines
go/worker/storage/committee/worker.go	66.66%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6446      +/-   ##
==========================================
- Coverage   64.93%   64.29%   -0.65%     
==========================================
  Files         697      697              
  Lines       68068    68105      +37     
==========================================
- Hits        44198    43786     -412     
- Misses      18896    19312     +416     
- Partials     4974     5007      +33

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Prior to this change it could happen that if you had pruning configured and then you disable it, the runtime state would still be pruned on the next restart. This is because light history and runtime state pruning are decoupled, and state pruner simply removes all state rounds older than the last retained light history round. Since light history pruning is much faster, it could happen that even after restart and explicitly disabling pruning the pruner would still have some pending work to do.

martintomazic force-pushed the martin/fix/runtime-state-pruner branch from 7c689c9 to a26e087 Compare January 22, 2026 13:15

martintomazic marked this pull request as ready for review January 22, 2026 14:10

martintomazic requested review from kostko, peternose and ptrus as code owners January 22, 2026 14:10

martintomazic force-pushed the martin/fix/runtime-state-pruner branch from a26e087 to aba232e Compare January 26, 2026 09:54

matevz assigned martintomazic Jan 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

go/worker/storage: Fix runtime state pruner #6446

go/worker/storage: Fix runtime state pruner #6446

martintomazic commented Jan 21, 2026 •

edited

Loading

Uh oh!

netlify bot commented Jan 21, 2026 •

edited

Loading

Uh oh!

martintomazic commented Jan 21, 2026 •

edited

Loading

Uh oh!

codecov bot commented Jan 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

go/worker/storage: Fix runtime state pruner #6446

Are you sure you want to change the base?

go/worker/storage: Fix runtime state pruner #6446

Conversation

martintomazic commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bug

Context

Additional consideration

Future Direction / Testing / Performance Regression

Uh oh!

netlify bot commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for oasisprotocol-oasis-core canceled.

Uh oh!

martintomazic commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

martintomazic commented Jan 21, 2026 •

edited

Loading

netlify bot commented Jan 21, 2026 •

edited

Loading

martintomazic commented Jan 21, 2026 •

edited

Loading

codecov bot commented Jan 22, 2026 •

edited

Loading