Skip to content

Conversation

@martintomazic
Copy link
Contributor

@martintomazic martintomazic commented Jan 21, 2026

Bug

Prior to this change it could happen that if you had pruning configured and then you disable it, the runtime state would still be pruned on the next restart.

This is because light history and runtime state pruning are decoupled, and state pruner simply removes all state rounds older than the last retained light history round. Since light history pruning is much faster, it could happen that even after restart and explicitly disabling pruning the pruner would still have some pending work to do.

Context

This bug was introduced as part of #6355, which actually didn't solve the issue: #6446 (comment).

At that time #6355 was benched and it indeed solved the issue of sync getting effectively stopped due to slow pruning (in case of large state DB, #6334). Exploring again if background pruning can still negatively impact syncing speed, as there were some such reports recently. Likely this is the case with "hot loops", although it seems to me that speed is also affected a lot by the age of the storage diffs (data availability).

Additional consideration

We should rate limit max number of versions that can be pruned in the given interval. This is already the case for the runtime light history. Consensus also suffers from this problem but is out of our hands and indeed configuring pruning there if done aggressively on large state can cause the node to halt a bit (negligible compared to runtimes) (currently mitigated with offline pruning).

Future Direction / Testing / Performance Regression

Like the previous 2 PRs about the storage worker, this is impossible to test unless this worker is refactored as proposed, so that we can test unhappy paths and or mock resources. This would also open possibility for automating performance regression tests, e.g. sanity sync(s) with pruning enabled prior to releasing a new release.

Imo, we should step it up when it comes to integration tests, else it is hard to guarantee correct code behavior. At the same time this can quickly scope creep, as there is a lot of tech debt when it comes to making our workers testable and as with every refactor it increases the risk of introducing new bugs, unless done "perfectly" (takes resources :/).

@netlify
Copy link

netlify bot commented Jan 21, 2026

Deploy Preview for oasisprotocol-oasis-core canceled.

Name Link
🔨 Latest commit aba232e
🔍 Latest deploy log https://app.netlify.com/projects/oasisprotocol-oasis-core/deploys/697739ece0d6d80008617bb6

@martintomazic
Copy link
Contributor Author

martintomazic commented Jan 21, 2026

At that time #6355 was benched and it indeed solved the issue of sync getting effectively stopped due to slow pruning (in case of large state DB, #6334).

This was false positive :/. Regardless 6355 is an improvement, as we fixed nesting long running operation inside database transactions that should be short-lived and improved overall code.

Nevertheless doing initial sync with pruning enabled from the start seems to slow down the sync significantly. E.g. 5 rounds/s was measured once pruning was basically at the tail of the sync. It gets worst if e.g. you are syncing faster then pruning as this becomes slower and slower until you are pruning 1 round every 30s+, effectively stopping the sync.

Temporary solution for this particular issues is:

Long term solution would be to find a reason why pruning is so slow, possibly profile badger/pathbadger implementations.

All this only affects archivers, validators should not retain all state and or sync from genesis.

@martintomazic martintomazic force-pushed the martin/fix/runtime-state-pruner branch from 7c689c9 to a26e087 Compare January 22, 2026 13:15
@codecov
Copy link

codecov bot commented Jan 22, 2026

Codecov Report

❌ Patch coverage is 83.33333% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.29%. Comparing base (4dcd07c) to head (aba232e).
⚠️ Report is 14 commits behind head on master.

Files with missing lines Patch % Lines
go/worker/storage/committee/worker.go 66.66% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6446      +/-   ##
==========================================
- Coverage   64.93%   64.29%   -0.65%     
==========================================
  Files         697      697              
  Lines       68068    68105      +37     
==========================================
- Hits        44198    43786     -412     
- Misses      18896    19312     +416     
- Partials     4974     5007      +33     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@martintomazic martintomazic marked this pull request as ready for review January 22, 2026 14:10
Prior to this change it could happen that if you had pruning
configured and then you disable it, the runtime state would still
be pruned on the next restart.

This is because light history and runtime state pruning are
decoupled, and state pruner simply removes all state rounds older
than the last retained light history round. Since light history
pruning is much faster, it could happen that even after restart
and explicitly disabling pruning the pruner would still have
some pending work to do.
@martintomazic martintomazic force-pushed the martin/fix/runtime-state-pruner branch from a26e087 to aba232e Compare January 26, 2026 09:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants