-
Notifications
You must be signed in to change notification settings - Fork 149
go/worker/storage: Fix runtime state pruner #6446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for oasisprotocol-oasis-core canceled.
|
This was false positive :/. Regardless 6355 is an improvement, as we fixed nesting long running operation inside database transactions that should be short-lived and improved overall code. Nevertheless doing initial sync with pruning enabled from the start seems to slow down the sync significantly. E.g. Temporary solution for this particular issues is:
Long term solution would be to find a reason why pruning is so slow, possibly profile badger/pathbadger implementations. All this only affects archivers, validators should not retain all state and or sync from genesis. |
7c689c9 to
a26e087
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6446 +/- ##
==========================================
- Coverage 64.93% 64.29% -0.65%
==========================================
Files 697 697
Lines 68068 68105 +37
==========================================
- Hits 44198 43786 -412
- Misses 18896 19312 +416
- Partials 4974 5007 +33 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Prior to this change it could happen that if you had pruning configured and then you disable it, the runtime state would still be pruned on the next restart. This is because light history and runtime state pruning are decoupled, and state pruner simply removes all state rounds older than the last retained light history round. Since light history pruning is much faster, it could happen that even after restart and explicitly disabling pruning the pruner would still have some pending work to do.
a26e087 to
aba232e
Compare
Bug
Prior to this change it could happen that if you had pruning configured and then you disable it, the runtime state would still be pruned on the next restart.
This is because light history and runtime state pruning are decoupled, and state pruner simply removes all state rounds older than the last retained light history round. Since light history pruning is much faster, it could happen that even after restart and explicitly disabling pruning the pruner would still have some pending work to do.
Context
This bug was introduced as part of #6355, which actually didn't solve the issue: #6446 (comment).
At that time #6355 was benched and it indeed solved the issue of sync getting effectively stopped due to slow pruning (in case of large state DB, #6334). Exploring again if background pruning can still negatively impact syncing speed, as there were some such reports recently. Likely this is the case with "hot loops", although it seems to me that speed is also affected a lot by the age of the storage diffs (data availability).Additional consideration
We should rate limit max number of versions that can be pruned in the given interval. This is already the case for the runtime light history. Consensus also suffers from this problem
but is out of our handsand indeed configuring pruning there if done aggressively on large state can cause the node to halt a bit (negligible compared to runtimes) (currently mitigated with offline pruning).Future Direction / Testing / Performance Regression
Like the previous 2 PRs about the storage worker, this is impossible to test unless this worker is refactored as proposed, so that we can test unhappy paths and or mock resources. This would also open possibility for automating performance regression tests, e.g. sanity sync(s) with pruning enabled prior to releasing a new release.
Imo, we should step it up when it comes to integration tests, else it is hard to guarantee correct code behavior. At the same time this can quickly scope creep, as there is a lot of tech debt when it comes to making our workers testable and as with every refactor it increases the risk of introducing new bugs, unless done "perfectly" (takes resources :/).