-
Notifications
You must be signed in to change notification settings - Fork 697
feat: add failsafe to transaction replay #6212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
After some discussion, we're going to update this to use the rule of "once there are 2 burn blocks past the previous tip, clear the replay set if we're still in it". We'll use a config value for the "2 burn blocks" value. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #6212 +/- ##
===========================================
+ Coverage 82.45% 82.55% +0.10%
===========================================
Files 541 541
Lines 344244 344309 +65
Branches 323 323
===========================================
+ Hits 283833 284251 +418
+ Misses 60403 60050 -353
Partials 8 8
... and 41 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall the implementation looks fine!
As expected, all tx_replay_*
tests are currently failing due to the two-tenure limit, for the reasons reported in the PR description. Once the current approach is finalized, these tests will likely need to be revisited and properly adjusted to align with the new logic.
I've included a few remarks throughout the review suggesting possible improvements for readability and maintainability.
|
||
wait_for(30, || { | ||
let tip = get_chain_info(&conf); | ||
Ok(tip.stacks_tip_height > tip_after_fork.stacks_tip_height + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be tip_after_fork.stacks_tip_height + 2
?
signer_test | ||
.wait_for_signer_state_check(30, |state| Ok(state.get_tx_replay_set().is_some())) | ||
.expect("Expected replay set to still be set"); | ||
|
||
wait_for(30, || { | ||
let tip = get_chain_info(&conf); | ||
Ok(tip.stacks_tip_height > tip_after_fork.stacks_tip_height + 1) | ||
}) | ||
.expect("Timed out waiting for a TenureChange block to be mined"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Would it make sense to swap these two waits to align the structure with the surrounding code block, specifically to make this section more symmetric with the previous block ("Mining a second tenure") and the next one ("Mining a third tenure")?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure - I think either way makes sense? Since the replay set is tied to a new burn block, it may happen first anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that makes sense! What I mean, is If ordering isn't relevant for correctness, I’d favor maintaining code symmetry with the surrounding blocks. I try to summarize the involved test steps like this:
info!("---- Waiting for two tenures, without replay set cleared ----";)
// - unstall mining
// - wait for chain tip
// - wait for signer state check
info!("---- Mining a second tenure ----");
// - Mine naka block
// - wait for signer state check
// - wait for chain tip
info!("---- Mining a third tenure ----");
// - Mine naka block
// - wait for chain tip
// - wait for signer state check
As you can see, in the "Mining a second tenure" block, the two checks are inverted compared to the others.
If the order doesn’t matter functionally, I’d suggest aligning them to follow the same pattern across all blocks. It improves readability and could also help in the future if we migrate these tests to madhouse commands.
stacks-signer/src/v0/signer_state.rs
Outdated
let failsafe_height = | ||
replay_scope.past_tip.burn_block_height + reset_replay_set_after_fork_blocks; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm understanding the scope construction correctly, I think the behavior here is roughly: if a replay set takes longer than 2 burn blocks to resolve, clear the replay set.
I think that behavior is actually fine. But I think the alternative we discussed was something like "if its been more than 2 burn blocks since a transaction in the replay set has been processed" which is a more restrictive condition (i.e., less likely to trigger). I'm okay with the less restrictive condition, but we should make sure to communicate it.
This reverts commit c6ca6b9.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks fine!
Added some minor remarks.
I also noticed that:
- there is a small clippy issue to be addressed: https://github.com/stacks-network/stacks-core/actions/runs/16027643852/job/45219629003?pr=6212
- and a bunch of flaky tests where one is tx replay related: https://github.com/stacks-network/stacks-core/actions/runs/16027643859/job/45221079154?pr=6212 (which I tested succesfully locally)
@@ -6589,6 +6589,7 @@ fn signer_chainstate() { | |||
tenure_idle_timeout: Duration::from_secs(300), | |||
tenure_idle_timeout_buffer: Duration::from_secs(2), | |||
reorg_attempts_activity_timeout: Duration::from_secs(30), | |||
reset_replay_set_after_fork_blocks: 2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a quick thought: in general, for this kind of test configuration, I wonder if it might be valuable to use the DEFAULT_RESET_REPLAY_SET_AFTER_FORK_BLOCKS
constant. Not a strong opinion, just sharing in case it's worth considering for consistency or clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion, done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I see the change in the nakamoto_integrations.rs
module.
Eventually let me know if you want to apply the same approach in the remaining modules or not:
stacks-signer/src/tests/chainstate.rs
testnet/stacks-node/src/tests/signer/v0.rs
/// exits. | ||
#[ignore] | ||
#[test] | ||
fn tx_replay_simple() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Consider renaming the test to better reflect the scenario being tested (e.g. tx_replay_started_after_fork
or similar)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely understand this suggestion, but I'm a little partial to keeping it as-is - it's really just a useful test for testing out the "base case" of tx replay. "Started after fork" is also redundant, since all tx replay happens after a fork.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I see your point. This test essentially just verifies whether the transaction replay starts correctly.
If we can’t come up with a better name, I’m okay with keeping it as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've resolved some of the remarks and updated the others.
@hstove, let me know your thoughts on also the remaining older comments
(Opening as a draft - the test could be improved, and I'm not very confident in the "rules" implemented here)
This PR implements a 'failsafe' for transaction replay - if we've had 2 burn blocks since the new fork tip, clear the replay set. While this is very imperfect, it prioritizes liveness of the chain over guarantees about replay getting executed as expected. Most of the time, this shouldn't make a difference anyways. A new config field,
reset_replay_set_after_fork_blocks
, is provided to allow changing this value (which defaults to2
).I've also refactored many of the transaction replay tests to do shallower forks, which aligns much more with reality. This actually caught a bug in the fork detection logic, which we were getting away with due to the tests using deeper forks. We now use a descendency check to determine whether a new burn block is a fork, where we previously did a simple check against the height of a new burn block.