feat: add failsafe to transaction replay #6212

hstove · 2025-06-19T22:28:57Z

Closes TX Replay: Failsafe logic in signer state #6206

(Opening as a draft - the test could be improved, and I'm not very confident in the "rules" implemented here)

This PR implements a 'failsafe' for transaction replay - if we've had 2 burn blocks since the new fork tip, clear the replay set. While this is very imperfect, it prioritizes liveness of the chain over guarantees about replay getting executed as expected. Most of the time, this shouldn't make a difference anyways. A new config field, reset_replay_set_after_fork_blocks, is provided to allow changing this value (which defaults to 2).

I've also refactored many of the transaction replay tests to do shallower forks, which aligns much more with reality. This actually caught a bug in the fork detection logic, which we were getting away with due to the tests using deeper forks. We now use a descendency check to determine whether a new burn block is a fork, where we previously did a simple check against the height of a new burn block.

hstove · 2025-06-20T15:47:53Z

After some discussion, we're going to update this to use the rule of "once there are 2 burn blocks past the previous tip, clear the replay set if we're still in it". We'll use a config value for the "2 burn blocks" value.

codecov · 2025-06-22T01:22:48Z

Codecov Report

Attention: Patch coverage is 9.94550% with 661 lines in your changes missing coverage. Please review.

Project coverage is 82.55%. Comparing base (97a96fc) to head (66bfc13).

Files with missing lines	Patch %	Lines
testnet/stacks-node/src/tests/signer/v0.rs	1.19%	581 Missing ⚠️
testnet/stacks-node/src/tests/signer/mod.rs	5.00%	38 Missing ⚠️
stacks-signer/src/v0/signer_state.rs	68.49%	23 Missing ⚠️
stackslib/src/net/api/postblock_proposal.rs	0.00%	10 Missing ⚠️
...-node/src/burnchains/bitcoin_regtest_controller.rs	0.00%	9 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #6212      +/-   ##
===========================================
+ Coverage    82.45%   82.55%   +0.10%     
===========================================
  Files          541      541              
  Lines       344244   344309      +65     
  Branches       323      323              
===========================================
+ Hits        283833   284251     +418     
+ Misses       60403    60050     -353     
  Partials         8        8

Files with missing lines	Coverage Δ
stacks-signer/src/chainstate.rs	`94.71% <100.00%> (+0.01%)`	⬆️
stacks-signer/src/client/mod.rs	`99.23% <100.00%> (+<0.01%)`	⬆️
stacks-signer/src/config.rs	`91.03% <100.00%> (+0.14%)`	⬆️
stacks-signer/src/runloop.rs	`89.68% <100.00%> (+0.02%)`	⬆️
stacks-signer/src/tests/chainstate.rs	`100.00% <100.00%> (ø)`
stacks-signer/src/v0/signer.rs	`84.30% <100.00%> (-0.08%)`	⬇️
...net/stacks-node/src/tests/nakamoto_integrations.rs	`87.17% <100.00%> (+0.12%)`	⬆️
...-node/src/burnchains/bitcoin_regtest_controller.rs	`87.79% <0.00%> (-0.40%)`	⬇️
stackslib/src/net/api/postblock_proposal.rs	`68.73% <0.00%> (-1.15%)`	⬇️
stacks-signer/src/v0/signer_state.rs	`76.31% <68.49%> (-1.53%)`	⬇️
... and 2 more

... and 41 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 97a96fc...66bfc13. Read the comment docs.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

fdefelici

Overall the implementation looks fine!

As expected, all tx_replay_* tests are currently failing due to the two-tenure limit, for the reasons reported in the PR description. Once the current approach is finalized, these tests will likely need to be revisited and properly adjusted to align with the new logic.

I've included a few remarks throughout the review suggesting possible improvements for readability and maintainability.

stacks-signer/src/v0/signer_state.rs

fdefelici · 2025-06-23T08:16:30Z

testnet/stacks-node/src/tests/signer/v0.rs

+
+    wait_for(30, || {
+        let tip = get_chain_info(&conf);
+        Ok(tip.stacks_tip_height > tip_after_fork.stacks_tip_height + 1)


should this be tip_after_fork.stacks_tip_height + 2 ?

fdefelici · 2025-06-23T08:22:46Z

testnet/stacks-node/src/tests/signer/v0.rs

+    signer_test
+        .wait_for_signer_state_check(30, |state| Ok(state.get_tx_replay_set().is_some()))
+        .expect("Expected replay set to still be set");
+
+    wait_for(30, || {
+        let tip = get_chain_info(&conf);
+        Ok(tip.stacks_tip_height > tip_after_fork.stacks_tip_height + 1)
+    })
+    .expect("Timed out waiting for a TenureChange block to be mined");


nit: Would it make sense to swap these two waits to align the structure with the surrounding code block, specifically to make this section more symmetric with the previous block ("Mining a second tenure") and the next one ("Mining a third tenure")?

I'm not sure - I think either way makes sense? Since the replay set is tied to a new burn block, it may happen first anyways.

Yeah, that makes sense! What I mean, is If ordering isn't relevant for correctness, I’d favor maintaining code symmetry with the surrounding blocks. I try to summarize the involved test steps like this:

info!("---- Waiting for two tenures, without replay set cleared ----";) // - unstall mining // - wait for chain tip // - wait for signer state check info!("---- Mining a second tenure ----"); // - Mine naka block // - wait for signer state check // - wait for chain tip info!("---- Mining a third tenure ----"); // - Mine naka block // - wait for chain tip // - wait for signer state check

As you can see, in the "Mining a second tenure" block, the two checks are inverted compared to the others.

If the order doesn’t matter functionally, I’d suggest aligning them to follow the same pattern across all blocks. It improves readability and could also help in the future if we migrate these tests to madhouse commands.

stacks-signer/src/config.rs

kantai · 2025-06-23T14:52:20Z

stacks-signer/src/v0/signer_state.rs

+        let failsafe_height =
+            replay_scope.past_tip.burn_block_height + reset_replay_set_after_fork_blocks;


If I'm understanding the scope construction correctly, I think the behavior here is roughly: if a replay set takes longer than 2 burn blocks to resolve, clear the replay set.

I think that behavior is actually fine. But I think the alternative we discussed was something like "if its been more than 2 burn blocks since a transaction in the replay set has been processed" which is a more restrictive condition (i.e., less likely to trigger). I'm okay with the less restrictive condition, but we should make sure to communicate it.

This reverts commit c6ca6b9.

fdefelici

Overall looks fine!

Added some minor remarks.

I also noticed that:

there is a small clippy issue to be addressed: https://github.com/stacks-network/stacks-core/actions/runs/16027643852/job/45219629003?pr=6212
and a bunch of flaky tests where one is tx replay related: https://github.com/stacks-network/stacks-core/actions/runs/16027643859/job/45221079154?pr=6212 (which I tested succesfully locally)

stacks-signer/src/chainstate.rs

stacks-signer/src/v0/signer_state.rs

fdefelici · 2025-07-03T08:23:44Z

testnet/stacks-node/src/tests/nakamoto_integrations.rs

@@ -6589,6 +6589,7 @@ fn signer_chainstate() {
            tenure_idle_timeout: Duration::from_secs(300),
            tenure_idle_timeout_buffer: Duration::from_secs(2),
            reorg_attempts_activity_timeout: Duration::from_secs(30),
+            reset_replay_set_after_fork_blocks: 2,


Just a quick thought: in general, for this kind of test configuration, I wonder if it might be valuable to use the DEFAULT_RESET_REPLAY_SET_AFTER_FORK_BLOCKS constant. Not a strong opinion, just sharing in case it's worth considering for consistency or clarity.

Good suggestion, done!

Ok. I see the change in the nakamoto_integrations.rs module.

Eventually let me know if you want to apply the same approach in the remaining modules or not:

stacks-signer/src/tests/chainstate.rs

testnet/stacks-node/src/tests/signer/v0.rs

testnet/stacks-node/src/tests/signer/v0.rs

fdefelici · 2025-07-03T08:57:35Z

testnet/stacks-node/src/tests/signer/v0.rs

+/// exits.
+#[ignore]
+#[test]
+fn tx_replay_simple() {


nit: Consider renaming the test to better reflect the scenario being tested (e.g. tx_replay_started_after_fork or similar)

I definitely understand this suggestion, but I'm a little partial to keeping it as-is - it's really just a useful test for testing out the "base case" of tx replay. "Started after fork" is also redundant, since all tx replay happens after a fork.

Yeah, I see your point. This test essentially just verifies whether the transaction replay starts correctly.
If we can’t come up with a better name, I’m okay with keeping it as is.

fdefelici

I've resolved some of the remarks and updated the others.

@hstove, let me know your thoughts on also the remaining older comments

feat: add failsafe to transaction replay

983f9ce

hstove requested review from kantai, jferrant and fdefelici June 19, 2025 22:29

fix: clippy

4fa3499

aldur assigned hstove Jun 20, 2025

aldur added this to Stacks Core Eng Jun 20, 2025

aldur moved this to Status: 💻 In Progress in Stacks Core Eng Jun 20, 2025

aldur added this to the 3.1.0.0.13 milestone Jun 20, 2025

feat: wait for +2 blocks after previous fork tip to reset

d422eae

fdefelici reviewed Jun 23, 2025

View reviewed changes

kantai reviewed Jun 23, 2025

View reviewed changes

hstove added 12 commits June 25, 2025 07:14

fix: use pending burn block in bitcoin_block_arrival

772798b

wip: update tx replay tests to work with failsafe

bbea1c7

Merge remote-tracking branch 'core/develop' into feat/tx-replay-failsafe

2ac624a

fix: build warnings in test commands

991f010

fix: tx_replay_disagreement

de8b6e6

fix: btc_on_stx test

8f790dc

fix: revert logic for setting expected_burn_height

d4d3917

fix: dont rely on node burn block to be processed

c6ca6b9

Revert "fix: dont rely on node burn block to be processed"

795b4a7

This reverts commit c6ca6b9.

fix: better descendency check

4cc0758

Merge branch 'develop' into feat/tx-replay-failsafe

d70d554

fix: off-by-one in failsafe descendency check

fbce54a

obycode modified the milestones: 3.1.0.0.13, 3.1.0.0.14 Jul 1, 2025

hstove added 2 commits July 1, 2025 13:00

Merge branch 'develop' into feat/tx-replay-failsafe

bdfed91

fix: prevent panic in test setup

66bfc13

hstove marked this pull request as ready for review July 2, 2025 14:11

hstove requested review from a team as code owners July 2, 2025 14:11

crc: review comments

78ac7cc

hstove requested review from fdefelici and kantai July 2, 2025 14:24

kantai moved this from Status: 💻 In Progress to Status: In Review in Stacks Core Eng Jul 2, 2025

fdefelici reviewed Jul 3, 2025

View reviewed changes

hstove added 2 commits July 3, 2025 06:34

Merge branch 'develop' into feat/tx-replay-failsafe

2499c16

crc: code improvements from feedback

13a71b2

hstove requested a review from fdefelici July 3, 2025 14:34

fix: return bool instead of Result<bool>

3130e83

aldur requested a review from rdeioris July 3, 2025 14:52

fdefelici reviewed Jul 3, 2025

View reviewed changes

fdefelici mentioned this pull request Jul 4, 2025

Second round of snapshottable signer tests #6235

Open

5 tasks

		let failsafe_height =
		replay_scope.past_tip.burn_block_height + reset_replay_set_after_fork_blocks;

feat: add failsafe to transaction replay #6212

Are you sure you want to change the base?

feat: add failsafe to transaction replay #6212

Conversation

hstove commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hstove commented Jun 20, 2025

Uh oh!

codecov bot commented Jun 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

fdefelici left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fdefelici left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fdefelici left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hstove commented Jun 19, 2025 •

edited

Loading

codecov bot commented Jun 22, 2025 •

edited

Loading

fdefelici left a comment •

edited

Loading