-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sweep: properly handle failed sweeping txns #9448
base: yy-sweeper-fix
Are you sure you want to change the base?
sweep: properly handle failed sweeping txns #9448
Conversation
Important Review skippedAuto reviews are limited to specific labels. 🏷️ Labels to auto review (1)
Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
48d4631
to
bd0f218
Compare
3c0c687
to
6ad2db0
Compare
Note to reviewers: the merge target for this is currently a side feature branch, and not directly master. |
sweep/sweeper.go
Outdated
|
||
// Do a non-blocking read to see if the output has been spent. | ||
select { | ||
case spend, ok := <-spendEvent.Spend: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here re hanging on the select, even if just for a moment (thereby yielding).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to the comments made here, yes, block for some million seconds does help - initially I used time.Sleep
to get the itest to pass since we can easily miss it. And I tried this approach here to wait for 100ms. Eventually I ditched that approach as I felt like it's a bit hacky (maybe it's not...), so I went to the current approach - even if the spending tx is missed here, we are fine as it will cause us to have an error saying the input is missing, and we would cross check it against our wallet to decide whether it's our tx, if so we'd mark it as swept.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, so my concern here is that we just never get the spent event. If we had some other goroutine that we piped all the spend events in through (instead of only doing a non-blocking recv here, then ditching it), then we'd ensure that we always received the spend event (after historical dispatch has completed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed could be an issue if we keep missing the events. I'm thinking of two approaches,
- block for 100ms when reading the channel - since we have multiple places checking the spendness, it's unlikely we would miss all of them, plus we would fail to perform a fee bump as the tx would be rejected if the inputs are spent.
- read the spend events in a goroutine to make sure we never miss it - this does create a bit more complexity, but think it's the most safe option.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
read the spend events in a goroutine to make sure we never miss it - this does create a bit more complexity, but think it's the most safe option.
I think this option is the best route forward. We'd then pipe the spend notifications into the collector
goroutine? We should also make sure we de-dup the spend notifications (can go to the main goroutine to register perhaps?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool - so I decided to not remove the spending notification subscription in the sweeper, such that we will always get notified if an input is spent, as suggested here.
sweep/fee_bumper.go
Outdated
// For neutrino, we will publish tx instead as it's the only way to know | ||
// whether the RBF requirements are met or not. | ||
if t.isNeutrinoBackend() { | ||
err = t.cfg.Wallet.PublishTransaction( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we know? We have no way of knowing, other than remembering the past fee rates that we published at.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got the following logs from neutrino backend, PublishTransaction
returns an error when the replacement tx pays insufficient fee, which means we can use it to detect whether we need to increase the fee rate or not?
2025-02-05 20:19:40.752 [DBG] BTCN peer.go:1082: Received reject (cmd tx, code REJECT_INSUFFICIENTFEE, reason 40f6678a0f55f3babf76ad7c4d95bbf61f53cb7e94129e4c167a38f65eaee93b: replacement transaction has an insufficient fee rate: needs more than 1022, has 1014, hash 40f6678a0f55f3babf76ad7c4d95bbf61f53cb7e94129e4c167a38f65eaee93b) from 127.0.0.1:16576 (outbound)
2025-02-05 20:19:40.752 [DBG] BTCN query.go:1051: Transaction 40f6678a0f55f3babf76ad7c4d95bbf61f53cb7e94129e4c167a38f65eaee93b rejected by peer 127.0.0.1:16576: code = InsufficientFee, reason = "rejected by 127.0.0.1:16576: 40f6678a0f55f3babf76ad7c4d95bbf61f53cb7e94129e4c167a38f65eaee93b: replacement transaction has an insufficient fee rate: needs more than 1022, has 1014"
2025-02-05 20:19:40.752 [DBG] BTCN query.go:1094: Got replies from 1 peers and 1 rejections
2025-02-05 20:19:40.752 [WRN] BTCN query.go:1097: All peers rejected transaction 40f6678a0f55f3babf76ad7c4d95bbf61f53cb7e94129e4c167a38f65eaee93b checking errors
2025-02-05 20:19:40.752 [ERR] BTCN broadcaster.go:188: Broadcast attempt failed: rejected by 127.0.0.1:16576: 40f6678a0f55f3babf76ad7c4d95bbf61f53cb7e94129e4c167a38f65eaee93b: replacement transaction has an insufficient fee rate: needs more than 1022, has 1014
So the idea is, for neutrino, we just try publishing the tx, if failed, we increase the fee rate and try it again until it succeeds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that's just for btcd
. We kept the reject
message, but bitcoind
removed it unfortunately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💔💔💔 I guess then there's no need to keep this commit right? Not sure how many nodes are using btcd
, but if only a few, then this would only work if the remote peer is running btcd
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change seems unrelated to the rest of the PR, so I think the commit should be dropped regardless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but if only a few, then this would only work if the remote peer is running btcd?
Yeah I think ok to leave it in, as odds are the active neutrino nodes may have at least one btcd
peer to send this special error. If none of the peers still support reject
, then this will just return nil
. Nothing we can really do in that case, other than wait till blocks are mined to see if anything has been spent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change seems unrelated to the rest of the PR, so I think the commit should be dropped regardless.
It was created as the itest was failing for the neutrino case tho.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decided to drop this commit instead as I realized this won't work for aux channels as it bypasses AuxSweeper
when publishing the tx. Think a bit refactor is needed to make it happen, so gonna open another PR. As for the itest, we will skip checking restart behavior for the neutrino case.
0ca8914
to
18df4fb
Compare
6ad2db0
to
81259e5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 7 of 7 files at r1, 4 of 4 files at r2, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @yyforyongyu)
18df4fb
to
e2a7210
Compare
81259e5
to
f4625de
Compare
e2a7210
to
7cdf369
Compare
f4625de
to
8e46111
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @Roasbeef and @yyforyongyu)
sweep/sweeper.go
Outdated
|
||
// Do a non-blocking read to see if the output has been spent. | ||
select { | ||
case spend, ok := <-spendEvent.Spend: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed could be an issue if we keep missing the events. I'm thinking of two approaches,
- block for 100ms when reading the channel - since we have multiple places checking the spendness, it's unlikely we would miss all of them, plus we would fail to perform a fee bump as the tx would be rejected if the inputs are spent.
- read the spend events in a goroutine to make sure we never miss it - this does create a bit more complexity, but think it's the most safe option.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 4 of 9 files reviewed, 8 unresolved discussions (waiting on @Roasbeef and @yyforyongyu)
sweep/fee_bumper.go
Outdated
// For neutrino, we will publish tx instead as it's the only way to know | ||
// whether the RBF requirements are met or not. | ||
if t.isNeutrinoBackend() { | ||
err = t.cfg.Wallet.PublishTransaction( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💔💔💔 I guess then there's no need to keep this commit right? Not sure how many nodes are using btcd
, but if only a few, then this would only work if the remote peer is running btcd
?
8e46111
to
d935f71
Compare
sweep/fee_bumper.go
Outdated
// For neutrino, we will publish tx instead as it's the only way to know | ||
// whether the RBF requirements are met or not. | ||
if t.isNeutrinoBackend() { | ||
err = t.cfg.Wallet.PublishTransaction( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change seems unrelated to the rest of the PR, so I think the commit should be dropped regardless.
sweep/fee_bumper.go
Outdated
continue | ||
} | ||
|
||
// Fetch the wallet to see if it's our previous sweeping tx. | ||
tx, err := t.cfg.Wallet.FetchTx(spendingTxid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a bad corner case that comes from this change:
- A previous sweep confirms, and it had only a subset of the current input set.
- Because of the change in this commit,
isUnknownSpent
returns false instead of true. - The fee bumper tells the sweeper
TxConfirmed
instead ofTxUnknownSpent
. - The sweeper fails to reaggregate the remaining inputs.
- Deadlines missed and funds lost.
Is the purpose of this commit just to improve logging when a previous sweep confirms? If there's no security reason for this change, IMO we should drop it and address the logging issue in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! So the purpose of this commit is to make sure the fee bumper can recover from a restart. For instance, a sweep tx is published, we then restart, and the tx now confirms, we would then have trouble here. In details,
- sweep tx is published, we go offline.
- we now come online, inputs are offered to the sweeper again right after the sweeping tx is confirmed.
- we won't find them in the mempool, so the sweeper will create a new sweeping tx and offer it to the fee bumper.
- the fee bumper will get an error publishing it since the inputs are already spent.
This is why we need to check the wallet again to see if it's a previous sweeping tx. Note that this could happen if RegisterSpendNtfn
doesn't give back a timely response in that subscription stream, which is found to be the case in the itest. This is also the source of the complexity here - when an input is spent in a block, and we receive a blockbeat, we expect a spending tx to be found in that block. However, due to the async design of RegisterSpendNtfn
, the spending notification won't be available immediately.
This is why previous I wanted to extend the blockbeat
to have a new method beat.HasOutpointSpent
to make things deterministic, as mentioned in this comment. The idea being that, whenever we receive a block, the block data is already there, so it shouldn't be hard to just go through the block data and check whether a given outpoint has been spent or not, which makes the control flow linear. However, this doesn't work for neutrino as it would require the full block data. Given the complexity introduced from this async design, I think I will revisit the idea of treating blockbeat
as the single source of truth.
As for this commit, I think you are right that it complicates the flow a bit - the sweeper can handle the TxUnknownSpent
perfectly except when the sweeping tx is not saved before shutting down, IsOurTx
can have issue identifying our sweeping tx. I think there's some work to do re persisting data before shutdown. And I will move the wallet tx check there in the sweeper for now to mitigate this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for this commit, I think you are right that it complicates the flow a bit - the sweeper can handle the TxUnknownSpent perfectly except when the sweeping tx is not saved before shutting down, IsOurTx can have issue identifying our sweeping tx. I think there's some work to do re persisting data before shutdown. And I will move the wallet tx check there in the sweeper for now to mitigate this issue.
Doesn't the sweeper still handle the case correctly if the old sweeping tx wasn't persisted? In that case the sweeper thinks the tx belongs to a third party, but it should still correctly remove the spent inputs from future sweeps.
This seems like a change that is only done to improve logging/reports. I would prefer to just drop it from this PR since it's ugly to poll 2 different tx DBs for only cosmetic benefit in a edge case.
Would be happy to review a separate future PR that either fixes the sweeper store persistence issues or switches to using the wallet DB and removes the sweeper store entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be happy to review a separate future PR that either fixes the sweeper store persistence issues or switches to using the wallet DB and removes the sweeper store entirely.
Cool let me try to remove it and see if the itest still passes - think given we now retain the inputs spending subscription in the sweeper, it should be fine.
sweep/sweeper.go
Outdated
} | ||
|
||
s.wg.Add(1) | ||
go func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So perhaps we'll retain some form of this after all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree - so I ended up not touching this area in this PR to limit the scope. For now we would keep this spend notification in the sweeper and read them in a goroutine to make sure we won't miss it.
8759a4c
to
8626ca4
Compare
sweep/fee_bumper.go
Outdated
continue | ||
} | ||
|
||
// Fetch the wallet to see if it's our previous sweeping tx. | ||
tx, err := t.cfg.Wallet.FetchTx(spendingTxid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for this commit, I think you are right that it complicates the flow a bit - the sweeper can handle the TxUnknownSpent perfectly except when the sweeping tx is not saved before shutting down, IsOurTx can have issue identifying our sweeping tx. I think there's some work to do re persisting data before shutdown. And I will move the wallet tx check there in the sweeper for now to mitigate this issue.
Doesn't the sweeper still handle the case correctly if the old sweeping tx wasn't persisted? In that case the sweeper thinks the tx belongs to a third party, but it should still correctly remove the spent inputs from future sweeps.
This seems like a change that is only done to improve logging/reports. I would prefer to just drop it from this PR since it's ugly to poll 2 different tx DBs for only cosmetic benefit in a edge case.
Would be happy to review a separate future PR that either fixes the sweeper store persistence issues or switches to using the wallet DB and removes the sweeper store entirely.
// handleUnknownSpendTx takes an input and its spending tx. If the spending tx | ||
// cannot be found in the sweeper store, the input will be marked as fatal, | ||
// otherwise it will be marked as swept. | ||
func (s *UtxoSweeper) handleUnknownSpendTx(inp *SweeperInput, tx *wire.MsgTx) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there might be an opportunity to merge this function with handleInputSpent
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still wanna unify the spending subscription flow here, and that method was supposed to be removed...think I'll defer it to a new PR re fixing the race notifications. As for now, these two methods serve different flows - one for handling spends from the fee bumper, the other handles the internal subscription.
0f9e82d
to
ed16f5f
Compare
dfb4776
to
225b05f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Willing to approve once CI is happy.
7cdf369
to
cfbf023
Compare
225b05f
to
ed9375d
Compare
cfbf023
to
53f84ed
Compare
This commit adds a new field `InputsSpent` to the `BumpResult` so they can be used to track inputs spent by txns not recoginized by the fee bumper.
We now start handling `TxUnknownSpend` in our sweeper to make sure the failed inputs are retried when possible.
This is a minor refactor so the `createAndPublishTx` flow becomes more clear, also prepares for the following commit where we start to handle missing inputs.
A minor refactor to break the method `handleUnknownSpent` into two steps, which prepares the following commit where we start handling missing inputs.
This commit refactors `handleInitialTxError` and `createAndCheckTx` to take a `monitorRecord` param, which prepares for the following commit where we start handling missing inputs.
This commit handles the case when the input is missing during the RBF process, which could happen when the bumped tx has inputs being spent by a third party. Normally we should be able to catch the spend early via the spending notification and never attempt to fee bump the record. However, due to the possible race between block notification and spend notification, this cannot be guaranteed. Thus, we need to handle the case during the RBF when seeing a `ErrMissingInputs`, which can only happen when the inputs are spent by others.
This commit adds the failed tx to the result when marking the input as fatal, which is used in the commit resolver when handling revoked outputs.
Previously, when a given input is found spent in the mempool, we'd mark it as Published and never offer it to the fee bumper. This is dangerous as the input will never be fee bumped. We now fix it by always initializing the input with state Init, and only use mempool to check for fee and fee rate. This changes the current restart behavior - as previously when a sweeping tx is broadcast, the node shuts down, when it starts again, the input will be offered to the sweeper again, but not to the fee bumper, which means the sweeping tx will stay in the mempool with the last-tried fee rate. After this change, after a restart, the input will be swept again, and the fee bumper will monitor its status. The restart will also behave like a fee bump if there's already an existing sweeping tx in the mempool.
So we can focus on testing normal flow vs persistence flow.
Before this commit, the only error returned from `IsOurTx` is when the root bucket was not created. In that case, we should consider the tx to be not found in our db, since technically our db is empty. A future PR may consider treating our wallet as the single source of truth and query the wallet instead to check for past sweeping txns.
ed9375d
to
5dd4566
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
We now properly handle
missing inputs
error from either mempool acceptance check or publish transaction.Depends on,
Failed
toFatal
and minor refactor #9446testAnchorThirdPartySpend
#9445This change is![Reviewable](https://camo.githubusercontent.com/1541c4039185914e83657d3683ec25920c672c6c5c7ab4240ee7bff601adec0b/68747470733a2f2f72657669657761626c652e696f2f7265766965775f627574746f6e2e737667)