Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sweep: properly handle failed sweeping txns #9448

Open
wants to merge 13 commits into
base: yy-sweeper-fix
Choose a base branch
from

Conversation

yyforyongyu
Copy link
Member

@yyforyongyu yyforyongyu commented Jan 26, 2025

We now properly handle missing inputs error from either mempool acceptance check or publish transaction.

Depends on,


This change is Reviewable

@yyforyongyu yyforyongyu added this to the v0.19.0 milestone Jan 26, 2025
Copy link
Contributor

coderabbitai bot commented Jan 26, 2025

Important

Review skipped

Auto reviews are limited to specific labels.

🏷️ Labels to auto review (1)
  • llm-review

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@Roasbeef
Copy link
Member

Roasbeef commented Feb 3, 2025

Note to reviewers: the merge target for this is currently a side feature branch, and not directly master.

sweep/fee_bumper.go Outdated Show resolved Hide resolved
sweep/fee_bumper.go Outdated Show resolved Hide resolved
sweep/fee_bumper.go Outdated Show resolved Hide resolved
sweep/sweeper.go Show resolved Hide resolved
sweep/sweeper.go Outdated Show resolved Hide resolved
sweep/sweeper.go Outdated

// Do a non-blocking read to see if the output has been spent.
select {
case spend, ok := <-spendEvent.Spend:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here re hanging on the select, even if just for a moment (thereby yielding).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the comments made here, yes, block for some million seconds does help - initially I used time.Sleep to get the itest to pass since we can easily miss it. And I tried this approach here to wait for 100ms. Eventually I ditched that approach as I felt like it's a bit hacky (maybe it's not...), so I went to the current approach - even if the spending tx is missed here, we are fine as it will cause us to have an error saying the input is missing, and we would cross check it against our wallet to decide whether it's our tx, if so we'd mark it as swept.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, so my concern here is that we just never get the spent event. If we had some other goroutine that we piped all the spend events in through (instead of only doing a non-blocking recv here, then ditching it), then we'd ensure that we always received the spend event (after historical dispatch has completed).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed could be an issue if we keep missing the events. I'm thinking of two approaches,

  1. block for 100ms when reading the channel - since we have multiple places checking the spendness, it's unlikely we would miss all of them, plus we would fail to perform a fee bump as the tx would be rejected if the inputs are spent.
  2. read the spend events in a goroutine to make sure we never miss it - this does create a bit more complexity, but think it's the most safe option.

WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read the spend events in a goroutine to make sure we never miss it - this does create a bit more complexity, but think it's the most safe option.

I think this option is the best route forward. We'd then pipe the spend notifications into the collector goroutine? We should also make sure we de-dup the spend notifications (can go to the main goroutine to register perhaps?).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool - so I decided to not remove the spending notification subscription in the sweeper, such that we will always get notified if an input is spent, as suggested here.

// For neutrino, we will publish tx instead as it's the only way to know
// whether the RBF requirements are met or not.
if t.isNeutrinoBackend() {
err = t.cfg.Wallet.PublishTransaction(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we know? We have no way of knowing, other than remembering the past fee rates that we published at.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got the following logs from neutrino backend, PublishTransaction returns an error when the replacement tx pays insufficient fee, which means we can use it to detect whether we need to increase the fee rate or not?

2025-02-05 20:19:40.752 [DBG] BTCN peer.go:1082: Received reject (cmd tx, code REJECT_INSUFFICIENTFEE, reason 40f6678a0f55f3babf76ad7c4d95bbf61f53cb7e94129e4c167a38f65eaee93b: replacement transaction has an insufficient fee rate: needs more than 1022, has 1014, hash 40f6678a0f55f3babf76ad7c4d95bbf61f53cb7e94129e4c167a38f65eaee93b) from 127.0.0.1:16576 (outbound)
2025-02-05 20:19:40.752 [DBG] BTCN query.go:1051: Transaction 40f6678a0f55f3babf76ad7c4d95bbf61f53cb7e94129e4c167a38f65eaee93b rejected by peer 127.0.0.1:16576: code = InsufficientFee, reason = "rejected by 127.0.0.1:16576: 40f6678a0f55f3babf76ad7c4d95bbf61f53cb7e94129e4c167a38f65eaee93b: replacement transaction has an insufficient fee rate: needs more than 1022, has 1014"
2025-02-05 20:19:40.752 [DBG] BTCN query.go:1094: Got replies from 1 peers and 1 rejections
2025-02-05 20:19:40.752 [WRN] BTCN query.go:1097: All peers rejected transaction 40f6678a0f55f3babf76ad7c4d95bbf61f53cb7e94129e4c167a38f65eaee93b checking errors
2025-02-05 20:19:40.752 [ERR] BTCN broadcaster.go:188: Broadcast attempt failed: rejected by 127.0.0.1:16576: 40f6678a0f55f3babf76ad7c4d95bbf61f53cb7e94129e4c167a38f65eaee93b: replacement transaction has an insufficient fee rate: needs more than 1022, has 1014

So the idea is, for neutrino, we just try publishing the tx, if failed, we increase the fee rate and try it again until it succeeds.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's just for btcd. We kept the reject message, but bitcoind removed it unfortunately.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💔💔💔 I guess then there's no need to keep this commit right? Not sure how many nodes are using btcd, but if only a few, then this would only work if the remote peer is running btcd?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change seems unrelated to the rest of the PR, so I think the commit should be dropped regardless.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but if only a few, then this would only work if the remote peer is running btcd?

Yeah I think ok to leave it in, as odds are the active neutrino nodes may have at least one btcd peer to send this special error. If none of the peers still support reject, then this will just return nil. Nothing we can really do in that case, other than wait till blocks are mined to see if anything has been spent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change seems unrelated to the rest of the PR, so I think the commit should be dropped regardless.

It was created as the itest was failing for the neutrino case tho.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided to drop this commit instead as I realized this won't work for aux channels as it bypasses AuxSweeper when publishing the tx. Think a bit refactor is needed to make it happen, so gonna open another PR. As for the itest, we will skip checking restart behavior for the neutrino case.

itest/lnd_channel_force_close_test.go Show resolved Hide resolved
Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 7 files at r1, 4 of 4 files at r2, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @yyforyongyu)

Copy link
Member Author

@yyforyongyu yyforyongyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @Roasbeef and @yyforyongyu)

sweep/sweeper.go Show resolved Hide resolved
sweep/sweeper.go Outdated

// Do a non-blocking read to see if the output has been spent.
select {
case spend, ok := <-spendEvent.Spend:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed could be an issue if we keep missing the events. I'm thinking of two approaches,

  1. block for 100ms when reading the channel - since we have multiple places checking the spendness, it's unlikely we would miss all of them, plus we would fail to perform a fee bump as the tx would be rejected if the inputs are spent.
  2. read the spend events in a goroutine to make sure we never miss it - this does create a bit more complexity, but think it's the most safe option.

WDYT?

Copy link
Member Author

@yyforyongyu yyforyongyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 4 of 9 files reviewed, 8 unresolved discussions (waiting on @Roasbeef and @yyforyongyu)

// For neutrino, we will publish tx instead as it's the only way to know
// whether the RBF requirements are met or not.
if t.isNeutrinoBackend() {
err = t.cfg.Wallet.PublishTransaction(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💔💔💔 I guess then there's no need to keep this commit right? Not sure how many nodes are using btcd, but if only a few, then this would only work if the remote peer is running btcd?

sweep/fee_bumper.go Outdated Show resolved Hide resolved
sweep/sweeper.go Outdated Show resolved Hide resolved
sweep/sweeper.go Outdated Show resolved Hide resolved
sweep/sweeper.go Outdated Show resolved Hide resolved
sweep/sweeper.go Outdated Show resolved Hide resolved
sweep/sweeper.go Show resolved Hide resolved
sweep/sweeper.go Outdated Show resolved Hide resolved
// For neutrino, we will publish tx instead as it's the only way to know
// whether the RBF requirements are met or not.
if t.isNeutrinoBackend() {
err = t.cfg.Wallet.PublishTransaction(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change seems unrelated to the rest of the PR, so I think the commit should be dropped regardless.

continue
}

// Fetch the wallet to see if it's our previous sweeping tx.
tx, err := t.cfg.Wallet.FetchTx(spendingTxid)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a bad corner case that comes from this change:

  1. A previous sweep confirms, and it had only a subset of the current input set.
  2. Because of the change in this commit, isUnknownSpent returns false instead of true.
  3. The fee bumper tells the sweeper TxConfirmed instead of TxUnknownSpent.
  4. The sweeper fails to reaggregate the remaining inputs.
  5. Deadlines missed and funds lost.

Is the purpose of this commit just to improve logging when a previous sweep confirms? If there's no security reason for this change, IMO we should drop it and address the logging issue in a separate PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! So the purpose of this commit is to make sure the fee bumper can recover from a restart. For instance, a sweep tx is published, we then restart, and the tx now confirms, we would then have trouble here. In details,

  1. sweep tx is published, we go offline.
  2. we now come online, inputs are offered to the sweeper again right after the sweeping tx is confirmed.
  3. we won't find them in the mempool, so the sweeper will create a new sweeping tx and offer it to the fee bumper.
  4. the fee bumper will get an error publishing it since the inputs are already spent.

This is why we need to check the wallet again to see if it's a previous sweeping tx. Note that this could happen if RegisterSpendNtfn doesn't give back a timely response in that subscription stream, which is found to be the case in the itest. This is also the source of the complexity here - when an input is spent in a block, and we receive a blockbeat, we expect a spending tx to be found in that block. However, due to the async design of RegisterSpendNtfn, the spending notification won't be available immediately.

This is why previous I wanted to extend the blockbeat to have a new method beat.HasOutpointSpent to make things deterministic, as mentioned in this comment. The idea being that, whenever we receive a block, the block data is already there, so it shouldn't be hard to just go through the block data and check whether a given outpoint has been spent or not, which makes the control flow linear. However, this doesn't work for neutrino as it would require the full block data. Given the complexity introduced from this async design, I think I will revisit the idea of treating blockbeat as the single source of truth.

As for this commit, I think you are right that it complicates the flow a bit - the sweeper can handle the TxUnknownSpent perfectly except when the sweeping tx is not saved before shutting down, IsOurTx can have issue identifying our sweeping tx. I think there's some work to do re persisting data before shutdown. And I will move the wallet tx check there in the sweeper for now to mitigate this issue.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for this commit, I think you are right that it complicates the flow a bit - the sweeper can handle the TxUnknownSpent perfectly except when the sweeping tx is not saved before shutting down, IsOurTx can have issue identifying our sweeping tx. I think there's some work to do re persisting data before shutdown. And I will move the wallet tx check there in the sweeper for now to mitigate this issue.

Doesn't the sweeper still handle the case correctly if the old sweeping tx wasn't persisted? In that case the sweeper thinks the tx belongs to a third party, but it should still correctly remove the spent inputs from future sweeps.

This seems like a change that is only done to improve logging/reports. I would prefer to just drop it from this PR since it's ugly to poll 2 different tx DBs for only cosmetic benefit in a edge case.

Would be happy to review a separate future PR that either fixes the sweeper store persistence issues or switches to using the wallet DB and removes the sweeper store entirely.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be happy to review a separate future PR that either fixes the sweeper store persistence issues or switches to using the wallet DB and removes the sweeper store entirely.

Cool let me try to remove it and see if the itest still passes - think given we now retain the inputs spending subscription in the sweeper, it should be fine.

sweep/fee_bumper.go Outdated Show resolved Hide resolved
sweep/fee_bumper.go Outdated Show resolved Hide resolved
sweep/sweeper.go Outdated
}

s.wg.Add(1)
go func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So perhaps we'll retain some form of this after all?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree - so I ended up not touching this area in this PR to limit the scope. For now we would keep this spend notification in the sweeper and read them in a goroutine to make sure we won't miss it.

@yyforyongyu yyforyongyu force-pushed the sweeper-fix branch 2 times, most recently from 8759a4c to 8626ca4 Compare February 12, 2025 14:09
sweep/fee_bumper.go Outdated Show resolved Hide resolved
sweep/sweeper_test.go Show resolved Hide resolved
sweep/sweeper_test.go Show resolved Hide resolved
sweep/fee_bumper.go Outdated Show resolved Hide resolved
sweep/fee_bumper.go Outdated Show resolved Hide resolved
continue
}

// Fetch the wallet to see if it's our previous sweeping tx.
tx, err := t.cfg.Wallet.FetchTx(spendingTxid)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for this commit, I think you are right that it complicates the flow a bit - the sweeper can handle the TxUnknownSpent perfectly except when the sweeping tx is not saved before shutting down, IsOurTx can have issue identifying our sweeping tx. I think there's some work to do re persisting data before shutdown. And I will move the wallet tx check there in the sweeper for now to mitigate this issue.

Doesn't the sweeper still handle the case correctly if the old sweeping tx wasn't persisted? In that case the sweeper thinks the tx belongs to a third party, but it should still correctly remove the spent inputs from future sweeps.

This seems like a change that is only done to improve logging/reports. I would prefer to just drop it from this PR since it's ugly to poll 2 different tx DBs for only cosmetic benefit in a edge case.

Would be happy to review a separate future PR that either fixes the sweeper store persistence issues or switches to using the wallet DB and removes the sweeper store entirely.

sweep/sweeper.go Outdated Show resolved Hide resolved
// handleUnknownSpendTx takes an input and its spending tx. If the spending tx
// cannot be found in the sweeper store, the input will be marked as fatal,
// otherwise it will be marked as swept.
func (s *UtxoSweeper) handleUnknownSpendTx(inp *SweeperInput, tx *wire.MsgTx) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there might be an opportunity to merge this function with handleInputSpent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still wanna unify the spending subscription flow here, and that method was supposed to be removed...think I'll defer it to a new PR re fixing the race notifications. As for now, these two methods serve different flows - one for handling spends from the fee bumper, the other handles the internal subscription.

@yyforyongyu yyforyongyu force-pushed the sweeper-fix branch 2 times, most recently from 0f9e82d to ed16f5f Compare February 13, 2025 06:24
@yyforyongyu yyforyongyu force-pushed the sweeper-fix branch 2 times, most recently from dfb4776 to 225b05f Compare February 13, 2025 07:10
Copy link
Collaborator

@morehouse morehouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Willing to approve once CI is happy.

sweep/fee_bumper.go Outdated Show resolved Hide resolved
sweep/sweeper.go Show resolved Hide resolved
This commit adds a new field `InputsSpent` to the `BumpResult` so they
can be used to track inputs spent by txns not recoginized by the fee
bumper.
We now start handling `TxUnknownSpend` in our sweeper to make sure the
failed inputs are retried when possible.
This is a minor refactor so the `createAndPublishTx` flow becomes more
clear, also prepares for the following commit where we start to handle
missing inputs.
A minor refactor to break the method `handleUnknownSpent` into two
steps, which prepares the following commit where we start handling
missing inputs.
This commit refactors `handleInitialTxError` and `createAndCheckTx` to
take a `monitorRecord` param, which prepares for the following commit
where we start handling missing inputs.
This commit handles the case when the input is missing during the RBF
process, which could happen when the bumped tx has inputs being spent by
a third party. Normally we should be able to catch the spend early via
the spending notification and never attempt to fee bump the record.
However, due to the possible race between block notification and spend
notification, this cannot be guaranteed. Thus, we need to handle the
case during the RBF when seeing a `ErrMissingInputs`, which can only
happen when the inputs are spent by others.
This commit adds the failed tx to the result when marking the input as
fatal, which is used in the commit resolver when handling revoked
outputs.
Previously, when a given input is found spent in the mempool, we'd mark
it as Published and never offer it to the fee bumper. This is dangerous
as the input will never be fee bumped. We now fix it by always
initializing the input with state Init, and only use mempool to check
for fee and fee rate.

This changes the current restart behavior - as previously when a
sweeping tx is broadcast, the node shuts down, when it starts again, the
input will be offered to the sweeper again, but not to the fee bumper,
which means the sweeping tx will stay in the mempool with the last-tried
fee rate. After this change, after a restart, the input will be swept
again, and the fee bumper will monitor its status. The restart will also
behave like a fee bump if there's already an existing sweeping tx in the
mempool.
So we can focus on testing normal flow vs persistence flow.
Before this commit, the only error returned from `IsOurTx` is when the
root bucket was not created. In that case, we should consider the tx to
be not found in our db, since technically our db is empty.

A future PR may consider treating our wallet as the single source of
truth and query the wallet instead to check for past sweeping txns.
Copy link
Collaborator

@morehouse morehouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

3 participants