-
Notifications
You must be signed in to change notification settings - Fork 418
Replay lost MonitorEvent
s in some cases for closed channels
#4004
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Replay lost MonitorEvent
s in some cases for closed channels
#4004
Conversation
👋 Thanks for assigning @wpaulino as a reviewer! |
658c408
to
dc20515
Compare
One of the new tests is failing |
dc20515
to
e076dd5
Compare
Oops, silent rebase conflict due to #4001. |
Test is still failing |
e076dd5
to
86a58b6
Compare
Grr, the #4001 changes were random so had to fix all the paths, should be fine now. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4004 +/- ##
==========================================
+ Coverage 88.61% 88.79% +0.17%
==========================================
Files 174 175 +1
Lines 127640 128268 +628
Branches 127640 128268 +628
==========================================
+ Hits 113113 113891 +778
+ Misses 12046 11809 -237
- Partials 2481 2568 +87
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
lightning/src/ln/channelmanager.rs
Outdated
for (channel_id, monitor) in args.channel_monitors.iter() { | ||
let mut is_channel_closed = false; | ||
let counterparty_node_id = monitor.get_counterparty_node_id(); | ||
if let Some(peer_state_mtx) = per_peer_state.get(&counterparty_node_id) { | ||
let mut peer_state_lock = peer_state_mtx.lock().unwrap(); | ||
let peer_state = &mut *peer_state_lock; | ||
is_channel_closed = !peer_state.channel_by_id.contains_key(channel_id); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not seeing the bug that this is fixing -- if a payment is failed after being fulfilled in outbound_payments
, we'll terminate early after noticing it's fulfilled already in OutboundPayments::fail_htlc
Context from the commit message: "... we could get both a PaymentFailed
and a PaymentClaimed
event on startup for the same payment"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets say we send an MPP payment along two different channels. One is claimed and the other failed (who knows why, the recipient just decided they don't like money for whatever reason).
Going through the single loop we may first find the failed-htlc channel - we'll add the pending payment in insert_from_monitor_on_startup
which will just add the one part, then we'll see it failed and generate a PaymentFailed
event since there's only one part. Then we'll go to the second channel and repeat the same process, but now with a PaymentSent
event.
lightning/src/ln/channelmanager.rs
Outdated
for (channel_id, monitor) in args.channel_monitors.iter() { | ||
let mut is_channel_closed = false; | ||
let counterparty_node_id = monitor.get_counterparty_node_id(); | ||
if let Some(peer_state_mtx) = per_peer_state.get(&counterparty_node_id) { | ||
let mut peer_state_lock = peer_state_mtx.lock().unwrap(); | ||
let peer_state = &mut *peer_state_lock; | ||
is_channel_closed = !peer_state.channel_by_id.contains_key(channel_id); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would appreciate some more context on this part of the commit message: "this can lead to a pending payment getting re-added and re-claimed multiple times" (phrased as bad).
It looks like in the prior code and this new code, we'll call OutboundPayments::insert_from_monitor_on_startup
for each session_priv
(seems fine), and then call claim_htlc
for each session_priv
(seems fine since this will only generate a PaymentSent
event on the first call to claim_htlc
). Can you help me understand what I'm missing that was buggy in the prior code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rewrote the commit message.
($holder_commitment: expr, $htlc_iter: expr) => { | ||
for (htlc, source) in $htlc_iter { | ||
let filter = |v: &&IrrevocablyResolvedHTLC| { | ||
v.commitment_tx_output_idx == htlc.transaction_output_index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output index can be None
on both sides currently, is that intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhhh, it wasn't, but I believe it is almost the correct behavior - if we have a dust HTLC that never made it to chain and the commitment is confirmed, we want to consider it resolved. I'll tweak it to make it clearer and more likely correct.
}; | ||
if let Some(state) = us.htlcs_resolved_on_chain.iter().filter(filter).next() { | ||
if let Some(source) = source { | ||
if state.payment_preimage.is_none() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests pass when removing this check atm. Probably fine if the test is too annoying to write though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe its unobservable. There are several cases where we'd hit a theoretical else
block, but the ChannelManager
read logic actually replays preimages first, so we'd always actually replay the preimage first, then get here, spuriously include it, and not have anything to fail.
👋 The first review has been submitted! Do you think this PR is ready for a second reviewer? If so, click here to assign a second reviewer. |
86a58b6
to
87ae44d
Compare
✅ Added second reviewer: @joostjager |
{ | ||
walk_htlcs!( | ||
false, | ||
us.funding.counterparty_claimable_outpoints.get(&txid).unwrap().iter().filter_map( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential panic due to unwrap() on None value. The code calls us.funding.counterparty_claimable_outpoints.get(&txid).unwrap()
but there's no guarantee that the txid exists in the map. If the txid is not found, this will panic at runtime. Should use if let Some(outpoints) = us.funding.counterparty_claimable_outpoints.get(&txid)
or similar safe pattern instead of unwrap().
us.funding.counterparty_claimable_outpoints.get(&txid).unwrap().iter().filter_map( | |
us.funding.counterparty_claimable_outpoints.get(&txid).unwrap_or(&Vec::new()).iter().filter_map( |
Spotted by Diamond
Is this helpful? React 👍 or 👎 to let us know.
"Failing HTLC with payment hash {} as it was resolved on-chain.", | ||
htlc.payment_hash | ||
); | ||
failed_htlcs.push(( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we generate a PaymentFailed
and/or HTLCHandlingFailed
event for these HTLCs on every restart until the monitor is removed? Could be worth noting in a comment if so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fixed in #3984 :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM after squash
8ae6661
to
69b281b
Compare
Rebased to update against the new funding logic in ChannelMonitor. |
Feel free to squash, this changed enough I'll be doing another full pass |
69b281b
to
cc9db83
Compare
Squashed without further changes, @wpaulino |
`test_dup_htlc_onchain_doesnt_fail_on_reload` made reference to `ChainMonitor` persisting `ChannelMonitor`s on each new block, which hasn't been the case in some time. Instead, we update the comment and code to make explicit that it doesn't impact the test.
During testsing, we check that a `ChannelMonitor` will round-trip through serialization exactly. However, we recently added a fix to change a value in `PackageTemplate` on reload to fix some issues in the field in 0.1. This can cause the round-trip tests to fail as a field is modified during read. We fix it here by simply exempting the field from the equality test in the condition where it would be updated on read. We also make the `ChannelMonitor` `PartialEq` trait implementation non-public as weird workarounds like this make clear that such a comparison is a britle API at best.
On `ChannelManager` reload we rebuild the pending outbound payments list by looking for any missing payments in `ChannelMonitor`s. However, in the same loop over `ChannelMonitor`s, we also re-claim any pending payments which we see we have a payment preimage for. If we send an MPP payment across different chanels, the result may be that we'll iterate the loop, and in each iteration add a pending payment with only one known path, then claim/fail it and remove the pending apyment (at least for the claim case). This may result in spurious extra events, or even both a `PaymentFailed` and `PaymentSent` event on startup for the same payment.
cc9db83
to
2c9934c
Compare
Rebased and pushed with the following diff to account for the new channel closure monitor updates: diff --git a/lightning/src/ln/monitor_tests.rs b/lightning/src/ln/monitor_tests.rs
index 609b39d336..8ac995e9b3 100644
--- a/lightning/src/ln/monitor_tests.rs
+++ b/lightning/src/ln/monitor_tests.rs
@@ -3425,7 +3425,7 @@ fn do_test_lost_preimage_monitor_events(on_counterparty_tx: bool) {
// latest state.
let mon_events = nodes[1].chain_monitor.chain_monitor.release_pending_monitor_events();
assert_eq!(mon_events.len(), 1);
- assert_eq!(mon_events[0].2.len(), 2);
+ assert_eq!(mon_events[0].2.len(), 3);
let node_ser = nodes[1].node.encode();
let mon_a_ser = get_monitor!(nodes[1], chan_a).encode();
@@ -3661,7 +3661,7 @@ fn do_test_lost_timeout_monitor_events(confirm_tx: CommitmentType, dust_htlcs: b
// latest state.
let mon_events = nodes[1].chain_monitor.chain_monitor.release_pending_monitor_events();
assert_eq!(mon_events.len(), 1);
- assert_eq!(mon_events[0].2.len(), 2);
+ assert_eq!(mon_events[0].2.len(), 3);
let node_ser = nodes[1].node.encode();
let mon_a_ser = get_monitor!(nodes[1], chan_a).encode(); |
🔔 1st Reminder Hey @wpaulino! This PR has been waiting for your review. |
lightning/src/ln/channelmanager.rs
Outdated
} | ||
} | ||
for (channel_id, monitor) in args.channel_monitors.iter() { | ||
let mut is_channel_closed = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be true
in case we don't have the per_peer_state
around either?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're required to have a peer state around for any channelmonitor we have (to store the latest-update-id of the monitor), so its not a bug, but I agree it should be cleaned up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this was copied from the later block.
@@ -5841,6 +5841,7 @@ impl<Signer: EcdsaChannelSigner> ChannelMonitorImpl<Signer> { | |||
on_to_local_output_csv: None, | |||
}, | |||
}); | |||
self.counterparty_fulfilled_htlcs.insert(SentHTLCId::from_source(&source), payment_preimage); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this change will now result in users seeing duplicate PaymentSent/Forwarded
events after every reload once the channel closes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is fixed by #3984
); | ||
} else if confirmed_txid == funding.current_holder_commitment_tx.trust().txid() { | ||
walk_htlcs!(true, holder_commitment_htlcs!(us, CURRENT_WITH_SOURCES)); | ||
} else if let Some(prev_commitment_tx) = &funding.prev_holder_commitment_tx { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could map the option to the confirmed_txid == prev_commitment_tx.trust().txid()
bool and avoid the inner else branch, but maybe you find this less readable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its kinda hard to do without rustfmt vomiting all over the place :(
// We walk the set of HTLCs in the unrevoked counterparty commitment transactions (see | ||
// `fail_unbroadcast_htlcs` for a description of why). | ||
if let Some(ref txid) = us.funding.current_counterparty_commitment_txid { | ||
if let Some(htlcs) = us.funding.counterparty_claimable_outpoints.get(txid) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should expect
this, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can, yea.
While the case of not having a per-peer-state for a peer that we have a `ChannelMonitor` with should be unreachable by this point in `ChannelManager` loading (we have to at least store the latest update id of the monitor in the peer state), considering a channel as still-live and not replaying its payment state when we don't have a per-peer-state is wrong, so here we fix it.
`MonitorEvent`s aren't delivered to the `ChannelManager` in a durable fashion - if the `ChannelManager` fetches the pending `MonitorEvent`s, then the `ChannelMonitor` gets persisted (i.e. due to a block update) then the node crashes, prior to persisting the `ChannelManager` again, the `MonitorEvent` and its effects on the `ChannelManger` will be lost. This isn't likely in a sync persist environment, but in an async one this could be an issue. Note that this is only an issue for closed channels - `MonitorEvent`s only inform the `ChannelManager` that a channel is closed (which the `ChannelManager` will learn on startup or when it next tries to advance the channel state), that `ChannelMonitorUpdate` writes completed (which the `ChannelManager` will detect on startup), or that HTLCs resolved on-chain post closure. Of the three, only the last is problematic to lose prior to a reload. When we restart and, during `ChannelManager` load, see a `ChannelMonitor` for a closed channel, we scan it for preimages that we passed to it and re-apply those to any pending or forwarded payments. However, we didn't scan it for preimages it learned from transactions on-chain. In cases where a `MonitorEvent` is lost, this can lead to a lost preimage. Here we fix it by simply tracking preimages we learned on-chain the same way we track preimages picked up during normal channel operation.
`MonitorEvent`s aren't delivered to the `ChannelManager` in a durable fashion - if the `ChannelManager` fetches the pending `MonitorEvent`s, then the `ChannelMonitor` gets persisted (i.e. due to a block update) then the node crashes, prior to persisting the `ChannelManager` again, the `MonitorEvent` and its effects on the `ChannelManger` will be lost. This isn't likely in a sync persist environment, but in an async one this could be an issue. Note that this is only an issue for closed channels - `MonitorEvent`s only inform the `ChannelManager` that a channel is closed (which the `ChannelManager` will learn on startup or when it next tries to advance the channel state), that `ChannelMonitorUpdate` writes completed (which the `ChannelManager` will detect on startup), or that HTLCs resolved on-chain post closure. Of the three, only the last is problematic to lose prior to a reload. In a previous commit we handled the case of claimed HTLCs by replaying payment preimages on startup to avoid `MonitorEvent` loss causing us to miss an HTLC claim. Here we handle the HTLC-failed case similarly. Unlike with HTLC claims via preimage, we don't already have replay logic in `ChannelManager` startup, but its easy enough to add one. Luckily, we already track when an HTLC reaches permanently-failed state in `ChannelMonitor` (i.e. it has `ANTI_REORG_DELAY` confirmations on-chain on the failing transaction), so all we need to do is add the ability to query for that and fail them on `ChannelManager` startup.
2c9934c
to
be18acd
Compare
Some tests are failing |
Huh, must be more rebase to hit that. |
🔔 1st Reminder Hey @wpaulino! This PR has been waiting for your review. |
This is the first few commits from #3984 which handles some of the trivial cases, but skips actually marking
MonitorEvent
s completed which is needed for some further edge-cases.