Replay lost `MonitorEvent`s in some cases for closed channels #4004

TheBlueMatt · 2025-08-11T20:17:28Z

This is the first few commits from #3984 which handles some of the trivial cases, but skips actually marking MonitorEvents completed which is needed for some further edge-cases.

ldk-reviews-bot · 2025-08-11T20:17:30Z

👋 Thanks for assigning @wpaulino as a reviewer!
I'll wait for their review and will help manage the review process.
Once they submit their review, I'll check if a second reviewer would be helpful.

valentinewallace · 2025-08-11T20:39:28Z

One of the new tests is failing

TheBlueMatt · 2025-08-12T01:47:02Z

Oops, silent rebase conflict due to #4001.

valentinewallace · 2025-08-12T13:57:29Z

Test is still failing

TheBlueMatt · 2025-08-12T14:46:27Z

Grr, the #4001 changes were random so had to fix all the paths, should be fine now.

codecov · 2025-08-12T15:00:03Z

Codecov Report

❌ Patch coverage is 94.10256% with 23 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.79%. Comparing base (96f9242) to head (f77d037).
⚠️ Report is 23 commits behind head on main.

Files with missing lines	Patch %	Lines
lightning/src/ln/monitor_tests.rs	95.14%	6 Missing and 6 partials ⚠️
lightning/src/chain/package.rs	70.37%	8 Missing ⚠️
lightning/src/ln/channelmanager.rs	89.65%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4004      +/-   ##
==========================================
+ Coverage   88.61%   88.79%   +0.17%     
==========================================
  Files         174      175       +1     
  Lines      127640   128268     +628     
  Branches   127640   128268     +628     
==========================================
+ Hits       113113   113891     +778     
+ Misses      12046    11809     -237     
- Partials     2481     2568      +87

Flag	Coverage Δ
fuzzing	`22.06% <0.00%> (?)`
tests	`88.62% <94.10%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

lightning/src/ln/channelmanager.rs

valentinewallace · 2025-08-12T15:05:50Z

lightning/src/ln/channelmanager.rs

+			for (channel_id, monitor) in args.channel_monitors.iter() {
+				let mut is_channel_closed = false;
+				let counterparty_node_id = monitor.get_counterparty_node_id();
+				if let Some(peer_state_mtx) = per_peer_state.get(&counterparty_node_id) {
+					let mut peer_state_lock = peer_state_mtx.lock().unwrap();
+					let peer_state = &mut *peer_state_lock;
+					is_channel_closed = !peer_state.channel_by_id.contains_key(channel_id);
+				}


I'm not seeing the bug that this is fixing -- if a payment is failed after being fulfilled in outbound_payments, we'll terminate early after noticing it's fulfilled already in OutboundPayments::fail_htlc

Context from the commit message: "... we could get both a PaymentFailed and a PaymentClaimed event on startup for the same payment"

Lets say we send an MPP payment along two different channels. One is claimed and the other failed (who knows why, the recipient just decided they don't like money for whatever reason).

Going through the single loop we may first find the failed-htlc channel - we'll add the pending payment in insert_from_monitor_on_startup which will just add the one part, then we'll see it failed and generate a PaymentFailed event since there's only one part. Then we'll go to the second channel and repeat the same process, but now with a PaymentSent event.

valentinewallace · 2025-08-12T15:13:45Z

lightning/src/ln/channelmanager.rs

+			for (channel_id, monitor) in args.channel_monitors.iter() {
+				let mut is_channel_closed = false;
+				let counterparty_node_id = monitor.get_counterparty_node_id();
+				if let Some(peer_state_mtx) = per_peer_state.get(&counterparty_node_id) {
+					let mut peer_state_lock = peer_state_mtx.lock().unwrap();
+					let peer_state = &mut *peer_state_lock;
+					is_channel_closed = !peer_state.channel_by_id.contains_key(channel_id);
+				}


Would appreciate some more context on this part of the commit message: "this can lead to a pending payment getting re-added and re-claimed multiple times" (phrased as bad).

It looks like in the prior code and this new code, we'll call OutboundPayments::insert_from_monitor_on_startup for each session_priv (seems fine), and then call claim_htlc for each session_priv (seems fine since this will only generate a PaymentSent event on the first call to claim_htlc). Can you help me understand what I'm missing that was buggy in the prior code?

I rewrote the commit message.

lightning/src/ln/monitor_tests.rs

valentinewallace · 2025-08-12T18:31:00Z

lightning/src/chain/channelmonitor.rs

+			($holder_commitment: expr, $htlc_iter: expr) => {
+				for (htlc, source) in $htlc_iter {
+					let filter = |v: &&IrrevocablyResolvedHTLC| {
+						v.commitment_tx_output_idx == htlc.transaction_output_index


The output index can be None on both sides currently, is that intentional?

Uhhh, it wasn't, but I believe it is almost the correct behavior - if we have a dust HTLC that never made it to chain and the commitment is confirmed, we want to consider it resolved. I'll tweak it to make it clearer and more likely correct.

valentinewallace · 2025-08-12T18:34:09Z

lightning/src/chain/channelmonitor.rs

+					};
+					if let Some(state) = us.htlcs_resolved_on_chain.iter().filter(filter).next() {
+						if let Some(source) = source {
+							if state.payment_preimage.is_none() {


Tests pass when removing this check atm. Probably fine if the test is too annoying to write though.

I believe its unobservable. There are several cases where we'd hit a theoretical else block, but the ChannelManager read logic actually replays preimages first, so we'd always actually replay the preimage first, then get here, spuriously include it, and not have anything to fail.

lightning/src/chain/channelmonitor.rs

ldk-reviews-bot · 2025-08-12T18:49:33Z

👋 The first review has been submitted!

Do you think this PR is ready for a second reviewer? If so, click here to assign a second reviewer.

ldk-reviews-bot · 2025-08-13T13:36:57Z

✅ Added second reviewer: @joostjager

graphite-app · 2025-08-13T14:39:28Z

lightning/src/chain/channelmonitor.rs

+		{
+			walk_htlcs!(
+				false,
+				us.funding.counterparty_claimable_outpoints.get(&txid).unwrap().iter().filter_map(


Potential panic due to unwrap() on None value. The code calls us.funding.counterparty_claimable_outpoints.get(&txid).unwrap() but there's no guarantee that the txid exists in the map. If the txid is not found, this will panic at runtime. Should use if let Some(outpoints) = us.funding.counterparty_claimable_outpoints.get(&txid) or similar safe pattern instead of unwrap().

Suggested change

us.funding.counterparty_claimable_outpoints.get(&txid).unwrap().iter().filter_map(

us.funding.counterparty_claimable_outpoints.get(&txid).unwrap_or(&Vec::new()).iter().filter_map(

Spotted by Diamond

Is this helpful? React 👍 or 👎 to let us know.

lightning/src/ln/monitor_tests.rs

valentinewallace · 2025-08-13T16:08:18Z

lightning/src/ln/channelmanager.rs

+							"Failing HTLC with payment hash {} as it was resolved on-chain.",
+							htlc.payment_hash
+						);
+						failed_htlcs.push((


Will we generate a PaymentFailed and/or HTLCHandlingFailed event for these HTLCs on every restart until the monitor is removed? Could be worth noting in a comment if so

That's fixed in #3984 :)

valentinewallace

LGTM after squash

lightning/src/chain/channelmonitor.rs

lightning/src/ln/monitor_tests.rs

TheBlueMatt · 2025-08-15T23:27:13Z

Rebased to update against the new funding logic in ChannelMonitor.

wpaulino · 2025-08-18T22:59:21Z

Feel free to squash, this changed enough I'll be doing another full pass

TheBlueMatt · 2025-08-19T15:25:47Z

Squashed without further changes, @wpaulino

`test_dup_htlc_onchain_doesnt_fail_on_reload` made reference to `ChainMonitor` persisting `ChannelMonitor`s on each new block, which hasn't been the case in some time. Instead, we update the comment and code to make explicit that it doesn't impact the test.

During testsing, we check that a `ChannelMonitor` will round-trip through serialization exactly. However, we recently added a fix to change a value in `PackageTemplate` on reload to fix some issues in the field in 0.1. This can cause the round-trip tests to fail as a field is modified during read. We fix it here by simply exempting the field from the equality test in the condition where it would be updated on read. We also make the `ChannelMonitor` `PartialEq` trait implementation non-public as weird workarounds like this make clear that such a comparison is a britle API at best.

On `ChannelManager` reload we rebuild the pending outbound payments list by looking for any missing payments in `ChannelMonitor`s. However, in the same loop over `ChannelMonitor`s, we also re-claim any pending payments which we see we have a payment preimage for. If we send an MPP payment across different chanels, the result may be that we'll iterate the loop, and in each iteration add a pending payment with only one known path, then claim/fail it and remove the pending apyment (at least for the claim case). This may result in spurious extra events, or even both a `PaymentFailed` and `PaymentSent` event on startup for the same payment.

TheBlueMatt · 2025-08-19T23:36:39Z

Rebased and pushed with the following diff to account for the new channel closure monitor updates:

diff --git a/lightning/src/ln/monitor_tests.rs b/lightning/src/ln/monitor_tests.rs
index 609b39d336..8ac995e9b3 100644
--- a/lightning/src/ln/monitor_tests.rs
+++ b/lightning/src/ln/monitor_tests.rs
@@ -3425,7 +3425,7 @@ fn do_test_lost_preimage_monitor_events(on_counterparty_tx: bool) {
 	// latest state.
 	let mon_events = nodes[1].chain_monitor.chain_monitor.release_pending_monitor_events();
 	assert_eq!(mon_events.len(), 1);
-	assert_eq!(mon_events[0].2.len(), 2);
+	assert_eq!(mon_events[0].2.len(), 3);

 	let node_ser = nodes[1].node.encode();
 	let mon_a_ser = get_monitor!(nodes[1], chan_a).encode();
@@ -3661,7 +3661,7 @@ fn do_test_lost_timeout_monitor_events(confirm_tx: CommitmentType, dust_htlcs: b
 	// latest state.
 	let mon_events = nodes[1].chain_monitor.chain_monitor.release_pending_monitor_events();
 	assert_eq!(mon_events.len(), 1);
-	assert_eq!(mon_events[0].2.len(), 2);
+	assert_eq!(mon_events[0].2.len(), 3);

 	let node_ser = nodes[1].node.encode();
 	let mon_a_ser = get_monitor!(nodes[1], chan_a).encode();

ldk-reviews-bot · 2025-08-20T16:06:20Z

🔔 1st Reminder

Hey @wpaulino! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

wpaulino · 2025-08-20T17:44:01Z

lightning/src/ln/channelmanager.rs

+				}
+			}
+			for (channel_id, monitor) in args.channel_monitors.iter() {
+				let mut is_channel_closed = false;


Shouldn't this be true in case we don't have the per_peer_state around either?

We're required to have a peer state around for any channelmonitor we have (to store the latest-update-id of the monitor), so its not a bug, but I agree it should be cleaned up.

Note that this was copied from the later block.

lightning/src/ln/channelmanager.rs

wpaulino · 2025-08-20T17:59:01Z

lightning/src/chain/channelmonitor.rs

@@ -5841,6 +5841,7 @@ impl<Signer: EcdsaChannelSigner> ChannelMonitorImpl<Signer> {
 								on_to_local_output_csv: None,
 							},
 						});
+						self.counterparty_fulfilled_htlcs.insert(SentHTLCId::from_source(&source), payment_preimage);


IIUC this change will now result in users seeing duplicate PaymentSent/Forwarded events after every reload once the channel closes?

Yes, that is fixed by #3984

wpaulino · 2025-08-20T18:03:32Z

lightning/src/chain/channelmonitor.rs

+			);
+		} else if confirmed_txid == funding.current_holder_commitment_tx.trust().txid() {
+			walk_htlcs!(true, holder_commitment_htlcs!(us, CURRENT_WITH_SOURCES));
+		} else if let Some(prev_commitment_tx) = &funding.prev_holder_commitment_tx {


We could map the option to the confirmed_txid == prev_commitment_tx.trust().txid() bool and avoid the inner else branch, but maybe you find this less readable

Its kinda hard to do without rustfmt vomiting all over the place :(

wpaulino · 2025-08-20T18:06:15Z

lightning/src/chain/channelmonitor.rs

+				// We walk the set of HTLCs in the unrevoked counterparty commitment transactions (see
+				// `fail_unbroadcast_htlcs` for a description of why).
+				if let Some(ref txid) = us.funding.current_counterparty_commitment_txid {
+					if let Some(htlcs) = us.funding.counterparty_claimable_outpoints.get(txid) {


We should expect this, no?

We can, yea.

lightning/src/chain/channelmonitor.rs

While the case of not having a per-peer-state for a peer that we have a `ChannelMonitor` with should be unreachable by this point in `ChannelManager` loading (we have to at least store the latest update id of the monitor in the peer state), considering a channel as still-live and not replaying its payment state when we don't have a per-peer-state is wrong, so here we fix it.

`MonitorEvent`s aren't delivered to the `ChannelManager` in a durable fashion - if the `ChannelManager` fetches the pending `MonitorEvent`s, then the `ChannelMonitor` gets persisted (i.e. due to a block update) then the node crashes, prior to persisting the `ChannelManager` again, the `MonitorEvent` and its effects on the `ChannelManger` will be lost. This isn't likely in a sync persist environment, but in an async one this could be an issue. Note that this is only an issue for closed channels - `MonitorEvent`s only inform the `ChannelManager` that a channel is closed (which the `ChannelManager` will learn on startup or when it next tries to advance the channel state), that `ChannelMonitorUpdate` writes completed (which the `ChannelManager` will detect on startup), or that HTLCs resolved on-chain post closure. Of the three, only the last is problematic to lose prior to a reload. When we restart and, during `ChannelManager` load, see a `ChannelMonitor` for a closed channel, we scan it for preimages that we passed to it and re-apply those to any pending or forwarded payments. However, we didn't scan it for preimages it learned from transactions on-chain. In cases where a `MonitorEvent` is lost, this can lead to a lost preimage. Here we fix it by simply tracking preimages we learned on-chain the same way we track preimages picked up during normal channel operation.

`MonitorEvent`s aren't delivered to the `ChannelManager` in a durable fashion - if the `ChannelManager` fetches the pending `MonitorEvent`s, then the `ChannelMonitor` gets persisted (i.e. due to a block update) then the node crashes, prior to persisting the `ChannelManager` again, the `MonitorEvent` and its effects on the `ChannelManger` will be lost. This isn't likely in a sync persist environment, but in an async one this could be an issue. Note that this is only an issue for closed channels - `MonitorEvent`s only inform the `ChannelManager` that a channel is closed (which the `ChannelManager` will learn on startup or when it next tries to advance the channel state), that `ChannelMonitorUpdate` writes completed (which the `ChannelManager` will detect on startup), or that HTLCs resolved on-chain post closure. Of the three, only the last is problematic to lose prior to a reload. In a previous commit we handled the case of claimed HTLCs by replaying payment preimages on startup to avoid `MonitorEvent` loss causing us to miss an HTLC claim. Here we handle the HTLC-failed case similarly. Unlike with HTLC claims via preimage, we don't already have replay logic in `ChannelManager` startup, but its easy enough to add one. Luckily, we already track when an HTLC reaches permanently-failed state in `ChannelMonitor` (i.e. it has `ANTI_REORG_DELAY` confirmations on-chain on the failing transaction), so all we need to do is add the ability to query for that and fail them on `ChannelManager` startup.

valentinewallace · 2025-08-21T15:23:39Z

Some tests are failing

TheBlueMatt · 2025-08-21T19:31:09Z

Huh, must be more rebase to hit that.

ldk-reviews-bot · 2025-08-23T14:33:05Z

🔔 1st Reminder

Hey @wpaulino! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

TheBlueMatt added backport 0.1 weekly goal Someone wants to land this this week labels Aug 11, 2025

TheBlueMatt requested a review from valentinewallace August 11, 2025 20:17

TheBlueMatt force-pushed the 2025-07-mon-event-failures-1 branch from 658c408 to dc20515 Compare August 11, 2025 20:18

TheBlueMatt force-pushed the 2025-07-mon-event-failures-1 branch from dc20515 to e076dd5 Compare August 12, 2025 01:46

TheBlueMatt force-pushed the 2025-07-mon-event-failures-1 branch from e076dd5 to 86a58b6 Compare August 12, 2025 14:46

valentinewallace reviewed Aug 12, 2025

View reviewed changes

TheBlueMatt force-pushed the 2025-07-mon-event-failures-1 branch from 86a58b6 to 87ae44d Compare August 12, 2025 20:48

ldk-reviews-bot requested a review from joostjager August 13, 2025 13:36

TheBlueMatt requested review from wpaulino and valentinewallace and removed request for joostjager August 13, 2025 13:37

TheBlueMatt mentioned this pull request Aug 13, 2025

Correctly handle lost MonitorEvents #3984

Draft

graphite-app bot reviewed Aug 13, 2025

View reviewed changes

valentinewallace reviewed Aug 13, 2025

View reviewed changes

valentinewallace reviewed Aug 14, 2025

View reviewed changes

wpaulino reviewed Aug 14, 2025

View reviewed changes

lightning/src/chain/channelmonitor.rs Outdated Show resolved Hide resolved

lightning/src/chain/channelmonitor.rs Outdated Show resolved Hide resolved

lightning/src/ln/monitor_tests.rs Outdated Show resolved Hide resolved

TheBlueMatt force-pushed the 2025-07-mon-event-failures-1 branch 2 times, most recently from 8ae6661 to 69b281b Compare August 15, 2025 23:26

TheBlueMatt requested a review from wpaulino August 18, 2025 16:06

TheBlueMatt force-pushed the 2025-07-mon-event-failures-1 branch from 69b281b to cc9db83 Compare August 19, 2025 15:25

TheBlueMatt added 3 commits August 19, 2025 21:13

TheBlueMatt force-pushed the 2025-07-mon-event-failures-1 branch from cc9db83 to 2c9934c Compare August 19, 2025 23:36

wpaulino reviewed Aug 20, 2025

View reviewed changes

TheBlueMatt added 6 commits August 21, 2025 14:29

f simplfiy sentence

b7f0c6e

rustfmt and clean up get_onchain_failed_outbound_htlcs

95fe325

f remove unused var and expect tx info

be18acd

TheBlueMatt force-pushed the 2025-07-mon-event-failures-1 branch from 2c9934c to be18acd Compare August 21, 2025 14:32

TheBlueMatt requested a review from wpaulino August 21, 2025 14:32

f huh, new tests fail now

46e8837

f rustfmt

f77d037

	us.funding.counterparty_claimable_outpoints.get(&txid).unwrap().iter().filter_map(
	us.funding.counterparty_claimable_outpoints.get(&txid).unwrap_or(&Vec::new()).iter().filter_map(

Replay lost MonitorEvents in some cases for closed channels #4004

Are you sure you want to change the base?

Replay lost MonitorEvents in some cases for closed channels #4004

Conversation

TheBlueMatt commented Aug 11, 2025

Uh oh!

ldk-reviews-bot commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

valentinewallace commented Aug 11, 2025

Uh oh!

TheBlueMatt commented Aug 12, 2025

Uh oh!

valentinewallace commented Aug 12, 2025

Uh oh!

TheBlueMatt commented Aug 12, 2025

Uh oh!

codecov bot commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ldk-reviews-bot commented Aug 12, 2025

Uh oh!

ldk-reviews-bot commented Aug 13, 2025

Uh oh!

graphite-app bot Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

valentinewallace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TheBlueMatt commented Aug 15, 2025

Uh oh!

wpaulino commented Aug 18, 2025

Uh oh!

TheBlueMatt commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheBlueMatt commented Aug 19, 2025

Uh oh!

ldk-reviews-bot commented Aug 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Replay lost `MonitorEvent`s in some cases for closed channels #4004

Replay lost `MonitorEvent`s in some cases for closed channels #4004

ldk-reviews-bot commented Aug 11, 2025 •

edited

Loading

codecov bot commented Aug 12, 2025 •

edited

Loading

TheBlueMatt commented Aug 19, 2025 •

edited

Loading