Skip to content

Conversation

TheBlueMatt
Copy link
Collaborator

If a channel operation occurs while a peer is disconnected, we'll
still have a PeerState for that peer. Historically, we haven't
bothered to check if a peer is actually connected before we push
new messages onto the PeerState::pending_msg_events field,
leaving us sending messages into the void.

This is generally not an issue as
ChannelManager::get_and_clear_pending_msg_events should be called
very regularly, removing these messages and then dropping them as
PeerManager won't have anything to do with them.

Still, there is a race condition here - if a peer manages to
connect between the generation of one such message and when
get_and_clear_pending_msg_events is called, we may enqueue a
message to the peer which makes no sense and could lead to a
spurious channel closure (especially in the case of an async
ChannelMonitorUpdate completion or async signing operation, which
often lead to normal channel message generation).

Further, if a peer is slow to send their channel_reestablish
message after connection this race could be substantially more
likely, as such normal channel messages may be nonsense until we've
completed the reestablish dance (i.e. the later reestablish dance
may lead us to re-send the same messages again immediately).

Here we remove most of the cases where we enqueue messages for
disconnected peers.

Note that we differentiate between two different checks for
connected-ness - for cases where we're sending an error or gossip
messages, we allow the messages to be enqueued if the peer is
connected at all. For most other cases, we only allow messages to
be enqueued if the peer is connected and the channel has
completed its reestablish dance (if required, i.e. the channel is
"connected").

Fixes #4036

If a channel operation occurs while a peer is disconnected, we'll
still have a `PeerState` for that peer. Historically, we haven't
bothered to check if a peer is actually connected before we push
new messages onto the `PeerState::pending_msg_events` field,
leaving us sending messages into the void.

This is generally not an issue as
`ChannelManager::get_and_clear_pending_msg_events` should be called
very regularly, removing these messages and then dropping them as
`PeerManager` won't have anything to do with them.

Still, there is a race condition here - if a peer manages to
connect between the generation of one such message and when
`get_and_clear_pending_msg_events` is called, we may enqueue a
message to the peer which makes no sense and could lead to a
spurious channel closure (especially in the case of an async
`ChannelMonitorUpdate` completion or async signing operation, which
often lead to normal channel message generation).

Further, if a peer is slow to send their `channel_reestablish`
message after connection this race could be substantially more
likely, as such normal channel messages may be nonsense until we've
completed the reestablish dance (i.e. the later reestablish dance
may lead us to re-send the same messages again immediately).

Here we remove most of the cases where we enqueue messages for
disconnected peers.

Note that we differentiate between two different checks for
connected-ness - for cases where we're sending an `error` or gossip
messages, we allow the messages to be enqueued if the peer is
connected at all. For most other cases, we only allow messages to
be enqueued if the peer is connected *and* the channel has
completed its reestablish dance (if required, i.e. the channel is
"connected").
If a channel is failed while a peer is disconnected, we'll still
have a `PeerState` for that peer. Historically, we haven't bothered
to check if a peer is actually connected before we push the `error`
message onto the `PeerState::pending_msg_events` queue, leaving us
sending messages into the void.

This is generally not an issue as
`ChannelManager::get_and_clear_pending_msg_events` should be called
very regularly, removing these messages and then dropping them as
`PeerManager` won't have anything to do with them. Further, when
the the message is an `error`, if a peer happens to connect between
when we push the message and when `get_and_clear_pending_msg_events`
is called the worst that happens is they get the `error` message
we'd end up sending them when they try to reestablish the channel
anyway.

Still, its awkward to leave the `error`s lying around in a message
queue for a disconnected peer, so we remove them here.
In the previous two commits we stopped enqueueing messages for
disconnected peers. Here we add some basic test coverage of these
changes by asserting that there are no queued messages for a peer
when they (re-)connect.

We also add a few assertions that a peer is connected when we push
a message onto the queu, in cases where we expect a peer to be
connected.
@TheBlueMatt TheBlueMatt added this to the 0.2 milestone Sep 19, 2025
@ldk-reviews-bot
Copy link

ldk-reviews-bot commented Sep 19, 2025

👋 Thanks for assigning @tnull as a reviewer!
I'll wait for their review and will help manage the review process.
Once they submit their review, I'll check if a second reviewer would be helpful.

@tnull tnull self-requested a review September 19, 2025 12:40
Copy link

codecov bot commented Sep 19, 2025

Codecov Report

❌ Patch coverage is 82.53012% with 29 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.60%. Comparing base (50391d3) to head (47a3e5c).
⚠️ Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
lightning/src/ln/channelmanager.rs 79.57% 22 Missing and 7 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4094      +/-   ##
==========================================
- Coverage   88.61%   88.60%   -0.01%     
==========================================
  Files         176      176              
  Lines      132099   132161      +62     
  Branches   132099   132161      +62     
==========================================
+ Hits       117060   117103      +43     
- Misses      12365    12386      +21     
+ Partials     2674     2672       -2     
Flag Coverage Δ
fuzzing 21.56% <13.19%> (-0.38%) ⬇️
tests 88.44% <82.53%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Async Monitor Update completion while peer disconnected can generate sendable messages
2 participants