-
Notifications
You must be signed in to change notification settings - Fork 57
Rollback "claimed" runs to "available" if worker doesn't join run channel #3566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #3566 +/- ##
==========================================
- Coverage 90.27% 90.21% -0.07%
==========================================
Files 386 386
Lines 15869 15921 +52
==========================================
+ Hits 14326 14363 +37
- Misses 1543 1558 +15 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
@rorymckinley , @josephjclark , @stuartc , i feel pretty good about this with one straggling question... reposting it from the PR itself down here:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@taylordowns2000 I have some silly questions about the implementation and also I think there are places where we could screw down the tests a bit more.
|
||
# Notify the worker channel that this run channel has been joined | ||
# Use PubSub for cross-node communication in clustered environments | ||
case socket.assigns[:worker_id] do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@taylordowns2000 run_channel_test.exs
does not seem to have been updated to cover this additional functionality?
"Worker socket disconnected before claim reply, rolling back transaction for runs: #{Enum.map_join(original_runs, ", ", & &1.id)}" | ||
) | ||
|
||
Runs.rollback_claimed_runs(original_runs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@taylordowns2000 There does not appear to be test coverage here - is there a cheap way to test that the Process.alive?
check failed?
|
||
# Store the timeout reference and original runs in socket assigns | ||
assign(socket, | ||
claim_timeout_ref: timeout_ref, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@taylordowns2000 If I am reading this correctly, the reference to the timeout is set for the socket - does that make us vulnerable to the following:
Worker 1 sends a claim request for 1 run (R1)
Lightning sends back run details and sets timeout T1
Before Worker 1 joins the run channel and within the join timeout, Worker 1 sends through a claim for another run.
Lightning sends back run details (R2) and sets timeout T2.
Now, T1 has been replaced with T2, i.e. have we lost timeout coverage of R1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great catch. investigating now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rorymckinley , i think you were right... does this do the trick? 0a874d9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@taylordowns2000 It looks like it - I have a question re: this comment.
Is that functionality still coming?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we shouldn't roll them back. We should let the Janitor have its way with them if the worker channel closes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed the comment. Thanks for catching it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @taylordowns2000 - will you be adding tests later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only if Joe/stu say yes, they want this approach in the app.
Logger.info("Successfully rolled back #{count} runs") | ||
|
||
# Clear the timeout reference and pending runs from socket assigns | ||
socket = assign(socket, claim_timeout_ref: nil, pending_runs: nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@taylordowns2000 This line does not appear to have test coverage?
{:noreply, socket} | ||
end | ||
else | ||
# Still have pending runs, update the list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@taylordowns2000 There does not appear to be test coverage for the case where there are runs remaining (and the timeout does not get cancelled)?
handle_info({:run_channel_joined, run_id}, socket) | ||
|
||
_ -> | ||
# Message not for this worker, ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@taylordowns2000 Do we have a test to cover the case where the message is not for this worker?
"Worker channel terminating with pending runs, rolling back: #{Enum.map_join(pending_runs, ", ", & &1.id)}" | ||
) | ||
|
||
Runs.rollback_claimed_runs(pending_runs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@taylordowns2000 This line does not appear to have test coverage?
# No worker ID available, continue normally | ||
:ok | ||
|
||
worker_id -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@taylordowns2000 Are we vulnerable to a race condition here> E.g, we retrieve the run, broadcast a message that the timer can be reset and then send data back to the worker? In the gap between the broadcast and something acting on it, the timeout could kick in, unclaiming the run and potentially allowing situation where two workers are running it?
This feels like the 'ack-ack' scenario (although we can use the run channel response as the 'ack-ack' - and I think that addresses your question about the Worker A, Worker B scenario.
When a worker joins the run channel, I think that Lightning should confirm that the run is still claimed, lock the run and apply a change that makes rollback impossible (it feels like the responsible way to do this would be to introduce a new state), and then only send the :ok back to the worker.
If none of these conditions are true, the Worker gets an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rorymckinley , would this be handled better by ensuring that a run channel can only be joined once? or by a single worker? (this was a thought I had in the initial PR.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eish, or would that prevent workers from re-connecting to the run channel if they got temporarily disconnected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@taylordowns2000 Even if a single-connect is an option I do not think it would solve the problem entirely and may introduce a new problem:
Showing my workings, apologies for the repetition of what I said before.
I can't see any obvious place where we are checking the state of a run before we send stuff back to the worker (probably never needed it before?) - so race 1 would be that the run has already been rolled back (essentially Worker A sans Worker B). A single connect does not seem to solve this.
Race 2 would be that the run has not been rolled back but in between the broadcast and the action the run gets rolled back. Using the Highlander approach you suggest would prevent another worker picking up the run once it has rolled back but......
- The first worker would think that it is ok to work on this, but Lightning thinks it is available. What happens when the worker sends through 'started' - would Lightning need a moment to compose itself?
- If we only allow one connection to the run, would that block legitimate cases where we do want another Worker to connect. So, let us say that Worker A does connect too late, the run has been rolled back but not claimed yet. In this case, we would tell Worker A to fly a kite and we would want Worker B to be able to connect, but the Highlander principle would block it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And, oops: I forgot to add that I do not know enough about workers to know know if it will reconnect mid-run to the run channel - @josephjclark ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eish. Yeah, it does sound like adding another state (:locked
?) might be useful. Let's see what @stuartc and @josephjclark think here before I invest more time in this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:confirmed
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the worker will just reconnect when the connection is lost. The socket should handle this. I'll test locally later and see if I can confirm.
|
||
{count, _} = | ||
from(r in Run, where: r.id in ^run_ids) | ||
|> Repo.update_all( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@taylordowns2000 To avoid race conditions, it may be worth locking each record, checking that the record should be rolled back and then rolling it back.
b3bd79a
to
3984d14
Compare
Not sure, I keep changing my mind. It definitely feels like there are a few teeny tiny gaps where a claim might resolve just after the timeout, or two workers might join, or hhe rollback might occur just after a worker starts, or maybe the wrong worker joins. These are really tiny gaps but possible. They might just be impossible to close? If a worker is disconnected mid-run, it should reconnect and continue sending events. In fact I've just tested this and it works great (although some events may time out if the connection is down for long enough) So you could add validation that only one worker can be in the channel at the time, or after a worker joins, only another worker with the same id can rejoin. But none of those are ironclad. I would leave it: this PR probably fixes 99% of the LostBeforeStart cases. Let's put that into the wild and come back and thing about the remaining 1% if it remains a problem. |
One comment here, will digest the "ack" aspect tomorrow. The idea of checking the transport pid is alive feels odd to me, the socket and channel process (afaik) are tied to the transport process so they shouldn't be running, ie that check will always be true. I haven't seen logs indicating a message failed because the transport was dead. The original conversation was about a theory I had that transactions were continuing after socket had closed, since the claim handler is synchronous right now I find it hard to believe the client is still listening to replies after 60s. We observed multiple claim transactions on the db still going after 60s. So I'm interested in trying to replicate this. The client should have disconnected (and the channel process dies and takes the transaction with it) or the worker should still be waiting for a response. Really want to determine if the worker is still connected at least via a socket and isn't going to consume the response. So the logic seems ok as a safe guard for a worst case scenario, since the reply isn't ack'd in our current design. But it shouldn't replace understanding what's happening (and I'm thinking it might get the way of debugging the scenario above). |
<% else %> | ||
<Common.combobox | ||
items={assigns[:projects] || []} | ||
items={(assigns[:projects] || []) |> Enum.sort_by(& &1.name)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cheeky. I think Elias has fixed this on main now?
|
||
{:reply, {:ok, %{runs: runs}}, socket} | ||
# Check if the socket is still alive | ||
if Process.alive?(socket.transport_pid) do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I'm with @stuartc here - checking the socket is alive isn't the right thing.
Is this a hangover from the original PR that needs removing? Or is there still some value?
Does two things:
Question: Should we add some sort of state to the run_channel to make sure that it cannot be joined by more than one worker? (Imagine we roll back after 30 seconds, and right at that moment worker A joins the run channel. Worker B will then join also once it gets its own claim. It would be awesome to either invalidate the first key or block joining the run channel once we decide to roll back. Is that possible?)
Validation steps
RUN_CHANNEL_JOIN_TIMEOUT_SECONDS=15
AI Usage
Please disclose how you've used AI in this work (it's cool, we just want to know!):
AI review of changes
Add Claim Timeout Mechanism for Worker Channel
Summary
This PR implements a timeout mechanism for the
WorkerChannel
to prevent runs from being stuck in:claimed
state when workers disconnect or fail to join run channels after claiming runs. The mechanism works across clustered Lightning nodes using Phoenix.PubSub for cross-node communication.Problem
Previously, when a worker claimed runs and then disconnected before joining the corresponding run channels, those runs would remain in
:claimed
state indefinitely, preventing other workers from claiming them. This created a resource leak where runs became unavailable.Solution
:available
stateKey Changes
Core Implementation
lib/lightning_web/channels/worker_channel.ex
: Added timeout logic, PubSub subscription, and rollback handlinglib/lightning_web/channels/run_channel.ex
: Added notification broadcast when run channels are joinedlib/lightning/runs.ex
: Addedrollback_claimed_runs/1
function for database rollbackConfiguration
lib/lightning/config/bootstrap.ex
: AddedRUN_CHANNEL_JOIN_TIMEOUT_SECONDS
environment variablelib/lightning/config.ex
: Added configuration callback and renamed existing callbacks for consistencyTesting
test/lightning_web/channels/worker_channel_test.exs
: Added comprehensive tests for timeout and cancellation scenariosAreas for Reviewers to Focus On
🔍 Critical Review Areas
Cross-Node Communication Logic (
worker_channel.ex:39-44, 168-179
)terminate/2
Timeout and Rollback Logic (
worker_channel.ex:83-102, 119-135
)original_runs
count > 0:available
Database Rollback Function (
runs.ex:350-374
)rollback_claimed_runs/1
function correctly updates run stateConfiguration Changes (
config.ex:327-330, 453, 693-695
)🧪 Test Coverage
Timeout Test (
worker_channel_test.exs:139-170
)Cancellation Test (
worker_channel_test.exs:177-210
)Configuration
The timeout can be configured via the
RUN_CHANNEL_JOIN_TIMEOUT_SECONDS
environment variable (default: 30 seconds).Testing
All existing tests pass, and new tests cover:
Backward Compatibility
This change is fully backward compatible. Existing functionality remains unchanged, and the new mechanism only activates when runs are claimed.
You can read more details in our Responsible AI Policy
Pre-submission checklist
:owner
,:admin
,:editor
,:viewer
)