Rollback "claimed" runs to "available" if worker doesn't join run channel #3566

taylordowns2000 · 2025-09-08T16:30:26Z

Does two things:

Fixes Runs are marked as lost when they should be re-enqueued #3565
Renames some variables to be less ambiguous - it was confusing for me not knowing if they were seconds or milliseconds

Question: Should we add some sort of state to the run_channel to make sure that it cannot be joined by more than one worker? (Imagine we roll back after 30 seconds, and right at that moment worker A joins the run channel. Worker B will then join also once it gets its own claim. It would be awesome to either invalidate the first key or block joining the run channel once we decide to roll back. Is that possible?)

Validation steps

set RUN_CHANNEL_JOIN_TIMEOUT_SECONDS=15
add process.sleep(11_000) as described in the issue (line 64/65 in worker_channel.ex in this PR)
run a run and see what happens!
it should be claimed, the worker should timeout, and then after 30 seconds it should be put back into available state.
if you remove the timeout, everything should work as usual

AI Usage

Please disclose how you've used AI in this work (it's cool, we just want to know!):

AI review of changes

Add Claim Timeout Mechanism for Worker Channel

Summary

This PR implements a timeout mechanism for the WorkerChannel to prevent runs from being stuck in :claimed state when workers disconnect or fail to join run channels after claiming runs. The mechanism works across clustered Lightning nodes using Phoenix.PubSub for cross-node communication.

Problem

Previously, when a worker claimed runs and then disconnected before joining the corresponding run channels, those runs would remain in :claimed state indefinitely, preventing other workers from claiming them. This created a resource leak where runs became unavailable.

Solution

Timeout Mechanism: After a worker claims runs, a configurable timeout (default 30 seconds) is started
Automatic Rollback: If the client doesn't join run channels within the timeout period, claimed runs are automatically rolled back to :available state
Cross-Node Communication: Uses Phoenix.PubSub to work across clustered Lightning nodes
Optimization: Only starts timeout when there are actually runs to track (no timeout for 0 runs)

Key Changes

Core Implementation

lib/lightning_web/channels/worker_channel.ex: Added timeout logic, PubSub subscription, and rollback handling
lib/lightning_web/channels/run_channel.ex: Added notification broadcast when run channels are joined
lib/lightning/runs.ex: Added rollback_claimed_runs/1 function for database rollback

Configuration

lib/lightning/config/bootstrap.ex: Added RUN_CHANNEL_JOIN_TIMEOUT_SECONDS environment variable
lib/lightning/config.ex: Added configuration callback and renamed existing callbacks for consistency

Testing

test/lightning_web/channels/worker_channel_test.exs: Added comprehensive tests for timeout and cancellation scenarios

Areas for Reviewers to Focus On

🔍 Critical Review Areas

Cross-Node Communication Logic (worker_channel.ex:39-44, 168-179)
- Verify PubSub subscription/unsubscription is handled correctly
- Check that worker_id filtering prevents message cross-contamination
- Ensure proper cleanup in terminate/2
Timeout and Rollback Logic (worker_channel.ex:83-102, 119-135)
- Verify timeout only starts when original_runs count > 0
- Check that rollback correctly resets run state to :available
- Ensure timeout cancellation works when all runs are joined
Database Rollback Function (runs.ex:350-374)
- Verify the rollback_claimed_runs/1 function correctly updates run state
- Check that it handles empty lists gracefully
- Ensure proper logging of rollback operations
Configuration Changes (config.ex:327-330, 453, 693-695)
- Verify new configuration follows existing patterns
- Check that callback renaming doesn't break existing functionality
- Ensure proper default values

🧪 Test Coverage

Timeout Test (worker_channel_test.exs:139-170)
- Verify test correctly simulates timeout scenario
- Check that manual timeout triggering works in test environment
- Ensure rollback verification is comprehensive
Cancellation Test (worker_channel_test.exs:177-210)
- Verify PubSub message format matches implementation
- Check that timeout cancellation prevents rollback
- Ensure test cleanup removes configuration

⚠️ Potential Issues to Watch For

Memory Leaks: Ensure PubSub subscriptions are properly cleaned up
Race Conditions: Verify timeout cancellation is atomic
Configuration: Check that environment variable naming is consistent
Error Handling: Ensure graceful handling of edge cases (empty runs, nil worker_id)

Configuration

The timeout can be configured via the RUN_CHANNEL_JOIN_TIMEOUT_SECONDS environment variable (default: 30 seconds).

Testing

All existing tests pass, and new tests cover:

Timeout rollback when client doesn't join run channels
Timeout cancellation when client successfully joins run channels
Cross-node communication via PubSub

Backward Compatibility

This change is fully backward compatible. Existing functionality remains unchanged, and the new mechanism only activates when runs are claimed.

You can read more details in our Responsible AI Policy

Pre-submission checklist

I have performed a self-review of my code.
I have implemented and tested all related authorization policies. (e.g., :owner, :admin, :editor, :viewer)
I have updated the changelog.
I have ticked a box in "AI usage" in this PR

codecov · 2025-09-09T05:23:11Z

Codecov Report

❌ Patch coverage is 71.18644% with 17 lines in your changes missing coverage. Please review.
✅ Project coverage is 90.21%. Comparing base (a859751) to head (9249a70).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
lib/lightning_web/channels/worker_channel.ex	66.66%	14 Missing ⚠️
lib/lightning_web/channels/run_channel.ex	50.00%	2 Missing ⚠️
lib/lightning/runs.ex	75.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3566      +/-   ##
==========================================
- Coverage   90.27%   90.21%   -0.07%     
==========================================
  Files         386      386              
  Lines       15869    15921      +52     
==========================================
+ Hits        14326    14363      +37     
- Misses       1543     1558      +15

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

lib/lightning_web/channels/worker_channel.ex

taylordowns2000 · 2025-09-10T07:25:24Z

@rorymckinley , @josephjclark , @stuartc , i feel pretty good about this with one straggling question... reposting it from the PR itself down here:

Question: Should we add some sort of state to the run_channel to make sure that it cannot be joined by more than one worker? (Imagine we roll back after 30 seconds, and right at that moment worker A joins the run channel. Worker B will then join also once it gets its own claim. It would be awesome to either invalidate the first key or block joining the run channel once we decide to roll back. Is that possible?)

rorymckinley

@taylordowns2000 I have some silly questions about the implementation and also I think there are places where we could screw down the tests a bit more.

lib/lightning_web/channels/worker_channel.ex

rorymckinley · 2025-09-10T08:09:26Z

lib/lightning_web/channels/run_channel.ex


+      # Notify the worker channel that this run channel has been joined
+      # Use PubSub for cross-node communication in clustered environments
+      case socket.assigns[:worker_id] do


@taylordowns2000 run_channel_test.exs does not seem to have been updated to cover this additional functionality?

rorymckinley · 2025-09-10T08:14:19Z

lib/lightning_web/channels/worker_channel.ex

+            "Worker socket disconnected before claim reply, rolling back transaction for runs: #{Enum.map_join(original_runs, ", ", & &1.id)}"
+          )
+
+          Runs.rollback_claimed_runs(original_runs)


@taylordowns2000 There does not appear to be test coverage here - is there a cheap way to test that the Process.alive? check failed?

rorymckinley · 2025-09-10T08:22:53Z

lib/lightning_web/channels/worker_channel.ex

+
+              # Store the timeout reference and original runs in socket assigns
+              assign(socket,
+                claim_timeout_ref: timeout_ref,


@taylordowns2000 If I am reading this correctly, the reference to the timeout is set for the socket - does that make us vulnerable to the following:

Worker 1 sends a claim request for 1 run (R1)
Lightning sends back run details and sets timeout T1
Before Worker 1 joins the run channel and within the join timeout, Worker 1 sends through a claim for another run.
Lightning sends back run details (R2) and sets timeout T2.
Now, T1 has been replaced with T2, i.e. have we lost timeout coverage of R1?

great catch. investigating now

@rorymckinley , i think you were right... does this do the trick? 0a874d9

@taylordowns2000 It looks like it - I have a question re: this comment.

Is that functionality still coming?

No, we shouldn't roll them back. We should let the Janitor have its way with them if the worker channel closes

I've removed the comment. Thanks for catching it!

Thanks @taylordowns2000 - will you be adding tests later?

Only if Joe/stu say yes, they want this approach in the app.

rorymckinley · 2025-09-10T08:26:55Z

lib/lightning_web/channels/worker_channel.ex

+    Logger.info("Successfully rolled back #{count} runs")
+
+    # Clear the timeout reference and pending runs from socket assigns
+    socket = assign(socket, claim_timeout_ref: nil, pending_runs: nil)


@taylordowns2000 This line does not appear to have test coverage?

rorymckinley · 2025-09-10T08:33:39Z

lib/lightning_web/channels/worker_channel.ex

+              {:noreply, socket}
+          end
+        else
+          # Still have pending runs, update the list


@taylordowns2000 There does not appear to be test coverage for the case where there are runs remaining (and the timeout does not get cancelled)?

rorymckinley · 2025-09-10T08:36:15Z

lib/lightning_web/channels/worker_channel.ex

+        handle_info({:run_channel_joined, run_id}, socket)
+
+      _ ->
+        # Message not for this worker, ignore


@taylordowns2000 Do we have a test to cover the case where the message is not for this worker?

rorymckinley · 2025-09-10T08:37:43Z

lib/lightning_web/channels/worker_channel.ex

+              "Worker channel terminating with pending runs, rolling back: #{Enum.map_join(pending_runs, ", ", & &1.id)}"
+            )
+
+            Runs.rollback_claimed_runs(pending_runs)


@taylordowns2000 This line does not appear to have test coverage?

rorymckinley · 2025-09-10T08:50:00Z

lib/lightning_web/channels/run_channel.ex

+          # No worker ID available, continue normally
+          :ok
+
+        worker_id ->


@taylordowns2000 Are we vulnerable to a race condition here> E.g, we retrieve the run, broadcast a message that the timer can be reset and then send data back to the worker? In the gap between the broadcast and something acting on it, the timeout could kick in, unclaiming the run and potentially allowing situation where two workers are running it?

This feels like the 'ack-ack' scenario (although we can use the run channel response as the 'ack-ack' - and I think that addresses your question about the Worker A, Worker B scenario.

When a worker joins the run channel, I think that Lightning should confirm that the run is still claimed, lock the run and apply a change that makes rollback impossible (it feels like the responsible way to do this would be to introduce a new state), and then only send the :ok back to the worker.

If none of these conditions are true, the Worker gets an error.

@rorymckinley , would this be handled better by ensuring that a run channel can only be joined once? or by a single worker? (this was a thought I had in the initial PR.)

Eish, or would that prevent workers from re-connecting to the run channel if they got temporarily disconnected?

@taylordowns2000 Even if a single-connect is an option I do not think it would solve the problem entirely and may introduce a new problem:

Showing my workings, apologies for the repetition of what I said before.

I can't see any obvious place where we are checking the state of a run before we send stuff back to the worker (probably never needed it before?) - so race 1 would be that the run has already been rolled back (essentially Worker A sans Worker B). A single connect does not seem to solve this.

Race 2 would be that the run has not been rolled back but in between the broadcast and the action the run gets rolled back. Using the Highlander approach you suggest would prevent another worker picking up the run once it has rolled back but......

The first worker would think that it is ok to work on this, but Lightning thinks it is available. What happens when the worker sends through 'started' - would Lightning need a moment to compose itself?

If we only allow one connection to the run, would that block legitimate cases where we do want another Worker to connect. So, let us say that Worker A does connect too late, the run has been rolled back but not claimed yet. In this case, we would tell Worker A to fly a kite and we would want Worker B to be able to connect, but the Highlander principle would block it?

And, oops: I forgot to add that I do not know enough about workers to know know if it will reconnect mid-run to the run channel - @josephjclark ?

Eish. Yeah, it does sound like adding another state (:locked?) might be useful. Let's see what @stuartc and @josephjclark think here before I invest more time in this.

:confirmed ?

I believe the worker will just reconnect when the connection is lost. The socket should handle this. I'll test locally later and see if I can confirm.

rorymckinley · 2025-09-10T08:51:22Z

lib/lightning/runs.ex

+
+    {count, _} =
+      from(r in Run, where: r.id in ^run_ids)
+      |> Repo.update_all(


@taylordowns2000 To avoid race conditions, it may be worth locking each record, checking that the record should be rolled back and then rolling it back.

josephjclark · 2025-09-11T14:44:26Z

@taylordowns2000

Should we add some sort of state to the run_channel to make sure that it cannot be joined by more than one worker?

Not sure, I keep changing my mind.

It definitely feels like there are a few teeny tiny gaps where a claim might resolve just after the timeout, or two workers might join, or hhe rollback might occur just after a worker starts, or maybe the wrong worker joins. These are really tiny gaps but possible. They might just be impossible to close?

If a worker is disconnected mid-run, it should reconnect and continue sending events. In fact I've just tested this and it works great (although some events may time out if the connection is down for long enough)

So you could add validation that only one worker can be in the channel at the time, or after a worker joins, only another worker with the same id can rejoin. But none of those are ironclad.

I would leave it: this PR probably fixes 99% of the LostBeforeStart cases. Let's put that into the wild and come back and thing about the remaining 1% if it remains a problem.

stuartc · 2025-09-11T16:02:59Z

One comment here, will digest the "ack" aspect tomorrow.

The idea of checking the transport pid is alive feels odd to me, the socket and channel process (afaik) are tied to the transport process so they shouldn't be running, ie that check will always be true. I haven't seen logs indicating a message failed because the transport was dead.

The original conversation was about a theory I had that transactions were continuing after socket had closed, since the claim handler is synchronous right now I find it hard to believe the client is still listening to replies after 60s. We observed multiple claim transactions on the db still going after 60s. So I'm interested in trying to replicate this. The client should have disconnected (and the channel process dies and takes the transaction with it) or the worker should still be waiting for a response. Really want to determine if the worker is still connected at least via a socket and isn't going to consume the response.

So the logic seems ok as a safe guard for a worst case scenario, since the reply isn't ack'd in our current design. But it shouldn't replace understanding what's happening (and I'm thinking it might get the way of debugging the scenario above).

josephjclark · 2025-09-11T17:39:57Z

lib/lightning_web/components/layout_components.ex

    <% else %>
      <Common.combobox
-        items={assigns[:projects] || []}
+        items={(assigns[:projects] || []) |> Enum.sort_by(& &1.name)}


Cheeky. I think Elias has fixed this on main now?

josephjclark · 2025-09-11T17:42:27Z

lib/lightning_web/channels/worker_channel.ex


-        {:reply, {:ok, %{runs: runs}}, socket}
+        # Check if the socket is still alive
+        if Process.alive?(socket.transport_pid) do


yeah I'm with @stuartc here - checking the socket is alive isn't the right thing.

Is this a hangover from the original PR that needs removing? Or is there still some value?

github-project-automation bot added this to v2 Sep 8, 2025

github-project-automation bot moved this to New Issues in v2 Sep 8, 2025

taylordowns2000 requested review from josephjclark, rorymckinley and stuartc September 8, 2025 16:30

taylordowns2000 mentioned this pull request Sep 8, 2025

Runs are marked as lost when they should be re-enqueued #3565

Open

josephjclark reviewed Sep 9, 2025

View reviewed changes

lib/lightning_web/channels/worker_channel.ex Show resolved Hide resolved

taylordowns2000 changed the title ~~Only reply if the socket is still alive~~ Rollback "claimed" runs if worker doesn't join run channel Sep 10, 2025

taylordowns2000 changed the title ~~Rollback "claimed" runs if worker doesn't join run channel~~ Rollback "claimed" runs to "available" if worker doesn't join run channel Sep 10, 2025

rorymckinley requested changes Sep 10, 2025

View reviewed changes

github-project-automation bot moved this from New Issues to In review in v2 Sep 10, 2025

taylordowns2000 added 16 commits September 10, 2025 12:34

for #3565

1b6b82e

run mix format

cec6dc6

simplify

b5106e3

is it still waiting for a reply

8708295

remove sleep

e849b37

nope

00e324a

clean ENV names, fix potential bug in worker.ex

cca24de

use runchannel as ACK

fcaa5ba

cross cluster

7924732

back

09d254c

clean env

f0c0c74

update

903f356

update changelog

c54edd4

cleanup Config ENV usage

fb1ac9b

track multiple timeouts

0a874d9

format

3984d14

taylordowns2000 force-pushed the 3565 branch from b3bd79a to 3984d14 Compare September 10, 2025 10:37

remove comment

3302126

josephjclark reviewed Sep 11, 2025

View reviewed changes

Rollback "claimed" runs to "available" if worker doesn't join run channel #3566

Are you sure you want to change the base?

Rollback "claimed" runs to "available" if worker doesn't join run channel #3566

Uh oh!

Conversation

taylordowns2000 commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Validation steps

AI Usage

AI review of changes

Add Claim Timeout Mechanism for Worker Channel

Summary

Problem

Solution

Key Changes

Core Implementation

Configuration

Testing

Areas for Reviewers to Focus On

🔍 Critical Review Areas

🧪 Test Coverage

⚠️ Potential Issues to Watch For

Configuration

Testing

Backward Compatibility

Pre-submission checklist

Uh oh!

codecov bot commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

taylordowns2000 commented Sep 10, 2025

Uh oh!

rorymckinley left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

taylordowns2000 Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

taylordowns2000 Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

taylordowns2000 commented Sep 8, 2025 •

edited

Loading

codecov bot commented Sep 9, 2025 •

edited

Loading

taylordowns2000 Sep 10, 2025 •

edited

Loading

taylordowns2000 Sep 11, 2025 •

edited

Loading