Fix race condition at exit of monitoring filesystem radio #4041

benclifford · 2025-12-02T13:52:24Z

This was uncovered by removal of an unrelated sleep in PR #4037, but I think it will present itself (as missing monitoring data, rather than as an error/exception) in user-facing code.

Prior to this PR

A monitoring message could be written to the new_dir and then shortly after the exit event could be set, sufficiently close in time that the monitoring radio receiver loop exited without seeing those new message files.

This PR modifies the exit behaviour of that loop to have one final iteration with the following ordering of events:

monitoring messages are written
task completes
parsl begins to shut down
monitoring radio exit event is set by DFK
monitoring radio loop observes exit event

As of this PR

monitoring radio loop performs one final processing of directory

The new behaviour here is step 6, that a final directory processing will always happen strictly after the exit event is set, which is strictly after the monitoring messages are written in step 1, assuming directories are consistently observable from different places in the filesystem.

The misbehaviour can be observed by increasing the delay time of the loop before this PR (for example to 10 seconds) and running the test suite.

With this race condition addressed, the loop poll period can be made longer and this PR arbitrarily increases it from 1 second to 10 seconds - although it could also be made configurable.

Changed Behaviour

I expect some situations where end of task monitoring data may have been missing to now not be missing that data.

Type of change

Bug fix

khk-globus · 2025-12-02T17:24:20Z

parsl/monitoring/radios/filesystem_router.py

+        if exit_event.wait(POLL_PERIOD_S):
+            loop = False


A do-while approximation. I like it.

This was uncovered by removal of an unrelated sleep in PR #4037, but I think it will present itself (as missing monitoring data, rather than as an error/exception) in user-facing code. ### Prior to this PR A monitoring message could be written to the new_dir and then shortly after the exit event could be set, sufficiently close in time that the monitoring radio receiver loop exited without seeing those new message files. This PR modifies the exit behaviour of that loop to have one final iteration with the following ordering of events: 1. monitoring messages are written 2. task completes 3. parsl begins to shut down 4. monitoring radio exit event is set by DFK 5. monitoring radio loop observes exit event ### As of this PR 6. monitoring radio loop performs one final processing of directory The new behaviour here is step 6, that a final directory processing will always happen strictly after the exit event is set, which is strictly after the monitoring messages are written in step 1, assuming directories are consistently observable from different places in the filesystem. The misbehaviour can be observed by increasing the delay time of the loop before this PR (for example to 10 seconds) and running the test suite. With this race condition addressed, the loop poll period can be made longer and this PR arbitrarily increases it from 1 second to 10 seconds - although it could also be made configurable. # Changed Behaviour I expect some situations where end of task monitoring data may have been missing to now not be missing that data. ## Type of change - Bug fix

This was uncovered by removal of an unrelated sleep in PR #4037, but I think it will present itself (as missing monitoring data, rather than as an error/exception) in user-facing code. Prior to this PR: A monitoring message could be written to the new_dir and then shortly after the exit event could be set, sufficiently close in time that the monitoring radio receiver loop exited without seeing those new message files. This PR modifies the exit behaviour of that loop to have one final iteration with the following ordering of events: i) monitoring messages are written ii) task completes iii) parsl begins to shut down iv) monitoring radio exit event is set by DFK v) monitoring radio loop observes exit event vi) monitoring radio loop performs one final processing of directory The new behaviour here is step vi, that a final directory processing will always happen strictly after the exit event is set, which is strictly after the monitoring messages are written in step i, assuming directories are consistently observable from different places in the filesystem. The misbehaviour can be observed fairly easily by increasing the delay time of the loop before this PR for example to 10 seconds and running the test suite. With this race condition addressed, the loop poll period can be made longer and this PR arbitrarily increases it from 1 second to 10 seconds - although it could also be made configurable.

benclifford mentioned this pull request Dec 2, 2025

Remove unnecessary test sleep #4037

Merged

khk-globus approved these changes Dec 2, 2025

View reviewed changes

khk-globus added this pull request to the merge queue Dec 2, 2025

khk-globus removed this pull request from the merge queue due to a manual request Dec 2, 2025

khk-globus force-pushed the benc-monitoring-fs-radio-shutdown-race branch from 6e776bd to cdce117 Compare December 2, 2025 17:45

khk-globus enabled auto-merge December 2, 2025 17:45

khk-globus added this pull request to the merge queue Dec 2, 2025

Merged via the queue into master with commit 0a488de Dec 2, 2025
9 checks passed

khk-globus deleted the benc-monitoring-fs-radio-shutdown-race branch December 2, 2025 18:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix race condition at exit of monitoring filesystem radio #4041

Fix race condition at exit of monitoring filesystem radio #4041

Uh oh!

benclifford commented Dec 2, 2025 •

edited by khk-globus

Loading

Uh oh!

khk-globus Dec 2, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix race condition at exit of monitoring filesystem radio #4041

Fix race condition at exit of monitoring filesystem radio #4041

Uh oh!

Conversation

benclifford commented Dec 2, 2025 • edited by khk-globus Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Prior to this PR

As of this PR

Changed Behaviour

Type of change

Uh oh!

khk-globus Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

benclifford commented Dec 2, 2025 •

edited by khk-globus

Loading