Skip to content

Conversation

@benclifford
Copy link
Collaborator

@benclifford benclifford commented Dec 2, 2025

This was uncovered by removal of an unrelated sleep in PR #4037, but I think it will present itself (as missing monitoring data, rather than as an error/exception) in user-facing code.

Prior to this PR

A monitoring message could be written to the new_dir and then shortly after the exit event could be set, sufficiently close in time that the monitoring radio receiver loop exited without seeing those new message files.

This PR modifies the exit behaviour of that loop to have one final iteration with the following ordering of events:

  1. monitoring messages are written

  2. task completes

  3. parsl begins to shut down

  4. monitoring radio exit event is set by DFK

  5. monitoring radio loop observes exit event

As of this PR

  1. monitoring radio loop performs one final processing of directory

The new behaviour here is step 6, that a final directory processing will always happen strictly after the exit event is set, which is strictly after the monitoring messages are written in step 1, assuming directories are consistently observable from different places in the filesystem.

The misbehaviour can be observed by increasing the delay time of the loop before this PR (for example to 10 seconds) and running the test suite.

With this race condition addressed, the loop poll period can be made longer and this PR arbitrarily increases it from 1 second to 10 seconds - although it could also be made configurable.

Changed Behaviour

I expect some situations where end of task monitoring data may have been missing to now not be missing that data.

Type of change

  • Bug fix

Comment on lines +51 to +52
if exit_event.wait(POLL_PERIOD_S):
loop = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A do-while approximation. I like it.

@khk-globus khk-globus added this pull request to the merge queue Dec 2, 2025
@khk-globus khk-globus removed this pull request from the merge queue due to a manual request Dec 2, 2025
github-merge-queue bot pushed a commit that referenced this pull request Dec 2, 2025
This was uncovered by removal of an unrelated sleep in PR #4037, but I
think it will present itself (as missing monitoring data, rather than as
an error/exception) in user-facing code.

### Prior to this PR

A monitoring message could be written to the new_dir and then shortly
after the exit event could be set, sufficiently close in time that the
monitoring radio receiver loop exited without seeing those new message
files.

This PR modifies the exit behaviour of that loop to have one final
iteration with the following ordering of events:

1. monitoring messages are written

2. task completes

3. parsl begins to shut down

4. monitoring radio exit event is set by DFK

5. monitoring radio loop observes exit event

### As of this PR

6. monitoring radio loop performs one final processing of directory

The new behaviour here is step 6, that a final directory processing will
always happen strictly after the exit event is set, which is strictly
after the monitoring messages are written in step 1, assuming
directories are consistently observable from different places in the
filesystem.

The misbehaviour can be observed by increasing the delay time of the
loop before this PR (for example to 10 seconds) and running the test
suite.

With this race condition addressed, the loop poll period can be made
longer and this PR arbitrarily increases it from 1 second to 10 seconds
- although it could also be made configurable.

# Changed Behaviour

I expect some situations where end of task monitoring data may have been
missing to now not be missing that data.

## Type of change

- Bug fix
This was uncovered by removal of an unrelated sleep in PR #4037, but I
think it will present itself (as missing monitoring data, rather than
as an error/exception) in user-facing code.

Prior to this PR:

A monitoring message could be written to the new_dir and then shortly
after the exit event could be set, sufficiently close in time that the
monitoring radio receiver loop exited without seeing those new message
files.

This PR modifies the exit behaviour of that loop to have one final iteration
with the following ordering of events:

i) monitoring messages are written

ii) task completes

iii) parsl begins to shut down

iv) monitoring radio exit event is set by DFK

v) monitoring radio loop observes exit event

vi) monitoring radio loop performs one final processing of directory

The new behaviour here is step vi, that a final directory processing will
always happen strictly after the exit event is set, which is strictly after
the monitoring messages are written in step i, assuming directories are
consistently observable from different places in the filesystem.

The misbehaviour can be observed fairly easily by increasing the delay time
of the loop before this PR for example to 10 seconds and running the test
suite.

With this race condition addressed, the loop poll period can be made longer
and this PR arbitrarily increases it from 1 second to 10 seconds - although
it could also be made configurable.
@khk-globus khk-globus force-pushed the benc-monitoring-fs-radio-shutdown-race branch from 6e776bd to cdce117 Compare December 2, 2025 17:45
@khk-globus khk-globus enabled auto-merge December 2, 2025 17:45
@khk-globus khk-globus added this pull request to the merge queue Dec 2, 2025
Merged via the queue into master with commit 0a488de Dec 2, 2025
9 checks passed
@khk-globus khk-globus deleted the benc-monitoring-fs-radio-shutdown-race branch December 2, 2025 18:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants