Remove unnecessary test sleep #4037

benclifford · 2025-12-01T11:41:27Z

This removes about 15 seconds of test runtime, for --config local tests.

This delay was introduced as part of new tests for resource records, in:

Test that monitoring resource rows are recorded for a long task (#1932)

but then later on, the monitoring code was modified to always send one final record, even for very fast tasks, which meant the delay was no longer needed, in:

Separate accumulation and sending of resource info (#2380)

Changed Behaviour

none

Type of change

Code maintenance/cleanup

This removes about 15 seconds of test runtime, for --config local tests. This delay was introduced as part of new tests for resource records, in: Test that monitoring resource rows are recorded for a long task (#1932) but then later on, the monitoring code was modified to always send one final record, even for very fast tasks, which meant the delay was no longer needed, in: Separate accumulation and sending of resource info (#2380)

benclifford · 2025-12-01T11:58:56Z

looks like this reveals a different test race condition - converting to draft

This was uncovered by removal of an unrelated sleep in PR #4037, but I think it will present itself (as missing monitoring data, rather than as an error/exception) in user-facing code. Prior to this PR: A monitoring message could be written to the new_dir and then shortly after the exit event could be set, sufficiently close in time that the monitoring radio receiver loop exited without seeing those new message files. This PR modifies the exit behaviour of that loop to have one final iteration with the following ordering of events: i) monitoring messages are written ii) task completes iii) parsl begins to shut down iv) monitoring radio exit event is set by DFK v) monitoring radio loop observes exit event vi) monitoring radio loop performs one final processing of directory The new behaviour here is step vi, that a final directory processing will always happen strictly after the exit event is set, which is strictly after the monitoring messages are written in step i, assuming directories are consistently observable from different places in the filesystem. The misbehaviour can be observed fairly easily by increasing the delay time of the loop before this PR for example to 10 seconds and running the test suite. With this race condition addressed, the loop poll period can be made longer and this PR arbitrarily increases it from 1 second to 10 seconds - although it could also be made configurable.

benclifford · 2025-12-02T14:00:36Z

see #4041 attempting to address the race condition uncovered here

This was uncovered by removal of an unrelated sleep in PR #4037, but I think it will present itself (as missing monitoring data, rather than as an error/exception) in user-facing code. ### Prior to this PR A monitoring message could be written to the new_dir and then shortly after the exit event could be set, sufficiently close in time that the monitoring radio receiver loop exited without seeing those new message files. This PR modifies the exit behaviour of that loop to have one final iteration with the following ordering of events: 1. monitoring messages are written 2. task completes 3. parsl begins to shut down 4. monitoring radio exit event is set by DFK 5. monitoring radio loop observes exit event ### As of this PR 6. monitoring radio loop performs one final processing of directory The new behaviour here is step 6, that a final directory processing will always happen strictly after the exit event is set, which is strictly after the monitoring messages are written in step 1, assuming directories are consistently observable from different places in the filesystem. The misbehaviour can be observed by increasing the delay time of the loop before this PR (for example to 10 seconds) and running the test suite. With this race condition addressed, the loop poll period can be made longer and this PR arbitrarily increases it from 1 second to 10 seconds - although it could also be made configurable. # Changed Behaviour I expect some situations where end of task monitoring data may have been missing to now not be missing that data. ## Type of change - Bug fix

This was uncovered by removal of an unrelated sleep in PR #4037, but I think it will present itself (as missing monitoring data, rather than as an error/exception) in user-facing code. Prior to this PR: A monitoring message could be written to the new_dir and then shortly after the exit event could be set, sufficiently close in time that the monitoring radio receiver loop exited without seeing those new message files. This PR modifies the exit behaviour of that loop to have one final iteration with the following ordering of events: i) monitoring messages are written ii) task completes iii) parsl begins to shut down iv) monitoring radio exit event is set by DFK v) monitoring radio loop observes exit event vi) monitoring radio loop performs one final processing of directory The new behaviour here is step vi, that a final directory processing will always happen strictly after the exit event is set, which is strictly after the monitoring messages are written in step i, assuming directories are consistently observable from different places in the filesystem. The misbehaviour can be observed fairly easily by increasing the delay time of the loop before this PR for example to 10 seconds and running the test suite. With this race condition addressed, the loop poll period can be made longer and this PR arbitrarily increases it from 1 second to 10 seconds - although it could also be made configurable.

This was uncovered by removal of an unrelated sleep in PR #4037, but I think it will present itself (as missing monitoring data, rather than as an error/exception) in user-facing code. ### Prior to this PR A monitoring message could be written to the new_dir and then shortly after the exit event could be set, sufficiently close in time that the monitoring radio receiver loop exited without seeing those new message files. This PR modifies the exit behaviour of that loop to have one final iteration with the following ordering of events: 1. monitoring messages are written 2. task completes 3. parsl begins to shut down 4. monitoring radio exit event is set by DFK 5. monitoring radio loop observes exit event ### As of this PR 6. monitoring radio loop performs one final processing of directory The new behaviour here is step 6, that a final directory processing will always happen strictly after the exit event is set, which is strictly after the monitoring messages are written in step 1, assuming directories are consistently observable from different places in the filesystem. The misbehaviour can be observed by increasing the delay time of the loop before this PR (for example to 10 seconds) and running the test suite. With this race condition addressed, the loop poll period can be made longer and this PR arbitrarily increases it from 1 second to 10 seconds - although it could also be made configurable. # Changed Behaviour I expect some situations where end of task monitoring data may have been missing to now not be missing that data. ## Type of change - Bug fix

khk-globus

Amusing how 3 seconds turns into 15 seconds.

benclifford marked this pull request as draft December 1, 2025 11:59

benclifford mentioned this pull request Dec 2, 2025

Fix race condition at exit of monitoring filesystem radio #4041

Merged

Merge branch 'master' into benc-monitoring-resource-test

0aed9c3

benclifford marked this pull request as ready for review December 4, 2025 15:01

khk-globus approved these changes Dec 4, 2025

View reviewed changes

khk-globus added this pull request to the merge queue Dec 4, 2025

Merged via the queue into master with commit af32bab Dec 4, 2025
9 checks passed

khk-globus deleted the benc-monitoring-resource-test branch December 4, 2025 15:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove unnecessary test sleep #4037

Remove unnecessary test sleep #4037

Uh oh!

benclifford commented Dec 1, 2025

Uh oh!

benclifford commented Dec 1, 2025

Uh oh!

benclifford commented Dec 2, 2025

Uh oh!

khk-globus left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Remove unnecessary test sleep #4037

Remove unnecessary test sleep #4037

Uh oh!

Conversation

benclifford commented Dec 1, 2025

Changed Behaviour

Type of change

Uh oh!

benclifford commented Dec 1, 2025

Uh oh!

benclifford commented Dec 2, 2025

Uh oh!

khk-globus left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants