Fix race condition at exit of monitoring filesystem radio (#4041)

benclifford · web-flow · commit 6b093f48bf98 · 2025-12-02T17:24:43.000Z
This was uncovered by removal of an unrelated sleep in PR #4037, but I think it will present itself (as missing monitoring data, rather than as an error/exception) in user-facing code. ### Prior to this PR A monitoring message could be written to the new_dir and then shortly after the exit event could be set, sufficiently close in time that the monitoring radio receiver loop exited without seeing those new message files. This PR modifies the exit behaviour of that loop to have one final iteration with the following ordering of events: 1. monitoring messages are written 2. task completes 3. parsl begins to shut down 4. monitoring radio exit event is set by DFK 5. monitoring radio loop observes exit event ### As of this PR 6. monitoring radio loop performs one final processing of directory The new behaviour here is step 6, that a final directory processing will always happen strictly after the exit event is set, which is strictly after the monitoring messages are written in step 1, assuming directories are consistently observable from different places in the filesystem. The misbehaviour can be observed by increasing the delay time of the loop before this PR (for example to 10 seconds) and running the test suite. With this race condition addressed, the loop poll period can be made longer and this PR arbitrarily increases it from 1 second to 10 seconds - although it could also be made configurable. # Changed Behaviour I expect some situations where end of task monitoring data may have been missing to now not be missing that data. ## Type of change - Bug fix
diff --git a/parsl/monitoring/radios/filesystem_router.py b/parsl/monitoring/radios/filesystem_router.py
@@ -3,7 +3,6 @@
 import logging
 import os
 import pickle
-import time
 from multiprocessing.queues import Queue
 from multiprocessing.synchronize import Event
 from typing import cast
@@ -18,6 +17,9 @@
 
 logger = logging.getLogger(__name__)
 
+# how often the router will scan the new message directory
+POLL_PERIOD_S = 10
+
 
 @wrap_with_logs
 def filesystem_router_starter(*, q: Queue[TaggedMonitoringMessage], run_dir: str, exit_event: Event) -> None:
@@ -36,9 +38,19 @@ def filesystem_router_starter(*, q: Queue[TaggedMonitoringMessage], run_dir: str
     os.makedirs(tmp_dir, exist_ok=True)
     os.makedirs(new_dir, exist_ok=True)
 
-    while not exit_event.is_set():
+    loop = True
+
+    while loop:
         logger.debug("Start filesystem radio receiver loop")
 
+        # this happens before the final poll of the directory so that
+        # one complete pass over the new_dir will happen strictly after
+        # the exit_event is set. Without forcing that final pass, there
+        # can be a race between exiting on exit_event after the files
+        # are added.
+        if exit_event.wait(POLL_PERIOD_S):
+            loop = False
+
         # iterate over files in new_dir
         for filename in os.listdir(new_dir):
             try:
@@ -53,7 +65,6 @@ def filesystem_router_starter(*, q: Queue[TaggedMonitoringMessage], run_dir: str
             except Exception:
                 logger.exception("Exception processing %s - probably will be retried next iteration", filename)
 
-        time.sleep(1)  # whats a good time for this poll?
     logger.info("Ending filesystem radio receiver")