-
Notifications
You must be signed in to change notification settings - Fork 738
Open
Labels
Description
Background
Currently, Nextflow reads the .exitcode
file from the work directory to determine task completion status. PR #6442 improved K8s error handling by prioritizing the scheduler's exit status for failed executions (e.g., OOMKilled, pod eviction), but still falls back to reading the .exitcode
file for successful executions.
Proposed Optimization
For successful task executions (scheduler exit status == 0), we should rely solely on the scheduler's reported exit status and bypass reading the .exitcode
file entirely.
Benefits
- Reduced I/O pressure: Eliminates one file read operation per successful task
- Better scalability: Particularly beneficial for workloads with many fine-grain jobs
- Lower storage costs: Reduces remote file storage access (S3, Azure Blob, GCS, etc.)
- Improved performance: Faster task completion acknowledgment
Implementation Considerations
This optimization should be evaluated across all executor types:
- K8s (nf-k8s)
- AWS Batch (nf-amazon)
- Azure Batch (nf-azure)
- Google Batch (nf-google)
- Other cloud executors
Related Work
- PR Get exit code from pod to manage OOM in k8s #6442: Improved K8s exit code handling for error cases
- Issue OOM do not return 137 exit code in K8s executor #6436: Original OOM handling issue
The current PR establishes the pattern of prioritizing scheduler exit status for errors. This issue proposes extending that approach to successful executions as well.