Skip to content

Expose claim_backoff_max_secs as a CLI option#355

Merged
daniel-thom merged 5 commits into
mainfrom
feat/claim-backoff-max-secs
May 28, 2026
Merged

Expose claim_backoff_max_secs as a CLI option#355
daniel-thom merged 5 commits into
mainfrom
feat/claim-backoff-max-secs

Conversation

@daniel-thom
Copy link
Copy Markdown
Collaborator

Add --claim-backoff-max-secs to torc run, torc submit, and torc-slurm-job-runner so the adaptive claim/poll backoff cap can be set per invocation. When set on torc submit, the override is propagated into the generated Slurm submission scripts. torc config init now writes the 300-second default into the generated config file.

Add --claim-backoff-max-secs to torc run, torc submit, and
torc-slurm-job-runner so the adaptive claim/poll backoff cap can be set
per invocation. When set on torc submit, the override is propagated into
the generated Slurm submission scripts. torc config init now writes the
300-second default into the generated config file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a configurable cap for Torc’s adaptive claim/poll backoff, exposing it as a CLI option across local runs and Slurm-based execution, and documenting/persisting the default in generated config.

Changes:

  • Add --claim-backoff-max-secs to torc run, torc submit, and torc-slurm-job-runner, and thread the value into runner initialization.
  • Extend Slurm submission script generation to optionally include --claim-backoff-max-secs, with accompanying tests.
  • Update torc config init template and docs to include client.run.claim_backoff_max_secs = 300.0.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/test_slurm_commands.rs Updates submission-script tests for the new optional parameter; adds coverage for flag inclusion/omission.
src/run_jobs_cmd.rs Adds CLI arg and applies it via JobRunner::override_claim_backoff_max_secs.
src/main.rs Wires claim_backoff_max_secs through torc run/torc submit paths.
src/exec_cmd.rs Updates run_jobs_cmd::Args construction to include the new field.
src/config/loader.rs Adds claim_backoff_max_secs = 300.0 to the generated config template.
src/client/workflow_manager.rs Threads the new override into Slurm scheduling.
src/client/job_runner.rs Adds override method + tests; propagates value when auto-scheduling Slurm nodes.
src/client/hpc/slurm_interface.rs Appends --claim-backoff-max-secs to generated runner command when provided.
src/client/hpc/hpc_manager.rs Updates interface call to include the new parameter (currently None).
src/client/hpc/hpc_interface.rs Extends trait signature/docs for the new parameter.
src/client/commands/slurm.rs Extends schedule_slurm_nodes signature and passes the value to script generation.
src/cli.rs Adds the new flag to Run and Submit subcommands.
src/bin/torc-slurm-job-runner.rs Adds flag parsing and applies it to JobRunner.
docs/src/core/reference/configuration.md Documents the new config key and env var mapping.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/main.rs Outdated
Comment thread src/client/workflow_manager.rs Outdated
Comment thread src/client/commands/slurm.rs
Comment thread src/client/commands/slurm.rs
daniel-thom and others added 2 commits May 27, 2026 18:57
Pass the CLI Option through as-is instead of wrapping the config default
in Some(...). When no CLI flag is provided, JobRunner falls back to its
own config read, and generated Slurm scripts omit --claim-backoff-max-secs
so runners on compute nodes honor their local config. Addresses Copilot
review feedback on PR #355.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The flag was missing from this direct entry point into schedule_slurm_nodes.
When set, propagates into the generated Slurm submission scripts; when
unset, runners fall back to their own client.run.claim_backoff_max_secs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Surface the resolved cap alongside job_completion_poll_interval in the
"Starting torc job runner" log line so operators can confirm whether a
CLI override took effect.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 5 comments.

Comment thread src/client/job_runner.rs
Comment thread src/cli.rs
Comment on lines +459 to +464
/// Upper bound (seconds) on the adaptive claim/poll backoff used while
/// runners are idle. Defaults to `client.run.claim_backoff_max_secs`
/// in config (300 seconds out of the box). Propagated to the
/// generated Slurm submission scripts as `--claim-backoff-max-secs`.
#[arg(long)]
claim_backoff_max_secs: Option<f64>,
Comment thread src/bin/torc-slurm-job-runner.rs
Comment thread src/run_jobs_cmd.rs
Comment thread src/cli.rs
f64 parsing accepts NaN, +/-inf, and negative values; inf in particular
survives the existing max() clamp and crashes the runner later in
Duration::from_secs_f64. Add a shared clap value_parser
(parse_finite_non_negative_secs) and apply it to every site that exposes
--claim-backoff-max-secs (torc run, submit, slurm schedule-nodes, and
torc-slurm-job-runner). Also drop non-finite overrides defensively in
JobRunner::override_claim_backoff_max_secs in case bad values arrive
via a TOML config that bypasses the CLI parser. Addresses Copilot
review feedback on PR #355.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 7 comments.

Comment thread src/run_jobs_cmd.rs
Comment on lines +86 to +90
/// Upper bound (seconds) on the adaptive claim/poll backoff. After an
/// iteration with no completions and no claimed jobs, the wait between
/// iterations doubles from `poll_interval` toward this cap and resets on
/// any progress. Set to `poll_interval` or lower to disable backoff.
/// Falls back to `client.run.claim_backoff_max_secs` in config (default 300).
Comment thread src/cli.rs
Comment on lines +266 to +268
/// Upper bound (seconds) on the adaptive claim/poll backoff used while
/// the runner is idle. Defaults to `client.run.claim_backoff_max_secs`
/// in config (300 seconds out of the box).
Comment thread src/cli.rs
Comment on lines +459 to +464
/// Upper bound (seconds) on the adaptive claim/poll backoff used while
/// runners are idle. Defaults to `client.run.claim_backoff_max_secs`
/// in config (300 seconds out of the box). Propagated to the
/// generated Slurm submission scripts as `--claim-backoff-max-secs`.
#[arg(long, value_parser = crate::client::utils::parse_finite_non_negative_secs)]
claim_backoff_max_secs: Option<f64>,
Comment on lines +432 to +436
/// Upper bound (seconds) on the adaptive claim/poll backoff used while
/// runners are idle. When set, propagated to the generated Slurm
/// submission scripts as `--claim-backoff-max-secs`. When unset, the
/// runner falls back to `client.run.claim_backoff_max_secs` in its
/// own config (300 seconds out of the box).
Comment on lines +67 to +69
/// Upper bound (seconds) on the adaptive claim/poll backoff used while
/// the runner is idle. Defaults to `client.run.claim_backoff_max_secs`
/// in config (300 seconds out of the box).
Comment thread src/config/loader.rs
Comment on lines +198 to +200
# Upper bound (seconds) on the adaptive claim/poll backoff used while the
# runner is idle. After an iteration with no completions and no claimed
# jobs, the wait between iterations doubles from `poll_interval` toward
Comment on lines +45 to +53
| Option | Type | Default | Description |
| ------------------------ | ----- | ------------- | ---------------------------------------------------------------------- |
| `poll_interval` | float | `5.0` | Job completion poll interval (seconds) |
| `claim_backoff_max_secs` | float | `300.0` | Upper bound (seconds) for the adaptive idle backoff (see job-runners). |
| `output_dir` | path | `torc_output` | Output directory for job logs |
| `max_parallel_jobs` | int | (none) | Maximum parallel jobs (overrides resource-based) |
| `num_cpus` | int | (none) | Available CPUs for resource-based scheduling |
| `memory_gb` | float | (none) | Available memory (GB) for resource-based scheduling |
| `num_gpus` | int | (none) | Available GPUs for resource-based scheduling |
@daniel-thom daniel-thom merged commit c4ee72e into main May 28, 2026
10 checks passed
@daniel-thom daniel-thom deleted the feat/claim-backoff-max-secs branch May 28, 2026 02:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants