Skip to content

feat(orchestrator): drop groups after max_error_reschedule_attempts#2459

Open
samsja wants to merge 1 commit into
mainfrom
feat/orchestrator-drop-groups-after-max-attempts
Open

feat(orchestrator): drop groups after max_error_reschedule_attempts#2459
samsja wants to merge 1 commit into
mainfrom
feat/orchestrator-drop-groups-after-max-attempts

Conversation

@samsja
Copy link
Copy Markdown
Member

@samsja samsja commented May 9, 2026

Summary

  • Add orchestrator.max_error_reschedule_attempts: int | None = None (default: retry indefinitely, current behavior).
  • GroupState.failed_attempts tracks consecutive batches where at least one rollout came back errored or empty.
  • When the counter hits the cap, the group is dropped from the current step's batch (in-flight rollouts cancelled via existing drop_group), and the rest of the batch proceeds.
  • New dropped_groups_by_env counter on the scheduler for observability; warning log on each drop.

Why

Agent envs can hit single-example hangs (e.g. a sandbox poll that times out at 60s on every retry). The error surfaces as rollout["error"], which the scheduler currently treats as a normal "reschedule the group" signal — there is no give-up condition, so a single bad example blocks step progress forever. This adds an opt-in cap so step progress can continue without the bad group.

Notes

Minimal local re-implementation of the abandoned #2076 — keeps the retry-cap behavior without the larger GeneratedBatch / variable group-size / deferred-scoring refactor in that PR.

Files

  • packages/prime-rl-configs/src/prime_rl/configs/orchestrator.py — new config field.
  • src/prime_rl/orchestrator/scheduler.pyfailed_attempts on GroupState, drop logic in the rollout completion path, dropped_groups_by_env counter.

🤖 Generated with Claude Code


Note

Low Risk
Low risk because default behavior is unchanged (max_error_reschedule_attempts=None), and the new logic only triggers when a rollout group repeatedly returns errors/empty trajectories with the cap enabled.

Overview
Adds an opt-in orchestrator.max_error_reschedule_attempts setting to stop retrying rollout groups that repeatedly return errors or empty trajectories.

The scheduler now tracks per-group consecutive failure attempts (GroupState.failed_attempts) and, once the configured cap is reached, drops the group from the current training step (cancelling any in-flight rollouts via existing drop_group) while allowing the rest of the batch to proceed; it also increments a new dropped_groups_by_env counter and logs a warning when a group is dropped.

Reviewed by Cursor Bugbot for commit 2502565. Bugbot is set up for automated code reviews on this repo. Configure here.

Adds a per-group retry cap to the rollout scheduler. Each `GroupState`
now tracks a `failed_attempts` counter, incremented whenever a batch of
rollouts comes back with at least one errored or empty trajectory. When
the counter reaches `orchestrator.max_error_reschedule_attempts`, the
group is dropped from the current step's batch (cancelling any
in-flight rollouts for that group) and the rest of the batch proceeds.

Default is `None` (retry indefinitely, current behavior). Set a value
to unblock single-example hangs in agent envs (e.g. a sandbox poll
that times out at 60s on every retry — that loop was previously
infinite because the AgentError surfaces as `rollout["error"]`, which
the scheduler treats as a normal "reschedule the group" signal with no
give-up condition).

Minimal local re-implementation of the abandoned PR #2076 — keeps the
retry-cap behavior without the larger GeneratedBatch / variable
group-size / deferred-scoring refactor in that PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@samsja samsja marked this pull request as ready for review May 9, 2026 16:19
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 2502565. Configure here.

self.empty_rollouts_by_env: dict[str, int] = defaultdict(int)
self.errored_rollouts_by_env: dict[str, int] = defaultdict(int)
self.total_rollouts_by_env: dict[str, int] = defaultdict(int)
self.dropped_groups_by_env: dict[str, int] = defaultdict(int)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dropped_groups_by_env counter never reported or cleared in metrics

Medium Severity

The dropped_groups_by_env counter is incremented when groups are dropped but is never included in get_metrics() and never cleared — unlike all sibling counters (empty_rollouts_by_env, errored_rollouts_by_env, total_rollouts_by_env) which are both reported and cleared in that method. This makes the "observability" counter invisible to monitoring and causes it to accumulate indefinitely across steps rather than resetting per reporting interval.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 2502565. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant