feat(orchestrator): drop groups after max_error_reschedule_attempts#2459
feat(orchestrator): drop groups after max_error_reschedule_attempts#2459samsja wants to merge 1 commit into
Conversation
Adds a per-group retry cap to the rollout scheduler. Each `GroupState` now tracks a `failed_attempts` counter, incremented whenever a batch of rollouts comes back with at least one errored or empty trajectory. When the counter reaches `orchestrator.max_error_reschedule_attempts`, the group is dropped from the current step's batch (cancelling any in-flight rollouts for that group) and the rest of the batch proceeds. Default is `None` (retry indefinitely, current behavior). Set a value to unblock single-example hangs in agent envs (e.g. a sandbox poll that times out at 60s on every retry — that loop was previously infinite because the AgentError surfaces as `rollout["error"]`, which the scheduler treats as a normal "reschedule the group" signal with no give-up condition). Minimal local re-implementation of the abandoned PR #2076 — keeps the retry-cap behavior without the larger GeneratedBatch / variable group-size / deferred-scoring refactor in that PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 2502565. Configure here.
| self.empty_rollouts_by_env: dict[str, int] = defaultdict(int) | ||
| self.errored_rollouts_by_env: dict[str, int] = defaultdict(int) | ||
| self.total_rollouts_by_env: dict[str, int] = defaultdict(int) | ||
| self.dropped_groups_by_env: dict[str, int] = defaultdict(int) |
There was a problem hiding this comment.
dropped_groups_by_env counter never reported or cleared in metrics
Medium Severity
The dropped_groups_by_env counter is incremented when groups are dropped but is never included in get_metrics() and never cleared — unlike all sibling counters (empty_rollouts_by_env, errored_rollouts_by_env, total_rollouts_by_env) which are both reported and cleared in that method. This makes the "observability" counter invisible to monitoring and causes it to accumulate indefinitely across steps rather than resetting per reporting interval.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 2502565. Configure here.


Summary
orchestrator.max_error_reschedule_attempts: int | None = None(default: retry indefinitely, current behavior).GroupState.failed_attemptstracks consecutive batches where at least one rollout came back errored or empty.drop_group), and the rest of the batch proceeds.dropped_groups_by_envcounter on the scheduler for observability; warning log on each drop.Why
Agent envs can hit single-example hangs (e.g. a sandbox poll that times out at 60s on every retry). The error surfaces as
rollout["error"], which the scheduler currently treats as a normal "reschedule the group" signal — there is no give-up condition, so a single bad example blocks step progress forever. This adds an opt-in cap so step progress can continue without the bad group.Notes
Minimal local re-implementation of the abandoned #2076 — keeps the retry-cap behavior without the larger
GeneratedBatch/ variable group-size / deferred-scoring refactor in that PR.Files
packages/prime-rl-configs/src/prime_rl/configs/orchestrator.py— new config field.src/prime_rl/orchestrator/scheduler.py—failed_attemptsonGroupState, drop logic in the rollout completion path,dropped_groups_by_envcounter.🤖 Generated with Claude Code
Note
Low Risk
Low risk because default behavior is unchanged (
max_error_reschedule_attempts=None), and the new logic only triggers when a rollout group repeatedly returns errors/empty trajectories with the cap enabled.Overview
Adds an opt-in
orchestrator.max_error_reschedule_attemptssetting to stop retrying rollout groups that repeatedly return errors or empty trajectories.The scheduler now tracks per-group consecutive failure attempts (
GroupState.failed_attempts) and, once the configured cap is reached, drops the group from the current training step (cancelling any in-flight rollouts via existingdrop_group) while allowing the rest of the batch to proceed; it also increments a newdropped_groups_by_envcounter and logs a warning when a group is dropped.Reviewed by Cursor Bugbot for commit 2502565. Bugbot is set up for automated code reviews on this repo. Configure here.