feat(orchestrator): drop groups after max_error_reschedule_attempts by samsja · Pull Request #2459 · PrimeIntellect-ai/prime-rl

samsja · 2026-05-09T16:17:19Z

Summary

Add orchestrator.max_error_reschedule_attempts: int | None = None (default: retry indefinitely, current behavior).
GroupState.failed_attempts tracks consecutive batches where at least one rollout came back errored or empty.
When the counter hits the cap, the group is dropped from the current step's batch (in-flight rollouts cancelled via existing drop_group), and the rest of the batch proceeds.
New dropped_groups_by_env counter on the scheduler for observability; warning log on each drop.

Why

Agent envs can hit single-example hangs (e.g. a sandbox poll that times out at 60s on every retry). The error surfaces as rollout["error"], which the scheduler currently treats as a normal "reschedule the group" signal — there is no give-up condition, so a single bad example blocks step progress forever. This adds an opt-in cap so step progress can continue without the bad group.

Notes

Minimal local re-implementation of the abandoned #2076 — keeps the retry-cap behavior without the larger GeneratedBatch / variable group-size / deferred-scoring refactor in that PR.

Files

packages/prime-rl-configs/src/prime_rl/configs/orchestrator.py — new config field.
src/prime_rl/orchestrator/scheduler.py — failed_attempts on GroupState, drop logic in the rollout completion path, dropped_groups_by_env counter.

🤖 Generated with Claude Code

Note

Low Risk
Low risk because default behavior is unchanged (max_error_reschedule_attempts=None), and the new logic only triggers when a rollout group repeatedly returns errors/empty trajectories with the cap enabled.

Overview
Adds an opt-in orchestrator.max_error_reschedule_attempts setting to stop retrying rollout groups that repeatedly return errors or empty trajectories.

The scheduler now tracks per-group consecutive failure attempts (GroupState.failed_attempts) and, once the configured cap is reached, drops the group from the current training step (cancelling any in-flight rollouts via existing drop_group) while allowing the rest of the batch to proceed; it also increments a new dropped_groups_by_env counter and logs a warning when a group is dropped.

^{Reviewed by Cursor Bugbot for commit 2502565. Bugbot is set up for automated code reviews on this repo. Configure here.}

Adds a per-group retry cap to the rollout scheduler. Each `GroupState` now tracks a `failed_attempts` counter, incremented whenever a batch of rollouts comes back with at least one errored or empty trajectory. When the counter reaches `orchestrator.max_error_reschedule_attempts`, the group is dropped from the current step's batch (cancelling any in-flight rollouts for that group) and the rest of the batch proceeds. Default is `None` (retry indefinitely, current behavior). Set a value to unblock single-example hangs in agent envs (e.g. a sandbox poll that times out at 60s on every retry — that loop was previously infinite because the AgentError surfaces as `rollout["error"]`, which the scheduler treats as a normal "reschedule the group" signal with no give-up condition). Minimal local re-implementation of the abandoned PR #2076 — keeps the retry-cap behavior without the larger GeneratedBatch / variable group-size / deferred-scoring refactor in that PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 2502565. Configure here.}

cursor · 2026-05-09T16:26:53Z

        self.empty_rollouts_by_env: dict[str, int] = defaultdict(int)
        self.errored_rollouts_by_env: dict[str, int] = defaultdict(int)
        self.total_rollouts_by_env: dict[str, int] = defaultdict(int)
+        self.dropped_groups_by_env: dict[str, int] = defaultdict(int)


dropped_groups_by_env counter never reported or cleared in metrics

Medium Severity

The dropped_groups_by_env counter is incremented when groups are dropped but is never included in get_metrics() and never cleared — unlike all sibling counters (empty_rollouts_by_env, errored_rollouts_by_env, total_rollouts_by_env) which are both reported and cleared in that method. This makes the "observability" counter invisible to monitoring and causes it to accumulate indefinitely across steps rather than resetting per reporting interval.

Additional Locations (1)

src/prime_rl/orchestrator/scheduler.py#L469-L470

^{Reviewed by Cursor Bugbot for commit 2502565. Configure here.}

samsja marked this pull request as ready for review May 9, 2026 16:19

samsja mentioned this pull request May 9, 2026

feat(trainer): symmetric DPPO + KL on unmasked tokens (vLLM 0.19 base) #2460

Draft

cursor Bot reviewed May 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(orchestrator): drop groups after max_error_reschedule_attempts#2459

feat(orchestrator): drop groups after max_error_reschedule_attempts#2459
samsja wants to merge 1 commit into
mainfrom
feat/orchestrator-drop-groups-after-max-attempts

samsja commented May 9, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

samsja commented May 9, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Notes

Files

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 9, 2026

Choose a reason for hiding this comment

dropped_groups_by_env counter never reported or cleared in metrics

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

samsja commented May 9, 2026 •

edited by cursor Bot

Loading

`dropped_groups_by_env` counter never reported or cleared in metrics