feat(trainer): symmetric DPPO-Binary TV default loss (no KL, no advantage conditioning) by samsja · Pull Request #2434 · PrimeIntellect-ai/prime-rl

samsja · 2026-05-07T00:08:30Z

Summary

Companion to #2401 (IcePop). Same structural change — drop the Kimi-K2.5 KL term and the advantage-conditioned mask from the default loss, mask tokens symmetrically — but the masking criterion uses the probability difference $\pi_{\text{train}} - \pi_{\text{infer}}$ (DPPO-Binary TV, arxiv) instead of the importance ratio $\pi_{\text{train}} / \pi_{\text{infer}}$.

$$J(\theta) = \mathbb{E}\left[ \frac{1}{\sum_i |y_i|} \sum_{i,t} \mathbb{1}!\left[ \pi_{\text{train}}(y_{i,t}) - \pi_{\text{infer}}(y_{i,t}) \in [-\alpha, \beta] \right] \cdot \frac{\pi_{\text{train}}(y_{i,t})}{\pi_{\text{infer}}(y_{i,t})} \cdot \widehat{A}_{i,t} \right]$$

Tokens whose train/infer probability difference falls outside $[-\alpha, \beta]$ get zero policy-gradient weight — they are dropped, not clipped. The mask is symmetric (no longer conditioned on advantage sign). There is no separate KL penalty — the double-sided difference mask is what keeps the update inside the trust region.

Breaking config changes (`trainer.loss`)

Removed	Added	Default
`dppo_mask_low`	`dppo_diff_low` (α)	`0.2`
`dppo_mask_high`	`dppo_diff_high` (β)	`0.2`
`kl_tau`	—	(term removed)

adv_tau and teacher_tau are unchanged. mismatch_kl / is_masked* metrics are retained for observability.

Notes

Token-level mask is detached from autograd: gradient flows through importance_ratio only for in-range tokens.
Same overall change shape as feat(trainer): replace DPPO+KL default loss with IcePop #2401 — only the masking criterion differs (diff vs. ratio).

🤖 Generated with Claude Code

Drop the Kimi-K2.5 KL term and the advantage-conditioned mask from the default loss. Tokens are now masked symmetrically when π_train - π_infer falls outside [-dppo_diff_low, dppo_diff_high], regardless of advantage sign. The double-sided difference mask is what keeps the update inside the trust region. Renames `dppo_mask_low/high` → `dppo_diff_low/high` and removes `kl_tau`. `adv_tau` and `teacher_tau` unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Adds a per-group retry cap to the rollout scheduler. Each `GroupState` now tracks a `failed_attempts` counter, incremented whenever a batch of rollouts comes back with at least one errored or empty trajectory. When the counter reaches `orchestrator.max_error_reschedule_attempts`, the group is dropped from the current step's batch (cancelling any in-flight rollouts for that group) and the rest of the batch proceeds. Default is `None` (retry indefinitely, current behavior). Set a value to unblock single-example hangs in agent envs (e.g. a sandbox poll that times out at 60s on every retry — that loop was previously infinite because the AgentError surfaces as `rollout["error"]`, which the scheduler treats as a normal "reschedule the group" signal with no give-up condition). Minimal local re-implementation of the abandoned PR #2076 — keeps the retry-cap behavior without the larger GeneratedBatch / variable group-size / deferred-scoring refactor in that PR. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

…etween_tool_calls Threads two new RendererConfig flags through setup_inference_pool / StaticInferencePool / ElasticInferencePool / setup_clients into vf.ClientConfig and create_renderer. Why: GLM/Qwen renderers strip <think>...</think> from older assistant turns by default. For RL with multi-turn agents that compact context, that breaks the trajectory-step prefix property at every turn-rebuild (re-rendered tokens no longer extend the streamed tokens), causing the splitter to open extra samples (1 + 2*compactions instead of 1 + compactions). These flags let the toml opt back into preserve-thinking without forking the renderer. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

samsja and others added 3 commits May 8, 2026 11:19

samsja force-pushed the feat/dppo-diff-default-loss branch from fa87fa6 to 9665bd6 Compare May 9, 2026 04:44

S1ro1 added 2 commits May 9, 2026 16:10

Fix: upcast to i64

537ccc2

Fix: 1 more upcast

0409652

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(trainer): symmetric DPPO-Binary TV default loss (no KL, no advantage conditioning)#2434

feat(trainer): symmetric DPPO-Binary TV default loss (no KL, no advantage conditioning)#2434
samsja wants to merge 5 commits intomainfrom
feat/dppo-diff-default-loss

samsja commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

samsja commented May 7, 2026

Summary

Breaking config changes (trainer.loss)

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Breaking config changes (`trainer.loss`)