Skip to content

feat(trainer): symmetric DPPO-Binary TV default loss (no KL, no advantage conditioning)#2434

Draft
samsja wants to merge 5 commits intomainfrom
feat/dppo-diff-default-loss
Draft

feat(trainer): symmetric DPPO-Binary TV default loss (no KL, no advantage conditioning)#2434
samsja wants to merge 5 commits intomainfrom
feat/dppo-diff-default-loss

Conversation

@samsja
Copy link
Copy Markdown
Member

@samsja samsja commented May 7, 2026

Summary

Companion to #2401 (IcePop). Same structural change — drop the Kimi-K2.5 KL term and the advantage-conditioned mask from the default loss, mask tokens symmetrically — but the masking criterion uses the probability difference $\pi_{\text{train}} - \pi_{\text{infer}}$ (DPPO-Binary TV, arxiv) instead of the importance ratio $\pi_{\text{train}} / \pi_{\text{infer}}$.

$$J(\theta) = \mathbb{E}\left[ \frac{1}{\sum_i |y_i|} \sum_{i,t} \mathbb{1}!\left[ \pi_{\text{train}}(y_{i,t}) - \pi_{\text{infer}}(y_{i,t}) \in [-\alpha, \beta] \right] \cdot \frac{\pi_{\text{train}}(y_{i,t})}{\pi_{\text{infer}}(y_{i,t})} \cdot \widehat{A}_{i,t} \right]$$

Tokens whose train/infer probability difference falls outside $[-\alpha, \beta]$ get zero policy-gradient weight — they are dropped, not clipped. The mask is symmetric (no longer conditioned on advantage sign). There is no separate KL penalty — the double-sided difference mask is what keeps the update inside the trust region.

Breaking config changes (trainer.loss)

Removed Added Default
dppo_mask_low dppo_diff_low (α) 0.2
dppo_mask_high dppo_diff_high (β) 0.2
kl_tau (term removed)

adv_tau and teacher_tau are unchanged. mismatch_kl / is_masked* metrics are retained for observability.

Notes

🤖 Generated with Claude Code

samsja and others added 3 commits May 8, 2026 11:19
Drop the Kimi-K2.5 KL term and the advantage-conditioned mask from the
default loss. Tokens are now masked symmetrically when
π_train - π_infer falls outside [-dppo_diff_low, dppo_diff_high],
regardless of advantage sign. The double-sided difference mask is what
keeps the update inside the trust region.

Renames `dppo_mask_low/high` → `dppo_diff_low/high` and removes
`kl_tau`. `adv_tau` and `teacher_tau` unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Adds a per-group retry cap to the rollout scheduler. Each `GroupState`
now tracks a `failed_attempts` counter, incremented whenever a batch of
rollouts comes back with at least one errored or empty trajectory. When
the counter reaches `orchestrator.max_error_reschedule_attempts`, the
group is dropped from the current step's batch (cancelling any
in-flight rollouts for that group) and the rest of the batch proceeds.

Default is `None` (retry indefinitely, current behavior). Set a value
to unblock single-example hangs in agent envs (e.g. a sandbox poll
that times out at 60s on every retry — that loop was previously
infinite because the AgentError surfaces as `rollout["error"]`, which
the scheduler treats as a normal "reschedule the group" signal with no
give-up condition).

Minimal local re-implementation of the abandoned PR #2076 — keeps the
retry-cap behavior without the larger GeneratedBatch / variable
group-size / deferred-scoring refactor in that PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…etween_tool_calls

Threads two new RendererConfig flags through setup_inference_pool /
StaticInferencePool / ElasticInferencePool / setup_clients into
vf.ClientConfig and create_renderer.

Why: GLM/Qwen renderers strip <think>...</think> from older assistant
turns by default. For RL with multi-turn agents that compact context,
that breaks the trajectory-step prefix property at every turn-rebuild
(re-rendered tokens no longer extend the streamed tokens), causing the
splitter to open extra samples (1 + 2*compactions instead of 1 +
compactions). These flags let the toml opt back into preserve-thinking
without forking the renderer.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@samsja samsja force-pushed the feat/dppo-diff-default-loss branch from fa87fa6 to 9665bd6 Compare May 9, 2026 04:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants