feat(trainer): symmetric DPPO-Binary TV default loss (no KL, no advantage conditioning)#2434
Draft
feat(trainer): symmetric DPPO-Binary TV default loss (no KL, no advantage conditioning)#2434
Conversation
Drop the Kimi-K2.5 KL term and the advantage-conditioned mask from the default loss. Tokens are now masked symmetrically when π_train - π_infer falls outside [-dppo_diff_low, dppo_diff_high], regardless of advantage sign. The double-sided difference mask is what keeps the update inside the trust region. Renames `dppo_mask_low/high` → `dppo_diff_low/high` and removes `kl_tau`. `adv_tau` and `teacher_tau` unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Adds a per-group retry cap to the rollout scheduler. Each `GroupState` now tracks a `failed_attempts` counter, incremented whenever a batch of rollouts comes back with at least one errored or empty trajectory. When the counter reaches `orchestrator.max_error_reschedule_attempts`, the group is dropped from the current step's batch (cancelling any in-flight rollouts for that group) and the rest of the batch proceeds. Default is `None` (retry indefinitely, current behavior). Set a value to unblock single-example hangs in agent envs (e.g. a sandbox poll that times out at 60s on every retry — that loop was previously infinite because the AgentError surfaces as `rollout["error"]`, which the scheduler treats as a normal "reschedule the group" signal with no give-up condition). Minimal local re-implementation of the abandoned PR #2076 — keeps the retry-cap behavior without the larger GeneratedBatch / variable group-size / deferred-scoring refactor in that PR. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…etween_tool_calls Threads two new RendererConfig flags through setup_inference_pool / StaticInferencePool / ElasticInferencePool / setup_clients into vf.ClientConfig and create_renderer. Why: GLM/Qwen renderers strip <think>...</think> from older assistant turns by default. For RL with multi-turn agents that compact context, that breaks the trajectory-step prefix property at every turn-rebuild (re-rendered tokens no longer extend the streamed tokens), causing the splitter to open extra samples (1 + 2*compactions instead of 1 + compactions). These flags let the toml opt back into preserve-thinking without forking the renderer. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
fa87fa6 to
9665bd6
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Companion to #2401 (IcePop). Same structural change — drop the Kimi-K2.5 KL term and the advantage-conditioned mask from the default loss, mask tokens symmetrically — but the masking criterion uses the probability difference$\pi_{\text{train}} - \pi_{\text{infer}}$ (DPPO-Binary TV, arxiv) instead of the importance ratio $\pi_{\text{train}} / \pi_{\text{infer}}$ .
Tokens whose train/infer probability difference falls outside$[-\alpha, \beta]$ get zero policy-gradient weight — they are dropped, not clipped. The mask is symmetric (no longer conditioned on advantage sign). There is no separate KL penalty — the double-sided difference mask is what keeps the update inside the trust region.
Breaking config changes (
trainer.loss)dppo_mask_lowdppo_diff_low(α)0.2dppo_mask_highdppo_diff_high(β)0.2kl_tauadv_tauandteacher_tauare unchanged.mismatch_kl/is_masked*metrics are retained for observability.Notes
importance_ratioonly for in-range tokens.🤖 Generated with Claude Code