Echo#2677
Draft
snimu wants to merge 265 commits into
Draft
Conversation
Picks up configs/private 03fd21f — drops the old forth-lang TOMLs (which were running against the silent SFT-gradient bug fixed in the previous commit), and adds: - forth-lang: qwen-rl + 4 cmb-code cells (one per normalizer) + sweep/ - deepdive: qwen-rl + 4 cmb cells (no tool_names filter) + sweep/ - alpha_sweep_launcher.sh: 7 αs × 4 norms × 2 envs = 56 runs, paired with this branch's loss fix so the runs actually exercise SFT. Topology: 2 nodes / bs=256 throughout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t test ``.cuda()`` after ``requires_grad=True`` makes the device tensor a non-leaf, so ``.grad`` never populates after backward(). Switch to ``device="cuda"`` in the constructor so it's a leaf on the GPU. Caught when running the test on the cluster against the gradient fix.
prompt_attribution arrives at the orchestrator as a plain dict through
the verifiers env-server JSON boundary, even though the renderer emits
a RenderedTokens object. The previous code assumed attribute access
(``prompt_attribution.message_indices``), which crashed on the cluster:
AttributeError: 'dict' object has no attribute 'message_indices'
Handle both forms — getattr fallback for the object case, dict.get for
the wire-serialized case. Per the function's existing precondition (no
runtime type check on ``prompt_attribution: Any``), this is the right
contract: the SFT mask should be computable from whatever shape the
caller passed in as long as the data is present.
Caught when the cluster smoke test crashed at step 0 rollout generation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
prompt_attribution always arrives as a plain dict through the verifiers env-server JSON boundary — the renderer's RenderedTokens object loses its identity in serialization. Drop the prior defensive isinstance/ getattr branching and just use dict access. Updates the type annotation on the parameter to ``dict | None`` to match.
The previous design flipped SFT positions into ``loss_mask`` and routed everything through ``default_loss_fn``. That had two problems: 1. Double normalization on SFT. SFT positions were divided by the per- rollout mode weight (α/X, set in ``prepare_sample``) AND by the batch-level ``loss_scale`` (= total loss_mask tokens, now including SFT). The per-SFT-token gradient ended up at α/(X × loss_scale), an ~S-times smaller magnitude than the legacy ``sft_loss_weight / N_rl`` regime the production α=0.5 was tuned against. 2. Gradient flow on SFT positions went through ``torch.where(sft_mask, zeros, log_importance_ratio)``, which blocks the gradient through ``trainer_logprobs`` on the True branch. The intent (force IS-ratio to 1 for off-policy correction skip) silently disabled the SFT signal entirely. Restructure to match the shape ECHO's algorithm actually has: - RL term: batch-level normalization by N_rl_batch (the existing ``loss_scale``, now RL-only again). - SFT term: per-rollout normalization by the chosen mode (α, α/S, α/T, α×R/S — set in ``prepare_sample`` on the advantage tensor). No batch normalization on top. Implementation: - ``prepare_sample``: stop flipping SFT into ``loss_mask``. Keep setting ``advantage[sft_pos] = mode_weight``. - ``LossInputs.sft_mask`` removed — no longer needed since SFT and RL flow through separate ``compute_loss`` calls. - ``default_loss_fn``: RL-only. All ``sft_mask``-conditional branches dropped. - New ``sft_pg_loss_fn``: ``loss = -(advantages * trainer_logprobs * loss_mask).sum()``. Caller passes the SFT mask as ``loss_mask`` and ``loss_scale=1`` so no batch division. - ``train.py``: second ``compute_loss`` call after the RL one, sharing the same ``out["logprobs"]`` tensor — single forward + single backward covers both terms. - Regression test: replaced the broken ``test_default_loss_fn_sft_mask_ gradient_flows`` with ``test_sft_pg_loss_fn_gradient_magnitude`` that exercises the new function directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three follow-ups from review of the SFT/RL split: 1. prepare_sample now clears sft_mask to None when sft_alpha is unset. Previously sft_mask propagated downstream regardless, and train.py would run sft_pg_loss_fn over the rollout's raw RL advantage on tool tokens — reward-shaped, not SFT-direction. With the mask cleared, the trainer skips the SFT compute_loss call as intended. 2. _step_sft_mask now bails to all-False when prompt_attribution lacks "message_indices" or "is_content" keys, matching the docstring's "missing/partial attribution returns all-False" contract. The DefaultRenderer leaves these unpopulated; callers shouldn't have to special-case the no-attribution path. 3. test_batch.py fixtures updated for the new design: SFT prompt tokens stay out of loss_mask (= [False, False, True, True] for the 2-prompt + 2-completion fixture). Added an assertion that prepare_sample clears sft_mask to None when sft_alpha is missing. Plus a docstring update in batch.py clarifying that "all_tokens" mode (α/T) is NOT exactly ECHO — the paper uses Z=|𝒪|, which matches "sft_tokens" when no tool_names filter is set. "all_tokens" with T (prompt + completion) is strictly stricter.
…dling" This reverts commit 6d831cd.
…lization" This reverts commit bc01f94.
Per team decision: SFT-on-tool-body is RL credit assignment with a flat positive advantage on tool body positions. Drops the four-mode normalization machinery entirely: - SFTConfig: removed ``normalization`` field. Keep on_tool_outputs, alpha (default 1.0), tool_names. - TrainingSample: removed sft_normalization field. - prepare_sample: collapsed the 4-mode dispatch to just ``advantages[k] = sft_alpha`` on SFT-mask positions. - orchestrator.py: drop sft_normalization assignment. - train.py: drop the stale doc comment about per-rollout mode dispatch. - tests: replaced 4-mode parametric tests with one ``advantage = alpha`` assertion. default_loss_fn keeps the torch.where SFT branch (advantage * trainer_logp on SFT positions, log_IR forced to 0 for KL = 0 on SFT positions) — the mechanism is correct under the RL-credit-assignment framing, just no longer dispatched over four modes. Submodule bump picks up the matching configs/private cleanup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rth-env) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pulls in renderers @ 5f653df, which adds `RENDERERS_MAX_PROMPT_LEN` as an env-var override for the pre-flight overflow check in `_resolve_max_prompt_len`. Lets us re-enable the pre-flight on clusters whose vllm-router has a broken `/v1/models` handler. To activate at run time, set `RENDERERS_MAX_PROMPT_LEN=32768` (or the engine's actual `max_model_len`) in prime-rl's `.env` before launching. With it set, overlong rollouts raise `OverlongPromptError` cleanly (caught by `MultiTurnEnv.prompt_too_long`) instead of hitting vLLM and crashing the orchestrator with a raw `ValueError`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolves 12 textual conflicts across the SFT-on-tool-body data path. The feature's loss math is preserved bit-identical; only the surrounding plumbing changed. Trainer-side reconciliation: - transport/types.py: drop sft_loss bool (replaced by main's training_mode literal); keep sft_mask + sft_alpha. MicroBatch carries sft_mask alongside main's new training_mode + rewards. - batch.py: prepare_sample drops gone pixel_values/image_grid_thw/sft_loss (replaced by mm_kwargs + training_mode). packed_samples_into_micro_bs combines main's existing_len capture with the sft_mask extension. pad_micro_batch keeps both sft_mask (False) and routed_experts padding. - rl/data.py: replace sft_loss with training_mode in FakeDataLoader paths; preserve sft_mask tensor conversion alongside main's new RoutedExperts struct path. - rl/loss.py: drop teacher_logprobs handling from default_loss_fn (moved to a dedicated opd_loss_fn on main). Fix auto-merge bug that placed the sft_nll metrics block inside opd_loss_fn where sft_mask is undefined — moved back to default_loss_fn's tail. compute_loss drops sft_loss bool (replaced by training_mode); keeps sft_mask threading. - rl/train.py: swap sft_loss=micro_batch["sft_loss"] for training_mode=micro_batch["training_mode"]; keep sft_mask_split. Orchestrator-side reconciliation: - configs/orchestrator.py: dedupe state_columns (both sides added it). Reformat SFTConfig + new sft: field to match main's docstring-after- field style. - orchestrator/orchestrator.py: preserve the process_rollout helper, drop now-gone vlm_cache/cache_key args from interleave_rollout call. Combine SFT alpha attachment with sample.training_mode assignment; drop the no-longer-existing config.use_sft_loss path. - orchestrator/trajectories.py: prepare_step_tokens decodes routed_experts (b64 → numpy uint8) and carries multi_modal_data alongside the per-step sft_mask. make_sample uses main's deferred-routed_experts pattern with sft_mask appended. extend_sample keeps the sft_mask extension AND main's chunk-list routed_experts state machine. Tests: - test_batch.py: take main's renamed test_prepare_batch_does_not_pack_ mixed_training_mode (was ..._sft_loss); keep the 3 SFT overlay tests. - test_trajectories.py: drop imports of gone helpers (build_vlm_image_ cache, _ImageStore, _extract_images_from_*); keep _step_sft_mask; add main's align_routed_experts. Append the 8 _step_sft_mask tests after main's renderer mm_kwargs assertion test. - test_loss.py: update SFT gradient regression test to use the new setup_loss_fns (plural) + loss_fns= dispatch API. Submodules: - configs/private kept at HEAD (a7dc3d6) for the SFT sweep cells. - deps/verifiers kept at HEAD (cbc191d2) on sebastian/prompt-tool-names- 2026-05-19. Critical: main's pointer doesn't have prompt_attribution or prompt_message_tool_names merged yet, accepting it would silently disable the feature. The .gitmodules branch pin survived intact. Behavioral impact for SFT-on-tool-body sweeps (training_mode="rl", no teacher): zero change. The only loss-math drift is that default_loss_fn no longer mixes teacher_tau * teacher_kl.detach() into advantages — that path moved to a dedicated opd_loss_fn selected via training_mode="opd" with hardcoded knobs (not a drop-in mix). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the env packages used by the SFT-on-tool-outputs experiments (forth-lang, general-agent, wordle) as workspace members + envs extras so they're installed by `uv sync --all-extras` and survive re-syncs. Previously each run's `pre_run_command` did `uv pip install -e ...` to re-add them after `uv sync` wiped them — that created a venv-mutation race window with concurrent jobs sharing this venv. Also pins fastokens via exclude-newer-package overrides for the 2026-05-17 release (>0.2.0) that's still inside the 7-day exclude-newer window. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three coordinated changes post-merge: 1. deps/renderers: 5f653df (sebastian/preflight-max-prompt-len-env-override) → 1831bcc (renderers main). The `RENDERERS_MAX_PROMPT_LEN` env-var override workaround is no longer needed now that vllm-router 0.1.25 fixes its `/v1/models` handler; pre-flight auto-discovery works again with the new router. Keeping the feature branch on the renderers repo for reference but un-using it here. 2. pyproject.toml: remove the duplicate `fastokens = false` line that tripped uv sync after the main-merge auto-resolve. The remaining entry (with the renderers-fastokens-override comment) is kept. 3. configs/private: bump to 2a42e1d, which contains the `rollouts_per_example → group_size` rename across all configs to match upstream PR #2599, plus the new deepdive 1000-step prod cells for the α-ablation winners and matching sweep cells for α∈{0.01, 0.05}. uv.lock regenerated to reflect router 0.1.25 + the new prime-rl 0.5.0 pinned vLLM wheel + prime-pydantic-config. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops forth-lang, general-agent, and wordle from `envs` extras, workspace members, and `tool.uv.sources`. Those packages live outside the prime-rl repo (forth-lang) or are project-specific to the SFT-on-tool-outputs experiments rather than generally useful; pinning them in prime-rl's pyproject broke CI (`uv sync --all-extras --locked` can't resolve the absolute path source for forth-lang on a non-cluster checkout). Install responsibility now lives in `configs/private`: each experiment cell's `start.sh` does `uv pip install -e <env_paths>` on the launching machine and each `prod.toml`'s `[slurm].pre_run_command` does the same on the compute head node, after the sbatch's own `uv sync`. deepdive stays in prime-rl (already a workspace member upstream). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Prevents extraneous packages (those installed via \`uv pip install -e\` outside the lockfile) from being uninstalled on each new compute job's sbatch startup. The shared NFS venv was being briefly stripped of forth-lang/general-agent/wordle on every job startup, which crashed other running jobs' Python subprocesses that re-imported in that window. With --inexact, uv sync preserves packages not tracked by the workspace. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolve conflicts so the diff over main is echo-only. v2 (#2639) is now on main, so take main for v2-drift (advantage/dispatcher/types, Kind→RolloutKind rename, SFT cache_salt #2687) and packaging. Align to main's config hard-break — drop the filters→post_batch_filters alias and the [orchestrator.buffer] FutureWarning shim (echo filter machinery untouched). verifiers stays at 67f958538 (ours) per request; uv.lock reflects that (no nemogym extra). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
#2684 (verifiers as an editable path dep, not a workspace member) is now on this branch, so verifiers' nemogym extra-vs-extra conflicts no longer break `uv sync --all-extras` — it only enables prime-rl's own extras. Verified --all-extras resolves cleanly with verifiers main. 67f958538 -> 7f5109e4 picks up: v1 endpoint port pre-probe race fix (#1513), SandboxConfig.labels via #1512 (supersedes local hotfix 76afcac), NeMo Gym integration (#1396), unbounded v1 max_turns (#1517). uv.lock restores verifiers' nemogym/tasksets extra metadata. renderers was already at newest main (942449c37 / v0.1.8.dev39); only synced the working tree, which had drifted to the older v0.1.8.dev37 tag. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Collapse the per-role / per-scenario _step_echo_alpha and apply_echo_filter tests into parametrized tests (echo tests 50 -> 16 functions, ~-320 lines) with identical coverage. Merge the 3 prepare_sample echo-overlay batch tests into one parametrized test (and fix an `is True` identity check). Drop 4 pydantic-default framework-glue tests (kwargs default/explicit, required import_path, filter nesting) — the real kwargs=None contract stays covered; keep the custom at-least-one-role validator. Drop the vestigial echo_tokens-mode framing in the loss gradient test. Could not run pytest locally (darwin venv can't import prime_rl); verified via py_compile + ruff. Run on the remote before relying on it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cut the per-role / EchoConfig / EchoFilterConfig docstrings and the echo comment blocks in transport/types, batch, loss, and trajectories down to essentials (keeping loss.py's gradient-flow rationale). No behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Roles default to None (disabled); when one is activated it should contribute gradient out of the box, so per-role alpha now defaults to 1.0 instead of 0.0 (which required also setting a positive alpha to do anything). The alpha=0 "kill RL on assistant tokens" case stays available explicitly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The echo PR carries no dependency changes — it uses main's pins for verifiers (e1d4f2593), renderers (35c2407e9), research-environments (c752781), configs/private (70c3503e1), plus main's pyproject/uv.lock. Reverts the earlier verifiers bump (7f5109e4). Private configs aren't used in this PR (experiments run elsewhere), so main's stuff is fine; issues to be reported separately (see prime-rl-research-env-bump-issue.md). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Drop a gratuitous reword introduced during the echo doc-trim pass — only the body's echo-filter wiring should differ from main, not the docstring. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ho_config Bind the echo-filter kwargs into the resolved callable via functools.partial in TrainEnv setup, so process_rollout no longer threads kwargs back through env.config.echo. That removes the type-narrowing assert and the env_echo_config alias (echo_config is now inlined into the interleave_rollout call), and apply_echo_filter drops its filter_kwargs param. Removed test_apply_echo_filter_kwargs — kwargs binding is now functools.partial + the EchoFilterConfig.kwargs default, not apply_echo_filter's concern. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the in-place "null out where the mask is False" loop in _step_echo_alpha with an equivalent select-comprehension (keep the baseline alpha where the mask is True, None where False). No behavior change — a True still preserves a None baseline, so the filter only narrows. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
apply_echo_filter already validates the user filter's output (outer length vs trajectory, inner length vs each step's tokens, bool types), so the downstream re-checks were redundant: drop the outer-length check in interleave_rollout and the inner-length check in _step_echo_alpha. The narrowing zip uses strict=True so an internal misalignment still fails loud instead of silently truncating. Removed the two now-dead length-mismatch tests; the real validation stays covered by test_apply_echo_filter_invalid_raises. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The function builds the per-token echo_alpha array for one trajectory step; the verb-prefixed name reads as a builder (and keeps the echo_alpha field name) rather than looking like a scalar. Updates the def, the interleave_rollout call, the :func: ref, the transport/types comment, and the test import + test names + calls + docstrings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ases Cleanup pass over the per-role echo feature: simplifications, several correctness fixes, and restored test coverage. Loss: - Decouple echo cross-entropy from default_loss_fn via split_loss_masks (RL mask = loss_mask & ~echo_mask, echo mask = loss_mask & echo_mask). - Normalize RL and echo terms by separate global denominators (rl_loss_scale / echo_loss_scale) so echoing prompt/tool tokens no longer dilutes ordinary RL completion gradients. RL-only runs are numerically unchanged (rl_loss_scale == old loss_scale, echo term zero). - Reject echo with a custom RL loss and in opd/sft modes (explicit errors). Fixes: - Exclude token 0 from echo: its shifted logprob is a placeholder, not a valid CE target, so it must not be supervised. - Validate echo_alpha length in prepare_sample and reject misaligned samples in the packer. - Restore the train_sink echo-filter guard (skip empty-trajectory rollouts) so a configured filter can't crash on degenerate rollouts; keep running the filter when prompt_attribution is absent. - tool_names: list -> set with min_length>=1 (empty allow-list rejected; None still means all tools); alpha gets allow_inf_nan=False. - token_export: split RL/echo masks, export echo_mask + rl_loss_mask, and mask RL diagnostics to RL positions. Cleanup: - Flatten _build_step_echo_alpha; add _extend_optional_token_field helper; trim verbose echo docstrings/comments. - Exclude echo positions from RL metrics and entropy/mismatch_kl logging. Tests: - Consolidate echo unit tests; re-add per-role/content baseline cases, mixed-roles, filter outer-length validation, and filter exception / empty-trajectory coverage; add a token-export test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The cleanup commit reworded EchoConfig's role validator to "requires at least one of system, user, assistant, or tool" but left the test asserting match="at least one role", so test_echo_config_rejects_filter_without_role failed in CI (the rejection itself is correct). Match "at least one of". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…p run) Temporary experiment pin to launch the alpha=0.005 forth-lang run. Revert to the main-aligned configs/private pin before the echo PR merges. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Picks up the InterceptionServer port-restart race fix (#1526) plus current main: Scale-SWE taskset, v1 sandbox runtime lifecycle, unbounded max_turns, multi-SWE patch extraction, SWE taskset reorg. verifiers is an editable path dep, so this is the submodule pointer only — run `uv sync --all-extras` on the remote to pick up any shifted verifiers deps. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…air_rl cadence align) Picks up 4e7aebf in research-configs: migrates [inference.kv_cache_offload] from flat cpu_bytes to the new tiered/discriminated shape (PR #2689) on the three forth-lang cells in active use (cmb_code_a0p005, rl_muon, rl), and aligns glm45air_rl ckpt+eval cadence with the muon cell. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Brings deps/research-environments up to current main (was 23 commits behind), including the forth-lang env (#358) used by the current runs, websearch query fixes, and the legacy RLM environment removal (#426). Submodule pointer only; run `uv sync --all-extras` on the remote to re-resolve workspace members. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nments #426) research-environments main (b07ace376) deleted this env in #426, so its pyproject refs (envs extra + [tool.uv.sources]) must go after the bump. uv.lock still carries stale entries — regenerate with `uv lock` / `uv sync --all-extras` on the remote before any --locked use. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
research-environments main routes envs through `verifiers[packages]`, which pulls verifiers' internal editable path-packages `harnesses` and `tasksets`. uv requires transitively-pulled path deps to be direct requirements of the workspace root, so declare both (envs extra + [tool.uv.sources]). uv.lock regen on the remote; if tasksets' [openenv,openreward,ta] extras drag in further path deps, `uv lock` will name them next. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Introduce the proper echo configs