Skip to content

Echo#2677

Draft
snimu wants to merge 265 commits into
mainfrom
feat/per-role-echo
Draft

Echo#2677
snimu wants to merge 265 commits into
mainfrom
feat/per-role-echo

Conversation

@snimu
Copy link
Copy Markdown
Collaborator

@snimu snimu commented May 31, 2026

Introduce the proper echo configs

snimu and others added 30 commits May 22, 2026 16:30
Picks up configs/private 03fd21f — drops the old forth-lang TOMLs (which
were running against the silent SFT-gradient bug fixed in the previous
commit), and adds:

- forth-lang: qwen-rl + 4 cmb-code cells (one per normalizer) + sweep/
- deepdive:   qwen-rl + 4 cmb cells (no tool_names filter) + sweep/
- alpha_sweep_launcher.sh: 7 αs × 4 norms × 2 envs = 56 runs, paired
  with this branch's loss fix so the runs actually exercise SFT.

Topology: 2 nodes / bs=256 throughout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t test

``.cuda()`` after ``requires_grad=True`` makes the device tensor a
non-leaf, so ``.grad`` never populates after backward(). Switch to
``device="cuda"`` in the constructor so it's a leaf on the GPU.

Caught when running the test on the cluster against the gradient fix.
prompt_attribution arrives at the orchestrator as a plain dict through
the verifiers env-server JSON boundary, even though the renderer emits
a RenderedTokens object. The previous code assumed attribute access
(``prompt_attribution.message_indices``), which crashed on the cluster:

    AttributeError: 'dict' object has no attribute 'message_indices'

Handle both forms — getattr fallback for the object case, dict.get for
the wire-serialized case. Per the function's existing precondition (no
runtime type check on ``prompt_attribution: Any``), this is the right
contract: the SFT mask should be computable from whatever shape the
caller passed in as long as the data is present.

Caught when the cluster smoke test crashed at step 0 rollout generation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
prompt_attribution always arrives as a plain dict through the verifiers
env-server JSON boundary — the renderer's RenderedTokens object loses
its identity in serialization. Drop the prior defensive isinstance/
getattr branching and just use dict access. Updates the type annotation
on the parameter to ``dict | None`` to match.
The previous design flipped SFT positions into ``loss_mask`` and routed
everything through ``default_loss_fn``. That had two problems:

1. Double normalization on SFT. SFT positions were divided by the per-
   rollout mode weight (α/X, set in ``prepare_sample``) AND by the
   batch-level ``loss_scale`` (= total loss_mask tokens, now including
   SFT). The per-SFT-token gradient ended up at α/(X × loss_scale), an
   ~S-times smaller magnitude than the legacy ``sft_loss_weight / N_rl``
   regime the production α=0.5 was tuned against.

2. Gradient flow on SFT positions went through ``torch.where(sft_mask,
   zeros, log_importance_ratio)``, which blocks the gradient through
   ``trainer_logprobs`` on the True branch. The intent (force IS-ratio
   to 1 for off-policy correction skip) silently disabled the SFT
   signal entirely.

Restructure to match the shape ECHO's algorithm actually has:
- RL term: batch-level normalization by N_rl_batch (the existing
  ``loss_scale``, now RL-only again).
- SFT term: per-rollout normalization by the chosen mode (α, α/S, α/T,
  α×R/S — set in ``prepare_sample`` on the advantage tensor). No batch
  normalization on top.

Implementation:
- ``prepare_sample``: stop flipping SFT into ``loss_mask``. Keep setting
  ``advantage[sft_pos] = mode_weight``.
- ``LossInputs.sft_mask`` removed — no longer needed since SFT and RL
  flow through separate ``compute_loss`` calls.
- ``default_loss_fn``: RL-only. All ``sft_mask``-conditional branches
  dropped.
- New ``sft_pg_loss_fn``: ``loss = -(advantages * trainer_logprobs *
  loss_mask).sum()``. Caller passes the SFT mask as ``loss_mask`` and
  ``loss_scale=1`` so no batch division.
- ``train.py``: second ``compute_loss`` call after the RL one, sharing
  the same ``out["logprobs"]`` tensor — single forward + single backward
  covers both terms.
- Regression test: replaced the broken ``test_default_loss_fn_sft_mask_
  gradient_flows`` with ``test_sft_pg_loss_fn_gradient_magnitude`` that
  exercises the new function directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three follow-ups from review of the SFT/RL split:

1. prepare_sample now clears sft_mask to None when sft_alpha is unset.
   Previously sft_mask propagated downstream regardless, and train.py
   would run sft_pg_loss_fn over the rollout's raw RL advantage on
   tool tokens — reward-shaped, not SFT-direction. With the mask
   cleared, the trainer skips the SFT compute_loss call as intended.

2. _step_sft_mask now bails to all-False when prompt_attribution lacks
   "message_indices" or "is_content" keys, matching the docstring's
   "missing/partial attribution returns all-False" contract. The
   DefaultRenderer leaves these unpopulated; callers shouldn't have
   to special-case the no-attribution path.

3. test_batch.py fixtures updated for the new design: SFT prompt
   tokens stay out of loss_mask (= [False, False, True, True] for the
   2-prompt + 2-completion fixture). Added an assertion that
   prepare_sample clears sft_mask to None when sft_alpha is missing.

Plus a docstring update in batch.py clarifying that "all_tokens" mode
(α/T) is NOT exactly ECHO — the paper uses Z=|𝒪|, which matches
"sft_tokens" when no tool_names filter is set. "all_tokens" with T
(prompt + completion) is strictly stricter.
Per team decision: SFT-on-tool-body is RL credit assignment with a flat
positive advantage on tool body positions. Drops the four-mode
normalization machinery entirely:

- SFTConfig: removed ``normalization`` field. Keep on_tool_outputs,
  alpha (default 1.0), tool_names.
- TrainingSample: removed sft_normalization field.
- prepare_sample: collapsed the 4-mode dispatch to just
  ``advantages[k] = sft_alpha`` on SFT-mask positions.
- orchestrator.py: drop sft_normalization assignment.
- train.py: drop the stale doc comment about per-rollout mode dispatch.
- tests: replaced 4-mode parametric tests with one ``advantage = alpha``
  assertion.

default_loss_fn keeps the torch.where SFT branch (advantage * trainer_logp
on SFT positions, log_IR forced to 0 for KL = 0 on SFT positions) — the
mechanism is correct under the RL-credit-assignment framing, just no
longer dispatched over four modes.

Submodule bump picks up the matching configs/private cleanup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rth-env)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pulls in renderers @ 5f653df, which adds `RENDERERS_MAX_PROMPT_LEN`
as an env-var override for the pre-flight overflow check in
`_resolve_max_prompt_len`. Lets us re-enable the pre-flight on
clusters whose vllm-router has a broken `/v1/models` handler.

To activate at run time, set `RENDERERS_MAX_PROMPT_LEN=32768` (or the
engine's actual `max_model_len`) in prime-rl's `.env` before launching.
With it set, overlong rollouts raise `OverlongPromptError` cleanly
(caught by `MultiTurnEnv.prompt_too_long`) instead of hitting vLLM and
crashing the orchestrator with a raw `ValueError`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolves 12 textual conflicts across the SFT-on-tool-body data path.
The feature's loss math is preserved bit-identical; only the surrounding
plumbing changed.

Trainer-side reconciliation:
- transport/types.py: drop sft_loss bool (replaced by main's training_mode
  literal); keep sft_mask + sft_alpha. MicroBatch carries sft_mask alongside
  main's new training_mode + rewards.
- batch.py: prepare_sample drops gone pixel_values/image_grid_thw/sft_loss
  (replaced by mm_kwargs + training_mode). packed_samples_into_micro_bs
  combines main's existing_len capture with the sft_mask extension.
  pad_micro_batch keeps both sft_mask (False) and routed_experts padding.
- rl/data.py: replace sft_loss with training_mode in FakeDataLoader paths;
  preserve sft_mask tensor conversion alongside main's new RoutedExperts
  struct path.
- rl/loss.py: drop teacher_logprobs handling from default_loss_fn (moved
  to a dedicated opd_loss_fn on main). Fix auto-merge bug that placed
  the sft_nll metrics block inside opd_loss_fn where sft_mask is
  undefined — moved back to default_loss_fn's tail. compute_loss drops
  sft_loss bool (replaced by training_mode); keeps sft_mask threading.
- rl/train.py: swap sft_loss=micro_batch["sft_loss"] for
  training_mode=micro_batch["training_mode"]; keep sft_mask_split.

Orchestrator-side reconciliation:
- configs/orchestrator.py: dedupe state_columns (both sides added it).
  Reformat SFTConfig + new sft: field to match main's docstring-after-
  field style.
- orchestrator/orchestrator.py: preserve the process_rollout helper,
  drop now-gone vlm_cache/cache_key args from interleave_rollout call.
  Combine SFT alpha attachment with sample.training_mode assignment;
  drop the no-longer-existing config.use_sft_loss path.
- orchestrator/trajectories.py: prepare_step_tokens decodes routed_experts
  (b64 → numpy uint8) and carries multi_modal_data alongside the
  per-step sft_mask. make_sample uses main's deferred-routed_experts
  pattern with sft_mask appended. extend_sample keeps the sft_mask
  extension AND main's chunk-list routed_experts state machine.

Tests:
- test_batch.py: take main's renamed test_prepare_batch_does_not_pack_
  mixed_training_mode (was ..._sft_loss); keep the 3 SFT overlay tests.
- test_trajectories.py: drop imports of gone helpers (build_vlm_image_
  cache, _ImageStore, _extract_images_from_*); keep _step_sft_mask; add
  main's align_routed_experts. Append the 8 _step_sft_mask tests after
  main's renderer mm_kwargs assertion test.
- test_loss.py: update SFT gradient regression test to use the new
  setup_loss_fns (plural) + loss_fns= dispatch API.

Submodules:
- configs/private kept at HEAD (a7dc3d6) for the SFT sweep cells.
- deps/verifiers kept at HEAD (cbc191d2) on sebastian/prompt-tool-names-
  2026-05-19. Critical: main's pointer doesn't have prompt_attribution
  or prompt_message_tool_names merged yet, accepting it would silently
  disable the feature. The .gitmodules branch pin survived intact.

Behavioral impact for SFT-on-tool-body sweeps (training_mode="rl",
no teacher): zero change. The only loss-math drift is that
default_loss_fn no longer mixes teacher_tau * teacher_kl.detach() into
advantages — that path moved to a dedicated opd_loss_fn selected via
training_mode="opd" with hardcoded knobs (not a drop-in mix).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the env packages used by the SFT-on-tool-outputs experiments
(forth-lang, general-agent, wordle) as workspace members + envs
extras so they're installed by `uv sync --all-extras` and survive
re-syncs. Previously each run's `pre_run_command` did `uv pip
install -e ...` to re-add them after `uv sync` wiped them — that
created a venv-mutation race window with concurrent jobs sharing
this venv.

Also pins fastokens via exclude-newer-package overrides for the
2026-05-17 release (>0.2.0) that's still inside the 7-day
exclude-newer window.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three coordinated changes post-merge:

1. deps/renderers: 5f653df (sebastian/preflight-max-prompt-len-env-override)
   → 1831bcc (renderers main). The `RENDERERS_MAX_PROMPT_LEN` env-var
   override workaround is no longer needed now that vllm-router 0.1.25
   fixes its `/v1/models` handler; pre-flight auto-discovery works
   again with the new router. Keeping the feature branch on the
   renderers repo for reference but un-using it here.

2. pyproject.toml: remove the duplicate `fastokens = false` line that
   tripped uv sync after the main-merge auto-resolve. The remaining
   entry (with the renderers-fastokens-override comment) is kept.

3. configs/private: bump to 2a42e1d, which contains the
   `rollouts_per_example → group_size` rename across all configs to
   match upstream PR #2599, plus the new deepdive 1000-step prod cells
   for the α-ablation winners and matching sweep cells for α∈{0.01,
   0.05}.

uv.lock regenerated to reflect router 0.1.25 + the new prime-rl 0.5.0
pinned vLLM wheel + prime-pydantic-config.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops forth-lang, general-agent, and wordle from `envs` extras,
workspace members, and `tool.uv.sources`. Those packages live outside
the prime-rl repo (forth-lang) or are project-specific to the
SFT-on-tool-outputs experiments rather than generally useful; pinning
them in prime-rl's pyproject broke CI (`uv sync --all-extras --locked`
can't resolve the absolute path source for forth-lang on a non-cluster
checkout).

Install responsibility now lives in `configs/private`: each
experiment cell's `start.sh` does `uv pip install -e <env_paths>` on
the launching machine and each `prod.toml`'s `[slurm].pre_run_command`
does the same on the compute head node, after the sbatch's own
`uv sync`. deepdive stays in prime-rl (already a workspace member
upstream).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Prevents extraneous packages (those installed via \`uv pip install -e\`
outside the lockfile) from being uninstalled on each new compute job's
sbatch startup. The shared NFS venv was being briefly stripped of
forth-lang/general-agent/wordle on every job startup, which crashed
other running jobs' Python subprocesses that re-imported in that
window. With --inexact, uv sync preserves packages not tracked by the
workspace.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
snimu and others added 6 commits June 1, 2026 19:48
Resolve conflicts so the diff over main is echo-only. v2 (#2639) is now on main, so take main for v2-drift (advantage/dispatcher/types, Kind→RolloutKind rename, SFT cache_salt #2687) and packaging. Align to main's config hard-break — drop the filters→post_batch_filters alias and the [orchestrator.buffer] FutureWarning shim (echo filter machinery untouched). verifiers stays at 67f958538 (ours) per request; uv.lock reflects that (no nemogym extra).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
#2684 (verifiers as an editable path dep, not a workspace member) is now on this branch, so verifiers' nemogym extra-vs-extra conflicts no longer break `uv sync --all-extras` — it only enables prime-rl's own extras. Verified --all-extras resolves cleanly with verifiers main.

67f958538 -> 7f5109e4 picks up: v1 endpoint port pre-probe race fix (#1513), SandboxConfig.labels via #1512 (supersedes local hotfix 76afcac), NeMo Gym integration (#1396), unbounded v1 max_turns (#1517). uv.lock restores verifiers' nemogym/tasksets extra metadata.

renderers was already at newest main (942449c37 / v0.1.8.dev39); only synced the working tree, which had drifted to the older v0.1.8.dev37 tag.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Collapse the per-role / per-scenario _step_echo_alpha and apply_echo_filter tests into parametrized tests (echo tests 50 -> 16 functions, ~-320 lines) with identical coverage. Merge the 3 prepare_sample echo-overlay batch tests into one parametrized test (and fix an `is True` identity check). Drop 4 pydantic-default framework-glue tests (kwargs default/explicit, required import_path, filter nesting) — the real kwargs=None contract stays covered; keep the custom at-least-one-role validator. Drop the vestigial echo_tokens-mode framing in the loss gradient test.

Could not run pytest locally (darwin venv can't import prime_rl); verified via py_compile + ruff. Run on the remote before relying on it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cut the per-role / EchoConfig / EchoFilterConfig docstrings and the echo comment blocks in transport/types, batch, loss, and trajectories down to essentials (keeping loss.py's gradient-flow rationale). No behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Roles default to None (disabled); when one is activated it should contribute gradient out of the box, so per-role alpha now defaults to 1.0 instead of 0.0 (which required also setting a positive alpha to do anything). The alpha=0 "kill RL on assistant tokens" case stays available explicitly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@snimu snimu force-pushed the feat/per-role-echo branch from 52bee52 to 418e264 Compare June 2, 2026 11:45
snimu and others added 23 commits June 2, 2026 11:50
The echo PR carries no dependency changes — it uses main's pins for verifiers (e1d4f2593), renderers (35c2407e9), research-environments (c752781), configs/private (70c3503e1), plus main's pyproject/uv.lock. Reverts the earlier verifiers bump (7f5109e4). Private configs aren't used in this PR (experiments run elsewhere), so main's stuff is fine; issues to be reported separately (see prime-rl-research-env-bump-issue.md).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Drop a gratuitous reword introduced during the echo doc-trim pass — only the body's echo-filter wiring should differ from main, not the docstring.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ho_config

Bind the echo-filter kwargs into the resolved callable via functools.partial in TrainEnv setup, so process_rollout no longer threads kwargs back through env.config.echo. That removes the type-narrowing assert and the env_echo_config alias (echo_config is now inlined into the interleave_rollout call), and apply_echo_filter drops its filter_kwargs param. Removed test_apply_echo_filter_kwargs — kwargs binding is now functools.partial + the EchoFilterConfig.kwargs default, not apply_echo_filter's concern.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the in-place "null out where the mask is False" loop in _step_echo_alpha with an equivalent select-comprehension (keep the baseline alpha where the mask is True, None where False). No behavior change — a True still preserves a None baseline, so the filter only narrows.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
apply_echo_filter already validates the user filter's output (outer length vs trajectory, inner length vs each step's tokens, bool types), so the downstream re-checks were redundant: drop the outer-length check in interleave_rollout and the inner-length check in _step_echo_alpha. The narrowing zip uses strict=True so an internal misalignment still fails loud instead of silently truncating. Removed the two now-dead length-mismatch tests; the real validation stays covered by test_apply_echo_filter_invalid_raises.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The function builds the per-token echo_alpha array for one trajectory step; the verb-prefixed name reads as a builder (and keeps the echo_alpha field name) rather than looking like a scalar. Updates the def, the interleave_rollout call, the :func: ref, the transport/types comment, and the test import + test names + calls + docstrings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ases

Cleanup pass over the per-role echo feature: simplifications, several
correctness fixes, and restored test coverage.

Loss:
- Decouple echo cross-entropy from default_loss_fn via split_loss_masks
  (RL mask = loss_mask & ~echo_mask, echo mask = loss_mask & echo_mask).
- Normalize RL and echo terms by separate global denominators
  (rl_loss_scale / echo_loss_scale) so echoing prompt/tool tokens no longer
  dilutes ordinary RL completion gradients. RL-only runs are numerically
  unchanged (rl_loss_scale == old loss_scale, echo term zero).
- Reject echo with a custom RL loss and in opd/sft modes (explicit errors).

Fixes:
- Exclude token 0 from echo: its shifted logprob is a placeholder, not a
  valid CE target, so it must not be supervised.
- Validate echo_alpha length in prepare_sample and reject misaligned
  samples in the packer.
- Restore the train_sink echo-filter guard (skip empty-trajectory rollouts)
  so a configured filter can't crash on degenerate rollouts; keep running
  the filter when prompt_attribution is absent.
- tool_names: list -> set with min_length>=1 (empty allow-list rejected;
  None still means all tools); alpha gets allow_inf_nan=False.
- token_export: split RL/echo masks, export echo_mask + rl_loss_mask, and
  mask RL diagnostics to RL positions.

Cleanup:
- Flatten _build_step_echo_alpha; add _extend_optional_token_field helper;
  trim verbose echo docstrings/comments.
- Exclude echo positions from RL metrics and entropy/mismatch_kl logging.

Tests:
- Consolidate echo unit tests; re-add per-role/content baseline cases,
  mixed-roles, filter outer-length validation, and filter exception /
  empty-trajectory coverage; add a token-export test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The cleanup commit reworded EchoConfig's role validator to "requires at
least one of system, user, assistant, or tool" but left the test asserting
match="at least one role", so test_echo_config_rejects_filter_without_role
failed in CI (the rejection itself is correct). Match "at least one of".

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…p run)

Temporary experiment pin to launch the alpha=0.005 forth-lang run.
Revert to the main-aligned configs/private pin before the echo PR merges.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Picks up the InterceptionServer port-restart race fix (#1526) plus current
main: Scale-SWE taskset, v1 sandbox runtime lifecycle, unbounded max_turns,
multi-SWE patch extraction, SWE taskset reorg. verifiers is an editable path
dep, so this is the submodule pointer only — run `uv sync --all-extras` on the
remote to pick up any shifted verifiers deps.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…air_rl cadence align)

Picks up 4e7aebf in research-configs: migrates [inference.kv_cache_offload]
from flat cpu_bytes to the new tiered/discriminated shape (PR #2689) on
the three forth-lang cells in active use (cmb_code_a0p005, rl_muon, rl),
and aligns glm45air_rl ckpt+eval cadence with the muon cell.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Brings deps/research-environments up to current main (was 23 commits behind),
including the forth-lang env (#358) used by the current runs, websearch query
fixes, and the legacy RLM environment removal (#426). Submodule pointer only;
run `uv sync --all-extras` on the remote to re-resolve workspace members.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nments #426)

research-environments main (b07ace376) deleted this env in #426, so its
pyproject refs (envs extra + [tool.uv.sources]) must go after the bump.
uv.lock still carries stale entries — regenerate with `uv lock` /
`uv sync --all-extras` on the remote before any --locked use.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
research-environments main routes envs through `verifiers[packages]`, which
pulls verifiers' internal editable path-packages `harnesses` and `tasksets`.
uv requires transitively-pulled path deps to be direct requirements of the
workspace root, so declare both (envs extra + [tool.uv.sources]). uv.lock
regen on the remote; if tasksets' [openenv,openreward,ta] extras drag in
further path deps, `uv lock` will name them next.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants