Skip to content
Draft

Echo #2677

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
267 commits
Select commit Hold shift + click to select a range
2dbe8d5
fix(test): create trainer_logprobs as leaf cuda tensor in SFT gradien…
snimu May 22, 2026
1bd5ff3
chore(configs): bump configs/private (start.sh-style sweep wrapper)
snimu May 22, 2026
565ea6b
chore(configs): bump configs/private (PATH fix for non-login SSH)
snimu May 22, 2026
658f6c9
fix(orchestrator): _step_sft_mask handles dict prompt_attribution
snimu May 22, 2026
537c5af
fix(orchestrator): _step_sft_mask reads prompt_attribution as dict
snimu May 22, 2026
bc01f94
fix(loss): split SFT and RL loss paths to remove double normalization
snimu May 22, 2026
6d831cd
fix(sft): clear sft_mask without sft_alpha + missing-attr handling
snimu May 22, 2026
9cde634
Revert "fix(sft): clear sft_mask without sft_alpha + missing-attr han…
snimu May 22, 2026
acc3365
Revert "fix(loss): split SFT and RL loss paths to remove double norma…
snimu May 22, 2026
8b7ab3d
chore(configs): bump configs/private (none-only sweep launcher)
snimu May 22, 2026
767df36
chore(configs): bump configs/private (3 nodes/run, 2 infer replicas)
snimu May 22, 2026
7ea4cb7
sft: drop four-mode normalization, treat as RL credit assignment
snimu May 22, 2026
4a72045
chore: remove accidentally committed .DS_Store
snimu May 22, 2026
a18d339
chore(gitignore): ignore .DS_Store
snimu May 22, 2026
76cd355
chore(configs): bump configs/private (launcher polls squeue)
snimu May 22, 2026
671657f
chore(configs): bump configs/private (deepdive drops zero_advantage f…
snimu May 22, 2026
87e8621
chore(configs): bump configs/private (eval=100, cmb-all variant)
snimu May 23, 2026
c7f3540
chore(configs): bump configs/private (forth-lang prod cells)
snimu May 23, 2026
fb605d0
chore(configs): bump configs/private (dd sweep cells dd_a0p1 + dd_a1p0)
snimu May 23, 2026
bb5c179
chore(deps): bump research-environments (merge main into sebastian/fo…
snimu May 24, 2026
4a3b246
chore(deps): bump renderers to preflight-max-prompt-len-env-override
snimu May 25, 2026
cee0668
Merge main into feat/sft-on-tool-outputs
snimu May 25, 2026
3689db4
chore(deps): add forth-lang, general-agent, wordle envs to workspace
snimu May 25, 2026
3e0d3f4
chore(deps): revert renderers feature branch + bump configs/private
snimu May 25, 2026
a358c2d
chore(deps): revert cluster-specific env additions to pyproject
snimu May 25, 2026
4514139
chore(configs): bump configs/private for env install ownership
snimu May 25, 2026
1f65ec8
chore(configs): bump configs/private for deepdive launcher env install
snimu May 25, 2026
38fa64e
fix(slurm): use \`uv sync --inexact\` in sbatch templates
snimu May 25, 2026
6d5a6b1
chore(configs): bump configs/private for --inexact uv sync
snimu May 25, 2026
beb1528
fix(slurm): add --no-install-project to uv sync in sbatch templates
snimu May 26, 2026
5304c8e
chore(configs): bump for --no-install-project sync
snimu May 26, 2026
b44c854
feat(orchestrator): add orchestrator v2 with shared train+eval scheduler
mikasenghaas May 26, 2026
59d45c4
chore(configs): bump configs/private for nfpt forth-lang variants
snimu May 26, 2026
a3917c2
refactor(orchestrator_v2): address review feedback
mikasenghaas May 26, 2026
2609b01
refactor(orchestrator_v2): decompose batcher into single-purpose comp…
mikasenghaas May 26, 2026
aeecbbd
refactor(orchestrator_v2): rollout-as-atom + train/eval sinks with fl…
mikasenghaas May 27, 2026
766a407
refactor(orchestrator_v2): three-level sinks (rollout / group / batch)
mikasenghaas May 27, 2026
d2ef0f5
refactor(orchestrator_v2): unify is_batch_done trigger across train a…
mikasenghaas May 27, 2026
7274ab6
chore(configs): replace orch_v2 debug overlay with hendrycks_math sho…
mikasenghaas May 27, 2026
6bc11e1
refactor(orchestrator_v2/ckpt): drop unused last_eval_step from Progress
mikasenghaas May 27, 2026
d3b70cd
refactor(orchestrator_v2): drop pre-baked metrics from batch result t…
mikasenghaas May 27, 2026
75da0a3
refactor(orchestrator_v2): consolidate dataclasses + types into types.py
mikasenghaas May 27, 2026
685c24a
chore(configs): drop max_completion_tokens cap from orch_v2_demo
mikasenghaas May 27, 2026
15953cb
refactor(orchestrator_v2): move group/batch boundary signals into the…
mikasenghaas May 27, 2026
649ab6a
fix(configs): allow matching shared+sub values in propagator conflict…
mikasenghaas May 27, 2026
36f7105
refactor(orchestrator): replace legacy orchestrator with v2 implement…
mikasenghaas May 27, 2026
e812e0a
refactor(orchestrator): UUID group IDs, pool-owned client selection, …
mikasenghaas May 27, 2026
56f8cc1
revert: restore validation.py to main
mikasenghaas May 27, 2026
3730cd6
refactor(orchestrator): factor TrainSource + EvalSource out of dispat…
mikasenghaas May 27, 2026
7b2eacc
refactor(orchestrator): slim dispatcher, move eval triggers to per-ba…
mikasenghaas May 27, 2026
3c90b78
refactor(configs): move off-policy caps to per-side {train,eval}.max_…
mikasenghaas May 27, 2026
e447373
refactor(eval_sink): build metrics in process_batch + rename expected…
mikasenghaas May 27, 2026
5e76a5b
revert: collapse per-side off-policy caps back to global max_off_poli…
mikasenghaas May 27, 2026
c4f5634
Merge remote-tracking branch 'origin/main' into exp/orchestrator-v2
mikasenghaas May 27, 2026
d702343
Merge remote-tracking branch 'origin/main' into exp/orchestrator-v2
mikasenghaas May 27, 2026
f49dd80
chore(config): remove wandb.shared deprecation shim
mikasenghaas May 27, 2026
50e7ec7
fix(config): honor --no-wandb / --no-ckpt across shared and sub-configs
mikasenghaas May 27, 2026
df222bd
chore(changelog): drop --no-wandb fix entry
mikasenghaas May 27, 2026
4f8407e
chore(comment): trim bare-block enablement comment
mikasenghaas May 27, 2026
6279052
docs(wandb): drop legacy-mode reference; add AGENTS rule
mikasenghaas May 27, 2026
c7729a1
chore(agents): trim docs rule
mikasenghaas May 27, 2026
60766f6
Merge remote-tracking branch 'origin/chore/remove-wandb-shared-deprec…
mikasenghaas May 27, 2026
907a0a6
Merge remote-tracking branch 'origin/main' into exp/orchestrator-v2
mikasenghaas May 27, 2026
25517ce
refactor(eval_sink): drop redundant batch_arrivals counter
mikasenghaas May 27, 2026
329252d
refactor(sinks): symmetrize TrainSink + EvalSink API; non-optional Ev…
mikasenghaas May 27, 2026
0ef56ae
refactor(orchestrator): quiet filter per-group logs; add per-group su…
mikasenghaas May 27, 2026
6ae5f8d
refactor(rollout): key sinks by group_id UUID, not (env, example_id)
mikasenghaas May 27, 2026
205ee25
refactor(types): rename ProcessResult→TrainBatchMetrics, add typed Ev…
mikasenghaas May 27, 2026
f3a42fd
chore(filters): drop per-group detection log from apply_filters
mikasenghaas May 27, 2026
32719d0
refactor(periodic_logger): distribute per-component; no defensive wandb
mikasenghaas May 27, 2026
4717624
refactor(metrics): drop dispatcher gauges/drain from step-aligned Met…
mikasenghaas May 27, 2026
545d2df
chore: revert stray uv.lock edit to match origin/main
mikasenghaas May 27, 2026
e3f2326
refactor(orchestrator): tighten APIs + collapse dead state
mikasenghaas May 27, 2026
aedb452
fix(dispatcher): emit Cancelled markers for un-scheduled rollouts on …
mikasenghaas May 27, 2026
8bc6a4a
refactor(sources): unify next_example API + relax over-strict train p…
mikasenghaas May 27, 2026
1f59407
refactor(orchestrator): log overhaul + drain-cancel + multi-env support
mikasenghaas May 27, 2026
bef17a6
refactor(eval_source): unify trigger + trigger_at_start
mikasenghaas May 27, 2026
f91d21f
refactor(orchestrator): tighten attr typing, drop event_loop_lag, polish
mikasenghaas May 27, 2026
31398ee
refactor(orchestrator): consolidate periodic logs, per-env breakdown,…
mikasenghaas May 27, 2026
6b63884
fix(orchestrator): pipeline log per-env reconciliation + step-0 step …
mikasenghaas May 27, 2026
d3b0fcd
chore(orchestrator): move Error/Truncation to end of success log lines
mikasenghaas May 27, 2026
ddfca0c
chore(trainer): use format_time for per-step success log
mikasenghaas May 27, 2026
b35565c
fix(dispatcher): keep dispatching eval group tails after queue empties
mikasenghaas May 27, 2026
34f5a4f
feat(configs): rlm_swe qwen3-4b-thinking on 2-node slurm
mikasenghaas May 27, 2026
1f86c7e
fix(orchestrator): pipeline-view accounting + drain-switch overlap fi…
mikasenghaas May 27, 2026
8b21317
fix(sinks): pipeline view fills per-rollout for non-group-scoring envs
mikasenghaas May 27, 2026
5b7be19
chore(configs): drop hendrycks_math/sanity seq_len 8192 → 4096
mikasenghaas May 27, 2026
5617925
refactor(eval_sink): rename epoch_progress → batch_progress
mikasenghaas May 27, 2026
0b69dd4
chore(orchestrator): reformat pipeline log + drop avg@k wandb metric
mikasenghaas May 28, 2026
ed06dda
chore(configs): bump default LogConfig.interval 5s → 10s
mikasenghaas May 28, 2026
f1aa527
chore(orchestrator): reword train + eval success log prefixes
mikasenghaas May 28, 2026
fb78cee
chore(periodic_logger): drop name-column prefix from console emit
mikasenghaas May 28, 2026
4b6251e
chore(orchestrator): pipeline log puts batch progress before inflight
mikasenghaas May 28, 2026
601423f
chore(orchestrator): show batch-progress percentages in pipeline log
mikasenghaas May 28, 2026
2c86b48
chore(configs): hendrycks_math/sanity batch 512 → 256, keep max_infli…
mikasenghaas May 28, 2026
dcaa3a1
chore(orchestrator): drop /max from train-inflight pipeline log
mikasenghaas May 28, 2026
cd34551
chore(orchestrator): drop env list from eval-trigger mode-flip reason
mikasenghaas May 28, 2026
a601dd9
chore(configs): drop verbose comment on EvalConfig.validate_non_empty…
mikasenghaas May 28, 2026
7687dcf
chore(configs): drop verbose docstring on OrchestratorExperimentalConfig
mikasenghaas May 28, 2026
a16f35c
chore(configs): drop orch-v2 references from pre_batch_filters docstring
mikasenghaas May 28, 2026
2fa639b
chore(configs): trim verbose comment on train group_size propagation
mikasenghaas May 28, 2026
be50608
chore(configs): generic LogConfig.interval docstring
mikasenghaas May 28, 2026
c43363a
chore(eval_source): drop TrainSource cross-reference from docstring
mikasenghaas May 28, 2026
3af5c66
chore(filters): revert setup_filters docstring to one-liner
mikasenghaas May 28, 2026
1220034
refactor(orchestrator): typed rollout dataclasses, raw stays pristine
mikasenghaas May 28, 2026
bfc94ab
fix(trajectories): thread env_name as kwarg into interleave_rollout
mikasenghaas May 28, 2026
10ed23d
chore(types): drop ``_``-prefix from rollout metadata keys in to_dict
mikasenghaas May 28, 2026
31e71f9
feat(types): add FinishedRollout.rollout_id; fix advantage grouping
mikasenghaas May 28, 2026
e6316de
chore(orchestrator): reword eval success log 'Finished evaluating' ->…
mikasenghaas May 28, 2026
ffb913c
refactor(advantage): drop grouping logic, cache advantage_fn on the sink
mikasenghaas May 28, 2026
286601d
chore(orchestrator): drop Valid X/Y from eval success log (redundant …
mikasenghaas May 28, 2026
ae9d267
chore(orchestrator): bump per-env success-log indent by one space
mikasenghaas May 28, 2026
a158164
fix(orchestrator): drop_group orphan, empty-batch spin, resume eval dup
mikasenghaas May 28, 2026
4b04c91
chore(orchestrator): drop 'Starting orchestrator step N' info log
mikasenghaas May 28, 2026
e92345f
refactor(dispatcher): rename + slim down wandb metrics
mikasenghaas May 28, 2026
b7f0dee
chore(orchestrator): drop verbose attribute-schema comment block
mikasenghaas May 28, 2026
ffa6905
chore: trim excessive code comments + docstrings
mikasenghaas May 28, 2026
15bc7f1
refactor(types): trim types.py 255 → 212 lines
mikasenghaas May 28, 2026
62d1ba7
test: fix unit tests for typed rollouts + dropped modules
mikasenghaas May 28, 2026
80837b6
refactor(configs): drop obsolete buffer + cancel-inflight fields
mikasenghaas May 28, 2026
7d45c2d
refactor: read seq/completion lens from vf.RolloutOutput.token_usage
mikasenghaas May 28, 2026
4f1a301
Merge remote-tracking branch 'origin/main' into feat/sft-on-tool-outputs
snimu May 28, 2026
8d1732a
chore(deps): bump deps/verifiers to main
snimu May 28, 2026
94276cf
chore(deps): bump renderers to message_tool_names feature branch
snimu May 28, 2026
e60e83b
chore(configs): bump configs/private to feat/sft-on-tool-outputs-fort…
snimu May 28, 2026
386ed5c
uv lock
snimu May 28, 2026
4faeaa9
chore(deps): bump research-environments to sebastian/forth-lang-2026-…
snimu May 28, 2026
337065a
chore(configs): bump configs/private (add GLM-5.1 forth-lang cells)
snimu May 28, 2026
b999766
feat(orchestrator): read message_tool_names off prompt_attribution
snimu May 28, 2026
674cca0
chore(deps): bump research-environments (forth-lang holdout + v1 fixes)
snimu May 28, 2026
f4a30c2
chore(configs): bump configs/private (forth-lang holdout + v1 nesting)
snimu May 28, 2026
2188b10
chore(deps,configs): bump for forth-lang sandbox_labels
snimu May 28, 2026
64a5535
chore(configs): bump configs/private (deepdive install + research-pro…
snimu May 28, 2026
f9a4199
chore(configs): bump configs/private (drop max_async_level)
snimu May 28, 2026
d10d12d
chore(deps): bump renderers to main (message_tool_names PR merged)
snimu May 28, 2026
4f18cbf
chore(configs): bump configs/private (sanity cell + pre_run_command d…
snimu May 28, 2026
55e0e1a
chore(configs): bump configs/private (GLM-4.5-Air forth-lang cells)
snimu May 28, 2026
59f6722
chore(configs): bump configs/private (glm-4.5 renderer on GLM-4.5-Air…
snimu May 28, 2026
2d307bc
build(uv): exempt verifiers from exclude-newer for env editable installs
snimu May 28, 2026
893084d
fix(orchestrator): revert eval reward key to avg@N; migrate example b…
mikasenghaas May 28, 2026
c97c86c
Merge remote-tracking branch 'origin/main' into exp/orchestrator-v2
mikasenghaas May 28, 2026
018861f
feat(configs): warn + migration guide for removed [orchestrator.buffer]
mikasenghaas May 28, 2026
be0a498
chore(configs): restore unchanged docstrings/comments to minimize diff
mikasenghaas May 28, 2026
36c705b
uv.lock
snimu May 28, 2026
b3dd725
chore(configs): bump configs/private (drop pre_run_command, no instal…
snimu May 28, 2026
4bac4cd
fix(slurm): extend --no-install-project to prime-rl-configs + prime-p…
snimu May 28, 2026
3628fe6
Revert "fix(slurm): extend --no-install-project to prime-rl-configs +…
snimu May 28, 2026
8cc5d1b
chore(configs): bump configs/private (switch to pre_run_command for e…
snimu May 28, 2026
bc235d2
feat(rlm-swe): single-node qwen3.5-4b config; pin rlm-swe as workspac…
mikasenghaas May 28, 2026
56e8926
feat(rlm-swe): qwen3.5-4b config contents + workspace install for rlm…
mikasenghaas May 28, 2026
a0e1a2e
fix: eval metric correctness + integration-test log parsing
mikasenghaas May 28, 2026
aeac9d5
refactor(orchestrator): pipeline-log clarity, per-rollout offload, li…
mikasenghaas May 28, 2026
6684fc6
feat(rlm-swe): multinode 1 train + 1 infer node, inference dp=8
mikasenghaas May 28, 2026
30e6a9a
feat(rlm-swe): add 1h per-rollout timeout on swebench eval
mikasenghaas May 28, 2026
288ddfc
fix(rlm-swe): cp=4 to fix trainer OOM on step 1
mikasenghaas May 29, 2026
303cd9d
chore: update uv.lock to match pyproject (fix `uv sync --locked` in CI)
mikasenghaas May 29, 2026
9214d62
refactor(trainer): match orchestrator log format (drop colons after l…
mikasenghaas May 29, 2026
866415d
refactor(trainer): drop "Time" label from step line (match orch)
mikasenghaas May 29, 2026
c03b27f
feat(rlm-swe): 2 inference nodes (dp=16) to double decode throughput
mikasenghaas May 29, 2026
6aa4d92
fix(rlm-swe): inference dp is per-node — set dp=8 (not 16)
mikasenghaas May 29, 2026
516abcc
feat(rlm-swe): 2 independent inference replicas instead of cross-node DP
mikasenghaas May 29, 2026
f1f9e65
chore(configs): bump configs/private (glm45air KV offload 960 GB → 96…
snimu May 29, 2026
76d7652
chore(configs): bump configs/private (glm45air → 4 infer nodes)
snimu May 29, 2026
7dc7962
chore(configs): bump configs/private (glm45air batch 256, oversamplin…
snimu May 29, 2026
f2ff7b6
chore(configs): bump configs/private (glm45air batch 128, oversamplin…
snimu May 29, 2026
6354568
refactor(orchestrator): fan out temperature at batch build
mikasenghaas May 26, 2026
d6e4f34
fix(orchestrator): warn (not success) when cleanup is forced
mikasenghaas May 29, 2026
d07b387
Merge remote-tracking branch 'origin/main' into exp/orchestrator-v2
mikasenghaas May 29, 2026
ff4f74f
chore(orchestrator): remove dead print_benchmark from utils
mikasenghaas May 29, 2026
11be6fc
fix(orchestrator): eval the student (not teacher) in SFT; ckpt drain …
mikasenghaas May 29, 2026
7fa8384
fix(orchestrator): per-env group_size in solve rates; group by group_id
mikasenghaas May 29, 2026
826a3b4
chore(configs): consolidate multimodal configs under configs/debug
mikasenghaas May 29, 2026
7619727
fix(orchestrator): eval through the eval (chat) client, not the renderer
mikasenghaas May 29, 2026
0f5da88
test(reverse_text): lower min reward threshold 0.65 -> 0.6
mikasenghaas May 29, 2026
59817d6
fix(orchestrator): restore checkpoint resume compat with pre-rewrite …
mikasenghaas May 29, 2026
7789b3c
chore(configs): bump configs/private (glm45air rollouts_per_example 1…
snimu May 29, 2026
fe123b2
fix(orchestrator): gate dispatcher on lead instead of blocking ship
mikasenghaas May 30, 2026
a562280
fix(dispatcher): claim drop_group tasks atomically before emitting
mikasenghaas May 30, 2026
8ce8d9b
feat(rlm-swe): enable prefix caching + language-model-only inference
mikasenghaas May 30, 2026
6706bb7
feat(rlm-swe): disable thinking in the qwen3.5 renderer
mikasenghaas May 30, 2026
9963b88
feat(rlm-swe): bump main run to 400 steps (non-thinking)
mikasenghaas May 30, 2026
8c36d26
chore(configs): bump configs/private (glm45air optimizer muon → sign_…
snimu May 30, 2026
f24181c
chore(configs): bump configs/private (glm45air rollouts_per_example 8…
snimu May 30, 2026
b5fa04c
feat(configs): introduce EchoConfig (per-role per-token echo overlay)
snimu May 30, 2026
f3fb14d
refactor(configs): rename per-role echo classes with explicit Role su…
snimu May 30, 2026
23a4f61
refactor(trainer): rename sft_mask → echo_mask on MicroBatch + downst…
snimu May 30, 2026
ca258e0
feat(orchestrator,trainer): per-token echo_alpha — sft_mask → echo_al…
snimu May 30, 2026
36ea66f
chore(configs): remove SFTConfig + TrainEnvConfig.sft + bump configs/…
snimu May 30, 2026
b87bc65
feat(configs): EchoConfig — alpha defaults to 0, require at least one…
snimu May 30, 2026
234adc2
feat(rlm-swe): re-enable thinking on the main config
mikasenghaas May 30, 2026
0cda2ea
Merge remote-tracking branch 'origin/main' into exp/orchestrator-v2
mikasenghaas May 31, 2026
3ce4264
fix(orchestrator): only skip ckpt at the drain-entry step, not max_st…
mikasenghaas May 31, 2026
9967815
filter design doc
snimu May 31, 2026
1b50b60
feat(echo): user-pluggable per-token filter on top of role baseline
snimu May 31, 2026
d58af91
feat(echo): add echo_filter_examples module + bump configs/private (s…
snimu May 31, 2026
1b2d3d1
chore(configs): bump configs/private (echo_smoke: Qwen3-1.7B + bs=4)
snimu May 31, 2026
257e20b
chore(configs): bump configs/private (echo_smoke: Qwen3-4B + tier 0 +…
snimu May 31, 2026
f2d694b
chore(configs): bump configs/private (echo_smoke start.sh wipes outpu…
snimu May 31, 2026
04fde81
remove design doc
snimu May 31, 2026
449f3f2
chore(echo): remove example filter module + bump configs/private (dro…
snimu May 31, 2026
2fa5224
chore(deps): pin verifiers + renderers + configs/private to main
snimu May 31, 2026
a1ef625
Merge main into feat/per-role-echo
snimu May 31, 2026
ce08848
Merge origin/exp/orchestrator-v2 — port echo + filter to async sink a…
snimu May 31, 2026
2eb053c
chore(configs): bump configs/private (migrate prod cells to new echo …
snimu May 31, 2026
93a0c98
chore(deps): bump research-environments (forth_lang refactor + verifi…
snimu May 31, 2026
ca01e26
fix(pyproject): remove duplicate verifiers = false in exclude-newer-p…
snimu May 31, 2026
d619ef4
fix(test): merge pydantic import into third-party group (ruff I001)
snimu May 31, 2026
740d2eb
uv lock
snimu May 31, 2026
a93f04a
chore(deps): revert verifiers to main's pin (43016ea6)
snimu May 31, 2026
c264751
uv lock
snimu May 31, 2026
1ef1ca0
style(echo): apply ruff format — collapse multi-line statements that …
snimu May 31, 2026
a37116b
chore(deps): bump research-environments (forth_lang cleanup + ForthLa…
snimu May 31, 2026
5d50dca
uv lock
snimu May 31, 2026
b93a02e
chore(deps): pin verifiers at cf4028c5 — BindingsConfig before nemogy…
snimu May 31, 2026
721b4b6
uv lock
snimu May 31, 2026
62e9654
chore(configs): bump configs/private (drop holdout_side after forth-l…
snimu May 31, 2026
76afcac
chore(deps): bump verifiers — restore SandboxConfig.labels (fix/v1-sa…
snimu May 31, 2026
97ad1ac
fix(train_sink): fan out env temperature to sample.completion_tempera…
snimu May 31, 2026
5eab2bf
chore(configs): bump configs/private (glm45air muon ckpt+eval cadence…
snimu Jun 1, 2026
2cc912c
chore(configs): bump configs/private (glm45air muon ckpt.keep_last 1→2)
snimu Jun 1, 2026
791549b
Merge origin/main into feat/per-role-echo
snimu Jun 2, 2026
5096f03
chore(deps): bump verifiers to main (7f5109e4)
snimu Jun 2, 2026
f7c7a86
test(echo): consolidate echo unit tests via parametrization
snimu Jun 2, 2026
56898f4
chore(configs): trim verbose echo docstrings and comments
snimu Jun 2, 2026
418e264
feat(configs): echo role alpha defaults to 1.0
snimu Jun 2, 2026
550c6bf
uv lock
snimu Jun 2, 2026
5e4f477
chore(deps): align all deps and submodules with prime-rl main
snimu Jun 2, 2026
a6612fa
chore(train_sink): restore main's process_rollout docstring
snimu Jun 2, 2026
326cc12
refactor(echo): bind filter kwargs at env setup, drop assert + env_ec…
snimu Jun 2, 2026
913becc
refactor(echo): express filter narrowing as a comprehension
snimu Jun 2, 2026
e31428b
refactor(echo): validate filter-mask length once, in apply_echo_filter
snimu Jun 2, 2026
9f76b7c
refactor(echo): rename _step_echo_alpha -> _build_step_echo_alpha
snimu Jun 2, 2026
79a087e
refactor(echo): simplify echo path, decouple echo loss, harden edge c…
snimu Jun 2, 2026
c6e9308
Merge remote-tracking branch 'origin/main' into feat/per-role-echo
snimu Jun 2, 2026
9a0fe88
fix(echo): align test regex with validate_roles error message
snimu Jun 2, 2026
c732d9b
cleanup
snimu Jun 2, 2026
f2f3ecf
cleanup token export float conversion
snimu Jun 2, 2026
e00ede3
refactor echo annotations
snimu Jun 2, 2026
b8aa41e
ruff
snimu Jun 2, 2026
cd2f98d
rename echo nll metric
snimu Jun 2, 2026
83b1903
Merge remote-tracking branch 'origin/feat/per-role-echo' into feat/pe…
snimu Jun 2, 2026
358e205
chore(configs): bump configs/private to glm45air cmb-code a0p005 (tem…
snimu Jun 2, 2026
f2bc7df
chore(deps): bump verifiers to main (dacceced) — port-restart race fix
snimu Jun 2, 2026
f73d66e
chore(configs): bump configs/private (kv_cache_offload schema + glm45…
snimu Jun 2, 2026
fecebd8
chore(deps): bump research-environments to main (b07ace376)
snimu Jun 2, 2026
c1ac211
chore(deps): drop mini_swe_agent_plus_rlm (removed in research-enviro…
snimu Jun 2, 2026
e9f2702
chore(deps): declare harnesses + tasksets for the research-env-main bump
snimu Jun 2, 2026
77069ad
uv lock
snimu Jun 2, 2026
7cec09b
chore(configs): bump configs/private (glm45air_cmb_code_a0p005 cadenc…
snimu Jun 3, 2026
d7a8f33
fix(trainer): guard ckpt mkdir behind rank-master + barrier
snimu Jun 3, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion configs/private
2 changes: 1 addition & 1 deletion deps/research-environments
2 changes: 1 addition & 1 deletion deps/verifiers
Submodule verifiers updated 33 files
+1 −1 docs/reference.md
+11 −0 tests/test_v1_config_extension.py
+836 −141 tests/test_v1_runtime_lifecycle.py
+2 −0 verifiers/envs/experimental/composable/tasksets/__init__.py
+2 −0 verifiers/envs/experimental/composable/tasksets/swe/__init__.py
+0 −36 verifiers/envs/experimental/composable/tasksets/swe/create_fix_patch.sh
+3 −0 verifiers/envs/experimental/composable/tasksets/swe/multi_swe/__init__.py
+55 −0 verifiers/envs/experimental/composable/tasksets/swe/multi_swe/extract_fix_patch.sh
+10 −5 verifiers/envs/experimental/composable/tasksets/swe/multi_swe/taskset.py
+3 −0 verifiers/envs/experimental/composable/tasksets/swe/openswe/__init__.py
+0 −0 verifiers/envs/experimental/composable/tasksets/swe/openswe/taskset.py
+3 −0 verifiers/envs/experimental/composable/tasksets/swe/r2e_gym/__init__.py
+0 −0 verifiers/envs/experimental/composable/tasksets/swe/r2e_gym/log_parser.py
+0 −0 verifiers/envs/experimental/composable/tasksets/swe/r2e_gym/taskset.py
+3 −0 verifiers/envs/experimental/composable/tasksets/swe/scale_swe/__init__.py
+608 −0 verifiers/envs/experimental/composable/tasksets/swe/scale_swe/taskset.py
+3 −0 verifiers/envs/experimental/composable/tasksets/swe/shared/__init__.py
+0 −0 verifiers/envs/experimental/composable/tasksets/swe/shared/test_patch.py
+3 −0 verifiers/envs/experimental/composable/tasksets/swe/swe_bench/__init__.py
+0 −0 verifiers/envs/experimental/composable/tasksets/swe/swe_bench/taskset.py
+3 −0 verifiers/envs/experimental/composable/tasksets/swe/swe_lego/__init__.py
+1 −1 verifiers/envs/experimental/composable/tasksets/swe/swe_lego/taskset.py
+3 −0 verifiers/envs/experimental/composable/tasksets/swe/swe_rebench_v2/__init__.py
+0 −0 verifiers/envs/experimental/composable/tasksets/swe/swe_rebench_v2/log_parsers.py
+4 −5 verifiers/envs/experimental/composable/tasksets/swe/swe_rebench_v2/taskset.py
+3 −0 verifiers/envs/experimental/composable/tasksets/swe/swe_smith/__init__.py
+0 −0 verifiers/envs/experimental/composable/tasksets/swe/swe_smith/taskset.py
+10 −0 verifiers/envs/experimental/composable/tasksets/swe/swe_tasksets.py
+4 −0 verifiers/utils/interception_utils.py
+1 −1 verifiers/v1/harness.py
+336 −107 verifiers/v1/runtime.py
+25 −1 verifiers/v1/sandbox.py
+261 −111 verifiers/v1/utils/sandbox_utils.py
69 changes: 69 additions & 0 deletions packages/prime-rl-configs/src/prime_rl/configs/orchestrator.py
Original file line number Diff line number Diff line change
Expand Up @@ -206,6 +206,72 @@ def resolve_timeout(self):
return self


class SystemRoleEchoConfig(BaseConfig):
"""Echo supervision for system-message content tokens."""

alpha: float = Field(1.0, allow_inf_nan=False)
"""Per-token echo weight."""


class UserRoleEchoConfig(BaseConfig):
"""Echo supervision for user-message content tokens."""

alpha: float = Field(1.0, allow_inf_nan=False)
"""Per-token echo weight."""


class AssistantRoleEchoConfig(BaseConfig):
"""Echo supervision for assistant-message content and completion tokens."""

alpha: float = Field(1.0, allow_inf_nan=False)
"""Per-token echo weight. ``alpha=0`` keeps the token supervised but gives it zero gradient."""


class ToolRoleEchoConfig(BaseConfig):
"""Echo supervision for tool-message content tokens."""

alpha: float = Field(1.0, allow_inf_nan=False)
"""Per-token echo weight."""

tool_names: set[str] | None = Field(None, min_length=1)
"""Restrict echo to these tool function names; None = all tools."""


class EchoFilterConfig(BaseConfig):
"""Optional callable that narrows role-selected echo tokens per rollout."""

import_path: str
"""Dotted import path to the filter callable, e.g. ``"my_module.filter_warnings"``."""

kwargs: dict[str, Any] = Field(default_factory=dict)
"""Keyword arguments forwarded to the filter as ``**kwargs``."""


class EchoConfig(BaseConfig):
"""Enable CE echo on selected message roles for this training env."""

system: SystemRoleEchoConfig | None = None
"""System-message echo (default: disabled)."""

user: UserRoleEchoConfig | None = None
"""User-message echo (default: disabled)."""

assistant: AssistantRoleEchoConfig | None = None
"""Assistant-message echo (default: disabled)."""

tool: ToolRoleEchoConfig | None = None
"""Tool-message echo (default: disabled)."""

filter: EchoFilterConfig | None = None
"""Optional per-token filter on top of the role baseline."""

@model_validator(mode="after")
def validate_roles(self) -> "EchoConfig":
if self.system is self.user is self.assistant is self.tool is None:
raise ValueError("EchoConfig requires at least one of system, user, assistant, or tool.")
return self


class TrainEnvConfig(EnvConfig):
sampling: TrainSamplingConfig = TrainSamplingConfig()
"""Per-env sampling overrides. Unset fields inherit from the group-level train sampling config."""
Expand All @@ -214,6 +280,9 @@ class TrainEnvConfig(EnvConfig):
"""Rollouts generated per example for GRPO group-relative advantages.
Inherits from ``orchestrator.group_size`` when unset."""

echo: EchoConfig | None = None
"""Per-env per-role echo config."""


class EvalEnvConfig(EnvConfig):
sampling: EvalSamplingConfig = EvalSamplingConfig()
Expand Down
6 changes: 4 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@ envs = [
"deepdive",
"general-agent",
"gpqa",
"harnesses",
"hle",
"ifeval",
"livecodebench",
Expand All @@ -77,7 +78,6 @@ envs = [
"math-python",
"math500",
"mini-swe-agent-plus",
"mini-swe-agent-plus-rlm",
"mmlu-pro",
"opencode-cp",
"opencode-deepdive",
Expand All @@ -88,6 +88,7 @@ envs = [
"rlm-swe",
"science-env",
"simpleqa-verified",
"tasksets",
"tau2-bench",
"wiki-search",
]
Expand Down Expand Up @@ -197,6 +198,8 @@ prime-rl-configs = { path = "packages/prime-rl-configs", editable = true }
verifiers = { path = "deps/verifiers", editable = true }
renderers = { path = "deps/renderers", editable = true }
prime-pydantic-config = { path = "deps/pydantic-config", editable = true }
harnesses = { path = "deps/verifiers/packages/harnesses", editable = true }
tasksets = { path = "deps/verifiers/packages/tasksets", editable = true }
aime2024 = { path = "deps/research-environments/environments/aime2024", editable = true }
aime2025 = { path = "deps/research-environments/environments/aime2025", editable = true }
alphabet-sort = { path = "deps/verifiers/environments/alphabet_sort", editable = true }
Expand All @@ -213,7 +216,6 @@ math-env = { path = "deps/research-environments/environments/math_env", editable
math-python = { path = "deps/verifiers/environments/math_python", editable = true }
math500 = { path = "deps/research-environments/environments/math500", editable = true }
mini-swe-agent-plus = { path = "deps/research-environments/environments/mini_swe_agent_plus", editable = true }
mini-swe-agent-plus-rlm = { path = "deps/research-environments/environments/mini_swe_agent_plus_rlm", editable = true }
mmlu-pro = { path = "deps/research-environments/environments/mmlu_pro", editable = true }
opencode-cp = { path = "deps/research-environments/environments/opencode_cp", editable = true }
opencode-deepdive = { path = "deps/research-environments/environments/opencode_deepdive", editable = true }
Expand Down
138 changes: 138 additions & 0 deletions src/prime_rl/orchestrator/echo.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
from __future__ import annotations

from collections.abc import Callable
from dataclasses import dataclass

import verifiers as vf

from prime_rl.configs.orchestrator import EchoConfig


@dataclass(frozen=True)
class EchoAnnotations:
step_alpha: list[list[float | None]]

def initial_sample_alpha(self, step_idx: int) -> list[float | None] | None:
alpha = self.step_alpha[step_idx]
return list(alpha) if any(a is not None for a in alpha) else None

def extension_alpha(self, step_idx: int, prefix_len: int, prompt_len: int) -> list[float | None]:
alpha = self.step_alpha[step_idx]
return alpha[prefix_len:prompt_len] + alpha[prompt_len:]


def build_echo_annotations(
rollout: vf.RolloutOutput,
echo_config: EchoConfig | None,
filter_fn: Callable[..., list[list[bool]]] | None = None,
) -> EchoAnnotations | None:
if echo_config is None:
return None

trajectory = rollout["trajectory"]
step_tokens = []
for step in trajectory:
tokens = step["tokens"]
if tokens is None:
return None
step_tokens.append(tokens)

filter_masks = apply_echo_filter(rollout, filter_fn) if filter_fn is not None and trajectory else None
return EchoAnnotations(
step_alpha=[
_build_step_echo_alpha(
prompt_attribution=tokens.get("prompt_attribution"),
prompt_len=len(tokens["prompt_ids"]),
completion_len=len(tokens["completion_ids"]),
echo_config=echo_config,
filter_mask=filter_masks[step_idx] if filter_masks is not None else None,
)
for step_idx, tokens in enumerate(step_tokens)
]
)


def _build_step_echo_alpha(
prompt_attribution: dict | None,
prompt_len: int,
completion_len: int,
echo_config: EchoConfig | None,
filter_mask: list[bool] | None = None,
) -> list[float | None]:
expected_total_len = prompt_len + completion_len
out: list[float | None] = [None] * expected_total_len
if echo_config is not None:
if echo_config.assistant is not None:
out[prompt_len:expected_total_len] = [echo_config.assistant.alpha] * completion_len

if prompt_attribution is not None:
message_roles = prompt_attribution.get("message_roles")
message_indices = prompt_attribution.get("message_indices")
is_content = prompt_attribution.get("is_content")
if message_roles is not None and is_content and message_indices:
if len(is_content) == prompt_len and len(message_indices) == prompt_len:
role_alphas = {
"system": echo_config.system.alpha if echo_config.system is not None else None,
"user": echo_config.user.alpha if echo_config.user is not None else None,
"assistant": echo_config.assistant.alpha if echo_config.assistant is not None else None,
}
tool_config = echo_config.tool
tool_alpha = tool_config.alpha if tool_config is not None else None
enabled_tools = tool_config.tool_names if tool_config is not None else None
message_tool_names = prompt_attribution.get("message_tool_names") or []

for k, mi in enumerate(message_indices):
if mi < 0 or not is_content[k] or mi >= len(message_roles):
continue
role = message_roles[mi]
if role == "tool":
tool_name = message_tool_names[mi] if mi < len(message_tool_names) else None
if tool_alpha is not None and (enabled_tools is None or tool_name in enabled_tools):
out[k] = tool_alpha
continue

alpha = role_alphas.get(role)
if alpha is not None:
out[k] = alpha

if filter_mask is not None:
out = [alpha if keep else None for alpha, keep in zip(out, filter_mask, strict=True)]

return out


def apply_echo_filter(
rollout: vf.RolloutOutput,
filter_fn: Callable[..., list[list[bool]]],
) -> list[list[bool]]:
trajectory = rollout["trajectory"]
result = filter_fn(rollout)

if not isinstance(result, list):
raise TypeError(f"echo filter must return list[list[bool]], got {type(result).__name__}")
if len(result) != len(trajectory):
raise ValueError(
f"echo filter returned {len(result)} per-step masks but the rollout has {len(trajectory)} trajectory steps"
)

for step_idx, (step, mask) in enumerate(zip(trajectory, result)):
tokens = step["tokens"]
prompt_len = len(tokens["prompt_ids"])
completion_len = len(tokens["completion_ids"])
expected = prompt_len + completion_len

if not isinstance(mask, list):
raise TypeError(f"echo filter step {step_idx}: mask must be a list, got {type(mask).__name__}")
if len(mask) != expected:
raise ValueError(
f"echo filter step {step_idx}: mask length {len(mask)} "
f"!= expected {expected} "
f"(prompt_len={prompt_len}, completion_len={completion_len})"
)
for k, v in enumerate(mask):
if type(v) is not bool:
raise TypeError(
f"echo filter step {step_idx}: mask[{k}] must be a plain bool, got {type(v).__name__} ({v!r})"
)

return result
7 changes: 6 additions & 1 deletion src/prime_rl/orchestrator/envs.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

import asyncio
import atexit
import functools
import multiprocessing as mp
import time
from collections.abc import Awaitable, Callable, Iterator, Sequence
Expand All @@ -18,7 +19,7 @@
from prime_rl.orchestrator.eval_utils import compute_pass_at_k
from prime_rl.utils.logger import ProgressTracker, get_logger
from prime_rl.utils.monitor import get_monitor
from prime_rl.utils.utils import capitalize
from prime_rl.utils.utils import capitalize, import_object

REQUIRED_STATE_COLUMNS = ["trajectory"]

Expand Down Expand Up @@ -170,6 +171,10 @@ class TrainEnv(Env):
def __init__(self, config: TrainEnvConfig):
super().__init__(config)
self.sampling_args = config.sampling.to_sampling_args()
self.echo_filter_fn: Callable[..., list[list[bool]]] | None = None
if config.echo is not None and config.echo.filter is not None:
fn = import_object(config.echo.filter.import_path)
self.echo_filter_fn = functools.partial(fn, **config.echo.filter.kwargs)

def get_dataset(self, seed: int | None = None):
return self.env.get_dataset(seed=seed)
Expand Down
6 changes: 6 additions & 0 deletions src/prime_rl/orchestrator/train_sink.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@

from prime_rl.configs.orchestrator import AdvantageConfig, OrchestratorConfig
from prime_rl.orchestrator.advantage import assign_advantages, setup_advantage_fn
from prime_rl.orchestrator.echo import build_echo_annotations
from prime_rl.orchestrator.envs import TrainEnvs
from prime_rl.orchestrator.filters import RolloutFilter, apply_filters
from prime_rl.orchestrator.trajectories import (
Expand Down Expand Up @@ -160,11 +161,16 @@ async def process_rollout(self, rollout: TrainRollout) -> None:
needs_backfill = any(s["tokens"] is None for s in raw.get("trajectory") or [])
if needs_backfill:
await asyncio.to_thread(backfill_rollout_tokens, raw, self.tokenizer, renderer=self.renderer)

env = self.train_envs.get(rollout.env_name)
echo_annotations = await asyncio.to_thread(build_echo_annotations, raw, env.config.echo, env.echo_filter_fn)

samples = await asyncio.to_thread(
interleave_rollout,
raw,
mm_token_type_ids_mapping=self.mm_token_type_ids_mapping,
env_name=rollout.env_name,
echo_annotations=echo_annotations,
)
rollout.samples = samples or []
# Offload base64 image bytes to disk as soon as the rollout is
Expand Down
19 changes: 19 additions & 0 deletions src/prime_rl/orchestrator/trajectories.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
import verifiers as vf
from transformers.tokenization_utils import PreTrainedTokenizer

from prime_rl.orchestrator.echo import EchoAnnotations
from prime_rl.transport import RoutedExperts, TrainingSample
from prime_rl.utils.chat_template import (
common_prefix_len,
Expand Down Expand Up @@ -206,6 +207,7 @@ def interleave_rollout(
mm_token_type_ids_mapping: dict[int, int] | None = None,
*,
env_name: str = "",
echo_annotations: EchoAnnotations | None = None,
) -> list[TrainingSample] | None:
"""
Convert vf.RolloutOutput to trainable rollouts by interleaving trajectory steps
Expand All @@ -225,6 +227,12 @@ def interleave_rollout(
For VLM models, each renderer-produced trajectory step carries its
per-image processed tensors inline on ``multi_modal_data``; the last
merged step's sidecar covers every image in the sample.

Args:
output: vf.RolloutOutput containing trajectory data
mm_token_type_ids_mapping: Maps prompt-token ids to mm_token_type_ids
(1 = image, 2 = video, 0 otherwise). Renderer-supplied.
echo_annotations: Optional per-step echo alpha annotations.
"""
logger = get_logger()

Expand All @@ -238,6 +246,7 @@ def interleave_rollout(
return None

has_error = output["error"] is not None
# completion_temperatures is left empty; the train sink fills it per-env later.

def prepare_step_tokens(step: vf.TrajectoryStep, step_idx: int) -> dict[str, Any] | None:
tokens = step["tokens"]
Expand Down Expand Up @@ -308,6 +317,7 @@ def make_sample(tokens: dict[str, Any], step_idx: int) -> TrainingSample:
env_name=env_name,
mm_token_type_ids=None,
routed_experts=None, # deferred — finalized at end of interleave_rollout
echo_alpha=echo_annotations.initial_sample_alpha(step_idx) if echo_annotations is not None else None,
)
# Initialize routed-experts state for this sample. First chunk is the
# raw step routed_experts (no pad, no copy). running_len is the
Expand Down Expand Up @@ -385,6 +395,15 @@ def extend_sample(
sample.completion_mask.extend(tokens["completion_mask"])
sample.completion_logprobs.extend(tokens["completion_logprobs"])

if echo_annotations is not None:
step_prompt_len = len(tokens["prompt_ids"])
extension = echo_annotations.extension_alpha(step_idx, prefix_len, step_prompt_len)
if any(a is not None for a in extension) or sample.echo_alpha is not None:
if sample.echo_alpha is None:
existing_len = len(sample.prompt_ids) + len(sample.completion_ids) - len(extension)
sample.echo_alpha = [None] * existing_len
sample.echo_alpha.extend(extension)

step_routed = tokens.get("routed_experts")
state = sample_routed_state.get(id(sample))
if state is not None:
Expand Down
Loading