Skip to content

Add rollout traces to built-in envs#749

Open
derek-tml wants to merge 9 commits into
gh/derek-tml/5/basefrom
gh/derek-tml/5/head
Open

Add rollout traces to built-in envs#749
derek-tml wants to merge 9 commits into
gh/derek-tml/5/basefrom
gh/derek-tml/5/head

Conversation

@derek-tml

@derek-tml derek-tml commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Adds RolloutTrace payloads to built-in rollout envs that can decode prompt/response text.
  • Covers problem, prompt-only distillation, rubric, preference, message-env, and multiplayer rollout paths.
  • Keeps reward and state details in structured fields where useful.

Stack from ghstack (oldest at bottom):

Co-authored-by: Cursor cursoragent@cursor.com

[ghstack-poisoned]
derek-tml added 6 commits May 31, 2026 23:11
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
@derek-tml derek-tml changed the title Trace prompt-only distillation rollouts Add rollout traces to built-in envs Jun 1, 2026
derek-tml added a commit that referenced this pull request Jun 1, 2026
Co-authored-by: Cursor <cursoragent@cursor.com>
ghstack-source-id: 5998d8d
Pull-Request: #749
[ghstack-poisoned]
derek-tml added a commit that referenced this pull request Jun 1, 2026
Co-authored-by: Cursor <cursoragent@cursor.com>
ghstack-source-id: d4201f2
Pull-Request: #749
@derek-tml derek-tml requested review from dphuang2 and joschu and removed request for dphuang2 June 1, 2026 07:37
@derek-tml

Copy link
Copy Markdown
Collaborator Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6d34086f40

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tinker_cookbook/rl/message_env.py Outdated
[ghstack-poisoned]
derek-tml added a commit that referenced this pull request Jun 1, 2026
Co-authored-by: Cursor <cursoragent@cursor.com>
ghstack-source-id: f01b354
Pull-Request: #749
@derek-tml

Copy link
Copy Markdown
Collaborator Author

@claude review

@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Claude encountered an error after 0s —— View job


I'll analyze this and get back to you.

@dphuang2 dphuang2 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, the structured trace is much easier to consume than detokenizing ob/ac. Few things below, mostly test coverage. Non-blocking except maybe the message_env error-path coverage.

Two broader questions on the trace being unconditional (the field lives in #748 but it's this PR that populates it everywhere):

  1. It's built on every step and always serialized into the summaries, plus the per-step deepcopy in message_env, so long multi-turn rollouts pay CPU + a bigger summaries file. Is always-on intended or should it be gated?
  2. log_formatter already stashes to_data() into *_logtree.json, so for the log_formatter envs the prompt/response is somewhat duplicated. Fine if it's deliberate (summaries as the machine-readable per-traj record vs the HTML logtree), just want to confirm it's a choice.


if self._last_messages is None:
self._last_messages = await self.message_env.initial_observation()
prompt_messages = deepcopy(self._last_messages)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pre-step deepcopy is the right call since MessageEnv.step() mutates history in place. test_trace_prompt_snapshots_messages_before_step_mutation only covers the happy path though. The parse-error early return and the context-overflow return both emit a trace too, can we get the snapshot assertion on those paths as well? Those are the ones most likely to regress quietly.

response_messages,
is_valid_list,
):
trajectory.transitions[0].trace = RolloutTrace(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes every preference trajectory has exactly one transition, and that trajectory_group/response_messages/is_valid_list stay aligned. safezip catches a length mismatch but nothing asserts the right response lands on the right trajectory. Worth a small test, this is the one spot here with real back-fill logic.

trace=RolloutTrace(
prompt=prompt_formatter.to_data(),
policy_response=response_formatter.to_data(),
reward_data=reward_terms,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since reward_terms feeds both table_from_dict and reward_data, the trace ends up with display strings (f"{...:.3f}") rather than numbers. For a durable record I think we want raw floats so downstream can compute on them. Cheap fix: keep a numeric dict and format only for the table. Same thing in problem_env.py with reward_table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants