Add rollout traces to built-in envs by derek-tml · Pull Request #749 · thinking-machines-lab/tinker-cookbook

derek-tml · 2026-06-01T06:07:27Z

Summary

Adds RolloutTrace payloads to built-in rollout envs that can decode prompt/response text.
Covers problem, prompt-only distillation, rubric, preference, message-env, and multiplayer rollout paths.
Keeps reward and state details in structured fields where useful.

Stack from ghstack (oldest at bottom):

Co-authored-by: Cursor cursoragent@cursor.com

[ghstack-poisoned]

Co-authored-by: Cursor <cursoragent@cursor.com> ghstack-source-id: 5998d8d Pull-Request: #749

[ghstack-poisoned]

Co-authored-by: Cursor <cursoragent@cursor.com> ghstack-source-id: d4201f2 Pull-Request: #749

derek-tml · 2026-06-01T09:00:34Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6d34086f40

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

[ghstack-poisoned]

Co-authored-by: Cursor <cursoragent@cursor.com> ghstack-source-id: f01b354 Pull-Request: #749

derek-tml · 2026-06-01T23:53:53Z

@claude review

github-actions · 2026-06-01T23:54:09Z

Claude encountered an error after 0s —— View job

I'll analyze this and get back to you.

dphuang2

Nice, the structured trace is much easier to consume than detokenizing ob/ac. Few things below, mostly test coverage. Non-blocking except maybe the message_env error-path coverage.

Two broader questions on the trace being unconditional (the field lives in #748 but it's this PR that populates it everywhere):

It's built on every step and always serialized into the summaries, plus the per-step deepcopy in message_env, so long multi-turn rollouts pay CPU + a bigger summaries file. Is always-on intended or should it be gated?
log_formatter already stashes to_data() into *_logtree.json, so for the log_formatter envs the prompt/response is somewhat duplicated. Fine if it's deliberate (summaries as the machine-readable per-traj record vs the HTML logtree), just want to confirm it's a choice.

dphuang2 · 2026-06-04T22:12:08Z


+        if self._last_messages is None:
+            self._last_messages = await self.message_env.initial_observation()
+        prompt_messages = deepcopy(self._last_messages)


The pre-step deepcopy is the right call since MessageEnv.step() mutates history in place. test_trace_prompt_snapshots_messages_before_step_mutation only covers the happy path though. The parse-error early return and the context-overflow return both emit a trace too, can we get the snapshot assertion on those paths as well? Those are the ones most likely to regress quietly.

dphuang2 · 2026-06-04T22:12:08Z

+            response_messages,
+            is_valid_list,
+        ):
+            trajectory.transitions[0].trace = RolloutTrace(


This assumes every preference trajectory has exactly one transition, and that trajectory_group/response_messages/is_valid_list stay aligned. safezip catches a length mismatch but nothing asserts the right response lands on the right trajectory. Worth a small test, this is the one spot here with real back-fill logic.

dphuang2 · 2026-06-04T22:12:08Z

+            trace=RolloutTrace(
+                prompt=prompt_formatter.to_data(),
+                policy_response=response_formatter.to_data(),
+                reward_data=reward_terms,


Since reward_terms feeds both table_from_dict and reward_data, the trace ends up with display strings (f"{...:.3f}") rather than numbers. For a durable record I think we want raw floats so downstream can compute on them. Cheap fix: keep a numeric dict and format only for the table. Same thing in problem_env.py with reward_table.

Update

f13015f

[ghstack-poisoned]

This was referenced Jun 1, 2026

Write rollout traces to summaries #748

Open

Expose math RL logtree group limit #750

Closed

Write rollout traces to summaries #745

Closed

Export on-policy distillation rollouts #746

Closed

Expose math RL logtree group limit #747

Closed

derek-tml added 6 commits May 31, 2026 23:11

Update

5208f66

[ghstack-poisoned]

Update

6acdf66

[ghstack-poisoned]

Update

bbd20df

[ghstack-poisoned]

Update

96ded05

[ghstack-poisoned]

Update

42ae1c3

[ghstack-poisoned]

Update

7217664

[ghstack-poisoned]

derek-tml changed the title ~~Trace prompt-only distillation rollouts~~ Add rollout traces to built-in envs Jun 1, 2026

derek-tml added a commit that referenced this pull request Jun 1, 2026

Trace built-in rollout envs

b862fb9

Co-authored-by: Cursor <cursoragent@cursor.com> ghstack-source-id: 5998d8d Pull-Request: #749

Update

6d34086

[ghstack-poisoned]

derek-tml added a commit that referenced this pull request Jun 1, 2026

Trace built-in rollout envs

0821bd6

Co-authored-by: Cursor <cursoragent@cursor.com> ghstack-source-id: d4201f2 Pull-Request: #749

derek-tml requested review from dphuang2 and joschu and removed request for dphuang2 June 1, 2026 07:37

chatgpt-codex-connector Bot reviewed Jun 1, 2026

View reviewed changes

Comment thread tinker_cookbook/rl/message_env.py Outdated

Update

90e1601

[ghstack-poisoned]

derek-tml added a commit that referenced this pull request Jun 1, 2026

Trace built-in rollout envs

61067f5

Co-authored-by: Cursor <cursoragent@cursor.com> ghstack-source-id: f01b354 Pull-Request: #749

dphuang2 reviewed Jun 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add rollout traces to built-in envs#749

Add rollout traces to built-in envs#749
derek-tml wants to merge 9 commits into
gh/derek-tml/5/basefrom
gh/derek-tml/5/head

derek-tml commented Jun 1, 2026 •

edited

Loading

Uh oh!

derek-tml commented Jun 1, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

derek-tml commented Jun 1, 2026

Uh oh!

github-actions Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

dphuang2 left a comment

Uh oh!

dphuang2 Jun 4, 2026

Uh oh!

dphuang2 Jun 4, 2026

Uh oh!

dphuang2 Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

derek-tml commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

derek-tml commented Jun 1, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

derek-tml commented Jun 1, 2026

Uh oh!

github-actions Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dphuang2 left a comment

Choose a reason for hiding this comment

Uh oh!

dphuang2 Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

dphuang2 Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

dphuang2 Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

derek-tml commented Jun 1, 2026 •

edited

Loading

github-actions Bot commented Jun 1, 2026 •

edited

Loading