Skip to content

test(e2e): accept multiturn Kimi tool calls#5413

Merged
cv merged 1 commit into
mainfrom
fix/kimi-trajectory-multiturn-tools
Jun 14, 2026
Merged

test(e2e): accept multiturn Kimi tool calls#5413
cv merged 1 commit into
mainfrom
fix/kimi-trajectory-multiturn-tools

Conversation

@cv

@cv cv commented Jun 14, 2026

Copy link
Copy Markdown
Collaborator

Summary

Stabilizes the live Kimi inference compatibility E2E by accepting Kimi tool calls emitted across multiple assistant turns. The full nightly run showed Kimi successfully executing all three expected tools and producing the final answer, but the trajectory checker only inspected the last assistant tool-call message and failed when the model emitted one command per assistant turn.

Related Issue

Related to #5401.

Changes

  • Flatten tool-call content from all assistant tool-call messages in test/e2e/test-kimi-inference-compat.sh before checking the expected hostname, date, uptime order.
  • Preserve the existing checks that the trace artifacts contain exactly three exec tool metas, no combined semicolon command remains, final status is success, and the final assistant response follows all tool results.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • Git hooks passed during commit and push, or npx prek run --from-ref main --to-ref HEAD passes
  • Targeted tests pass for changed behavior
  • Full npm test passes (broad runtime changes only)
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • npm run docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Targeted verification:

  • bash -n test/e2e/test-kimi-inference-compat.sh

Nightly evidence:

  • Main full nightly run 27487294753 failed kimi-inference-compat-e2e only because sourceAssistantCommands was ['uptime'] while the same trajectory showed toolMetasCount: 3, command set date/hostname/uptime, final status success, and final assistant text was correct.

Docs review: no user-facing docs changes needed; this is E2E harness stabilization only.


Signed-off-by: Carlos Villela cvillela@nvidia.com

Summary by CodeRabbit

  • Tests
    • Enhanced end-to-end test validation to comprehensively examine tool call handling across all assistant messages with tool call blocks, ensuring more thorough assertion checks for command behavior.

Signed-off-by: Carlos Villela <cvillela@nvidia.com>
@cv cv self-assigned this Jun 14, 2026
@coderabbitai

coderabbitai Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c7b6fbce-3bde-4deb-b316-1d5bce48864c

📥 Commits

Reviewing files that changed from the base of the PR and between 1f75d03 and 94d7b0a.

📒 Files selected for processing (1)
  • test/e2e/test-kimi-inference-compat.sh

📝 Walkthrough

Walkthrough

The embedded Python inside check_trajectory_acceptance in the E2E test script is updated so that source_calls is built by iterating over all assistant messages containing tool-call blocks, rather than extracting content only from the last such message.

Changes

Trajectory Acceptance Tool-Call Aggregation

Layer / File(s) Summary
Aggregate all assistant tool-call messages into source_calls
test/e2e/test-kimi-inference-compat.sh
Replaces the single last-message extraction with a loop that appends tool-call content from every qualifying assistant message, producing a complete flattened source_calls list used for command-order and semicolon-join assertions.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Poem

🐇 One message was never enough for me,
I hopped through each assistant turn to see,
All tool calls collected, none left behind,
The trajectory checks now cover each find.
✨ A loop replaces a single peek — hooray!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'test(e2e): accept multiturn Kimi tool calls' directly and specifically summarizes the main change: updating the E2E test to accept tool calls emitted across multiple assistant turns instead of just the last message.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/kimi-trajectory-multiturn-tools

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-code-quality

github-code-quality Bot commented Jun 14, 2026

Copy link
Copy Markdown

Code Coverage Overview

Languages: TypeScript

TypeScript / code-coverage/plugin

The overall coverage in the branch is 96%. Coverage data for the branch is not yet available.

Show a code coverage summary of the most covered files.
File 94d7b0a +/-
nemoclaw/src/se...cret-scanner.ts 100%
nemoclaw/src/commands/slash.ts 100%
nemoclaw/src/li...bprocess-env.ts 100%
nemoclaw/src/bl...eprint/state.ts 98%
nemoclaw/src/onboard/config.ts 98%
nemoclaw/src/bl...int/snapshot.ts 97%
nemoclaw/src/bl...print/runner.ts 95%
nemoclaw/src/co...ration-state.ts 94%
nemoclaw/src/bl...ate-networks.ts 94%
nemoclaw/src/index.ts 94%

TypeScript / code-coverage/cli

The overall coverage in the branch is 44%. Coverage data for the branch is not yet available.

Show a code coverage summary of the most covered files.
File 94d7b0a +/-
src/lib/state/o...oard-session.ts 90%
src/lib/inference/local.ts 77%
src/lib/sandbox/config.ts 72%
src/lib/inference/nim.ts 72%
src/lib/onboard/preflight.ts 64%
src/lib/state/sandbox.ts 55%
src/lib/onboard...er-gpu-patch.ts 50%
src/lib/actions...licy-channel.ts 49%
src/lib/policy/index.ts 48%
src/lib/onboard.ts 17%

Updated June 14, 2026 04:12 UTC
Code Coverage is in Public Preview. Learn more and provide us with your feedback.

@github-actions

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: None
Optional E2E: kimi-inference-compat-e2e

Dispatch hint: kimi-inference-compat-e2e

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • None. No merge-blocking E2E is required because this is a tests-only change and does not affect NemoClaw runtime or user flows. Running the touched Kimi compatibility E2E is still useful to validate the updated test acceptance logic.

Optional E2E

  • kimi-inference-compat-e2e (medium): Optional confidence check for the modified E2E script itself. This job runs test/e2e/test-kimi-inference-compat.sh and validates Kimi K2.6 safe exec splitting through OpenClaw trajectories, including the acceptance logic changed by this PR.

New E2E recommendations

  • None.

Dispatch hint

  • Workflow: .github/workflows/nightly-e2e.yaml
  • jobs input: kimi-inference-compat-e2e

@github-actions

Copy link
Copy Markdown
Contributor

Vitest E2E Scenario Recommendation

Required Vitest E2E scenarios: None
Optional Vitest E2E scenarios: None

Workflow run

Full Vitest E2E advisor summary

Vitest E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required Vitest E2E scenarios

  • None. Only a legacy test/e2e shell test changed outside test/e2e-scenario/ and the Vitest scenario workflow; this does not affect Vitest scenario behavior or dispatchable Vitest scenario coverage.

Optional Vitest E2E scenarios

  • None.

Relevant changed files

  • None.

@github-actions

Copy link
Copy Markdown
Contributor

PR Review Advisor

Findings: 0 needs attention, 2 worth checking, 0 nice ideas
Top item: Filter flattened source blocks to toolCall entries

Review findings

🛠️ Needs attention

  • None.

🔎 Worth checking

  • Source-of-truth review needed: Kimi trajectory acceptance verifier: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: The diff changes last assistant tool-call message inspection to aggregating all selected messages, but does not filter the flattened blocks to `type == "toolCall"`.
  • Filter flattened source blocks to toolCall entries (test/e2e/test-kimi-inference-compat.sh:652): The new loop extends each selected assistant message's entire content array. Because messages are selected when they contain at least one toolCall, any valid assistant message that also includes text, reasoning, or another non-tool content block will add an entry whose command is None, making the strict source_commands == ["hostname", "date", "uptime"] assertion fail even though the actual tool calls are correct.
    • Recommendation: When flattening across assistant messages, append only blocks with block.get("type") == "toolCall" before deriving source_commands. A small fixture/helper test for mixed text-plus-tool-call content would make the intended acceptance rule explicit.
    • Evidence: The prior predicate already checks any(block.get("type") == "toolCall" ...), but the new code does source_calls.extend(message.get("content", [])) without applying that same block-type filter.

🌱 Nice ideas

  • None.
Consider writing more tests for
  • **Acceptance clause:** No deterministic linked issue clauses or linked issue comments were available in the validation context. — add test evidence or identify existing coverage. validation_context.linkedIssues is empty. The PR body says it is related to fix(e2e): stabilize nightly recovery coverage #5401, but no linked issue clauses/comments were provided by the deterministic context to map literally.
  • **Acceptance clause:** Flatten tool-call content from all assistant tool-call messages in `test/e2e/test-kimi-inference-compat.sh` before checking the expected `hostname`, `date`, `uptime` order. — add test evidence or identify existing coverage. The diff now aggregates content from all assistant messages with tool calls, replacing last-message-only behavior. However, it flattens all content blocks from those messages rather than only tool-call blocks.
  • **Kimi trajectory acceptance verifier** — No dedicated fixture or helper test was added for split multi-turn source commands or mixed content-plus-tool-call messages; coverage currently depends on running this E2E script against live/mock behavior.. The diff changes last assistant tool-call message inspection to aggregating all selected messages, but does not filter the flattened blocks to `type == "toolCall"`.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 27487923357
Target ref: fix/kimi-trajectory-multiturn-tools
Requested jobs: kimi-inference-compat-e2e
Summary: 1 passed, 0 failed, 0 cancelled, 0 skipped

Job Result
kimi-inference-compat-e2e ✅ success

@cv cv merged commit 618e978 into main Jun 14, 2026
116 of 117 checks passed
@cv cv deleted the fix/kimi-trajectory-multiturn-tools branch June 14, 2026 16:11
@cv cv added the v0.0.65 Release target label Jun 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

v0.0.65 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant