Skip to content

Add MiniSWECodeAgentV2#88

Open
rowan-assistant wants to merge 15 commits intomainfrom
add-mini-swe-agent-v2
Open

Add MiniSWECodeAgentV2#88
rowan-assistant wants to merge 15 commits intomainfrom
add-mini-swe-agent-v2

Conversation

@rowan-assistant
Copy link
Copy Markdown
Collaborator

@rowan-assistant rowan-assistant commented Feb 11, 2026

User description

(Actually authored by Josh)

This PR adds msweav2, along with a preset for it.
Full benchmarking TBD, but initially it looks on-par with the reference.

I needed to expand tool availability in the LLM request/response model, which makes this PR quite a lot larger. It might be worth breaking up. Hopefully once mux is available we should be able to add it.


Generated description

Below is a concise technical summary of the changes proposed in this PR:
Introduces the MiniSWECodeAgentV2 and its configuration presets, enabling advanced tool-call based interaction models for software engineering tasks. This agent leverages significant architectural enhancements to the core LLM request and response models, which now support ToolUseData and parallel tool execution across various LLM providers like OpenAI and Anthropic.

TopicDetails
LLM Client Adapters Updates LLM client converters for OpenAI Chat Completions, OpenAI Responses, and Anthropic Messages to correctly handle the new ToolUseData and API-specific response fields. This includes modifying cost accounting to support diverse API response structures and ensuring seamless conversion between internal and external LLM formats.
Modified files (12)
  • src/ares/contrib/llama_cpp.py
  • src/ares/contrib/transformers_client.py
  • src/ares/contrib/transformers_client_test.py
  • src/ares/llms/accounting.py
  • src/ares/llms/anthropic_converter.py
  • src/ares/llms/anthropic_converter_test.py
  • src/ares/llms/chat_completions_compatible.py
  • src/ares/llms/openai_chat_converter.py
  • src/ares/llms/openai_chat_converter_test.py
  • src/ares/llms/openai_responses_converter.py
  • src/ares/llms/openai_responses_converter_test.py
  • src/ares/llms/request_test.py
Latest Contributors(2)
UserCommitDate
[email protected]Add-a-transformers-cli...February 24, 2026
sarvanithinRefactor-LLMRequest-co...February 13, 2026
MiniSWEAgentV2 Implements the MiniSWECodeAgentV2, a new agent designed for software engineering tasks that utilizes OpenAI-style tool calls and supports parallel command execution. This includes defining its configuration presets and registering it within the system's available agents.
Modified files (3)
  • src/ares/code_agents/mini_swe_agent_v2.py
  • src/ares/code_agents/mini_swe_agent_v2_configs.py
  • src/ares/presets.py
Latest Contributors(2)
UserCommitDate
Narmeen07Add-mechanistic-interp...February 19, 2026
[email protected]Removed-repeated-indiv...January 29, 2026
LLM Response Model Enhances the core LLM response model to natively support ToolUseData for representing tool calls and expands the LLMResponse dataclass with API-specific metadata. This foundational change allows for richer interaction patterns with LLMs, including parallel tool execution, and updates the extract_text_content utility for broader compatibility.
Modified files (15)
  • src/ares/llms/__init__.py
  • src/ares/llms/response.py
  • src/ares/llms/response_test.py
  • src/ares/llms/test_data/anthropic_messages_mixed_content.json
  • src/ares/llms/test_data/anthropic_messages_text_only.json
  • src/ares/llms/test_data/anthropic_messages_tool_use_parallel.json
  • src/ares/llms/test_data/anthropic_messages_tool_use_single.json
  • src/ares/llms/test_data/openai_empty.json
  • src/ares/llms/test_data/openai_responses_text_only.json
  • src/ares/llms/test_data/openai_responses_tool_call_single.json
  • src/ares/llms/test_data/openai_text_only.json
  • src/ares/llms/test_data/openai_tool_call_malformed.json
  • src/ares/llms/test_data/openai_tool_call_mixed.json
  • src/ares/llms/test_data/openai_tool_call_parallel.json
  • src/ares/llms/test_data/openai_tool_call_single.json
Latest Contributors(1)
UserCommitDate
[email protected]Massively-simplify-the...January 29, 2026
Infrastructure & Docs Adds runtime checks to DockerContainer methods to ensure the container is started before attempting operations, preventing potential RuntimeError exceptions. Also updates pyproject.toml to exclude test files and data from the built wheel package, and adds type annotation guidelines to CLAUDE.md.
Modified files (3)
  • CLAUDE.md
  • pyproject.toml
  • src/ares/containers/docker.py
Latest Contributors(2)
UserCommitDate
[email protected]Add-a-transformers-cli...February 24, 2026
Narmeen07Add-mechanistic-interp...February 19, 2026
Agent & Utility Adaptations Updates existing agents and utility functions to align with the new LLMResponse structure, specifically by adopting the response.extract_text_content helper and handling ToolUseData where appropriate. This ensures compatibility and proper data extraction across various agent implementations and environment interactions.
Modified files (6)
  • examples/07_mech_interp_hooked_transformer.py
  • examples/utils.py
  • src/ares/code_agents/mini_swe_agent.py
  • src/ares/code_agents/terminus2/terminus2_agent.py
  • src/ares/environments/twenty_questions.py
  • src/ares/testing/mock_llm_test.py
Latest Contributors(2)
UserCommitDate
Narmeen07Add-mechanistic-interp...February 19, 2026
marathan24Add-terminus2_agent-su...February 12, 2026
This pull request is reviewed by Baz. Review like a pro on (Baz).

Summary by CodeRabbit

  • New Features

    • New async code agent (tool-call based) with configurable templates, env, and execution flow.
    • LLM responses now support mixed text and tool/function calls with bidirectional converters for multiple APIs.
    • Added presets/configs for the new agent.
  • Bug Fixes

    • Added runtime guards to container operations and strengthened runtime type validations.
  • Tests

    • Added/expanded round-trip and parsing tests across converters and response types.
  • Documentation

    • Clarified type-annotation guidance: prefer progressive fixes over blanket ignores.

Rowan and others added 13 commits February 11, 2026 17:05
Introduce v2 text-based code agent that uses markdown parsing and
COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT submission sentinel. Based on
mini-swe-agent's mini_textbased.yaml configuration.

Key features:
- Text-based interaction (parsing markdown code blocks)
- Jinja2-rendered system/instance/observation templates
- Timeout and format error handling
- Cost and step limit tracking

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
@joshgreaves joshgreaves requested a review from rsmith49 February 25, 2026 00:37
@joshgreaves joshgreaves changed the title Add MiniSWECodeAgentV2 implementation Add MiniSWECodeAgentV2 Feb 25, 2026
@joshgreaves joshgreaves marked this pull request as ready for review February 27, 2026 17:50
Comment thread src/ares/llms/response.py
Comment on lines +111 to +112
if not first_block.content:
raise ValueError("TextData block has empty content")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important

[Logic] Raising on empty TextData makes extract_text_content unusable for valid empty responses (e.g., openai_empty.json fixture). If any caller uses this helper on an empty but otherwise valid response, it will now throw unexpectedly. Consider allowing empty strings and only erroring when the first block isn't TextData.

Example:

# Allow empty text content
return first_block.content
Context for Agents
Raising on empty `TextData` makes `extract_text_content` unusable for valid empty responses (e.g., `openai_empty.json` fixture). If any caller uses this helper on an empty but otherwise valid response, it will now throw unexpectedly. Consider allowing empty strings and only erroring when the first block isn't `TextData`.

Example:
```python
# Allow empty text content
return first_block.content
```

File: src/ares/llms/response.py
Line: 112

Comment on lines 105 to +110
if workdir is None:
workdir = self.default_workdir

if self._container is None:
raise RuntimeError("Container not started - call start() first")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if self._container is None: raise RuntimeError(...) is duplicated in exec_run, upload_files, and download_files, so the same guard/error appears thrice; can we share a single _ensure_started() helper to avoid the copy-paste and keep the message consistent?

Finding type: Code Dedup and Conventions


Want Baz to fix this for you? Activate Fixer

Comment thread src/ares/llms/response.py
Comment on lines +104 to +113
if not response.data:
raise ValueError("Response has no data blocks")

first_block = response.data[0]
if not isinstance(first_block, TextData):
raise ValueError(f"Expected TextData as first block, got {type(first_block).__name__}")

if not first_block.content:
raise ValueError("TextData block has empty content")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract_text_content now raises if the first data block isn't TextData or the content string is empty, but ares_response_from_external (openai_chat_converter/openai_tool_call_* fixtures) produces ToolUseData-only responses when the assistant issues a function call; MiniSWECodeAgent.query/execute_action and Terminus2Agent._unwrap_single_response now hit this code and throw ValueError any time the LLM returns a pure tool call (a legitimate response), aborting the agent loop—can we scan for the first TextData block (or allow empty strings) instead of forcing the first block to be TextData?

Finding type: Logical Bugs


Want Baz to fix this for you? Activate Fixer

Comment on lines 367 to 373
def _extract_question_from_response(self, action: response.LLMResponse) -> str:
"""Extract the question text from the agent's response."""
# Get the text content from the first data element
if not isinstance(action.data[0], response.TextData):
return ""
question = action.data[0].content if action.data else ""
return question.strip()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_extract_question_from_response now short-circuits to "" whenever action.data[0] is a ToolUseData, but OpenAI Responses outputs can validly emit a function_call (ToolUseData) before any message (TextData) block, so the environment drops the agent's question, keeps calling the oracle with an empty prompt, and every step just penalizes the agent without recording the real question; can we scan the list for the first TextData instead of only checking data[0]?

Finding type: Logical Bugs


Want Baz to fix this for you? Activate Fixer

Comment on lines 16 to 22
import asyncio

import ares
from ares import llms
from ares.contrib import mech_interp
import transformer_lens

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ares and ares.contrib imports currently precede import transformer_lens, so third-party imports are in the wrong group; CLAUDE.md requires stdlib → third-party → local (ARES final). Can we move import transformer_lens ahead of the ARES imports so the local group is kept last?

Finding type: AI Coding Guidelines


Want Baz to fix this for you? Activate Fixer

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 5, 2026

📝 Walkthrough

Walkthrough

Adds heterogeneous LLM response support (tool-use blocks), bidirectional converters for OpenAI/Responses/Anthropic, a new async tool-call MiniSWECodeAgentV2 with configs, widespread callers updated to use centralized conversions/extractors, new tests and fixtures, doc/example/type-guidance, and minor runtime safety and packaging adjustments.

Changes

Cohort / File(s) Summary
Response types & exports
src/ares/llms/response.py, src/ares/llms/__init__.py
Introduce ToolUseData, expand LLMResponse.data to `TextData
OpenAI Chat converter
src/ares/llms/openai_chat_converter.py, src/ares/llms/openai_chat_converter_test.py
Rename request conversion APIs to ares_*, add ares_response_from_external / ares_response_to_external (text + tool calls) and round-trip tests/fixtures.
OpenAI Responses converter
src/ares/llms/openai_responses_converter.py, src/ares/llms/openai_responses_converter_test.py
Add ares_response_from_external / ares_response_to_external for Responses API (text + tool calls), including argument JSON handling and round-trip tests.
Anthropic converter
src/ares/llms/anthropic_converter.py, src/ares/llms/anthropic_converter_test.py
Add ares_response_from_external / ares_response_to_external mapping Anthropic messages ↔ ARES LLMResponse, with strict/non-strict handling and tests.
Accounting changes
src/ares/llms/accounting.py
Broaden get_llm_cost() to accept Chat/Responses/Anthropic types and branch on different usage fields; widen cost_mapping type.
Transformers / llama client adaptions
src/ares/contrib/transformers_client.py, src/ares/contrib/llama_cpp.py, src/ares/contrib/transformers_client_test.py
Switch call sites to ares_request_to_external(...) conversions; strengthen tests to assert TextData in responses.
Existing agent updates
src/ares/code_agents/mini_swe_agent.py, src/ares/code_agents/terminus2/terminus2_agent.py
Replace direct data[0].content indexing with extract_text_content() and rename local variables to llm_response.
New agent + configs
src/ares/code_agents/mini_swe_agent_v2.py, src/ares/code_agents/mini_swe_agent_v2_configs.py
Add MiniSWECodeAgentV2 (async tool-call agent) and MiniSWEAgentV2Config dataclass with factories and templates.
Presets registration
src/ares/presets.py
Register msweav2 preset factories for Harbor datasets using mini_swe_agent_v2.
Examples & utils
examples/07_mech_interp_hooked_transformer.py, examples/utils.py, CLAUDE.md
Add llms import and runtime type assertions in examples/util preview; update CLAUDE.md type-annotation guidance to discourage # type: ignore and recommend stepwise fixes.
Tests & fixtures
src/ares/llms/response_test.py, src/ares/llms/*_test.py, src/ares/llms/test_data/*.json
Add comprehensive parsing tests and ~14 JSON fixtures covering OpenAI Chat, Responses, and Anthropic message shapes; update related tests to use new converters and extractors.
Mock LLM tests
src/ares/testing/mock_llm_test.py
Rename local variables to llm_response, assert TextData types, and use extract_text_content() in assertions.
Docker safety checks
src/ares/containers/docker.py
Add guard clauses raising RuntimeError if container not started in exec_run, upload_files, download_files.
Packaging
pyproject.toml
Exclude **/*_test.py and **/test_data/ from wheel builds.

Sequence Diagram

sequenceDiagram
    participant User
    participant Agent as MiniSWECodeAgentV2
    participant LLM as LLMClient
    participant Container
    participant History as MessageHistory

    User->>Agent: run(task)
    loop step loop
        Agent->>History: render & read conversation
        Agent->>LLM: query(messages, tool_choice="any")
        LLM-->>Agent: LLMResponse (ToolUseData / TextData)
        Agent->>History: append assistant message (with tool_calls)
        loop for each ToolUseData
            Agent->>Container: exec_run(bash command)
            Container-->>Agent: stdout/stderr, returncode
            Agent->>Agent: render observation, append tool observation to History
            alt submission detected (sentinel + rc=0)
                Agent-->>User: return submission / raise _SubmittedError
            end
        end
        Agent->>Agent: check step_limit & cost_limit
        alt limits exceeded
            Agent-->>User: raise _LimitsExceededError
        end
    end
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 With types refined and tool calls bright,

I hop through messages into the night,
Converters hum and agents run free,
Containers answer with bash and glee,
A tiny rabbit cheers: "Ship it to me!"

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add MiniSWECodeAgentV2' is directly related to the main change of the pull request, which introduces the MiniSWECodeAgentV2 agent and its configuration components.
Docstring Coverage ✅ Passed Docstring coverage is 94.62% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch add-mini-swe-agent-v2

Comment @coderabbitai help to get the list of available commands and usage tips.

@mwilliammyers
Copy link
Copy Markdown
Collaborator

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 9, 2026

✅ Actions performed

Full review triggered.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

♻️ Duplicate comments (1)
src/ares/llms/response.py (1)

104-112: ⚠️ Potential issue | 🟠 Major

Allow empty TextData here.

The converters in this PR normalize valid empty completions to TextData(content=""), so raising on not first_block.content turns a legitimate empty response into an exception for every caller of this helper.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ares/llms/response.py` around lines 104 - 112, The code currently treats
an empty TextData.content as an error; change the check so legitimate empty
string completions are allowed: in the block handling response.data and
first_block (the TextData check), remove or replace the falsy check "if not
first_block.content" with a None-only check (e.g., "if first_block.content is
None") so that TextData(content="") is accepted; ensure you keep the earlier
checks for presence of response.data and that first_block is a TextData
instance.
🧹 Nitpick comments (3)
src/ares/llms/openai_responses_converter_test.py (1)

442-448: Add a parallel tool-call round-trip case to this matrix.

This only exercises text-only and single-call fixtures, so it will miss regressions in the new multi-tool Responses flow introduced by this PR. Please add a fixture with multiple function_call output items to keep Responses coverage aligned with the chat and Anthropic suites.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ares/llms/openai_responses_converter_test.py` around lines 442 - 448, The
parametrize matrix for "filename" only includes text-only and single-call
fixtures, so add a parallel multi-call fixture filename (e.g.,
"openai_responses_tool_call_multi.json") to the list so the test case exercises
multiple function_call output items; update the pytest.mark.parametrize list in
src/ares/llms/openai_responses_converter_test.py to include that new filename
and ensure a corresponding fixture JSON is added with multiple function_call
entries to mirror the new multi-tool Responses flow.
src/ares/llms/accounting.py (2)

74-135: Add direct tests for the three usage schemas.

This function now has provider-specific billing branches, but the converter round-trip tests won't catch a mix-up between prompt_tokens/completion_tokens and input_tokens/output_tokens. A small unit matrix around get_llm_cost() would make cost regressions much easier to catch in this shared path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ares/llms/accounting.py` around lines 74 - 135, Add unit tests that
directly exercise get_llm_cost for the three provider usage schemas to prevent
token-field mixups: create three small tests (e.g.,
test_get_llm_cost_chat_completion, test_get_llm_cost_responses_api,
test_get_llm_cost_anthropic_message) that construct minimal mock objects
matching openai.types.chat.chat_completion.ChatCompletion (with
usage.prompt_tokens and usage.completion_tokens),
openai.types.responses.Response (with usage.input_tokens and
usage.output_tokens), and anthropic.types.Message (with usage.input_tokens and
usage.output_tokens), supply a cost_mapping entry for the model_id with known
ModelCost.pricing values (request, prompt, completion) and assert the returned
decimal.Decimal equals the expected calculation ((request) +
prompt*input/prompt_tokens + completion*output/completion_tokens); ensure to
cover missing usage raising ValueError as well.

3-6: Use module imports for collections.abc and typing here.

This file now mixes the repo’s qualified-module import style with direct symbol imports for Mapping and assert_never. Keeping these qualified makes the module consistent with the rest of the codebase.

♻️ Minimal cleanup
-from collections.abc import Mapping
+import collections.abc
 import decimal
 import functools
-from typing import assert_never
+import typing
 ...
-    cost_mapping: Mapping[str, ModelCost],
+    cost_mapping: collections.abc.Mapping[str, ModelCost],
 ...
-        assert_never(completion)
+        typing.assert_never(completion)

As per coding guidelines, "Always import modules, not classes or functions. For external consumers (examples, docs), use import ares or from ares import llms and access via qualified names. For internal code, use from ares.llms import request and access as request.LLMRequest."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ares/llms/accounting.py` around lines 3 - 6, The file imports Mapping and
assert_never directly; update to module imports to match project style: replace
"from collections.abc import Mapping" with "import collections.abc" (use
collections.abc.Mapping where used) and replace "from typing import
assert_never" with "import typing" (use typing.assert_never where used), and
update any references to the direct symbols (Mapping, assert_never) accordingly
in src/ares/llms/accounting.py.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/utils.py`:
- Around line 42-50: The print_step code accesses action.data[0] without
checking for empty .data and uses len(action_content) which is wrong when
action_content is a dict for llms.ToolUseData; update the logic in print_step to
first guard for empty action.data (handle by setting action_content = "" or a
placeholder like "<empty response>"), then for llms.ToolUseData convert the
input to a string (e.g., str(action.data[0].input)) so truncation operates on
characters, compute preview using the string length and slice that string into
action_preview, and keep the existing ellipsis logic to append "..." only when
the string length exceeds the cutoff; reference llms.TextData, llms.ToolUseData,
action_content and action_preview to locate the changes.

In `@src/ares/code_agents/mini_swe_agent_v2.py`:
- Around line 396-403: The prompt's displayed working directory and the value
passed to exec_run() are inconsistent; change the code so both use the same
variable (do not coerce None to "/" for the template while passing None to
exec_run()). Use a single workdir variable set directly from
self.working_directory (allowing None) and pass that workdir into the
jinja2.Template.render(...) and into any exec_run(...) calls so the advertised
directory matches actual exec behavior; repeat the same change for the other
occurrence around the instance_template use at the 562-567 section.
- Around line 315-316: The parallel_tool_calls flag is defined but never used;
update the query(...) caller/definition to accept and forward
parallel_tool_calls into the tool-execution path and into the LLM request
options (where applicable), then modify the tool-execution helpers (e.g., the
functions named _call_tools, _execute_tool_calls, run_tool_calls or similar) to
branch on parallel_tool_calls and run tool calls concurrently (use
asyncio.gather / asyncio.create_task or concurrent.futures for sync tools) when
True and await serially when False; ensure the flag is threaded through all call
sites you referenced (around the symbols query,
_call_tools/_execute_tool_calls/run_tool_calls and any LLM request builder) so
the behavior actually toggles between parallel and serial execution.

In `@src/ares/code_agents/mini_swe_agent.py`:
- Around line 222-231: The agent currently strips tool-call responses via
response.extract_text_content(...) and falls through to parse_action("") causing
wasted retries; update execute_action to explicitly detect tool-call outputs by
checking llm_response.data for ToolUseData (or the tool-use discriminant) before
calling response.extract_text_content, and reject them (e.g., raise a clear
exception or return an error observation) so callers know the model returned a
tool invocation the agent can't handle; ensure to reference/ import the
ToolUseData type and update execute_action (and any callers expecting
parse_action) to handle this explicit rejection instead of silently parsing an
empty string.

In `@src/ares/containers/docker.py`:
- Around line 108-110: The methods only check self._container is not None but
not whether the container is still running; add a single reusable guard method
(e.g., _ensure_running()) that calls self._container.reload() (or otherwise
refreshes state) and asserts self._container is not None and
self._container.status == "running" (or checks
container.attrs["State"]["Running"]) and raise a clear RuntimeError like
"Container not running - call start() or restart the container" if not; replace
the simple None checks in start-related operations and in upload_files and
download_files with calls to _ensure_running() so operations fail fast with a
clear message when the container has been externally stopped or exited.

In `@src/ares/environments/twenty_questions.py`:
- Around line 359-361: The code dereferences oracle_response.data[0] without
checking for an empty list; update the block that sets answer_text to first
verify oracle_response.data is truthy (non-empty) before accessing index 0, and
raise a clear RuntimeError if it's empty or None; keep the existing type check
for response.TextData and assignment to answer_text so replace the single
dereference with an initial empty-check on oracle_response.data, then proceed to
isinstance(oracle_response.data[0], response.TextData) and answer_text =
oracle_response.data[0].content.strip().
- Around line 370-372: The code indexes action.data[0] before ensuring
action.data is non-empty which can raise IndexError in step(); change the check
to first confirm action.data is truthy and that its first element is a
response.TextData (e.g., if not action.data or not isinstance(action.data[0],
response.TextData): return ""), then set question = action.data[0].content;
update the block around step(), action.data and response.TextData to follow this
order to avoid indexing empty lists.

In `@src/ares/llms/openai_chat_converter.py`:
- Around line 515-517: The code currently takes only the first TextData block
from response.data (variables: response.data, llm_response.TextData,
text_blocks, content) which drops any subsequent assistant text; change it to
preserve all TextData blocks by concatenating their contents in order (e.g.,
join all b.content from text_blocks with a separator such as "\n\n") or
otherwise propagate the full list where the downstream logic expects multiple
blocks, and update any downstream consumers that assume a single string (search
for uses of the content variable) to handle the combined text or list
accordingly.

In `@src/ares/llms/openai_responses_converter.py`:
- Around line 549-581: The code currently collects all response_lib.TextData
into text_blocks and all response_lib.ToolUseData into tool_blocks which
reorders mixed blocks; change the logic to iterate over response.data in
original sequence and for each block check isinstance(block,
response_lib.TextData) to build a ResponseOutputMessage (using
openai.types.responses.ResponseOutputMessage / ResponseOutputText) or
isinstance(block, response_lib.ToolUseData) to build a ResponseFunctionToolCall
(using openai.types.responses.ResponseFunctionToolCall), then append each
created output to output immediately so original ordering is preserved (refer to
response.data, response_lib.TextData, response_lib.ToolUseData,
ResponseOutputMessage, ResponseOutputText, ResponseFunctionToolCall).
- Around line 477-495: The forward conversion currently maps
ResponseFunctionToolCall.call_id into response_lib.ToolUseData.id and drops the
original item.id; update the ToolUseData model to include both fields (e.g., id
and call_id), then change the forward converter handling
ResponseFunctionToolCall to populate both ToolUseData.id = item.id and
ToolUseData.call_id = item.call_id (preserving parsed arguments into input), and
update the reverse conversion that reconstructs ResponseFunctionToolCall to
restore both fields (write item.id from ToolUseData.id and item.call_id from
ToolUseData.call_id) so round-trip semantics and the distinction between item
identity and invocation correlation are preserved.

In `@src/ares/llms/response_test.py`:
- Around line 34-38: Tests are constructing TextData/ToolUseData manually from
completion payloads and thus bypass the real parsing logic; update the tests to
feed the same raw completion payloads into the actual parsing helpers (e.g.,
call ares_response_from_external(...) and/or extract_text_content(...) with the
completion/choice message instead of rebuilding objects via json.loads) so the
test exercises the production parsing path (replace local construction around
completion.choices[0].message and similar blocks at the other ranges with calls
to those helper functions and assert on their outputs).

---

Duplicate comments:
In `@src/ares/llms/response.py`:
- Around line 104-112: The code currently treats an empty TextData.content as an
error; change the check so legitimate empty string completions are allowed: in
the block handling response.data and first_block (the TextData check), remove or
replace the falsy check "if not first_block.content" with a None-only check
(e.g., "if first_block.content is None") so that TextData(content="") is
accepted; ensure you keep the earlier checks for presence of response.data and
that first_block is a TextData instance.

---

Nitpick comments:
In `@src/ares/llms/accounting.py`:
- Around line 74-135: Add unit tests that directly exercise get_llm_cost for the
three provider usage schemas to prevent token-field mixups: create three small
tests (e.g., test_get_llm_cost_chat_completion, test_get_llm_cost_responses_api,
test_get_llm_cost_anthropic_message) that construct minimal mock objects
matching openai.types.chat.chat_completion.ChatCompletion (with
usage.prompt_tokens and usage.completion_tokens),
openai.types.responses.Response (with usage.input_tokens and
usage.output_tokens), and anthropic.types.Message (with usage.input_tokens and
usage.output_tokens), supply a cost_mapping entry for the model_id with known
ModelCost.pricing values (request, prompt, completion) and assert the returned
decimal.Decimal equals the expected calculation ((request) +
prompt*input/prompt_tokens + completion*output/completion_tokens); ensure to
cover missing usage raising ValueError as well.
- Around line 3-6: The file imports Mapping and assert_never directly; update to
module imports to match project style: replace "from collections.abc import
Mapping" with "import collections.abc" (use collections.abc.Mapping where used)
and replace "from typing import assert_never" with "import typing" (use
typing.assert_never where used), and update any references to the direct symbols
(Mapping, assert_never) accordingly in src/ares/llms/accounting.py.

In `@src/ares/llms/openai_responses_converter_test.py`:
- Around line 442-448: The parametrize matrix for "filename" only includes
text-only and single-call fixtures, so add a parallel multi-call fixture
filename (e.g., "openai_responses_tool_call_multi.json") to the list so the test
case exercises multiple function_call output items; update the
pytest.mark.parametrize list in src/ares/llms/openai_responses_converter_test.py
to include that new filename and ensure a corresponding fixture JSON is added
with multiple function_call entries to mirror the new multi-tool Responses flow.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f6e79f40-d0e3-4ee6-831d-f91a2d0dd53b

📥 Commits

Reviewing files that changed from the base of the PR and between c804aa2 and 66617e6.

📒 Files selected for processing (39)
  • CLAUDE.md
  • examples/07_mech_interp_hooked_transformer.py
  • examples/utils.py
  • pyproject.toml
  • src/ares/code_agents/mini_swe_agent.py
  • src/ares/code_agents/mini_swe_agent_v2.py
  • src/ares/code_agents/mini_swe_agent_v2_configs.py
  • src/ares/code_agents/terminus2/terminus2_agent.py
  • src/ares/containers/docker.py
  • src/ares/contrib/llama_cpp.py
  • src/ares/contrib/transformers_client.py
  • src/ares/contrib/transformers_client_test.py
  • src/ares/environments/twenty_questions.py
  • src/ares/llms/__init__.py
  • src/ares/llms/accounting.py
  • src/ares/llms/anthropic_converter.py
  • src/ares/llms/anthropic_converter_test.py
  • src/ares/llms/chat_completions_compatible.py
  • src/ares/llms/openai_chat_converter.py
  • src/ares/llms/openai_chat_converter_test.py
  • src/ares/llms/openai_responses_converter.py
  • src/ares/llms/openai_responses_converter_test.py
  • src/ares/llms/request_test.py
  • src/ares/llms/response.py
  • src/ares/llms/response_test.py
  • src/ares/llms/test_data/anthropic_messages_mixed_content.json
  • src/ares/llms/test_data/anthropic_messages_text_only.json
  • src/ares/llms/test_data/anthropic_messages_tool_use_parallel.json
  • src/ares/llms/test_data/anthropic_messages_tool_use_single.json
  • src/ares/llms/test_data/openai_empty.json
  • src/ares/llms/test_data/openai_responses_text_only.json
  • src/ares/llms/test_data/openai_responses_tool_call_single.json
  • src/ares/llms/test_data/openai_text_only.json
  • src/ares/llms/test_data/openai_tool_call_malformed.json
  • src/ares/llms/test_data/openai_tool_call_mixed.json
  • src/ares/llms/test_data/openai_tool_call_parallel.json
  • src/ares/llms/test_data/openai_tool_call_single.json
  • src/ares/presets.py
  • src/ares/testing/mock_llm_test.py

Comment thread examples/utils.py
Comment on lines +42 to 50
if isinstance(action.data[0], llms.TextData):
action_content = action.data[0].content
elif isinstance(action.data[0], llms.ToolUseData):
action_content = action.data[0].input
else:
assert_never(action.data[0])
action_preview = str(action_content)[:200]
if len(action_content) > 200:
action_preview += "..."
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

print_step can crash on empty responses and mis-handle truncation for tool input.

Line 42 indexes action.data[0] without guarding for empty data, and Line 49 uses len(action_content) (dict key count for tool input) instead of string length.

Proposed fix
-    if isinstance(action.data[0], llms.TextData):
-        action_content = action.data[0].content
-    elif isinstance(action.data[0], llms.ToolUseData):
-        action_content = action.data[0].input
-    else:
-        assert_never(action.data[0])
+    if not action.data:
+        action_content = "(no response data)"
+    else:
+        first_block = action.data[0]
+        if isinstance(first_block, llms.TextData):
+            action_content = first_block.content
+        elif isinstance(first_block, llms.ToolUseData):
+            action_content = str(first_block.input)
+        else:
+            assert_never(first_block)
     action_preview = str(action_content)[:200]
-    if len(action_content) > 200:
+    if len(str(action_content)) > 200:
         action_preview += "..."
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if isinstance(action.data[0], llms.TextData):
action_content = action.data[0].content
elif isinstance(action.data[0], llms.ToolUseData):
action_content = action.data[0].input
else:
assert_never(action.data[0])
action_preview = str(action_content)[:200]
if len(action_content) > 200:
action_preview += "..."
if not action.data:
action_content = "(no response data)"
else:
first_block = action.data[0]
if isinstance(first_block, llms.TextData):
action_content = first_block.content
elif isinstance(first_block, llms.ToolUseData):
action_content = str(first_block.input)
else:
assert_never(first_block)
action_preview = str(action_content)[:200]
if len(str(action_content)) > 200:
action_preview += "..."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/utils.py` around lines 42 - 50, The print_step code accesses
action.data[0] without checking for empty .data and uses len(action_content)
which is wrong when action_content is a dict for llms.ToolUseData; update the
logic in print_step to first guard for empty action.data (handle by setting
action_content = "" or a placeholder like "<empty response>"), then for
llms.ToolUseData convert the input to a string (e.g., str(action.data[0].input))
so truncation operates on characters, compute preview using the string length
and slice that string into action_preview, and keep the existing ellipsis logic
to append "..." only when the string length exceeds the cutoff; reference
llms.TextData, llms.ToolUseData, action_content and action_preview to locate the
changes.

Comment on lines +315 to +316
parallel_tool_calls: bool = True # Enable parallel tool calls

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

parallel_tool_calls is currently a no-op.

The flag is configurable, but query() never forwards it to the LLM request and command execution still awaits each tool call serially. Both settings therefore behave the same, so the advertised parallel-tool-call behavior is not actually enabled.

Also applies to: 451-458, 495-498

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ares/code_agents/mini_swe_agent_v2.py` around lines 315 - 316, The
parallel_tool_calls flag is defined but never used; update the query(...)
caller/definition to accept and forward parallel_tool_calls into the
tool-execution path and into the LLM request options (where applicable), then
modify the tool-execution helpers (e.g., the functions named _call_tools,
_execute_tool_calls, run_tool_calls or similar) to branch on parallel_tool_calls
and run tool calls concurrently (use asyncio.gather / asyncio.create_task or
concurrent.futures for sync tools) when True and await serially when False;
ensure the flag is threaded through all call sites you referenced (around the
symbols query, _call_tools/_execute_tool_calls/run_tool_calls and any LLM
request builder) so the behavior actually toggles between parallel and serial
execution.

Comment on lines +396 to +403
# Determine working directory
workdir = self.working_directory or "/"

# Add initial task message
instance_message = jinja2.Template(self.instance_template, undefined=jinja2.StrictUndefined).render(
task=task,
working_directory=workdir,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Keep the advertised workdir in sync with exec_run().

When working_directory is unset, the prompt tells the model it is operating in /, but exec_run() receives None and falls back to the container's default directory instead. That mismatch can send reads/edits to the wrong tree.

Also applies to: 562-567

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ares/code_agents/mini_swe_agent_v2.py` around lines 396 - 403, The
prompt's displayed working directory and the value passed to exec_run() are
inconsistent; change the code so both use the same variable (do not coerce None
to "/" for the template while passing None to exec_run()). Use a single workdir
variable set directly from self.working_directory (allowing None) and pass that
workdir into the jinja2.Template.render(...) and into any exec_run(...) calls so
the advertised directory matches actual exec behavior; repeat the same change
for the other occurrence around the instance_template use at the 562-567
section.

Comment on lines +222 to +231
message_content = response.extract_text_content(llm_response)

self._add_message("assistant", message_content)

return response
return llm_response

async def execute_action(self, response: response.LLMResponse) -> None:
async def execute_action(self, llm_response: response.LLMResponse) -> None:
"""Execute the action and return the observation."""
_LOGGER.debug("[%d] Executing action.", id(self))
response_text = response.data[0].content
assert response_text is not None
response_text = response.extract_text_content(llm_response)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reject tool-call responses explicitly instead of silently dropping them.

If llm_response.data contains ToolUseData, extract_text_content(...) strips that information and this agent falls through to parse_action(""). On tool-capable models that means empty assistant turns, wasted retries, and cost burn until the limits trip, even though the real issue is that MiniSWECodeAgent only understands text/bash output.

🛠️ Suggested guard
+    def _extract_response_text(self, llm_response: response.LLMResponse) -> str:
+        if any(isinstance(block, response.ToolUseData) for block in llm_response.data):
+            raise _FormatError(
+                "MiniSWECodeAgent only supports text actions. Use MiniSWECodeAgentV2 for tool-call responses."
+            )
+        return response.extract_text_content(llm_response)
+
     async def query(self) -> response.LLMResponse:
         ...
-        message_content = response.extract_text_content(llm_response)
+        message_content = self._extract_response_text(llm_response)
         self._add_message("assistant", message_content)
         return llm_response
 
     async def execute_action(self, llm_response: response.LLMResponse) -> None:
         ...
-        response_text = response.extract_text_content(llm_response)
+        response_text = self._extract_response_text(llm_response)
         action = self.parse_action(response_text)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ares/code_agents/mini_swe_agent.py` around lines 222 - 231, The agent
currently strips tool-call responses via response.extract_text_content(...) and
falls through to parse_action("") causing wasted retries; update execute_action
to explicitly detect tool-call outputs by checking llm_response.data for
ToolUseData (or the tool-use discriminant) before calling
response.extract_text_content, and reject them (e.g., raise a clear exception or
return an error observation) so callers know the model returned a tool
invocation the agent can't handle; ensure to reference/ import the ToolUseData
type and update execute_action (and any callers expecting parse_action) to
handle this explicit rejection instead of silently parsing an empty string.

Comment on lines +108 to +110
if self._container is None:
raise RuntimeError("Container not started - call start() first")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify whether running-state checks (reload/status) exist in guarded methods.
rg -n --type=py -C3 'def (exec_run|upload_files|download_files)|reload\(|\.status\b' src/ares/containers/docker.py

Repository: withmartian/ares

Length of output: 985


🏁 Script executed:

# Get full method implementations for the three guarded methods
sed -n '96,132p' src/ares/containers/docker.py
echo "---"
sed -n '133,163p' src/ares/containers/docker.py
echo "---"
sed -n '164,200p' src/ares/containers/docker.py

Repository: withmartian/ares

Length of output: 4337


🏁 Script executed:

# Check the stop() method to see how container state is managed
rg -n --type=py -A 10 'async def stop' src/ares/containers/docker.py

Repository: withmartian/ares

Length of output: 437


🏁 Script executed:

# Check if there are any status checks or reload calls anywhere in the file
rg -n --type=py 'reload|\.status|is_running' src/ares/containers/docker.py

Repository: withmartian/ares

Length of output: 42


State guards only check if container was started, not whether it is still running.

These checks miss externally stopped/exited containers (_container can be non-None but unusable), causing operations to fail deeper in the docker API with unclear errors instead of early with a clear message.

🔧 Suggested pattern (single running-state guard reused across methods)
+    async def _require_running_container(self) -> docker.models.containers.Container:
+        if self._container is None:
+            raise RuntimeError("Container not started - call start() first")
+        await asyncio.to_thread(self._container.reload)
+        if self._container.status != "running":
+            raise RuntimeError("Container is not running - call start() first")
+        return self._container
+
     async def exec_run(
         self,
         command: str,
@@
-        if self._container is None:
-            raise RuntimeError("Container not started - call start() first")
+        container = await self._require_running_container()
@@
-                self._container.exec_run,
+                container.exec_run,

Also applies to: lines 135-137 (upload_files), 166-168 (download_files)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ares/containers/docker.py` around lines 108 - 110, The methods only check
self._container is not None but not whether the container is still running; add
a single reusable guard method (e.g., _ensure_running()) that calls
self._container.reload() (or otherwise refreshes state) and asserts
self._container is not None and self._container.status == "running" (or checks
container.attrs["State"]["Running"]) and raise a clear RuntimeError like
"Container not running - call start() or restart the container" if not; replace
the simple None checks in start-related operations and in upload_files and
download_files with calls to _ensure_running() so operations fail fast with a
clear message when the container has been externally stopped or exited.

Comment on lines +370 to 372
if not isinstance(action.data[0], response.TextData):
return ""
question = action.data[0].content if action.data else ""
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Handle empty action.data before checking the first block type.

Line 370 indexes action.data[0] unconditionally. Empty agent responses will crash step() with IndexError.

Proposed fix
-        if not isinstance(action.data[0], response.TextData):
+        if not action.data:
+            return ""
+        if not isinstance(action.data[0], response.TextData):
             return ""
-        question = action.data[0].content if action.data else ""
+        question = action.data[0].content
         return question.strip()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if not isinstance(action.data[0], response.TextData):
return ""
question = action.data[0].content if action.data else ""
if not action.data:
return ""
if not isinstance(action.data[0], response.TextData):
return ""
question = action.data[0].content
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ares/environments/twenty_questions.py` around lines 370 - 372, The code
indexes action.data[0] before ensuring action.data is non-empty which can raise
IndexError in step(); change the check to first confirm action.data is truthy
and that its first element is a response.TextData (e.g., if not action.data or
not isinstance(action.data[0], response.TextData): return ""), then set question
= action.data[0].content; update the block around step(), action.data and
response.TextData to follow this order to avoid indexing empty lists.

Comment on lines +515 to +517
# Extract text content (first TextData block)
text_blocks = [block for block in response.data if isinstance(block, llm_response.TextData)]
content = text_blocks[0].content if text_blocks else None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Only the first text block survives ChatCompletion conversion.

LLMResponse.data can now contain multiple TextData blocks, but this code serializes only text_blocks[0]. Converting an Anthropic/Responses-derived response through Chat Completions will silently drop the rest of the assistant text.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ares/llms/openai_chat_converter.py` around lines 515 - 517, The code
currently takes only the first TextData block from response.data (variables:
response.data, llm_response.TextData, text_blocks, content) which drops any
subsequent assistant text; change it to preserve all TextData blocks by
concatenating their contents in order (e.g., join all b.content from text_blocks
with a separator such as "\n\n") or otherwise propagate the full list where the
downstream logic expects multiple blocks, and update any downstream consumers
that assume a single string (search for uses of the content variable) to handle
the combined text or list accordingly.

Comment on lines +477 to +495
elif isinstance(item, openai.types.responses.ResponseFunctionToolCall):
call_id = item.call_id
name = item.name
arguments_str = item.arguments

# Parse arguments JSON
try:
input_dict = json.loads(arguments_str)
except json.JSONDecodeError:
input_dict = {
"_raw_arguments": arguments_str,
"_parse_error": "Invalid JSON",
}

data.append(
response_lib.ToolUseData(
id=call_id,
name=name,
input=input_dict,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In the OpenAI Responses API, what is the difference between ResponseFunctionToolCall.id and ResponseFunctionToolCall.call_id, and should both be preserved when reconstructing a Response object?

💡 Result:

  • ResponseFunctionToolCall.call_id is the tool-invocation correlation ID you must echo back when you submit the tool result (by sending a function_call_output item with the same call_id). The docs explicitly describe call_id as “used later to submit the function result,” and all examples of returning tool results key off call_id. [1][2]
  • ResponseFunctionToolCall.id is the ID of the function-call output item itself (the “item”), used to identify that specific output element in the response/streaming protocol (e.g., streaming events refer to an item_id that is “the id of the function call item”). [3]

Should you preserve both when reconstructing a Response?
Yes—preserve both. You need call_id to correctly associate function_call_output with the right call in subsequent turns, and preserving id maintains the identity of the output item (important for faithfully reconstructing/streaming/traceability). [1][2][3]

Sources
[1] OpenAI Developers – Function calling guide (Responses API): call_id used later to submit function result; sample includes both id and call_id. (platform.openai.com)
[2] OpenAI Developers – Function calling guide (Responses API): function_call_output uses call_id. (platform.openai.com)
[3] OpenAI Developers – Function calling guide (Responses API): streaming events reference item_id as the id of the function-call item. (platform.openai.com)

Citations:


🏁 Script executed:

# First, check the exact code at lines 477-495
sed -n '477,495p' src/ares/llms/openai_responses_converter.py

Repository: withmartian/ares

Length of output: 718


🏁 Script executed:

# Check the code at lines 573-579
sed -n '573,579p' src/ares/llms/openai_responses_converter.py

Repository: withmartian/ares

Length of output: 336


🏁 Script executed:

# Find and examine the ToolUseData definition
rg -A 5 'class ToolUseData' --type py

Repository: withmartian/ares

Length of output: 344


🏁 Script executed:

# Check how ares_response_to_external handles tool blocks - look for the reverse conversion
rg -B 5 -A 10 'ares_response_to_external' src/ares/llms/openai_responses_converter.py | head -80

Repository: withmartian/ares

Length of output: 457


Preserve both Responses tool identifiers (id and call_id).

ResponseFunctionToolCall has two distinct identifiers that serve different purposes:

  • call_id: the tool-invocation correlation ID, must be echoed back in function_call_output submissions
  • id: the identity of the output item itself, used for streaming, traceability, and item references

Currently:

  1. At lines 477-495 (forward conversion), only call_id is extracted and assigned to ToolUseData.id, discarding the original item id
  2. At lines 573-579 (reverse conversion), the single stored value is written to both id and call_id, creating semantic misalignment on round-trip

This breaks round-trip conversion and can cause downstream failures in consumers that rely on the distinction between item identity and call correlation. Extend ToolUseData to hold both values separately, then update both conversions to preserve and restore them correctly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ares/llms/openai_responses_converter.py` around lines 477 - 495, The
forward conversion currently maps ResponseFunctionToolCall.call_id into
response_lib.ToolUseData.id and drops the original item.id; update the
ToolUseData model to include both fields (e.g., id and call_id), then change the
forward converter handling ResponseFunctionToolCall to populate both
ToolUseData.id = item.id and ToolUseData.call_id = item.call_id (preserving
parsed arguments into input), and update the reverse conversion that
reconstructs ResponseFunctionToolCall to restore both fields (write item.id from
ToolUseData.id and item.call_id from ToolUseData.call_id) so round-trip
semantics and the distinction between item identity and invocation correlation
are preserved.

Comment thread src/ares/llms/openai_responses_converter.py
Comment on lines +34 to +38
# Extract data as we would in the LLM client
message = completion.choices[0].message
data = []
if message.content:
data.append(response.TextData(content=message.content))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

These tests never hit the production parsing path.

Most of the new cases rebuild TextData/ToolUseData inline with local json.loads(...) logic instead of calling the actual converter/helper code added in this PR. That means ares_response_from_external() or extract_text_content() can regress without failing this suite.

Also applies to: 49-55, 167-267

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ares/llms/response_test.py` around lines 34 - 38, Tests are constructing
TextData/ToolUseData manually from completion payloads and thus bypass the real
parsing logic; update the tests to feed the same raw completion payloads into
the actual parsing helpers (e.g., call ares_response_from_external(...) and/or
extract_text_content(...) with the completion/choice message instead of
rebuilding objects via json.loads) so the test exercises the production parsing
path (replace local construction around completion.choices[0].message and
similar blocks at the other ranges with calls to those helper functions and
assert on their outputs).

@rsmith49
Copy link
Copy Markdown
Contributor

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 10, 2026

✅ Actions performed

Full review triggered.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/ares/contrib/llama_cpp.py (1)

89-105: ⚠️ Potential issue | 🔴 Critical

Parameter mismatch: max_completion_tokens will cause runtime error in llama-cpp-python.

ares_request_to_external() returns max_completion_tokens, but llama_cpp.Llama.create_chat_completion() does not recognize this parameter—it expects max_tokens instead. Any request with max_output_tokens set will fail with a TypeError. Additionally, the response path discards tool_calls even though both the converter and llama-cpp-python support tool-calling. Filter the kwargs to match llama.cpp's actual signature (max_tokens, drop metadata/service_tier if unsupported) and update the response handler to preserve tool_calls when present, or use the same OpenAI response converter used by other clients.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ares/contrib/llama_cpp.py` around lines 89 - 105, The completion_kwargs
produced by ares_request_to_external contains max_completion_tokens and other
keys that llama_cpp.Llama.create_chat_completion doesn't accept, so before
calling self._llm.create_chat_completion map/rename max_completion_tokens ->
max_tokens, remove unsupported keys (e.g., metadata, service_tier) and only pass
keys that match llama.cpp's signature; after awaiting create_chat_completion and
model-validating chat_completion update the response assembly (where
content/usage and response.LLMResponse are created) to also preserve tool_calls
if present on chat_completion. Ensure you modify the code around
completion_kwargs, the call to self._llm.create_chat_completion, and the
chat_completion -> response conversion (references: ares_request_to_external,
completion_kwargs, self._llm.create_chat_completion, chat_completion,
response.LLMResponse, response.TextData, tool_calls, max_completion_tokens ->
max_tokens).
♻️ Duplicate comments (3)
src/ares/containers/docker.py (1)

108-110: ⚠️ Potential issue | 🟠 Major

Container guards still allow operations on externally stopped containers.

These checks only verify initialization, not liveness. If the container exited after start(), these methods fail deeper in Docker calls with less actionable errors.

🔧 Suggested fix (single reusable running-state guard)
+    async def _require_running_container(self) -> docker.models.containers.Container:
+        if self._container is None:
+            raise RuntimeError("Container not started - call start() first")
+        await asyncio.to_thread(self._container.reload)
+        if self._container.status != "running":
+            raise RuntimeError("Container not running - call start() or restart the container")
+        return self._container
+
     async def exec_run(
         self,
         command: str,
@@
-        if self._container is None:
-            raise RuntimeError("Container not started - call start() first")
+        container = await self._require_running_container()
@@
-                self._container.exec_run,
+                container.exec_run,
@@
     async def upload_files(self, local_paths: list[pathlib.Path], remote_paths: list[str]) -> None:
@@
-        if self._container is None:
-            raise RuntimeError("Container not started - call start() first")
+        container = await self._require_running_container()
@@
-                self._container.exec_run,
+                container.exec_run,
@@
-                self._container.put_archive,
+                container.put_archive,
@@
     async def download_files(self, remote_paths: list[str], local_paths: list[pathlib.Path]) -> None:
@@
-        if self._container is None:
-            raise RuntimeError("Container not started - call start() first")
+        container = await self._require_running_container()
@@
-                self._container.get_archive,
+                container.get_archive,

Also applies to: 135-137, 166-168

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ares/containers/docker.py` around lines 108 - 110, The current guards
only check that self._container is not None but don't verify the container's
liveness; add a reusable helper (e.g. _ensure_running or
_check_container_running) that calls self._container.reload() and checks
self._container.status == "running" and raises a clear RuntimeError like
"Container not running - call start() or restart the container" if not running;
replace the simple None checks in the affected methods (the ones using
self._container at lines shown) to call this helper before making Docker SDK
calls so operations fail fast with an actionable message.
examples/utils.py (1)

42-50: ⚠️ Potential issue | 🟠 Major

print_step still needs empty-data guarding and string-based truncation for tool input.

action.data[0] can raise on empty responses, and len(action_content) is incorrect when action_content is a dict.

✅ Suggested robust handling
-    if isinstance(action.data[0], llms.TextData):
-        action_content = action.data[0].content
-    elif isinstance(action.data[0], llms.ToolUseData):
-        action_content = action.data[0].input
-    else:
-        assert_never(action.data[0])
-    action_preview = str(action_content)[:200]
-    if len(action_content) > 200:
+    if not action.data:
+        action_content = "(no response data)"
+    else:
+        first_block = action.data[0]
+        if isinstance(first_block, llms.TextData):
+            action_content = first_block.content
+        elif isinstance(first_block, llms.ToolUseData):
+            action_content = str(first_block.input)
+        else:
+            assert_never(first_block)
+    action_preview = str(action_content)[:200]
+    if len(str(action_content)) > 200:
         action_preview += "..."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/utils.py` around lines 42 - 50, Guard against empty action.data in
print_step by checking if action.data is truthy and has at least one item before
accessing action.data[0]; if empty, set action_content = "" (or a sensible
default). When handling llms.ToolUseData, coerce the input to a string (e.g.,
action_content = str(action.data[0].input)) so truncation and length checks
operate on a string; likewise for llms.TextData use str(action.data[0].content).
Compute action_preview from the stringified action_content (action_preview =
action_str[:200]) and append "..." only when len(action_str) > 200. Ensure to
still call assert_never for unexpected types.
src/ares/llms/openai_chat_converter.py (1)

515-517: ⚠️ Potential issue | 🟡 Minor

Only the first text block is preserved in ChatCompletion conversion.

When converting back to OpenAI format, only text_blocks[0].content is used. If the LLMResponse contains multiple TextData blocks (e.g., from Anthropic responses), the additional text is silently dropped.

This appears to be a limitation of the OpenAI ChatCompletion format which only supports a single content field. Consider adding a warning log when multiple text blocks exist.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ares/llms/openai_chat_converter.py` around lines 515 - 517, The code
currently extracts only the first TextData block from response (text_blocks =
[block for block in response.data if isinstance(block, llm_response.TextData)];
content = text_blocks[0].content if text_blocks else None) which silently drops
any additional TextData entries; update the conversion to detect when
len(text_blocks) > 1 and emit a warning log (including a clear message like
"Multiple TextData blocks detected; only the first will be used" and include
metadata such as number of blocks) before continuing to use
text_blocks[0].content so callers are alerted to dropped content; reference the
existing logger in this module (e.g., logger or module-level logger) when adding
the warning.
🧹 Nitpick comments (7)
src/ares/code_agents/terminus2/terminus2_agent.py (1)

530-531: Handle non-text LLM blocks explicitly instead of relying on generic exception flow.

response.extract_text_content() now raises on empty responses and on leading ToolUseData. In run(), that falls into the broad except Exception path; in _unwrap_single_response(), it aborts summarization. A small targeted ValueError handler here would give this text-only agent a predictable failure mode if a heterogeneous response slips through.

Also applies to: 883-883

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ares/code_agents/terminus2/terminus2_agent.py` around lines 530 - 531,
The call to response.extract_text_content(llm_resp) can raise ValueError for
empty or non-text (ToolUseData) blocks, but presently those flow into a broad
except; update the run() flow around the self._query_llm() /
response.extract_text_content(llm_resp) sequence to catch ValueError explicitly,
log or surface a clear predictable text-only failure (e.g., set
assistant_message to a specific error/abort state or raise a more specific
exception), while letting other exceptions continue to the existing broad
handler; apply the same targeted ValueError handling in
_unwrap_single_response() where response.extract_text_content is used so
summarization aborts clearly on non-text blocks rather than relying on the
generic exception path.
src/ares/llms/openai_chat_converter_test.py (1)

36-41: Specify encoding when opening JSON files.

For consistency and cross-platform compatibility, specify encoding="utf-8".

♻️ Suggested fix
 def _load_test_completion(filename: str) -> openai.types.chat.ChatCompletion:
     """Load a test JSON file and convert to OpenAI ChatCompletion."""
     test_data_dir = Path(__file__).parent / "test_data"
-    with open(test_data_dir / filename) as f:
+    with open(test_data_dir / filename, encoding="utf-8") as f:
         data = json.load(f)
     return openai.types.chat.ChatCompletion(**data)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ares/llms/openai_chat_converter_test.py` around lines 36 - 41, The helper
function _load_test_completion currently opens JSON test files without
specifying an encoding; update the open call inside _load_test_completion (the
with open(test_data_dir / filename) as f:) to explicitly use encoding="utf-8" so
json.load reads files consistently across platforms and locales.
src/ares/llms/openai_responses_converter_test.py (1)

431-436: Specify encoding when opening JSON files.

The open() call should specify encoding="utf-8" to ensure consistent behavior across platforms.

♻️ Suggested fix
 def _load_test_response(filename: str) -> openai.types.responses.Response:
     """Load a test JSON file and convert to OpenAI Response."""
     test_data_dir = pathlib.Path(__file__).parent / "test_data"
-    with open(test_data_dir / filename) as f:
+    with open(test_data_dir / filename, encoding="utf-8") as f:
         data = json.load(f)
     return openai.types.responses.Response(**data)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ares/llms/openai_responses_converter_test.py` around lines 431 - 436, The
helper _load_test_response currently opens JSON test files without specifying an
encoding, which can cause platform-dependent behavior; update the open call
inside _load_test_response to pass encoding="utf-8" (i.e., open(test_data_dir /
filename, encoding="utf-8")) so json.load reads UTF-8 consistently when
constructing the openai.types.responses.Response.
src/ares/llms/anthropic_converter_test.py (1)

305-310: Specify encoding when opening JSON files.

Same as other test files - the open() call should specify encoding="utf-8".

♻️ Suggested fix
 def _load_test_message(filename: str) -> anthropic.types.Message:
     """Load a test JSON file and convert to Anthropic Message."""
     test_data_dir = pathlib.Path(__file__).parent / "test_data"
-    with open(test_data_dir / filename) as f:
+    with open(test_data_dir / filename, encoding="utf-8") as f:
         data = json.load(f)
     return anthropic.types.Message(**data)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ares/llms/anthropic_converter_test.py` around lines 305 - 310, The helper
function _load_test_message currently opens JSON test files without specifying
an encoding; update the open(...) call in _load_test_message to include
encoding="utf-8" so json.load reads files with the correct text encoding (keep
the rest of the function logic, return anthropic.types.Message(**data)
unchanged).
src/ares/code_agents/mini_swe_agent_v2.py (1)

610-614: Move json import to module level.

The json import inside _dict_to_json should be at the module level per Python conventions.

♻️ Proposed fix

Add to top-level imports:

import json

Then simplify the function:

 def _dict_to_json(d: dict) -> str:
     """Convert a dictionary to a JSON string."""
-    import json
-
     return json.dumps(d)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ares/code_agents/mini_swe_agent_v2.py` around lines 610 - 614, The json
import is inside the helper function _dict_to_json which violates module-level
import conventions; move the import to the top of the module (add import json to
the file's imports) and remove the inline import from inside _dict_to_json so
the function simply calls json.dumps(d); ensure no other inline json imports
remain in this module.
src/ares/code_agents/mini_swe_agent_v2_configs.py (1)

12-166: Template duplication with mini_swe_agent_v2.py.

The templates and environment variables defined here (_DEFAULT_ENV_VARS, _TOOLCALL_SYSTEM_TEMPLATE, _TOOLCALL_INSTANCE_TEMPLATE, _SHARED_OBSERVATION_TEMPLATE, _TOOLCALL_FORMAT_ERROR_TEMPLATE) are duplicated from mini_swe_agent_v2.py. This creates a maintenance burden where changes must be synchronized across both files.

Consider having the agent import from this config module, or vice versa, to maintain a single source of truth.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ares/code_agents/mini_swe_agent_v2_configs.py` around lines 12 - 166, The
two modules duplicate environment and template constants (_DEFAULT_ENV_VARS,
_TOOLCALL_SYSTEM_TEMPLATE, _TOOLCALL_INSTANCE_TEMPLATE,
_SHARED_OBSERVATION_TEMPLATE, _TOOLCALL_FORMAT_ERROR_TEMPLATE); consolidate them
into a single source of truth by keeping these definitions in this config module
and updating mini_swe_agent_v2.py to import those symbols instead of redefining
them (or vice versa if you prefer the other file as canonical); remove the
duplicate definitions from the non-canonical file, update all references to use
the imported names (e.g., replace local references to _DEFAULT_ENV_VARS,
_TOOLCALL_SYSTEM_TEMPLATE, _TOOLCALL_INSTANCE_TEMPLATE,
_SHARED_OBSERVATION_TEMPLATE, _TOOLCALL_FORMAT_ERROR_TEMPLATE with imports), and
run tests to ensure no import cycles or name changes break usage.
src/ares/llms/response.py (1)

7-8: Module-level imports create coupling to external API types.

Importing anthropic.types and openai.types.responses at the module level for type annotations couples this core response module to specific external SDKs. This means importing response.py requires both SDKs to be installed.

Consider using TYPE_CHECKING guards if these are only needed for type hints:

♻️ Proposed refactor
+from typing import TYPE_CHECKING
+
+if TYPE_CHECKING:
+    import anthropic.types
+    import openai.types.responses

Then use string annotations for the types, or keep as-is if runtime access is needed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ares/llms/response.py` around lines 7 - 8, The module-level imports of
external SDK types (anthropic.types and openai.types.responses) cause runtime
coupling; wrap those imports inside a TYPE_CHECKING guard (from typing import
TYPE_CHECKING) and move any references to those types into either string
annotations or enable postponed evaluation with "from __future__ import
annotations" so runtime imports are not required; update the annotations in this
module (e.g., any functions or classes that currently reference anthropic.types
or openai.types.responses) to use quoted type names or rely on TYPE_CHECKING to
import them only for type checking.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pyproject.toml`:
- Around line 7-10: The wheel exclusion list in pyproject.toml currently only
excludes files matching "**/*_test.py" which misses the "test_*.py" convention
(e.g., integration_tests/test_default_workdir.py); update the exclude array to
also include the pattern "**/test_*.py" so both naming variants are excluded
from the built wheel, ensuring the exclude list contains "**/*_test.py",
"**/test_*.py", and any existing entries like "**/test_data/".

In `@src/ares/code_agents/mini_swe_agent_v2_configs.py`:
- Around line 450-464: Update the text_based() docstring in MiniSWEAgentV2Config
to remove or correct the reference to the non-existent
MiniSWECodeAgentV2TextBased class: change the sentence to state that the
text_based() config is compatible with the existing MiniSWECodeAgentV2 agent (or
simply remove the sentence), keeping the rest of the docstring intact;
specifically edit the method text_based() and replace the mention of
MiniSWECodeAgentV2TextBased with MiniSWECodeAgentV2 (or delete that clause) so
the docstring accurately reflects supported agent classes.

In `@src/ares/llms/openai_chat_converter.py`:
- Line 142: The converter function names in openai_chat_converter need to match
the documented Protocol: rename the functions ares_request_to_external and
ares_request_from_external to to_external and from_external respectively
(preserving their signatures, parameters like request: llm_request.LLMRequest
and strict: bool = True, return types, and internal logic), update all internal
references/calls in this module to use the new names, and ensure any imports or
registrations that expect the Protocol methods are updated so the class/module
implements to_external() and from_external() exactly as defined in request.py.

---

Outside diff comments:
In `@src/ares/contrib/llama_cpp.py`:
- Around line 89-105: The completion_kwargs produced by ares_request_to_external
contains max_completion_tokens and other keys that
llama_cpp.Llama.create_chat_completion doesn't accept, so before calling
self._llm.create_chat_completion map/rename max_completion_tokens -> max_tokens,
remove unsupported keys (e.g., metadata, service_tier) and only pass keys that
match llama.cpp's signature; after awaiting create_chat_completion and
model-validating chat_completion update the response assembly (where
content/usage and response.LLMResponse are created) to also preserve tool_calls
if present on chat_completion. Ensure you modify the code around
completion_kwargs, the call to self._llm.create_chat_completion, and the
chat_completion -> response conversion (references: ares_request_to_external,
completion_kwargs, self._llm.create_chat_completion, chat_completion,
response.LLMResponse, response.TextData, tool_calls, max_completion_tokens ->
max_tokens).

---

Duplicate comments:
In `@examples/utils.py`:
- Around line 42-50: Guard against empty action.data in print_step by checking
if action.data is truthy and has at least one item before accessing
action.data[0]; if empty, set action_content = "" (or a sensible default). When
handling llms.ToolUseData, coerce the input to a string (e.g., action_content =
str(action.data[0].input)) so truncation and length checks operate on a string;
likewise for llms.TextData use str(action.data[0].content). Compute
action_preview from the stringified action_content (action_preview =
action_str[:200]) and append "..." only when len(action_str) > 200. Ensure to
still call assert_never for unexpected types.

In `@src/ares/containers/docker.py`:
- Around line 108-110: The current guards only check that self._container is not
None but don't verify the container's liveness; add a reusable helper (e.g.
_ensure_running or _check_container_running) that calls self._container.reload()
and checks self._container.status == "running" and raises a clear RuntimeError
like "Container not running - call start() or restart the container" if not
running; replace the simple None checks in the affected methods (the ones using
self._container at lines shown) to call this helper before making Docker SDK
calls so operations fail fast with an actionable message.

In `@src/ares/llms/openai_chat_converter.py`:
- Around line 515-517: The code currently extracts only the first TextData block
from response (text_blocks = [block for block in response.data if
isinstance(block, llm_response.TextData)]; content = text_blocks[0].content if
text_blocks else None) which silently drops any additional TextData entries;
update the conversion to detect when len(text_blocks) > 1 and emit a warning log
(including a clear message like "Multiple TextData blocks detected; only the
first will be used" and include metadata such as number of blocks) before
continuing to use text_blocks[0].content so callers are alerted to dropped
content; reference the existing logger in this module (e.g., logger or
module-level logger) when adding the warning.

---

Nitpick comments:
In `@src/ares/code_agents/mini_swe_agent_v2_configs.py`:
- Around line 12-166: The two modules duplicate environment and template
constants (_DEFAULT_ENV_VARS, _TOOLCALL_SYSTEM_TEMPLATE,
_TOOLCALL_INSTANCE_TEMPLATE, _SHARED_OBSERVATION_TEMPLATE,
_TOOLCALL_FORMAT_ERROR_TEMPLATE); consolidate them into a single source of truth
by keeping these definitions in this config module and updating
mini_swe_agent_v2.py to import those symbols instead of redefining them (or vice
versa if you prefer the other file as canonical); remove the duplicate
definitions from the non-canonical file, update all references to use the
imported names (e.g., replace local references to _DEFAULT_ENV_VARS,
_TOOLCALL_SYSTEM_TEMPLATE, _TOOLCALL_INSTANCE_TEMPLATE,
_SHARED_OBSERVATION_TEMPLATE, _TOOLCALL_FORMAT_ERROR_TEMPLATE with imports), and
run tests to ensure no import cycles or name changes break usage.

In `@src/ares/code_agents/mini_swe_agent_v2.py`:
- Around line 610-614: The json import is inside the helper function
_dict_to_json which violates module-level import conventions; move the import to
the top of the module (add import json to the file's imports) and remove the
inline import from inside _dict_to_json so the function simply calls
json.dumps(d); ensure no other inline json imports remain in this module.

In `@src/ares/code_agents/terminus2/terminus2_agent.py`:
- Around line 530-531: The call to response.extract_text_content(llm_resp) can
raise ValueError for empty or non-text (ToolUseData) blocks, but presently those
flow into a broad except; update the run() flow around the self._query_llm() /
response.extract_text_content(llm_resp) sequence to catch ValueError explicitly,
log or surface a clear predictable text-only failure (e.g., set
assistant_message to a specific error/abort state or raise a more specific
exception), while letting other exceptions continue to the existing broad
handler; apply the same targeted ValueError handling in
_unwrap_single_response() where response.extract_text_content is used so
summarization aborts clearly on non-text blocks rather than relying on the
generic exception path.

In `@src/ares/llms/anthropic_converter_test.py`:
- Around line 305-310: The helper function _load_test_message currently opens
JSON test files without specifying an encoding; update the open(...) call in
_load_test_message to include encoding="utf-8" so json.load reads files with the
correct text encoding (keep the rest of the function logic, return
anthropic.types.Message(**data) unchanged).

In `@src/ares/llms/openai_chat_converter_test.py`:
- Around line 36-41: The helper function _load_test_completion currently opens
JSON test files without specifying an encoding; update the open call inside
_load_test_completion (the with open(test_data_dir / filename) as f:) to
explicitly use encoding="utf-8" so json.load reads files consistently across
platforms and locales.

In `@src/ares/llms/openai_responses_converter_test.py`:
- Around line 431-436: The helper _load_test_response currently opens JSON test
files without specifying an encoding, which can cause platform-dependent
behavior; update the open call inside _load_test_response to pass
encoding="utf-8" (i.e., open(test_data_dir / filename, encoding="utf-8")) so
json.load reads UTF-8 consistently when constructing the
openai.types.responses.Response.

In `@src/ares/llms/response.py`:
- Around line 7-8: The module-level imports of external SDK types
(anthropic.types and openai.types.responses) cause runtime coupling; wrap those
imports inside a TYPE_CHECKING guard (from typing import TYPE_CHECKING) and move
any references to those types into either string annotations or enable postponed
evaluation with "from __future__ import annotations" so runtime imports are not
required; update the annotations in this module (e.g., any functions or classes
that currently reference anthropic.types or openai.types.responses) to use
quoted type names or rely on TYPE_CHECKING to import them only for type
checking.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0cb46145-1c48-4397-8c75-712293fee9be

📥 Commits

Reviewing files that changed from the base of the PR and between c804aa2 and 66617e6.

📒 Files selected for processing (39)
  • CLAUDE.md
  • examples/07_mech_interp_hooked_transformer.py
  • examples/utils.py
  • pyproject.toml
  • src/ares/code_agents/mini_swe_agent.py
  • src/ares/code_agents/mini_swe_agent_v2.py
  • src/ares/code_agents/mini_swe_agent_v2_configs.py
  • src/ares/code_agents/terminus2/terminus2_agent.py
  • src/ares/containers/docker.py
  • src/ares/contrib/llama_cpp.py
  • src/ares/contrib/transformers_client.py
  • src/ares/contrib/transformers_client_test.py
  • src/ares/environments/twenty_questions.py
  • src/ares/llms/__init__.py
  • src/ares/llms/accounting.py
  • src/ares/llms/anthropic_converter.py
  • src/ares/llms/anthropic_converter_test.py
  • src/ares/llms/chat_completions_compatible.py
  • src/ares/llms/openai_chat_converter.py
  • src/ares/llms/openai_chat_converter_test.py
  • src/ares/llms/openai_responses_converter.py
  • src/ares/llms/openai_responses_converter_test.py
  • src/ares/llms/request_test.py
  • src/ares/llms/response.py
  • src/ares/llms/response_test.py
  • src/ares/llms/test_data/anthropic_messages_mixed_content.json
  • src/ares/llms/test_data/anthropic_messages_text_only.json
  • src/ares/llms/test_data/anthropic_messages_tool_use_parallel.json
  • src/ares/llms/test_data/anthropic_messages_tool_use_single.json
  • src/ares/llms/test_data/openai_empty.json
  • src/ares/llms/test_data/openai_responses_text_only.json
  • src/ares/llms/test_data/openai_responses_tool_call_single.json
  • src/ares/llms/test_data/openai_text_only.json
  • src/ares/llms/test_data/openai_tool_call_malformed.json
  • src/ares/llms/test_data/openai_tool_call_mixed.json
  • src/ares/llms/test_data/openai_tool_call_parallel.json
  • src/ares/llms/test_data/openai_tool_call_single.json
  • src/ares/presets.py
  • src/ares/testing/mock_llm_test.py

Comment thread pyproject.toml
Comment on lines +7 to +10
exclude = [
"**/*_test.py",
"**/test_data/",
]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cat -n pyproject.toml | head -20

Repository: withmartian/ares

Length of output: 692


🏁 Script executed:

# Find all test files to understand naming patterns
find . -name '*test*.py' -type f | head -30

Repository: withmartian/ares

Length of output: 650


🏁 Script executed:

# Search for the coding guidelines mentioned
rg -i "naming pattern|test.*\.py" --type py --type md --type txt | head -20

Repository: withmartian/ares

Length of output: 768


Add test_*.py pattern to wheel exclusions.

The current pattern **/*_test.py excludes only the *_test.py naming variant. The file ./integration_tests/test_default_workdir.py (which follows the test_*.py naming pattern mentioned in CLAUDE.md as an acceptable alternative) would still be packaged in the wheel. Update the exclude list to include **/test_*.py:

Suggested pattern update
 exclude = [
     "**/*_test.py",
+    "**/test_*.py",
     "**/test_data/",
 ]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
exclude = [
"**/*_test.py",
"**/test_data/",
]
exclude = [
"**/*_test.py",
"**/test_*.py",
"**/test_data/",
]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` around lines 7 - 10, The wheel exclusion list in
pyproject.toml currently only excludes files matching "**/*_test.py" which
misses the "test_*.py" convention (e.g.,
integration_tests/test_default_workdir.py); update the exclude array to also
include the pattern "**/test_*.py" so both naming variants are excluded from the
built wheel, ensuring the exclude list contains "**/*_test.py", "**/test_*.py",
and any existing entries like "**/test_data/".

Comment on lines +450 to +464
@classmethod
def text_based(cls) -> "MiniSWEAgentV2Config":
"""Create config for text-based agent (backwards compatibility).

This configuration uses regex parsing of markdown code blocks
instead of tool calls. Matches swebench_backticks.yaml.

Note: This requires using MiniSWECodeAgentV2TextBased instead
of MiniSWECodeAgentV2, as the parsing logic is different.
"""
return cls(
system_template=_TEXTBASED_SYSTEM_TEMPLATE,
instance_template=_TEXTBASED_INSTANCE_TEMPLATE,
format_error_template=_TEXTBASED_FORMAT_ERROR_TEMPLATE,
parallel_tool_calls=False, # Text-based doesn't support parallel
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for MiniSWECodeAgentV2TextBased class definition
rg -n 'class MiniSWECodeAgentV2TextBased' --type py

# Also check for any references to it
rg -n 'MiniSWECodeAgentV2TextBased' --type py

Repository: withmartian/ares

Length of output: 183


🏁 Script executed:

# Search for MiniSWECodeAgentV2 classes (with or without TextBased)
rg -n 'class MiniSWECodeAgentV2' --type py

# Search for TextBased references to understand naming conventions
rg -n 'TextBased' --type py | head -20

# Check what's in the file around the referenced method
cat -n src/ares/code_agents/mini_swe_agent_v2_configs.py | sed -n '440,470p'

Repository: withmartian/ares

Length of output: 1451


🏁 Script executed:

# Search for any agent classes that might be text-based variants
rg -n 'class.*Agent.*' src/ares/code_agents/ --type py | grep -i text

# Check for backticks or markdown parsing related agents
rg -n 'backtick' --type py

# Look for TODO/FIXME comments related to TextBased
rg -n 'TextBased|text.based' --type py

Repository: withmartian/ares

Length of output: 953


🏁 Script executed:

# Search more broadly for all agent classes in the codebase
rg -n 'class.*\(.*Agent' src/ares/code_agents/ --type py

# Check the actual agent implementation file
cat -n src/ares/code_agents/mini_swe_agent_v2.py | head -50

Repository: withmartian/ares

Length of output: 2796


Remove or correct the docstring reference to the non-existent MiniSWECodeAgentV2TextBased class.

The docstring states "This requires using MiniSWECodeAgentV2TextBased instead of MiniSWECodeAgentV2", but this class does not exist in the codebase. The text_based() method is a config factory that should work with the same MiniSWECodeAgentV2 agent class, just with different templates for text-based (backticks) parsing. Either clarify that users should still use MiniSWECodeAgentV2 with this config, or remove the misleading reference.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ares/code_agents/mini_swe_agent_v2_configs.py` around lines 450 - 464,
Update the text_based() docstring in MiniSWEAgentV2Config to remove or correct
the reference to the non-existent MiniSWECodeAgentV2TextBased class: change the
sentence to state that the text_based() config is compatible with the existing
MiniSWECodeAgentV2 agent (or simply remove the sentence), keeping the rest of
the docstring intact; specifically edit the method text_based() and replace the
mention of MiniSWECodeAgentV2TextBased with MiniSWECodeAgentV2 (or delete that
clause) so the docstring accurately reflects supported agent classes.



def to_external(request: llm_request.LLMRequest, *, strict: bool = True) -> dict[str, Any]:
def ares_request_to_external(request: llm_request.LLMRequest, *, strict: bool = True) -> dict[str, Any]:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if the documented Protocol pattern is still accurate and if other converters follow it
rg -n "def to_external|def from_external" --type=py src/ares/llms/

Repository: withmartian/ares

Length of output: 693


Rename functions to match documented Protocol.

The functions ares_request_to_external and ares_request_from_external deviate from the documented converter interface defined in request.py (lines 201, 216), which specifies to_external() and from_external(). Other converters in the same module (openai_responses_converter, anthropic_converter) follow this naming convention.

Rename to to_external() and from_external() for consistency, or update the Protocol documentation and all other converters to match the new naming.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ares/llms/openai_chat_converter.py` at line 142, The converter function
names in openai_chat_converter need to match the documented Protocol: rename the
functions ares_request_to_external and ares_request_from_external to to_external
and from_external respectively (preserving their signatures, parameters like
request: llm_request.LLMRequest and strict: bool = True, return types, and
internal logic), update all internal references/calls in this module to use the
new names, and ensure any imports or registrations that expect the Protocol
methods are updated so the class/module implements to_external() and
from_external() exactly as defined in request.py.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants