Conversation
Introduce v2 text-based code agent that uses markdown parsing and COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT submission sentinel. Based on mini-swe-agent's mini_textbased.yaml configuration. Key features: - Text-based interaction (parsing markdown code blocks) - Jinja2-rendered system/instance/observation templates - Timeout and format error handling - Cost and step limit tracking Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
| if not first_block.content: | ||
| raise ValueError("TextData block has empty content") |
There was a problem hiding this comment.
[Logic] Raising on empty TextData makes extract_text_content unusable for valid empty responses (e.g., openai_empty.json fixture). If any caller uses this helper on an empty but otherwise valid response, it will now throw unexpectedly. Consider allowing empty strings and only erroring when the first block isn't TextData.
Example:
# Allow empty text content
return first_block.contentContext for Agents
Raising on empty `TextData` makes `extract_text_content` unusable for valid empty responses (e.g., `openai_empty.json` fixture). If any caller uses this helper on an empty but otherwise valid response, it will now throw unexpectedly. Consider allowing empty strings and only erroring when the first block isn't `TextData`.
Example:
```python
# Allow empty text content
return first_block.content
```
File: src/ares/llms/response.py
Line: 112| if workdir is None: | ||
| workdir = self.default_workdir | ||
|
|
||
| if self._container is None: | ||
| raise RuntimeError("Container not started - call start() first") | ||
|
|
There was a problem hiding this comment.
if self._container is None: raise RuntimeError(...) is duplicated in exec_run, upload_files, and download_files, so the same guard/error appears thrice; can we share a single _ensure_started() helper to avoid the copy-paste and keep the message consistent?
Finding type: Code Dedup and Conventions
Want Baz to fix this for you? Activate Fixer
| if not response.data: | ||
| raise ValueError("Response has no data blocks") | ||
|
|
||
| first_block = response.data[0] | ||
| if not isinstance(first_block, TextData): | ||
| raise ValueError(f"Expected TextData as first block, got {type(first_block).__name__}") | ||
|
|
||
| if not first_block.content: | ||
| raise ValueError("TextData block has empty content") | ||
|
|
There was a problem hiding this comment.
extract_text_content now raises if the first data block isn't TextData or the content string is empty, but ares_response_from_external (openai_chat_converter/openai_tool_call_* fixtures) produces ToolUseData-only responses when the assistant issues a function call; MiniSWECodeAgent.query/execute_action and Terminus2Agent._unwrap_single_response now hit this code and throw ValueError any time the LLM returns a pure tool call (a legitimate response), aborting the agent loop—can we scan for the first TextData block (or allow empty strings) instead of forcing the first block to be TextData?
Finding type: Logical Bugs
Want Baz to fix this for you? Activate Fixer
| def _extract_question_from_response(self, action: response.LLMResponse) -> str: | ||
| """Extract the question text from the agent's response.""" | ||
| # Get the text content from the first data element | ||
| if not isinstance(action.data[0], response.TextData): | ||
| return "" | ||
| question = action.data[0].content if action.data else "" | ||
| return question.strip() |
There was a problem hiding this comment.
_extract_question_from_response now short-circuits to "" whenever action.data[0] is a ToolUseData, but OpenAI Responses outputs can validly emit a function_call (ToolUseData) before any message (TextData) block, so the environment drops the agent's question, keeps calling the oracle with an empty prompt, and every step just penalizes the agent without recording the real question; can we scan the list for the first TextData instead of only checking data[0]?
Finding type: Logical Bugs
Want Baz to fix this for you? Activate Fixer
| import asyncio | ||
|
|
||
| import ares | ||
| from ares import llms | ||
| from ares.contrib import mech_interp | ||
| import transformer_lens | ||
|
|
There was a problem hiding this comment.
ares and ares.contrib imports currently precede import transformer_lens, so third-party imports are in the wrong group; CLAUDE.md requires stdlib → third-party → local (ARES final). Can we move import transformer_lens ahead of the ARES imports so the local group is kept last?
Finding type: AI Coding Guidelines
Want Baz to fix this for you? Activate Fixer
📝 WalkthroughWalkthroughAdds heterogeneous LLM response support (tool-use blocks), bidirectional converters for OpenAI/Responses/Anthropic, a new async tool-call MiniSWECodeAgentV2 with configs, widespread callers updated to use centralized conversions/extractors, new tests and fixtures, doc/example/type-guidance, and minor runtime safety and packaging adjustments. Changes
Sequence DiagramsequenceDiagram
participant User
participant Agent as MiniSWECodeAgentV2
participant LLM as LLMClient
participant Container
participant History as MessageHistory
User->>Agent: run(task)
loop step loop
Agent->>History: render & read conversation
Agent->>LLM: query(messages, tool_choice="any")
LLM-->>Agent: LLMResponse (ToolUseData / TextData)
Agent->>History: append assistant message (with tool_calls)
loop for each ToolUseData
Agent->>Container: exec_run(bash command)
Container-->>Agent: stdout/stderr, returncode
Agent->>Agent: render observation, append tool observation to History
alt submission detected (sentinel + rc=0)
Agent-->>User: return submission / raise _SubmittedError
end
end
Agent->>Agent: check step_limit & cost_limit
alt limits exceeded
Agent-->>User: raise _LimitsExceededError
end
end
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~50 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 11
♻️ Duplicate comments (1)
src/ares/llms/response.py (1)
104-112:⚠️ Potential issue | 🟠 MajorAllow empty
TextDatahere.The converters in this PR normalize valid empty completions to
TextData(content=""), so raising onnot first_block.contentturns a legitimate empty response into an exception for every caller of this helper.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ares/llms/response.py` around lines 104 - 112, The code currently treats an empty TextData.content as an error; change the check so legitimate empty string completions are allowed: in the block handling response.data and first_block (the TextData check), remove or replace the falsy check "if not first_block.content" with a None-only check (e.g., "if first_block.content is None") so that TextData(content="") is accepted; ensure you keep the earlier checks for presence of response.data and that first_block is a TextData instance.
🧹 Nitpick comments (3)
src/ares/llms/openai_responses_converter_test.py (1)
442-448: Add a parallel tool-call round-trip case to this matrix.This only exercises text-only and single-call fixtures, so it will miss regressions in the new multi-tool Responses flow introduced by this PR. Please add a fixture with multiple
function_calloutput items to keep Responses coverage aligned with the chat and Anthropic suites.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ares/llms/openai_responses_converter_test.py` around lines 442 - 448, The parametrize matrix for "filename" only includes text-only and single-call fixtures, so add a parallel multi-call fixture filename (e.g., "openai_responses_tool_call_multi.json") to the list so the test case exercises multiple function_call output items; update the pytest.mark.parametrize list in src/ares/llms/openai_responses_converter_test.py to include that new filename and ensure a corresponding fixture JSON is added with multiple function_call entries to mirror the new multi-tool Responses flow.src/ares/llms/accounting.py (2)
74-135: Add direct tests for the three usage schemas.This function now has provider-specific billing branches, but the converter round-trip tests won't catch a mix-up between
prompt_tokens/completion_tokensandinput_tokens/output_tokens. A small unit matrix aroundget_llm_cost()would make cost regressions much easier to catch in this shared path.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ares/llms/accounting.py` around lines 74 - 135, Add unit tests that directly exercise get_llm_cost for the three provider usage schemas to prevent token-field mixups: create three small tests (e.g., test_get_llm_cost_chat_completion, test_get_llm_cost_responses_api, test_get_llm_cost_anthropic_message) that construct minimal mock objects matching openai.types.chat.chat_completion.ChatCompletion (with usage.prompt_tokens and usage.completion_tokens), openai.types.responses.Response (with usage.input_tokens and usage.output_tokens), and anthropic.types.Message (with usage.input_tokens and usage.output_tokens), supply a cost_mapping entry for the model_id with known ModelCost.pricing values (request, prompt, completion) and assert the returned decimal.Decimal equals the expected calculation ((request) + prompt*input/prompt_tokens + completion*output/completion_tokens); ensure to cover missing usage raising ValueError as well.
3-6: Use module imports forcollections.abcandtypinghere.This file now mixes the repo’s qualified-module import style with direct symbol imports for
Mappingandassert_never. Keeping these qualified makes the module consistent with the rest of the codebase.♻️ Minimal cleanup
-from collections.abc import Mapping +import collections.abc import decimal import functools -from typing import assert_never +import typing ... - cost_mapping: Mapping[str, ModelCost], + cost_mapping: collections.abc.Mapping[str, ModelCost], ... - assert_never(completion) + typing.assert_never(completion)As per coding guidelines, "Always import modules, not classes or functions. For external consumers (examples, docs), use
import aresorfrom ares import llmsand access via qualified names. For internal code, usefrom ares.llms import requestand access asrequest.LLMRequest."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ares/llms/accounting.py` around lines 3 - 6, The file imports Mapping and assert_never directly; update to module imports to match project style: replace "from collections.abc import Mapping" with "import collections.abc" (use collections.abc.Mapping where used) and replace "from typing import assert_never" with "import typing" (use typing.assert_never where used), and update any references to the direct symbols (Mapping, assert_never) accordingly in src/ares/llms/accounting.py.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/utils.py`:
- Around line 42-50: The print_step code accesses action.data[0] without
checking for empty .data and uses len(action_content) which is wrong when
action_content is a dict for llms.ToolUseData; update the logic in print_step to
first guard for empty action.data (handle by setting action_content = "" or a
placeholder like "<empty response>"), then for llms.ToolUseData convert the
input to a string (e.g., str(action.data[0].input)) so truncation operates on
characters, compute preview using the string length and slice that string into
action_preview, and keep the existing ellipsis logic to append "..." only when
the string length exceeds the cutoff; reference llms.TextData, llms.ToolUseData,
action_content and action_preview to locate the changes.
In `@src/ares/code_agents/mini_swe_agent_v2.py`:
- Around line 396-403: The prompt's displayed working directory and the value
passed to exec_run() are inconsistent; change the code so both use the same
variable (do not coerce None to "/" for the template while passing None to
exec_run()). Use a single workdir variable set directly from
self.working_directory (allowing None) and pass that workdir into the
jinja2.Template.render(...) and into any exec_run(...) calls so the advertised
directory matches actual exec behavior; repeat the same change for the other
occurrence around the instance_template use at the 562-567 section.
- Around line 315-316: The parallel_tool_calls flag is defined but never used;
update the query(...) caller/definition to accept and forward
parallel_tool_calls into the tool-execution path and into the LLM request
options (where applicable), then modify the tool-execution helpers (e.g., the
functions named _call_tools, _execute_tool_calls, run_tool_calls or similar) to
branch on parallel_tool_calls and run tool calls concurrently (use
asyncio.gather / asyncio.create_task or concurrent.futures for sync tools) when
True and await serially when False; ensure the flag is threaded through all call
sites you referenced (around the symbols query,
_call_tools/_execute_tool_calls/run_tool_calls and any LLM request builder) so
the behavior actually toggles between parallel and serial execution.
In `@src/ares/code_agents/mini_swe_agent.py`:
- Around line 222-231: The agent currently strips tool-call responses via
response.extract_text_content(...) and falls through to parse_action("") causing
wasted retries; update execute_action to explicitly detect tool-call outputs by
checking llm_response.data for ToolUseData (or the tool-use discriminant) before
calling response.extract_text_content, and reject them (e.g., raise a clear
exception or return an error observation) so callers know the model returned a
tool invocation the agent can't handle; ensure to reference/ import the
ToolUseData type and update execute_action (and any callers expecting
parse_action) to handle this explicit rejection instead of silently parsing an
empty string.
In `@src/ares/containers/docker.py`:
- Around line 108-110: The methods only check self._container is not None but
not whether the container is still running; add a single reusable guard method
(e.g., _ensure_running()) that calls self._container.reload() (or otherwise
refreshes state) and asserts self._container is not None and
self._container.status == "running" (or checks
container.attrs["State"]["Running"]) and raise a clear RuntimeError like
"Container not running - call start() or restart the container" if not; replace
the simple None checks in start-related operations and in upload_files and
download_files with calls to _ensure_running() so operations fail fast with a
clear message when the container has been externally stopped or exited.
In `@src/ares/environments/twenty_questions.py`:
- Around line 359-361: The code dereferences oracle_response.data[0] without
checking for an empty list; update the block that sets answer_text to first
verify oracle_response.data is truthy (non-empty) before accessing index 0, and
raise a clear RuntimeError if it's empty or None; keep the existing type check
for response.TextData and assignment to answer_text so replace the single
dereference with an initial empty-check on oracle_response.data, then proceed to
isinstance(oracle_response.data[0], response.TextData) and answer_text =
oracle_response.data[0].content.strip().
- Around line 370-372: The code indexes action.data[0] before ensuring
action.data is non-empty which can raise IndexError in step(); change the check
to first confirm action.data is truthy and that its first element is a
response.TextData (e.g., if not action.data or not isinstance(action.data[0],
response.TextData): return ""), then set question = action.data[0].content;
update the block around step(), action.data and response.TextData to follow this
order to avoid indexing empty lists.
In `@src/ares/llms/openai_chat_converter.py`:
- Around line 515-517: The code currently takes only the first TextData block
from response.data (variables: response.data, llm_response.TextData,
text_blocks, content) which drops any subsequent assistant text; change it to
preserve all TextData blocks by concatenating their contents in order (e.g.,
join all b.content from text_blocks with a separator such as "\n\n") or
otherwise propagate the full list where the downstream logic expects multiple
blocks, and update any downstream consumers that assume a single string (search
for uses of the content variable) to handle the combined text or list
accordingly.
In `@src/ares/llms/openai_responses_converter.py`:
- Around line 549-581: The code currently collects all response_lib.TextData
into text_blocks and all response_lib.ToolUseData into tool_blocks which
reorders mixed blocks; change the logic to iterate over response.data in
original sequence and for each block check isinstance(block,
response_lib.TextData) to build a ResponseOutputMessage (using
openai.types.responses.ResponseOutputMessage / ResponseOutputText) or
isinstance(block, response_lib.ToolUseData) to build a ResponseFunctionToolCall
(using openai.types.responses.ResponseFunctionToolCall), then append each
created output to output immediately so original ordering is preserved (refer to
response.data, response_lib.TextData, response_lib.ToolUseData,
ResponseOutputMessage, ResponseOutputText, ResponseFunctionToolCall).
- Around line 477-495: The forward conversion currently maps
ResponseFunctionToolCall.call_id into response_lib.ToolUseData.id and drops the
original item.id; update the ToolUseData model to include both fields (e.g., id
and call_id), then change the forward converter handling
ResponseFunctionToolCall to populate both ToolUseData.id = item.id and
ToolUseData.call_id = item.call_id (preserving parsed arguments into input), and
update the reverse conversion that reconstructs ResponseFunctionToolCall to
restore both fields (write item.id from ToolUseData.id and item.call_id from
ToolUseData.call_id) so round-trip semantics and the distinction between item
identity and invocation correlation are preserved.
In `@src/ares/llms/response_test.py`:
- Around line 34-38: Tests are constructing TextData/ToolUseData manually from
completion payloads and thus bypass the real parsing logic; update the tests to
feed the same raw completion payloads into the actual parsing helpers (e.g.,
call ares_response_from_external(...) and/or extract_text_content(...) with the
completion/choice message instead of rebuilding objects via json.loads) so the
test exercises the production parsing path (replace local construction around
completion.choices[0].message and similar blocks at the other ranges with calls
to those helper functions and assert on their outputs).
---
Duplicate comments:
In `@src/ares/llms/response.py`:
- Around line 104-112: The code currently treats an empty TextData.content as an
error; change the check so legitimate empty string completions are allowed: in
the block handling response.data and first_block (the TextData check), remove or
replace the falsy check "if not first_block.content" with a None-only check
(e.g., "if first_block.content is None") so that TextData(content="") is
accepted; ensure you keep the earlier checks for presence of response.data and
that first_block is a TextData instance.
---
Nitpick comments:
In `@src/ares/llms/accounting.py`:
- Around line 74-135: Add unit tests that directly exercise get_llm_cost for the
three provider usage schemas to prevent token-field mixups: create three small
tests (e.g., test_get_llm_cost_chat_completion, test_get_llm_cost_responses_api,
test_get_llm_cost_anthropic_message) that construct minimal mock objects
matching openai.types.chat.chat_completion.ChatCompletion (with
usage.prompt_tokens and usage.completion_tokens),
openai.types.responses.Response (with usage.input_tokens and
usage.output_tokens), and anthropic.types.Message (with usage.input_tokens and
usage.output_tokens), supply a cost_mapping entry for the model_id with known
ModelCost.pricing values (request, prompt, completion) and assert the returned
decimal.Decimal equals the expected calculation ((request) +
prompt*input/prompt_tokens + completion*output/completion_tokens); ensure to
cover missing usage raising ValueError as well.
- Around line 3-6: The file imports Mapping and assert_never directly; update to
module imports to match project style: replace "from collections.abc import
Mapping" with "import collections.abc" (use collections.abc.Mapping where used)
and replace "from typing import assert_never" with "import typing" (use
typing.assert_never where used), and update any references to the direct symbols
(Mapping, assert_never) accordingly in src/ares/llms/accounting.py.
In `@src/ares/llms/openai_responses_converter_test.py`:
- Around line 442-448: The parametrize matrix for "filename" only includes
text-only and single-call fixtures, so add a parallel multi-call fixture
filename (e.g., "openai_responses_tool_call_multi.json") to the list so the test
case exercises multiple function_call output items; update the
pytest.mark.parametrize list in src/ares/llms/openai_responses_converter_test.py
to include that new filename and ensure a corresponding fixture JSON is added
with multiple function_call entries to mirror the new multi-tool Responses flow.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f6e79f40-d0e3-4ee6-831d-f91a2d0dd53b
📒 Files selected for processing (39)
CLAUDE.mdexamples/07_mech_interp_hooked_transformer.pyexamples/utils.pypyproject.tomlsrc/ares/code_agents/mini_swe_agent.pysrc/ares/code_agents/mini_swe_agent_v2.pysrc/ares/code_agents/mini_swe_agent_v2_configs.pysrc/ares/code_agents/terminus2/terminus2_agent.pysrc/ares/containers/docker.pysrc/ares/contrib/llama_cpp.pysrc/ares/contrib/transformers_client.pysrc/ares/contrib/transformers_client_test.pysrc/ares/environments/twenty_questions.pysrc/ares/llms/__init__.pysrc/ares/llms/accounting.pysrc/ares/llms/anthropic_converter.pysrc/ares/llms/anthropic_converter_test.pysrc/ares/llms/chat_completions_compatible.pysrc/ares/llms/openai_chat_converter.pysrc/ares/llms/openai_chat_converter_test.pysrc/ares/llms/openai_responses_converter.pysrc/ares/llms/openai_responses_converter_test.pysrc/ares/llms/request_test.pysrc/ares/llms/response.pysrc/ares/llms/response_test.pysrc/ares/llms/test_data/anthropic_messages_mixed_content.jsonsrc/ares/llms/test_data/anthropic_messages_text_only.jsonsrc/ares/llms/test_data/anthropic_messages_tool_use_parallel.jsonsrc/ares/llms/test_data/anthropic_messages_tool_use_single.jsonsrc/ares/llms/test_data/openai_empty.jsonsrc/ares/llms/test_data/openai_responses_text_only.jsonsrc/ares/llms/test_data/openai_responses_tool_call_single.jsonsrc/ares/llms/test_data/openai_text_only.jsonsrc/ares/llms/test_data/openai_tool_call_malformed.jsonsrc/ares/llms/test_data/openai_tool_call_mixed.jsonsrc/ares/llms/test_data/openai_tool_call_parallel.jsonsrc/ares/llms/test_data/openai_tool_call_single.jsonsrc/ares/presets.pysrc/ares/testing/mock_llm_test.py
| if isinstance(action.data[0], llms.TextData): | ||
| action_content = action.data[0].content | ||
| elif isinstance(action.data[0], llms.ToolUseData): | ||
| action_content = action.data[0].input | ||
| else: | ||
| assert_never(action.data[0]) | ||
| action_preview = str(action_content)[:200] | ||
| if len(action_content) > 200: | ||
| action_preview += "..." |
There was a problem hiding this comment.
print_step can crash on empty responses and mis-handle truncation for tool input.
Line 42 indexes action.data[0] without guarding for empty data, and Line 49 uses len(action_content) (dict key count for tool input) instead of string length.
Proposed fix
- if isinstance(action.data[0], llms.TextData):
- action_content = action.data[0].content
- elif isinstance(action.data[0], llms.ToolUseData):
- action_content = action.data[0].input
- else:
- assert_never(action.data[0])
+ if not action.data:
+ action_content = "(no response data)"
+ else:
+ first_block = action.data[0]
+ if isinstance(first_block, llms.TextData):
+ action_content = first_block.content
+ elif isinstance(first_block, llms.ToolUseData):
+ action_content = str(first_block.input)
+ else:
+ assert_never(first_block)
action_preview = str(action_content)[:200]
- if len(action_content) > 200:
+ if len(str(action_content)) > 200:
action_preview += "..."📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if isinstance(action.data[0], llms.TextData): | |
| action_content = action.data[0].content | |
| elif isinstance(action.data[0], llms.ToolUseData): | |
| action_content = action.data[0].input | |
| else: | |
| assert_never(action.data[0]) | |
| action_preview = str(action_content)[:200] | |
| if len(action_content) > 200: | |
| action_preview += "..." | |
| if not action.data: | |
| action_content = "(no response data)" | |
| else: | |
| first_block = action.data[0] | |
| if isinstance(first_block, llms.TextData): | |
| action_content = first_block.content | |
| elif isinstance(first_block, llms.ToolUseData): | |
| action_content = str(first_block.input) | |
| else: | |
| assert_never(first_block) | |
| action_preview = str(action_content)[:200] | |
| if len(str(action_content)) > 200: | |
| action_preview += "..." |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/utils.py` around lines 42 - 50, The print_step code accesses
action.data[0] without checking for empty .data and uses len(action_content)
which is wrong when action_content is a dict for llms.ToolUseData; update the
logic in print_step to first guard for empty action.data (handle by setting
action_content = "" or a placeholder like "<empty response>"), then for
llms.ToolUseData convert the input to a string (e.g., str(action.data[0].input))
so truncation operates on characters, compute preview using the string length
and slice that string into action_preview, and keep the existing ellipsis logic
to append "..." only when the string length exceeds the cutoff; reference
llms.TextData, llms.ToolUseData, action_content and action_preview to locate the
changes.
| parallel_tool_calls: bool = True # Enable parallel tool calls | ||
|
|
There was a problem hiding this comment.
parallel_tool_calls is currently a no-op.
The flag is configurable, but query() never forwards it to the LLM request and command execution still awaits each tool call serially. Both settings therefore behave the same, so the advertised parallel-tool-call behavior is not actually enabled.
Also applies to: 451-458, 495-498
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ares/code_agents/mini_swe_agent_v2.py` around lines 315 - 316, The
parallel_tool_calls flag is defined but never used; update the query(...)
caller/definition to accept and forward parallel_tool_calls into the
tool-execution path and into the LLM request options (where applicable), then
modify the tool-execution helpers (e.g., the functions named _call_tools,
_execute_tool_calls, run_tool_calls or similar) to branch on parallel_tool_calls
and run tool calls concurrently (use asyncio.gather / asyncio.create_task or
concurrent.futures for sync tools) when True and await serially when False;
ensure the flag is threaded through all call sites you referenced (around the
symbols query, _call_tools/_execute_tool_calls/run_tool_calls and any LLM
request builder) so the behavior actually toggles between parallel and serial
execution.
| # Determine working directory | ||
| workdir = self.working_directory or "/" | ||
|
|
||
| # Add initial task message | ||
| instance_message = jinja2.Template(self.instance_template, undefined=jinja2.StrictUndefined).render( | ||
| task=task, | ||
| working_directory=workdir, | ||
| ) |
There was a problem hiding this comment.
Keep the advertised workdir in sync with exec_run().
When working_directory is unset, the prompt tells the model it is operating in /, but exec_run() receives None and falls back to the container's default directory instead. That mismatch can send reads/edits to the wrong tree.
Also applies to: 562-567
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ares/code_agents/mini_swe_agent_v2.py` around lines 396 - 403, The
prompt's displayed working directory and the value passed to exec_run() are
inconsistent; change the code so both use the same variable (do not coerce None
to "/" for the template while passing None to exec_run()). Use a single workdir
variable set directly from self.working_directory (allowing None) and pass that
workdir into the jinja2.Template.render(...) and into any exec_run(...) calls so
the advertised directory matches actual exec behavior; repeat the same change
for the other occurrence around the instance_template use at the 562-567
section.
| message_content = response.extract_text_content(llm_response) | ||
|
|
||
| self._add_message("assistant", message_content) | ||
|
|
||
| return response | ||
| return llm_response | ||
|
|
||
| async def execute_action(self, response: response.LLMResponse) -> None: | ||
| async def execute_action(self, llm_response: response.LLMResponse) -> None: | ||
| """Execute the action and return the observation.""" | ||
| _LOGGER.debug("[%d] Executing action.", id(self)) | ||
| response_text = response.data[0].content | ||
| assert response_text is not None | ||
| response_text = response.extract_text_content(llm_response) |
There was a problem hiding this comment.
Reject tool-call responses explicitly instead of silently dropping them.
If llm_response.data contains ToolUseData, extract_text_content(...) strips that information and this agent falls through to parse_action(""). On tool-capable models that means empty assistant turns, wasted retries, and cost burn until the limits trip, even though the real issue is that MiniSWECodeAgent only understands text/bash output.
🛠️ Suggested guard
+ def _extract_response_text(self, llm_response: response.LLMResponse) -> str:
+ if any(isinstance(block, response.ToolUseData) for block in llm_response.data):
+ raise _FormatError(
+ "MiniSWECodeAgent only supports text actions. Use MiniSWECodeAgentV2 for tool-call responses."
+ )
+ return response.extract_text_content(llm_response)
+
async def query(self) -> response.LLMResponse:
...
- message_content = response.extract_text_content(llm_response)
+ message_content = self._extract_response_text(llm_response)
self._add_message("assistant", message_content)
return llm_response
async def execute_action(self, llm_response: response.LLMResponse) -> None:
...
- response_text = response.extract_text_content(llm_response)
+ response_text = self._extract_response_text(llm_response)
action = self.parse_action(response_text)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ares/code_agents/mini_swe_agent.py` around lines 222 - 231, The agent
currently strips tool-call responses via response.extract_text_content(...) and
falls through to parse_action("") causing wasted retries; update execute_action
to explicitly detect tool-call outputs by checking llm_response.data for
ToolUseData (or the tool-use discriminant) before calling
response.extract_text_content, and reject them (e.g., raise a clear exception or
return an error observation) so callers know the model returned a tool
invocation the agent can't handle; ensure to reference/ import the ToolUseData
type and update execute_action (and any callers expecting parse_action) to
handle this explicit rejection instead of silently parsing an empty string.
| if self._container is None: | ||
| raise RuntimeError("Container not started - call start() first") | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify whether running-state checks (reload/status) exist in guarded methods.
rg -n --type=py -C3 'def (exec_run|upload_files|download_files)|reload\(|\.status\b' src/ares/containers/docker.pyRepository: withmartian/ares
Length of output: 985
🏁 Script executed:
# Get full method implementations for the three guarded methods
sed -n '96,132p' src/ares/containers/docker.py
echo "---"
sed -n '133,163p' src/ares/containers/docker.py
echo "---"
sed -n '164,200p' src/ares/containers/docker.pyRepository: withmartian/ares
Length of output: 4337
🏁 Script executed:
# Check the stop() method to see how container state is managed
rg -n --type=py -A 10 'async def stop' src/ares/containers/docker.pyRepository: withmartian/ares
Length of output: 437
🏁 Script executed:
# Check if there are any status checks or reload calls anywhere in the file
rg -n --type=py 'reload|\.status|is_running' src/ares/containers/docker.pyRepository: withmartian/ares
Length of output: 42
State guards only check if container was started, not whether it is still running.
These checks miss externally stopped/exited containers (_container can be non-None but unusable), causing operations to fail deeper in the docker API with unclear errors instead of early with a clear message.
🔧 Suggested pattern (single running-state guard reused across methods)
+ async def _require_running_container(self) -> docker.models.containers.Container:
+ if self._container is None:
+ raise RuntimeError("Container not started - call start() first")
+ await asyncio.to_thread(self._container.reload)
+ if self._container.status != "running":
+ raise RuntimeError("Container is not running - call start() first")
+ return self._container
+
async def exec_run(
self,
command: str,
@@
- if self._container is None:
- raise RuntimeError("Container not started - call start() first")
+ container = await self._require_running_container()
@@
- self._container.exec_run,
+ container.exec_run,Also applies to: lines 135-137 (upload_files), 166-168 (download_files)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ares/containers/docker.py` around lines 108 - 110, The methods only check
self._container is not None but not whether the container is still running; add
a single reusable guard method (e.g., _ensure_running()) that calls
self._container.reload() (or otherwise refreshes state) and asserts
self._container is not None and self._container.status == "running" (or checks
container.attrs["State"]["Running"]) and raise a clear RuntimeError like
"Container not running - call start() or restart the container" if not; replace
the simple None checks in start-related operations and in upload_files and
download_files with calls to _ensure_running() so operations fail fast with a
clear message when the container has been externally stopped or exited.
| if not isinstance(action.data[0], response.TextData): | ||
| return "" | ||
| question = action.data[0].content if action.data else "" |
There was a problem hiding this comment.
Handle empty action.data before checking the first block type.
Line 370 indexes action.data[0] unconditionally. Empty agent responses will crash step() with IndexError.
Proposed fix
- if not isinstance(action.data[0], response.TextData):
+ if not action.data:
+ return ""
+ if not isinstance(action.data[0], response.TextData):
return ""
- question = action.data[0].content if action.data else ""
+ question = action.data[0].content
return question.strip()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if not isinstance(action.data[0], response.TextData): | |
| return "" | |
| question = action.data[0].content if action.data else "" | |
| if not action.data: | |
| return "" | |
| if not isinstance(action.data[0], response.TextData): | |
| return "" | |
| question = action.data[0].content |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ares/environments/twenty_questions.py` around lines 370 - 372, The code
indexes action.data[0] before ensuring action.data is non-empty which can raise
IndexError in step(); change the check to first confirm action.data is truthy
and that its first element is a response.TextData (e.g., if not action.data or
not isinstance(action.data[0], response.TextData): return ""), then set question
= action.data[0].content; update the block around step(), action.data and
response.TextData to follow this order to avoid indexing empty lists.
| # Extract text content (first TextData block) | ||
| text_blocks = [block for block in response.data if isinstance(block, llm_response.TextData)] | ||
| content = text_blocks[0].content if text_blocks else None |
There was a problem hiding this comment.
Only the first text block survives ChatCompletion conversion.
LLMResponse.data can now contain multiple TextData blocks, but this code serializes only text_blocks[0]. Converting an Anthropic/Responses-derived response through Chat Completions will silently drop the rest of the assistant text.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ares/llms/openai_chat_converter.py` around lines 515 - 517, The code
currently takes only the first TextData block from response.data (variables:
response.data, llm_response.TextData, text_blocks, content) which drops any
subsequent assistant text; change it to preserve all TextData blocks by
concatenating their contents in order (e.g., join all b.content from text_blocks
with a separator such as "\n\n") or otherwise propagate the full list where the
downstream logic expects multiple blocks, and update any downstream consumers
that assume a single string (search for uses of the content variable) to handle
the combined text or list accordingly.
| elif isinstance(item, openai.types.responses.ResponseFunctionToolCall): | ||
| call_id = item.call_id | ||
| name = item.name | ||
| arguments_str = item.arguments | ||
|
|
||
| # Parse arguments JSON | ||
| try: | ||
| input_dict = json.loads(arguments_str) | ||
| except json.JSONDecodeError: | ||
| input_dict = { | ||
| "_raw_arguments": arguments_str, | ||
| "_parse_error": "Invalid JSON", | ||
| } | ||
|
|
||
| data.append( | ||
| response_lib.ToolUseData( | ||
| id=call_id, | ||
| name=name, | ||
| input=input_dict, |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In the OpenAI Responses API, what is the difference between ResponseFunctionToolCall.id and ResponseFunctionToolCall.call_id, and should both be preserved when reconstructing a Response object?
💡 Result:
ResponseFunctionToolCall.call_idis the tool-invocation correlation ID you must echo back when you submit the tool result (by sending afunction_call_outputitem with the samecall_id). The docs explicitly describecall_idas “used later to submit the function result,” and all examples of returning tool results key offcall_id. [1][2]ResponseFunctionToolCall.idis the ID of the function-call output item itself (the “item”), used to identify that specific output element in the response/streaming protocol (e.g., streaming events refer to anitem_idthat is “the id of the function call item”). [3]
Should you preserve both when reconstructing a Response?
Yes—preserve both. You need call_id to correctly associate function_call_output with the right call in subsequent turns, and preserving id maintains the identity of the output item (important for faithfully reconstructing/streaming/traceability). [1][2][3]
Sources
[1] OpenAI Developers – Function calling guide (Responses API): call_id used later to submit function result; sample includes both id and call_id. (platform.openai.com)
[2] OpenAI Developers – Function calling guide (Responses API): function_call_output uses call_id. (platform.openai.com)
[3] OpenAI Developers – Function calling guide (Responses API): streaming events reference item_id as the id of the function-call item. (platform.openai.com)
Citations:
- 1: https://platform.openai.com/docs/guides/function-calling?api-mode=responses
- 2: https://platform.openai.com/docs/guides/function-calling?api-mode=responses
- 3: https://platform.openai.com/docs/guides/function-calling?api-mode=responses
🏁 Script executed:
# First, check the exact code at lines 477-495
sed -n '477,495p' src/ares/llms/openai_responses_converter.pyRepository: withmartian/ares
Length of output: 718
🏁 Script executed:
# Check the code at lines 573-579
sed -n '573,579p' src/ares/llms/openai_responses_converter.pyRepository: withmartian/ares
Length of output: 336
🏁 Script executed:
# Find and examine the ToolUseData definition
rg -A 5 'class ToolUseData' --type pyRepository: withmartian/ares
Length of output: 344
🏁 Script executed:
# Check how ares_response_to_external handles tool blocks - look for the reverse conversion
rg -B 5 -A 10 'ares_response_to_external' src/ares/llms/openai_responses_converter.py | head -80Repository: withmartian/ares
Length of output: 457
Preserve both Responses tool identifiers (id and call_id).
ResponseFunctionToolCall has two distinct identifiers that serve different purposes:
call_id: the tool-invocation correlation ID, must be echoed back infunction_call_outputsubmissionsid: the identity of the output item itself, used for streaming, traceability, and item references
Currently:
- At lines 477-495 (forward conversion), only
call_idis extracted and assigned toToolUseData.id, discarding the original itemid - At lines 573-579 (reverse conversion), the single stored value is written to both
idandcall_id, creating semantic misalignment on round-trip
This breaks round-trip conversion and can cause downstream failures in consumers that rely on the distinction between item identity and call correlation. Extend ToolUseData to hold both values separately, then update both conversions to preserve and restore them correctly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ares/llms/openai_responses_converter.py` around lines 477 - 495, The
forward conversion currently maps ResponseFunctionToolCall.call_id into
response_lib.ToolUseData.id and drops the original item.id; update the
ToolUseData model to include both fields (e.g., id and call_id), then change the
forward converter handling ResponseFunctionToolCall to populate both
ToolUseData.id = item.id and ToolUseData.call_id = item.call_id (preserving
parsed arguments into input), and update the reverse conversion that
reconstructs ResponseFunctionToolCall to restore both fields (write item.id from
ToolUseData.id and item.call_id from ToolUseData.call_id) so round-trip
semantics and the distinction between item identity and invocation correlation
are preserved.
| # Extract data as we would in the LLM client | ||
| message = completion.choices[0].message | ||
| data = [] | ||
| if message.content: | ||
| data.append(response.TextData(content=message.content)) |
There was a problem hiding this comment.
These tests never hit the production parsing path.
Most of the new cases rebuild TextData/ToolUseData inline with local json.loads(...) logic instead of calling the actual converter/helper code added in this PR. That means ares_response_from_external() or extract_text_content() can regress without failing this suite.
Also applies to: 49-55, 167-267
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ares/llms/response_test.py` around lines 34 - 38, Tests are constructing
TextData/ToolUseData manually from completion payloads and thus bypass the real
parsing logic; update the tests to feed the same raw completion payloads into
the actual parsing helpers (e.g., call ares_response_from_external(...) and/or
extract_text_content(...) with the completion/choice message instead of
rebuilding objects via json.loads) so the test exercises the production parsing
path (replace local construction around completion.choices[0].message and
similar blocks at the other ranges with calls to those helper functions and
assert on their outputs).
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/ares/contrib/llama_cpp.py (1)
89-105:⚠️ Potential issue | 🔴 CriticalParameter mismatch:
max_completion_tokenswill cause runtime error in llama-cpp-python.
ares_request_to_external()returnsmax_completion_tokens, butllama_cpp.Llama.create_chat_completion()does not recognize this parameter—it expectsmax_tokensinstead. Any request withmax_output_tokensset will fail with a TypeError. Additionally, the response path discardstool_callseven though both the converter and llama-cpp-python support tool-calling. Filter the kwargs to match llama.cpp's actual signature (max_tokens, dropmetadata/service_tierif unsupported) and update the response handler to preservetool_callswhen present, or use the same OpenAI response converter used by other clients.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ares/contrib/llama_cpp.py` around lines 89 - 105, The completion_kwargs produced by ares_request_to_external contains max_completion_tokens and other keys that llama_cpp.Llama.create_chat_completion doesn't accept, so before calling self._llm.create_chat_completion map/rename max_completion_tokens -> max_tokens, remove unsupported keys (e.g., metadata, service_tier) and only pass keys that match llama.cpp's signature; after awaiting create_chat_completion and model-validating chat_completion update the response assembly (where content/usage and response.LLMResponse are created) to also preserve tool_calls if present on chat_completion. Ensure you modify the code around completion_kwargs, the call to self._llm.create_chat_completion, and the chat_completion -> response conversion (references: ares_request_to_external, completion_kwargs, self._llm.create_chat_completion, chat_completion, response.LLMResponse, response.TextData, tool_calls, max_completion_tokens -> max_tokens).
♻️ Duplicate comments (3)
src/ares/containers/docker.py (1)
108-110:⚠️ Potential issue | 🟠 MajorContainer guards still allow operations on externally stopped containers.
These checks only verify initialization, not liveness. If the container exited after
start(), these methods fail deeper in Docker calls with less actionable errors.🔧 Suggested fix (single reusable running-state guard)
+ async def _require_running_container(self) -> docker.models.containers.Container: + if self._container is None: + raise RuntimeError("Container not started - call start() first") + await asyncio.to_thread(self._container.reload) + if self._container.status != "running": + raise RuntimeError("Container not running - call start() or restart the container") + return self._container + async def exec_run( self, command: str, @@ - if self._container is None: - raise RuntimeError("Container not started - call start() first") + container = await self._require_running_container() @@ - self._container.exec_run, + container.exec_run, @@ async def upload_files(self, local_paths: list[pathlib.Path], remote_paths: list[str]) -> None: @@ - if self._container is None: - raise RuntimeError("Container not started - call start() first") + container = await self._require_running_container() @@ - self._container.exec_run, + container.exec_run, @@ - self._container.put_archive, + container.put_archive, @@ async def download_files(self, remote_paths: list[str], local_paths: list[pathlib.Path]) -> None: @@ - if self._container is None: - raise RuntimeError("Container not started - call start() first") + container = await self._require_running_container() @@ - self._container.get_archive, + container.get_archive,Also applies to: 135-137, 166-168
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ares/containers/docker.py` around lines 108 - 110, The current guards only check that self._container is not None but don't verify the container's liveness; add a reusable helper (e.g. _ensure_running or _check_container_running) that calls self._container.reload() and checks self._container.status == "running" and raises a clear RuntimeError like "Container not running - call start() or restart the container" if not running; replace the simple None checks in the affected methods (the ones using self._container at lines shown) to call this helper before making Docker SDK calls so operations fail fast with an actionable message.examples/utils.py (1)
42-50:⚠️ Potential issue | 🟠 Major
print_stepstill needs empty-data guarding and string-based truncation for tool input.
action.data[0]can raise on empty responses, andlen(action_content)is incorrect whenaction_contentis a dict.✅ Suggested robust handling
- if isinstance(action.data[0], llms.TextData): - action_content = action.data[0].content - elif isinstance(action.data[0], llms.ToolUseData): - action_content = action.data[0].input - else: - assert_never(action.data[0]) - action_preview = str(action_content)[:200] - if len(action_content) > 200: + if not action.data: + action_content = "(no response data)" + else: + first_block = action.data[0] + if isinstance(first_block, llms.TextData): + action_content = first_block.content + elif isinstance(first_block, llms.ToolUseData): + action_content = str(first_block.input) + else: + assert_never(first_block) + action_preview = str(action_content)[:200] + if len(str(action_content)) > 200: action_preview += "..."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/utils.py` around lines 42 - 50, Guard against empty action.data in print_step by checking if action.data is truthy and has at least one item before accessing action.data[0]; if empty, set action_content = "" (or a sensible default). When handling llms.ToolUseData, coerce the input to a string (e.g., action_content = str(action.data[0].input)) so truncation and length checks operate on a string; likewise for llms.TextData use str(action.data[0].content). Compute action_preview from the stringified action_content (action_preview = action_str[:200]) and append "..." only when len(action_str) > 200. Ensure to still call assert_never for unexpected types.src/ares/llms/openai_chat_converter.py (1)
515-517:⚠️ Potential issue | 🟡 MinorOnly the first text block is preserved in ChatCompletion conversion.
When converting back to OpenAI format, only
text_blocks[0].contentis used. If theLLMResponsecontains multipleTextDatablocks (e.g., from Anthropic responses), the additional text is silently dropped.This appears to be a limitation of the OpenAI ChatCompletion format which only supports a single
contentfield. Consider adding a warning log when multiple text blocks exist.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ares/llms/openai_chat_converter.py` around lines 515 - 517, The code currently extracts only the first TextData block from response (text_blocks = [block for block in response.data if isinstance(block, llm_response.TextData)]; content = text_blocks[0].content if text_blocks else None) which silently drops any additional TextData entries; update the conversion to detect when len(text_blocks) > 1 and emit a warning log (including a clear message like "Multiple TextData blocks detected; only the first will be used" and include metadata such as number of blocks) before continuing to use text_blocks[0].content so callers are alerted to dropped content; reference the existing logger in this module (e.g., logger or module-level logger) when adding the warning.
🧹 Nitpick comments (7)
src/ares/code_agents/terminus2/terminus2_agent.py (1)
530-531: Handle non-text LLM blocks explicitly instead of relying on generic exception flow.
response.extract_text_content()now raises on empty responses and on leadingToolUseData. Inrun(), that falls into the broadexcept Exceptionpath; in_unwrap_single_response(), it aborts summarization. A small targetedValueErrorhandler here would give this text-only agent a predictable failure mode if a heterogeneous response slips through.Also applies to: 883-883
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ares/code_agents/terminus2/terminus2_agent.py` around lines 530 - 531, The call to response.extract_text_content(llm_resp) can raise ValueError for empty or non-text (ToolUseData) blocks, but presently those flow into a broad except; update the run() flow around the self._query_llm() / response.extract_text_content(llm_resp) sequence to catch ValueError explicitly, log or surface a clear predictable text-only failure (e.g., set assistant_message to a specific error/abort state or raise a more specific exception), while letting other exceptions continue to the existing broad handler; apply the same targeted ValueError handling in _unwrap_single_response() where response.extract_text_content is used so summarization aborts clearly on non-text blocks rather than relying on the generic exception path.src/ares/llms/openai_chat_converter_test.py (1)
36-41: Specify encoding when opening JSON files.For consistency and cross-platform compatibility, specify
encoding="utf-8".♻️ Suggested fix
def _load_test_completion(filename: str) -> openai.types.chat.ChatCompletion: """Load a test JSON file and convert to OpenAI ChatCompletion.""" test_data_dir = Path(__file__).parent / "test_data" - with open(test_data_dir / filename) as f: + with open(test_data_dir / filename, encoding="utf-8") as f: data = json.load(f) return openai.types.chat.ChatCompletion(**data)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ares/llms/openai_chat_converter_test.py` around lines 36 - 41, The helper function _load_test_completion currently opens JSON test files without specifying an encoding; update the open call inside _load_test_completion (the with open(test_data_dir / filename) as f:) to explicitly use encoding="utf-8" so json.load reads files consistently across platforms and locales.src/ares/llms/openai_responses_converter_test.py (1)
431-436: Specify encoding when opening JSON files.The
open()call should specifyencoding="utf-8"to ensure consistent behavior across platforms.♻️ Suggested fix
def _load_test_response(filename: str) -> openai.types.responses.Response: """Load a test JSON file and convert to OpenAI Response.""" test_data_dir = pathlib.Path(__file__).parent / "test_data" - with open(test_data_dir / filename) as f: + with open(test_data_dir / filename, encoding="utf-8") as f: data = json.load(f) return openai.types.responses.Response(**data)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ares/llms/openai_responses_converter_test.py` around lines 431 - 436, The helper _load_test_response currently opens JSON test files without specifying an encoding, which can cause platform-dependent behavior; update the open call inside _load_test_response to pass encoding="utf-8" (i.e., open(test_data_dir / filename, encoding="utf-8")) so json.load reads UTF-8 consistently when constructing the openai.types.responses.Response.src/ares/llms/anthropic_converter_test.py (1)
305-310: Specify encoding when opening JSON files.Same as other test files - the
open()call should specifyencoding="utf-8".♻️ Suggested fix
def _load_test_message(filename: str) -> anthropic.types.Message: """Load a test JSON file and convert to Anthropic Message.""" test_data_dir = pathlib.Path(__file__).parent / "test_data" - with open(test_data_dir / filename) as f: + with open(test_data_dir / filename, encoding="utf-8") as f: data = json.load(f) return anthropic.types.Message(**data)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ares/llms/anthropic_converter_test.py` around lines 305 - 310, The helper function _load_test_message currently opens JSON test files without specifying an encoding; update the open(...) call in _load_test_message to include encoding="utf-8" so json.load reads files with the correct text encoding (keep the rest of the function logic, return anthropic.types.Message(**data) unchanged).src/ares/code_agents/mini_swe_agent_v2.py (1)
610-614: Movejsonimport to module level.The
jsonimport inside_dict_to_jsonshould be at the module level per Python conventions.♻️ Proposed fix
Add to top-level imports:
import jsonThen simplify the function:
def _dict_to_json(d: dict) -> str: """Convert a dictionary to a JSON string.""" - import json - return json.dumps(d)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ares/code_agents/mini_swe_agent_v2.py` around lines 610 - 614, The json import is inside the helper function _dict_to_json which violates module-level import conventions; move the import to the top of the module (add import json to the file's imports) and remove the inline import from inside _dict_to_json so the function simply calls json.dumps(d); ensure no other inline json imports remain in this module.src/ares/code_agents/mini_swe_agent_v2_configs.py (1)
12-166: Template duplication withmini_swe_agent_v2.py.The templates and environment variables defined here (
_DEFAULT_ENV_VARS,_TOOLCALL_SYSTEM_TEMPLATE,_TOOLCALL_INSTANCE_TEMPLATE,_SHARED_OBSERVATION_TEMPLATE,_TOOLCALL_FORMAT_ERROR_TEMPLATE) are duplicated frommini_swe_agent_v2.py. This creates a maintenance burden where changes must be synchronized across both files.Consider having the agent import from this config module, or vice versa, to maintain a single source of truth.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ares/code_agents/mini_swe_agent_v2_configs.py` around lines 12 - 166, The two modules duplicate environment and template constants (_DEFAULT_ENV_VARS, _TOOLCALL_SYSTEM_TEMPLATE, _TOOLCALL_INSTANCE_TEMPLATE, _SHARED_OBSERVATION_TEMPLATE, _TOOLCALL_FORMAT_ERROR_TEMPLATE); consolidate them into a single source of truth by keeping these definitions in this config module and updating mini_swe_agent_v2.py to import those symbols instead of redefining them (or vice versa if you prefer the other file as canonical); remove the duplicate definitions from the non-canonical file, update all references to use the imported names (e.g., replace local references to _DEFAULT_ENV_VARS, _TOOLCALL_SYSTEM_TEMPLATE, _TOOLCALL_INSTANCE_TEMPLATE, _SHARED_OBSERVATION_TEMPLATE, _TOOLCALL_FORMAT_ERROR_TEMPLATE with imports), and run tests to ensure no import cycles or name changes break usage.src/ares/llms/response.py (1)
7-8: Module-level imports create coupling to external API types.Importing
anthropic.typesandopenai.types.responsesat the module level for type annotations couples this core response module to specific external SDKs. This means importingresponse.pyrequires both SDKs to be installed.Consider using
TYPE_CHECKINGguards if these are only needed for type hints:♻️ Proposed refactor
+from typing import TYPE_CHECKING + +if TYPE_CHECKING: + import anthropic.types + import openai.types.responsesThen use string annotations for the types, or keep as-is if runtime access is needed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ares/llms/response.py` around lines 7 - 8, The module-level imports of external SDK types (anthropic.types and openai.types.responses) cause runtime coupling; wrap those imports inside a TYPE_CHECKING guard (from typing import TYPE_CHECKING) and move any references to those types into either string annotations or enable postponed evaluation with "from __future__ import annotations" so runtime imports are not required; update the annotations in this module (e.g., any functions or classes that currently reference anthropic.types or openai.types.responses) to use quoted type names or rely on TYPE_CHECKING to import them only for type checking.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pyproject.toml`:
- Around line 7-10: The wheel exclusion list in pyproject.toml currently only
excludes files matching "**/*_test.py" which misses the "test_*.py" convention
(e.g., integration_tests/test_default_workdir.py); update the exclude array to
also include the pattern "**/test_*.py" so both naming variants are excluded
from the built wheel, ensuring the exclude list contains "**/*_test.py",
"**/test_*.py", and any existing entries like "**/test_data/".
In `@src/ares/code_agents/mini_swe_agent_v2_configs.py`:
- Around line 450-464: Update the text_based() docstring in MiniSWEAgentV2Config
to remove or correct the reference to the non-existent
MiniSWECodeAgentV2TextBased class: change the sentence to state that the
text_based() config is compatible with the existing MiniSWECodeAgentV2 agent (or
simply remove the sentence), keeping the rest of the docstring intact;
specifically edit the method text_based() and replace the mention of
MiniSWECodeAgentV2TextBased with MiniSWECodeAgentV2 (or delete that clause) so
the docstring accurately reflects supported agent classes.
In `@src/ares/llms/openai_chat_converter.py`:
- Line 142: The converter function names in openai_chat_converter need to match
the documented Protocol: rename the functions ares_request_to_external and
ares_request_from_external to to_external and from_external respectively
(preserving their signatures, parameters like request: llm_request.LLMRequest
and strict: bool = True, return types, and internal logic), update all internal
references/calls in this module to use the new names, and ensure any imports or
registrations that expect the Protocol methods are updated so the class/module
implements to_external() and from_external() exactly as defined in request.py.
---
Outside diff comments:
In `@src/ares/contrib/llama_cpp.py`:
- Around line 89-105: The completion_kwargs produced by ares_request_to_external
contains max_completion_tokens and other keys that
llama_cpp.Llama.create_chat_completion doesn't accept, so before calling
self._llm.create_chat_completion map/rename max_completion_tokens -> max_tokens,
remove unsupported keys (e.g., metadata, service_tier) and only pass keys that
match llama.cpp's signature; after awaiting create_chat_completion and
model-validating chat_completion update the response assembly (where
content/usage and response.LLMResponse are created) to also preserve tool_calls
if present on chat_completion. Ensure you modify the code around
completion_kwargs, the call to self._llm.create_chat_completion, and the
chat_completion -> response conversion (references: ares_request_to_external,
completion_kwargs, self._llm.create_chat_completion, chat_completion,
response.LLMResponse, response.TextData, tool_calls, max_completion_tokens ->
max_tokens).
---
Duplicate comments:
In `@examples/utils.py`:
- Around line 42-50: Guard against empty action.data in print_step by checking
if action.data is truthy and has at least one item before accessing
action.data[0]; if empty, set action_content = "" (or a sensible default). When
handling llms.ToolUseData, coerce the input to a string (e.g., action_content =
str(action.data[0].input)) so truncation and length checks operate on a string;
likewise for llms.TextData use str(action.data[0].content). Compute
action_preview from the stringified action_content (action_preview =
action_str[:200]) and append "..." only when len(action_str) > 200. Ensure to
still call assert_never for unexpected types.
In `@src/ares/containers/docker.py`:
- Around line 108-110: The current guards only check that self._container is not
None but don't verify the container's liveness; add a reusable helper (e.g.
_ensure_running or _check_container_running) that calls self._container.reload()
and checks self._container.status == "running" and raises a clear RuntimeError
like "Container not running - call start() or restart the container" if not
running; replace the simple None checks in the affected methods (the ones using
self._container at lines shown) to call this helper before making Docker SDK
calls so operations fail fast with an actionable message.
In `@src/ares/llms/openai_chat_converter.py`:
- Around line 515-517: The code currently extracts only the first TextData block
from response (text_blocks = [block for block in response.data if
isinstance(block, llm_response.TextData)]; content = text_blocks[0].content if
text_blocks else None) which silently drops any additional TextData entries;
update the conversion to detect when len(text_blocks) > 1 and emit a warning log
(including a clear message like "Multiple TextData blocks detected; only the
first will be used" and include metadata such as number of blocks) before
continuing to use text_blocks[0].content so callers are alerted to dropped
content; reference the existing logger in this module (e.g., logger or
module-level logger) when adding the warning.
---
Nitpick comments:
In `@src/ares/code_agents/mini_swe_agent_v2_configs.py`:
- Around line 12-166: The two modules duplicate environment and template
constants (_DEFAULT_ENV_VARS, _TOOLCALL_SYSTEM_TEMPLATE,
_TOOLCALL_INSTANCE_TEMPLATE, _SHARED_OBSERVATION_TEMPLATE,
_TOOLCALL_FORMAT_ERROR_TEMPLATE); consolidate them into a single source of truth
by keeping these definitions in this config module and updating
mini_swe_agent_v2.py to import those symbols instead of redefining them (or vice
versa if you prefer the other file as canonical); remove the duplicate
definitions from the non-canonical file, update all references to use the
imported names (e.g., replace local references to _DEFAULT_ENV_VARS,
_TOOLCALL_SYSTEM_TEMPLATE, _TOOLCALL_INSTANCE_TEMPLATE,
_SHARED_OBSERVATION_TEMPLATE, _TOOLCALL_FORMAT_ERROR_TEMPLATE with imports), and
run tests to ensure no import cycles or name changes break usage.
In `@src/ares/code_agents/mini_swe_agent_v2.py`:
- Around line 610-614: The json import is inside the helper function
_dict_to_json which violates module-level import conventions; move the import to
the top of the module (add import json to the file's imports) and remove the
inline import from inside _dict_to_json so the function simply calls
json.dumps(d); ensure no other inline json imports remain in this module.
In `@src/ares/code_agents/terminus2/terminus2_agent.py`:
- Around line 530-531: The call to response.extract_text_content(llm_resp) can
raise ValueError for empty or non-text (ToolUseData) blocks, but presently those
flow into a broad except; update the run() flow around the self._query_llm() /
response.extract_text_content(llm_resp) sequence to catch ValueError explicitly,
log or surface a clear predictable text-only failure (e.g., set
assistant_message to a specific error/abort state or raise a more specific
exception), while letting other exceptions continue to the existing broad
handler; apply the same targeted ValueError handling in
_unwrap_single_response() where response.extract_text_content is used so
summarization aborts clearly on non-text blocks rather than relying on the
generic exception path.
In `@src/ares/llms/anthropic_converter_test.py`:
- Around line 305-310: The helper function _load_test_message currently opens
JSON test files without specifying an encoding; update the open(...) call in
_load_test_message to include encoding="utf-8" so json.load reads files with the
correct text encoding (keep the rest of the function logic, return
anthropic.types.Message(**data) unchanged).
In `@src/ares/llms/openai_chat_converter_test.py`:
- Around line 36-41: The helper function _load_test_completion currently opens
JSON test files without specifying an encoding; update the open call inside
_load_test_completion (the with open(test_data_dir / filename) as f:) to
explicitly use encoding="utf-8" so json.load reads files consistently across
platforms and locales.
In `@src/ares/llms/openai_responses_converter_test.py`:
- Around line 431-436: The helper _load_test_response currently opens JSON test
files without specifying an encoding, which can cause platform-dependent
behavior; update the open call inside _load_test_response to pass
encoding="utf-8" (i.e., open(test_data_dir / filename, encoding="utf-8")) so
json.load reads UTF-8 consistently when constructing the
openai.types.responses.Response.
In `@src/ares/llms/response.py`:
- Around line 7-8: The module-level imports of external SDK types
(anthropic.types and openai.types.responses) cause runtime coupling; wrap those
imports inside a TYPE_CHECKING guard (from typing import TYPE_CHECKING) and move
any references to those types into either string annotations or enable postponed
evaluation with "from __future__ import annotations" so runtime imports are not
required; update the annotations in this module (e.g., any functions or classes
that currently reference anthropic.types or openai.types.responses) to use
quoted type names or rely on TYPE_CHECKING to import them only for type
checking.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0cb46145-1c48-4397-8c75-712293fee9be
📒 Files selected for processing (39)
CLAUDE.mdexamples/07_mech_interp_hooked_transformer.pyexamples/utils.pypyproject.tomlsrc/ares/code_agents/mini_swe_agent.pysrc/ares/code_agents/mini_swe_agent_v2.pysrc/ares/code_agents/mini_swe_agent_v2_configs.pysrc/ares/code_agents/terminus2/terminus2_agent.pysrc/ares/containers/docker.pysrc/ares/contrib/llama_cpp.pysrc/ares/contrib/transformers_client.pysrc/ares/contrib/transformers_client_test.pysrc/ares/environments/twenty_questions.pysrc/ares/llms/__init__.pysrc/ares/llms/accounting.pysrc/ares/llms/anthropic_converter.pysrc/ares/llms/anthropic_converter_test.pysrc/ares/llms/chat_completions_compatible.pysrc/ares/llms/openai_chat_converter.pysrc/ares/llms/openai_chat_converter_test.pysrc/ares/llms/openai_responses_converter.pysrc/ares/llms/openai_responses_converter_test.pysrc/ares/llms/request_test.pysrc/ares/llms/response.pysrc/ares/llms/response_test.pysrc/ares/llms/test_data/anthropic_messages_mixed_content.jsonsrc/ares/llms/test_data/anthropic_messages_text_only.jsonsrc/ares/llms/test_data/anthropic_messages_tool_use_parallel.jsonsrc/ares/llms/test_data/anthropic_messages_tool_use_single.jsonsrc/ares/llms/test_data/openai_empty.jsonsrc/ares/llms/test_data/openai_responses_text_only.jsonsrc/ares/llms/test_data/openai_responses_tool_call_single.jsonsrc/ares/llms/test_data/openai_text_only.jsonsrc/ares/llms/test_data/openai_tool_call_malformed.jsonsrc/ares/llms/test_data/openai_tool_call_mixed.jsonsrc/ares/llms/test_data/openai_tool_call_parallel.jsonsrc/ares/llms/test_data/openai_tool_call_single.jsonsrc/ares/presets.pysrc/ares/testing/mock_llm_test.py
| exclude = [ | ||
| "**/*_test.py", | ||
| "**/test_data/", | ||
| ] |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n pyproject.toml | head -20Repository: withmartian/ares
Length of output: 692
🏁 Script executed:
# Find all test files to understand naming patterns
find . -name '*test*.py' -type f | head -30Repository: withmartian/ares
Length of output: 650
🏁 Script executed:
# Search for the coding guidelines mentioned
rg -i "naming pattern|test.*\.py" --type py --type md --type txt | head -20Repository: withmartian/ares
Length of output: 768
Add test_*.py pattern to wheel exclusions.
The current pattern **/*_test.py excludes only the *_test.py naming variant. The file ./integration_tests/test_default_workdir.py (which follows the test_*.py naming pattern mentioned in CLAUDE.md as an acceptable alternative) would still be packaged in the wheel. Update the exclude list to include **/test_*.py:
Suggested pattern update
exclude = [
"**/*_test.py",
+ "**/test_*.py",
"**/test_data/",
]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| exclude = [ | |
| "**/*_test.py", | |
| "**/test_data/", | |
| ] | |
| exclude = [ | |
| "**/*_test.py", | |
| "**/test_*.py", | |
| "**/test_data/", | |
| ] |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pyproject.toml` around lines 7 - 10, The wheel exclusion list in
pyproject.toml currently only excludes files matching "**/*_test.py" which
misses the "test_*.py" convention (e.g.,
integration_tests/test_default_workdir.py); update the exclude array to also
include the pattern "**/test_*.py" so both naming variants are excluded from the
built wheel, ensuring the exclude list contains "**/*_test.py", "**/test_*.py",
and any existing entries like "**/test_data/".
| @classmethod | ||
| def text_based(cls) -> "MiniSWEAgentV2Config": | ||
| """Create config for text-based agent (backwards compatibility). | ||
|
|
||
| This configuration uses regex parsing of markdown code blocks | ||
| instead of tool calls. Matches swebench_backticks.yaml. | ||
|
|
||
| Note: This requires using MiniSWECodeAgentV2TextBased instead | ||
| of MiniSWECodeAgentV2, as the parsing logic is different. | ||
| """ | ||
| return cls( | ||
| system_template=_TEXTBASED_SYSTEM_TEMPLATE, | ||
| instance_template=_TEXTBASED_INSTANCE_TEMPLATE, | ||
| format_error_template=_TEXTBASED_FORMAT_ERROR_TEMPLATE, | ||
| parallel_tool_calls=False, # Text-based doesn't support parallel |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for MiniSWECodeAgentV2TextBased class definition
rg -n 'class MiniSWECodeAgentV2TextBased' --type py
# Also check for any references to it
rg -n 'MiniSWECodeAgentV2TextBased' --type pyRepository: withmartian/ares
Length of output: 183
🏁 Script executed:
# Search for MiniSWECodeAgentV2 classes (with or without TextBased)
rg -n 'class MiniSWECodeAgentV2' --type py
# Search for TextBased references to understand naming conventions
rg -n 'TextBased' --type py | head -20
# Check what's in the file around the referenced method
cat -n src/ares/code_agents/mini_swe_agent_v2_configs.py | sed -n '440,470p'Repository: withmartian/ares
Length of output: 1451
🏁 Script executed:
# Search for any agent classes that might be text-based variants
rg -n 'class.*Agent.*' src/ares/code_agents/ --type py | grep -i text
# Check for backticks or markdown parsing related agents
rg -n 'backtick' --type py
# Look for TODO/FIXME comments related to TextBased
rg -n 'TextBased|text.based' --type pyRepository: withmartian/ares
Length of output: 953
🏁 Script executed:
# Search more broadly for all agent classes in the codebase
rg -n 'class.*\(.*Agent' src/ares/code_agents/ --type py
# Check the actual agent implementation file
cat -n src/ares/code_agents/mini_swe_agent_v2.py | head -50Repository: withmartian/ares
Length of output: 2796
Remove or correct the docstring reference to the non-existent MiniSWECodeAgentV2TextBased class.
The docstring states "This requires using MiniSWECodeAgentV2TextBased instead of MiniSWECodeAgentV2", but this class does not exist in the codebase. The text_based() method is a config factory that should work with the same MiniSWECodeAgentV2 agent class, just with different templates for text-based (backticks) parsing. Either clarify that users should still use MiniSWECodeAgentV2 with this config, or remove the misleading reference.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ares/code_agents/mini_swe_agent_v2_configs.py` around lines 450 - 464,
Update the text_based() docstring in MiniSWEAgentV2Config to remove or correct
the reference to the non-existent MiniSWECodeAgentV2TextBased class: change the
sentence to state that the text_based() config is compatible with the existing
MiniSWECodeAgentV2 agent (or simply remove the sentence), keeping the rest of
the docstring intact; specifically edit the method text_based() and replace the
mention of MiniSWECodeAgentV2TextBased with MiniSWECodeAgentV2 (or delete that
clause) so the docstring accurately reflects supported agent classes.
|
|
||
|
|
||
| def to_external(request: llm_request.LLMRequest, *, strict: bool = True) -> dict[str, Any]: | ||
| def ares_request_to_external(request: llm_request.LLMRequest, *, strict: bool = True) -> dict[str, Any]: |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if the documented Protocol pattern is still accurate and if other converters follow it
rg -n "def to_external|def from_external" --type=py src/ares/llms/Repository: withmartian/ares
Length of output: 693
Rename functions to match documented Protocol.
The functions ares_request_to_external and ares_request_from_external deviate from the documented converter interface defined in request.py (lines 201, 216), which specifies to_external() and from_external(). Other converters in the same module (openai_responses_converter, anthropic_converter) follow this naming convention.
Rename to to_external() and from_external() for consistency, or update the Protocol documentation and all other converters to match the new naming.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ares/llms/openai_chat_converter.py` at line 142, The converter function
names in openai_chat_converter need to match the documented Protocol: rename the
functions ares_request_to_external and ares_request_from_external to to_external
and from_external respectively (preserving their signatures, parameters like
request: llm_request.LLMRequest and strict: bool = True, return types, and
internal logic), update all internal references/calls in this module to use the
new names, and ensure any imports or registrations that expect the Protocol
methods are updated so the class/module implements to_external() and
from_external() exactly as defined in request.py.
User description
(Actually authored by Josh)
This PR adds msweav2, along with a preset for it.
Full benchmarking TBD, but initially it looks on-par with the reference.
I needed to expand tool availability in the LLM request/response model, which makes this PR quite a lot larger. It might be worth breaking up. Hopefully once mux is available we should be able to add it.
Generated description
Below is a concise technical summary of the changes proposed in this PR:
Introduces the
MiniSWECodeAgentV2and its configuration presets, enabling advanced tool-call based interaction models for software engineering tasks. This agent leverages significant architectural enhancements to the core LLM request and response models, which now supportToolUseDataand parallel tool execution across various LLM providers like OpenAI and Anthropic.ToolUseDataand API-specific response fields. This includes modifying cost accounting to support diverse API response structures and ensuring seamless conversion between internal and external LLM formats.Modified files (12)
Latest Contributors(2)
MiniSWECodeAgentV2, a new agent designed for software engineering tasks that utilizes OpenAI-style tool calls and supports parallel command execution. This includes defining its configuration presets and registering it within the system's available agents.Modified files (3)
Latest Contributors(2)
ToolUseDatafor representing tool calls and expands theLLMResponsedataclass with API-specific metadata. This foundational change allows for richer interaction patterns with LLMs, including parallel tool execution, and updates theextract_text_contentutility for broader compatibility.Modified files (15)
Latest Contributors(1)
DockerContainermethods to ensure the container is started before attempting operations, preventing potentialRuntimeErrorexceptions. Also updatespyproject.tomlto exclude test files and data from the built wheel package, and adds type annotation guidelines toCLAUDE.md.Modified files (3)
Latest Contributors(2)
LLMResponsestructure, specifically by adopting theresponse.extract_text_contenthelper and handlingToolUseDatawhere appropriate. This ensures compatibility and proper data extraction across various agent implementations and environment interactions.Modified files (6)
Latest Contributors(2)
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Documentation