Conversation
74c099d to
9b62a2d
Compare
develra
left a comment
There was a problem hiding this comment.
LGTM - it's a bit hard for me to tell as a person pretty ignorant of this code if the test coverage is sufficient to be confident that these changes are safe. I think that it would be good to think through what might break as a result of these changes and make sure we have test coverage for it - especially given the somewhat sensitive timing of a new launch.
f83f2db to
08587e3
Compare
742b3d2 to
840d3b3
Compare
7802a61 to
611f3c1
Compare
11a88ce to
03e858d
Compare
| yield tool_utils.ToolInvocationResult( | ||
| name=part.function_response.name, | ||
| call_id=f"call_{part.function_response.name}", | ||
| arguments=calls.pop(0).args, |
There was a problem hiding this comment.
qq: If this is None, will it throw Type Error in invoke_tool for functions without arguments? so we should use calls.pop(0).args or {} instead.
| This class may include workarounds for specific proxy behaviors. | ||
| """ | ||
|
|
||
| def __init__(self, client: genai.Client, model: str, **kwargs): |
There was a problem hiding this comment.
if support_tool_calling is True, would the tool be called twice, once by API native support, the other by our "simulated call"? For example, would this work?
# %%
# --- Test Case: Tool called twice ---
COUNTER = 0
def increment_counter() -> int:
global COUNTER
COUNTER += 1
return COUNTER
@benchmark_test(include=[
"google/gemini-2.5-pro",
])
@kbench.task()
def test_stateful_tool_double_execution(llm):
global COUNTER
COUNTER = 0 # Reset for each test run
llm.prompt("Call the increment_counter tool.", tools=[increment_counter])
# If the bug exists, this will fail because COUNTER will be 2 (or more).
kbench.assertions.assert_equal(
1, COUNTER, expectation="Tool should be executed exactly once."
)
| call = message.content | ||
| return [ | ||
| { | ||
| "role": self.roles_mapping.get("system", "system"), |
There was a problem hiding this comment.
It seems the role system will NOT be recognized as "tool result" by some models like gemini. This will cause an infinite calling of tools till ToolInvocationLimitExhausted reached. Shall we change the role to "user"?
For example, this seems to fail the same test as in https://github.com/Kaggle/kaggle-benchmarks/pull/12/changes#r3018051338
| yield tool_utils.ToolInvocationResult( | ||
| name=part.function_response.name, | ||
| call_id=f"call_{part.function_response.name}", | ||
| arguments=calls.pop(0).args, |
There was a problem hiding this comment.
Same line here: shall we match function_response to function_call by id (or name) instead of pop(0). See here it seems to be the way to associate function results to function calls.
This might result in misaligned results (I remember seeing it before):
function_call(add, 2, 3)
function_call(times, 4, 5)
function_response(times, 20) # this will be mis-assigned to add
function_response(add, 5)
|
|
||
| return AssertionResult( | ||
| passed=passed, | ||
| expectation=expectation or "Expected to call `{tool}`", |
There was a problem hiding this comment.
missing f string? ex: f"Expected to call {tool}"
|
|
||
| return fields | ||
| # return pydantic.create_model(f"{func.__name__}", **fields) | ||
| return TypedDict( |
There was a problem hiding this comment.
This code is never reached?
| return TypedDict( | ||
| func.__name__, {field: annotation for field, (annotation, _) in fields.items()} | ||
| ) | ||
| # return pydantic.create_model(f"{func.__name__}", **fields) |
There was a problem hiding this comment.
Also, there are code comments scattered throughout the code, can we remove them? :)
| return tool_calls | ||
|
|
||
| def _iter_tool_calls(self, response): | ||
| # TODO: review this function for potentiall issues |
There was a problem hiding this comment.
nit: potentiall is misspelled
| raw_messages = self._convert_to_genai_types(messages) | ||
|
|
||
| config_params = {} | ||
| if tools and schema is not str: |
There was a problem hiding this comment.
This check seems a bit odd/ unconventional to me. Can we maybe set the default to None and check for that?
|
|
||
| tool_calls = self.extract_tool_calls(response) | ||
|
|
||
| for tool_invocation in self._iter_tool_calls(response): |
There was a problem hiding this comment.
i think _iter_tool_calls is called internally in extract_tool_calls so this for loop never actually runs? (because the iterator is exhausted)
| # The proxy returns a 400 error if tools are set with this model. | ||
| kwargs["support_tool_calling"] = False | ||
|
|
||
| elif "deepseek" in model: |
There was a problem hiding this comment.
could we perhaps make a config file instead? (to make it easier when adding a new model or a new capability, etc)
ex:
MODEL_CONFIGS = {
"gemini": {"support_structured_outputs": True, "support_tool_calling": True},
"deepseek"
mohami2000
left a comment
There was a problem hiding this comment.
I think now the intermediate state (what tools were called, what they returned, how many rounds it took) is hidden inside llm_message.chat.messages, and i know that info is useful for benchmark publishers sometimes, so maybe lets add some documentation about how users can display/see this info
3d185d5 to
b219a4c
Compare
This commit introduces a new simulation capability for models that do not natively support tool/function calling, allowing them to act as tool-using agents through structured prompting. - `simulate_agent`: An iterative agent loop that automatically prompts the LLM with available tools, parses its requested tool calls, executes them locally, and feeds the results back until a final answer is reached (or `max_iterations` is hit). - `simulate_respond_with_tools`: A single-turn simulation that wraps the LLM call with instructions and a structured output schema built dynamically from the provided Python functions.
353a0ca to
0974215
Compare
Major refactor of the LLM chat architecture to improve code organization, maintainability, and type safety. Key Changes: - Split `LLMChat` subclasses into distinct Non-Streaming and Streaming implementations. Streaming logic (primarily for notebooks) was complicating the core classes; this split makes primary actors more concise and less error-prone. - Moved provider-specific implementations into separate files: `openai.py` and `genai.py`. - Replaced the generic `LLMResponse` with a strictly typed version, specifically enforcing types for `tool_usage` and `token_usage`. - Updated `invoke` method to accept explicit arguments. - Migrated OpenAI integration from the `completion` API to the more user-friendly `responses` API. Testing: - Added coverage for common use cases using real APIs (tests run conditionally if environment keys are present).
Major refactoring of the LLM interaction layer, significantly enhancing the
llm.promptmethod to establish it as the primary, unified entry point for all model communications. The goal is to abstract away model-specific logic, providing seamless support for structured outputs, automatic tool calling, and vision capabilities across all integrated models.This simplifies task definitions and enhances the user experience by providing a consistent, high-level API.
Enhanced
llm.promptwith automatic tool calling:Automatic tool-calling emulation:
Refactored Actor model:
llms.pymodule has been streamlined. API-specific logic has been moved into dedicatedactors/genai.pyandactors/openai.pymodules.StreamingGoogleGenAI,StreamingOpenAIResponsesAPI) to isolate it from the core API used for scheduled runsImproved vision and image support:
Enhanced support for multimodal inputs, particularly for the Gemini API. The framework now correctly handles image content, including captions and various data formats (URLs and base64).
New agentic assertion:
Added
assert_tool_was_invokedto allow for testing and evaluation of agentic behavior by verifying that a specific tool was used during a task.Updated Examples & Tests
llm.promptAPI for tool use.test_api_integration.py) that run against live OpenAI, Google, and Model Proxy endpoints (when API keys are available) to ensure cross-model consistency.