Refactor LLM chats: separate streaming logic and enforce strict typing by s-alexey · Pull Request #12 · Kaggle/kaggle-benchmarks

s-alexey · 2026-01-07T15:59:22Z

Major refactoring of the LLM interaction layer, significantly enhancing the llm.prompt method to establish it as the primary, unified entry point for all model communications. The goal is to abstract away model-specific logic, providing seamless support for structured outputs, automatic tool calling, and vision capabilities across all integrated models.

This simplifies task definitions and enhances the user experience by providing a consistent, high-level API.

Enhanced llm.prompt with automatic tool calling:
- The llm.prompt method has been upgraded to manage the entire conversation turn, now including a built-in, multi-step tool-calling loop.
- When tools are provided, prompt automatically orchestrates the interaction: it invokes the LLM, executes requested tools, sends the results back, and repeats this cycle until a final answer is generated. This eliminates the need for manual tool-handling logic in task definitions.
Automatic tool-calling emulation:
- For models that lack native tool-calling support, a new emulation layer transparently provides this functionality by wrapping the requests with structured prompts, making the feature available across all models.
Refactored Actor model:
- The llms.py module has been streamlined. API-specific logic has been moved into dedicated actors/genai.py and actors/openai.py modules.
- Experimental streaming functionality is now encapsulated in separate classes (e.g., StreamingGoogleGenAI, StreamingOpenAIResponsesAPI) to isolate it from the core API used for scheduled runs
Improved vision and image support:
Enhanced support for multimodal inputs, particularly for the Gemini API. The framework now correctly handles image content, including captions and various data formats (URLs and base64).
New agentic assertion:
Added assert_tool_was_invoked to allow for testing and evaluation of agentic behavior by verifying that a specific tool was used during a task.
Updated Examples & Tests
- Revised examples to demonstrate the simplified llm.prompt API for tool use.
- Added a comprehensive suite of API integration tests (test_api_integration.py) that run against live OpenAI, Google, and Model Proxy endpoints (when API keys are available) to ensure cross-model consistency.

dolaameng

Thanks for the refactoring, which is really helpful and clearer! Left some questions/comments to understand more.

We can keep iterating it.

tests/test_genai.py

src/kaggle_benchmarks/actors/llms.py

src/kaggle_benchmarks/actors/openai.py

src/kaggle_benchmarks/actors/llms.py

develra

LGTM - it's a bit hard for me to tell as a person pretty ignorant of this code if the test coverage is sufficient to be confident that these changes are safe. I think that it would be good to think through what might break as a result of these changes and make sure we have test coverage for it - especially given the somewhat sensitive timing of a new launch.

cicd/build/cloudbuild_branch.yaml

src/kaggle_benchmarks/actors/openai.py

src/kaggle_benchmarks/actors/llms.py

tests/kaggle/test_integration.py

src/kaggle_benchmarks/actors/openai.py

golden_tests/test_cookbook_examples_report.yaml

golden_tests/test_cookbook_examples.py

dolaameng · 2026-03-31T18:36:51Z

src/kaggle_benchmarks/providers/genai.py

+                        yield tool_utils.ToolInvocationResult(
+                            name=part.function_response.name,
+                            call_id=f"call_{part.function_response.name}",
+                            arguments=calls.pop(0).args,


qq: If this is None, will it throw Type Error in invoke_tool for functions without arguments? so we should use calls.pop(0).args or {} instead.

dolaameng · 2026-03-31T19:59:56Z

src/kaggle_benchmarks/providers/genai.py

+    This class may include workarounds for specific proxy behaviors.
+    """
+
+    def __init__(self, client: genai.Client, model: str, **kwargs):


if support_tool_calling is True, would the tool be called twice, once by API native support, the other by our "simulated call"? For example, would this work?

# %% # --- Test Case: Tool called twice --- COUNTER = 0 def increment_counter() -> int: global COUNTER COUNTER += 1 return COUNTER @benchmark_test(include=[ "google/gemini-2.5-pro", ]) @kbench.task() def test_stateful_tool_double_execution(llm): global COUNTER COUNTER = 0 # Reset for each test run llm.prompt("Call the increment_counter tool.", tools=[increment_counter]) # If the bug exists, this will fail because COUNTER will be 2 (or more). kbench.assertions.assert_equal( 1, COUNTER, expectation="Tool should be executed exactly once." )

dolaameng · 2026-03-31T21:21:57Z

src/kaggle_benchmarks/actors/openai.py

+        call = message.content
+        return [
+            {
+                "role": self.roles_mapping.get("system", "system"),


It seems the role system will NOT be recognized as "tool result" by some models like gemini. This will cause an infinite calling of tools till ToolInvocationLimitExhausted reached. Shall we change the role to "user"?

For example, this seems to fail the same test as in https://github.com/Kaggle/kaggle-benchmarks/pull/12/changes#r3018051338

dolaameng · 2026-04-01T00:31:02Z

src/kaggle_benchmarks/providers/genai.py

+                        yield tool_utils.ToolInvocationResult(
+                            name=part.function_response.name,
+                            call_id=f"call_{part.function_response.name}",
+                            arguments=calls.pop(0).args,


Same line here: shall we match function_response to function_call by id (or name) instead of pop(0). See here it seems to be the way to associate function results to function calls.

This might result in misaligned results (I remember seeing it before):

function_call(add, 2, 3) function_call(times, 4, 5) function_response(times, 20) # this will be mis-assigned to add function_response(add, 5)

mohami2000 · 2026-04-06T19:12:50Z

src/kaggle_benchmarks/assertions.py

+
+    return AssertionResult(
+        passed=passed,
+        expectation=expectation or "Expected to call `{tool}`",


missing f string? ex: f"Expected to call {tool}"

mohami2000 · 2026-04-06T20:20:13Z

src/kaggle_benchmarks/tools/functions.py


+    return fields
+    # return pydantic.create_model(f"{func.__name__}", **fields)
+    return TypedDict(


This code is never reached?

mohami2000 · 2026-04-06T20:20:43Z

src/kaggle_benchmarks/tools/functions.py

+    return TypedDict(
+        func.__name__, {field: annotation for field, (annotation, _) in fields.items()}
+    )
+    # return pydantic.create_model(f"{func.__name__}", **fields)


Also, there are code comments scattered throughout the code, can we remove them? :)

mohami2000 · 2026-04-06T20:21:47Z

src/kaggle_benchmarks/providers/genai.py

+        return tool_calls
+
+    def _iter_tool_calls(self, response):
+        # TODO: review this function for potentiall issues


nit: potentiall is misspelled

mohami2000 · 2026-04-06T20:35:35Z

src/kaggle_benchmarks/providers/genai.py

+        raw_messages = self._convert_to_genai_types(messages)
+
+        config_params = {}
+        if tools and schema is not str:


This check seems a bit odd/ unconventional to me. Can we maybe set the default to None and check for that?

mohami2000 · 2026-04-06T20:37:57Z

src/kaggle_benchmarks/providers/genai.py

+
+        tool_calls = self.extract_tool_calls(response)
+
+        for tool_invocation in self._iter_tool_calls(response):


i think _iter_tool_calls is called internally in extract_tool_calls so this for loop never actually runs? (because the iterator is exhausted)

mohami2000 · 2026-04-06T20:50:21Z

src/kaggle_benchmarks/providers/genai.py

+                # The proxy returns a 400 error if tools are set with this model.
+                kwargs["support_tool_calling"] = False
+
+        elif "deepseek" in model:


could we perhaps make a config file instead? (to make it easier when adding a new model or a new capability, etc)
ex:
MODEL_CONFIGS = {
"gemini": {"support_structured_outputs": True, "support_tool_calling": True},
"deepseek"

mohami2000

I think now the intermediate state (what tools were called, what they returned, how many rounds it took) is hidden inside llm_message.chat.messages, and i know that info is useful for benchmark publishers sometimes, so maybe lets add some documentation about how users can display/see this info

This commit introduces a new simulation capability for models that do not natively support tool/function calling, allowing them to act as tool-using agents through structured prompting. - `simulate_agent`: An iterative agent loop that automatically prompts the LLM with available tools, parses its requested tool calls, executes them locally, and feeds the results back until a final answer is reached (or `max_iterations` is hit). - `simulate_respond_with_tools`: A single-turn simulation that wraps the LLM call with instructions and a structured output schema built dynamically from the provided Python functions.

Major refactor of the LLM chat architecture to improve code organization, maintainability, and type safety. Key Changes: - Split `LLMChat` subclasses into distinct Non-Streaming and Streaming implementations. Streaming logic (primarily for notebooks) was complicating the core classes; this split makes primary actors more concise and less error-prone. - Moved provider-specific implementations into separate files: `openai.py` and `genai.py`. - Replaced the generic `LLMResponse` with a strictly typed version, specifically enforcing types for `tool_usage` and `token_usage`. - Updated `invoke` method to accept explicit arguments. - Migrated OpenAI integration from the `completion` API to the more user-friendly `responses` API. Testing: - Added coverage for common use cases using real APIs (tests run conditionally if environment keys are present).

s-alexey requested a review from dolaameng January 7, 2026 16:04

s-alexey added the wip Work in progress label Jan 7, 2026

s-alexey force-pushed the genai branch from 519e732 to 1b24919 Compare January 7, 2026 17:53

dolaameng reviewed Jan 9, 2026

View reviewed changes

src/kaggle_benchmarks/actors/openai.py Outdated Show resolved Hide resolved

s-alexey force-pushed the genai branch 3 times, most recently from 74c099d to 9b62a2d Compare January 14, 2026 16:00

dolaameng reviewed Jan 16, 2026

View reviewed changes

src/kaggle_benchmarks/actors/llms.py Show resolved Hide resolved

dolaameng requested review from develra and yibinlin-google January 16, 2026 15:39

develra approved these changes Jan 16, 2026

View reviewed changes

cicd/build/cloudbuild_branch.yaml Outdated Show resolved Hide resolved

src/kaggle_benchmarks/actors/openai.py Outdated Show resolved Hide resolved

src/kaggle_benchmarks/actors/openai.py Outdated Show resolved Hide resolved

dolaameng reviewed Jan 16, 2026

View reviewed changes

tests/kaggle/test_integration.py Show resolved Hide resolved

s-alexey force-pushed the genai branch from 35396fc to 55f692f Compare January 22, 2026 08:59

s-alexey force-pushed the genai branch 5 times, most recently from f83f2db to 08587e3 Compare February 10, 2026 17:05

s-alexey force-pushed the genai branch 8 times, most recently from 742b3d2 to 840d3b3 Compare February 13, 2026 20:29

s-alexey force-pushed the genai branch 2 times, most recently from 7802a61 to 611f3c1 Compare February 23, 2026 19:27

s-alexey force-pushed the genai branch 2 times, most recently from 11a88ce to 03e858d Compare March 10, 2026 16:52

dolaameng reviewed Mar 11, 2026

View reviewed changes

src/kaggle_benchmarks/actors/openai.py Outdated Show resolved Hide resolved

s-alexey force-pushed the genai branch from 3f2ab5c to 16f1304 Compare March 12, 2026 14:17

dolaameng mentioned this pull request Mar 16, 2026

fix(prompting): Structured output class invalid chars #74

Merged

s-alexey force-pushed the genai branch from 16f1304 to f6add93 Compare March 20, 2026 20:10

dolaameng reviewed Mar 25, 2026

View reviewed changes

golden_tests/test_cookbook_examples_report.yaml Show resolved Hide resolved

s-alexey removed the wip Work in progress label Mar 25, 2026

s-alexey mentioned this pull request Mar 30, 2026

Add video modality input support #95

Merged

dolaameng reviewed Mar 30, 2026

View reviewed changes

golden_tests/test_cookbook_examples.py Outdated Show resolved Hide resolved

dolaameng reviewed Mar 31, 2026

View reviewed changes

dolaameng reviewed Apr 1, 2026

View reviewed changes

dolaameng mentioned this pull request Apr 1, 2026

test(tools): Add more golden tests for tools #114

Open

mohami2000 reviewed Apr 6, 2026

View reviewed changes

s-alexey force-pushed the genai branch 2 times, most recently from 3d185d5 to b219a4c Compare April 7, 2026 11:07

s-alexey mentioned this pull request Apr 8, 2026

feat(serialization): introduce dedicated serialization module #118

Merged

s-alexey force-pushed the genai branch 3 times, most recently from 353a0ca to 0974215 Compare April 9, 2026 17:11

s-alexey force-pushed the genai branch from 0974215 to c0fa57e Compare April 9, 2026 17:12


		tool_calls = self.extract_tool_calls(response)

		for tool_invocation in self._iter_tool_calls(response):

Conversation

s-alexey commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dolaameng left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

develra left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mohami2000 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

s-alexey commented Jan 7, 2026 •

edited

Loading

dolaameng left a comment •

edited

Loading