refactor: unify sync/async evaluation by defaulting aevaluate to asyncio.to_thread by afarntrog · Pull Request #173 · strands-agents/evals

afarntrog · 2026-03-20T18:13:49Z

Description

refactor: unify sync/async evaluation by defaulting aevaluate to asyncio.to_thread

Change base Evaluator.aevaluate() to delegate to evaluate() via
asyncio.to_thread instead of raising NotImplementedError
Remove duplicated sync/async helper methods from Experiment class
(_record_evaluator_result, _run_task, and related run logic)
Consolidate experiment execution paths to reduce code duplication

Related Issues

#172

Documentation PR

Type of Change

New feature

Testing

How have you tested the change? Verify that the changes do not break functionality or introduce warnings in consuming repositories: agents-docs, agents-tools, agents-cli

I ran hatch run prepare

Checklist

I have read the CONTRIBUTING document
I have added any necessary tests that prove my fix is effective or my feature works
I have updated the documentation accordingly
I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
My changes generate no new warnings
Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

…cio.to_thread - Change base Evaluator.aevaluate() to delegate to evaluate() via asyncio.to_thread instead of raising NotImplementedError - Remove duplicated sync/async helper methods from Experiment class (_record_evaluator_result, _run_task, and related run logic) - Consolidate experiment execution paths to reduce code duplication

src/strands_evals/experiment.py

Add semantic convention attributes to evaluation spans for improved observability. Wraps task execution in a dedicated span and enriches case, evaluator, and score spans with structured attributes such as case name, input, evaluator type, and score results.

afarntrog · 2026-03-20T19:33:59Z

The following script run the experiments in sync and async methods to show the generate spans.

"""Script to prove the OpenTelemetry span hierarchy in Experiment.

Runs both sync and async evaluation paths using StrandsEvalsTelemetry's
InMemorySpanExporter, then dumps the full span tree to docs/span_hierarchy_proof.txt.
"""

import asyncio
from pathlib import Path

from opentelemetry.sdk.trace.export.in_memory_span_exporter import InMemorySpanExporter
from strands import Agent, tool

from strands_evals import Case, Experiment
from strands_evals.evaluators.deterministic import Contains, ToolCalled
from strands_evals.extractors.tools_use_extractor import extract_agent_tools_used_from_messages
from strands_evals.telemetry import StrandsEvalsTelemetry

# Set up telemetry with in-memory exporter
telemetry = StrandsEvalsTelemetry().setup_in_memory_exporter()
exporter: InMemorySpanExporter = telemetry.in_memory_exporter


# --- Tool ---
@tool
def lookup_capital(country: str) -> str:
    """Look up the capital city of a country.

    Args:
        country: Name of the country to look up.
    """
    capitals = {
        "France": "Paris",
        "Japan": "Tokyo",
        "Brazil": "Brasilia",
        "Australia": "Canberra",
    }
    return capitals.get(country, f"Unknown capital for {country}")


# --- Task function using a real agent ---
def geography_agent_task(case: Case) -> dict:
    """Agent with a capital-lookup tool answering geography questions."""
    agent = Agent(
        system_prompt="You are a geography assistant. Use the lookup_capital tool when asked about capitals. "
        "Always include the capital city name in your response.",
        tools=[lookup_capital],
        callback_handler=None,
    )
    response = agent(case.input)
    trajectory = extract_agent_tools_used_from_messages(agent.messages)
    tool_names = [t["name"] for t in trajectory]
    return {"output": str(response), "trajectory": tool_names}


# --- Build test cases ---
cases = [
    Case(name="france", input="What is the capital of France?", expected_output=None),
    Case(name="japan", input="What is the capital of Japan?", expected_output=None),
    Case(name="brazil", input="What is the capital of Brazil?", expected_output=None),
]

evaluators = [
    ToolCalled(tool_name="lookup_capital"),
    Contains(value="capital"),
]


def build_span_tree(spans):
    """Organize flat span list into a parent-child tree."""
    by_id = {}
    children = {}

    for s in spans:
        sid = s.get_span_context().span_id
        by_id[sid] = s
        children.setdefault(sid, [])

    roots = []
    for s in spans:
        sid = s.get_span_context().span_id
        pid = s.parent.span_id if s.parent else None
        if pid and pid in by_id:
            children[pid].append(sid)
        else:
            roots.append(sid)

    return by_id, children, roots


def format_tree(by_id, children, roots, indent=0):
    """Recursively format the span tree as indented text."""
    lines = []
    for sid in roots:
        s = by_id[sid]
        attrs = dict(s.attributes) if s.attributes else {}
        prefix = "    " * indent
        marker = "|-- " if indent > 0 else ""
        lines.append(f"{prefix}{marker}[SPAN] {s.name}")
        if attrs:
            for k, v in sorted(attrs.items()):
                lines.append(f"{prefix}{'|   ' if indent > 0 else ''}    {k} = {v}")
        for child_id in children.get(sid, []):
            lines.extend(format_tree(by_id, children, [child_id], indent + 1))
    return lines


def dump_spans(label):
    """Dump current spans, return lines, and clear the exporter."""
    spans = exporter.get_finished_spans()
    print('begin non formatted finished spans')
    print(100*'=')
    for s in spans:
        print(s.to_json(indent=2))
    print(100*'=')
    print('end non formatted finished spans')
    by_id, children_map, roots = build_span_tree(spans)

    lines = [
        "=" * 70,
        f"  {label}",
        f"  Total spans captured: {len(spans)}",
        "=" * 70,
        "",
    ]
    lines.extend(format_tree(by_id, children_map, roots))
    lines.append("")
    exporter.clear()
    return lines


# --- Run both paths and collect output ---
output_lines = [
    "SPAN HIERARCHY PROOF",
    f"Generated by: {Path(__file__).name}",
    "",
    "This file proves the OpenTelemetry span parent-child relationships",
    "produced by Experiment.run_evaluations() and run_evaluations_async().",
    "",
    "Indented spans are CHILDREN of the span above them.",
    "Top-level (non-indented) spans have no parent in the evaluation scope.",
    "",
]

# 1. Sync path
exp_sync = Experiment(cases=cases, evaluators=evaluators)
exp_sync.run_evaluations(geography_agent_task)
output_lines.extend(dump_spans("SYNC PATH: Experiment.run_evaluations()"))

# 2. Async path
exp_async = Experiment(cases=cases, evaluators=evaluators)
asyncio.run(exp_async.run_evaluations_async(geography_agent_task, max_workers=1))
output_lines.extend(dump_spans("ASYNC PATH: Experiment.run_evaluations_async()"))

# --- Write to file ---
# add datetime to the path
from datetime import datetime
dt_str = datetime.now().strftime("%Y%m%d_%H%M%S")
out_path = Path(__file__).parent / f"span_hierarchy_proof_{dt_str}.txt"
text = "\n".join(output_lines)
out_path.write_text(text)
print(text)
print(f"\nWritten to {out_path}")

Output==>

SPAN HIERARCHY PROOF
Generated by: prove_span_hierarchy.py

This file proves the OpenTelemetry span parent-child relationships
produced by Experiment.run_evaluations() and run_evaluations_async().

Indented spans are CHILDREN of the span above them.
Top-level (non-indented) spans have no parent in the evaluation scope.

======================================================================
  SYNC PATH: Experiment.run_evaluations()
  Total spans captured: 30
======================================================================

[SPAN] execute_case france
    gen_ai.evaluation.case.input = "What is the capital of France?"
    gen_ai.evaluation.case.name = france
    |-- [SPAN] task_execution france
    |       gen_ai.evaluation.case.name = france
    |       gen_ai.evaluation.data.actual_output = "The capital of France is Paris.\n"
    |       gen_ai.evaluation.data.expected_output = null
    |       gen_ai.evaluation.data.has_interactions = False
    |       gen_ai.evaluation.data.has_trajectory = True
    |       gen_ai.evaluation.data.input = "What is the capital of France?"
    |       gen_ai.evaluation.task.type = agent_task
        |-- [SPAN] invoke_agent Strands Agents
        |       gen_ai.agent.name = Strands Agents
        |       gen_ai.agent.tools = ["lookup_capital"]
        |       gen_ai.event.end_time = 2026-03-20T18:46:29.410186+00:00
        |       gen_ai.event.start_time = 2026-03-20T18:46:27.306517+00:00
        |       gen_ai.operation.name = invoke_agent
        |       gen_ai.request.model = us.anthropic.claude-sonnet-4-20250514-v1:0
        |       gen_ai.system = strands-agents
        |       gen_ai.usage.cache_read_input_tokens = 0
        |       gen_ai.usage.cache_write_input_tokens = 0
        |       gen_ai.usage.completion_tokens = 63
        |       gen_ai.usage.input_tokens = 913
        |       gen_ai.usage.output_tokens = 63
        |       gen_ai.usage.prompt_tokens = 913
        |       gen_ai.usage.total_tokens = 976
        |       system_prompt = You are a geography assistant. Use the lookup_capital tool when asked about capitals. Always include the capital city name in your response.
            |-- [SPAN] execute_event_loop_cycle
            |       event_loop.cycle_id = c7ac92f5-40b5-4cdd-a159-b131aa8b6ebf
            |       event_loop.parent_cycle_id = 21687439-2fe9-4741-846f-bb33916b4e02
            |       gen_ai.event.end_time = 2026-03-20T18:46:29.410113+00:00
            |       gen_ai.event.start_time = 2026-03-20T18:46:28.648806+00:00
                |-- [SPAN] chat
                |       gen_ai.event.end_time = 2026-03-20T18:46:29.410019+00:00
                |       gen_ai.event.start_time = 2026-03-20T18:46:28.648874+00:00
                |       gen_ai.operation.name = chat
                |       gen_ai.request.model = us.anthropic.claude-sonnet-4-20250514-v1:0
                |       gen_ai.server.request.duration = 749
                |       gen_ai.server.time_to_first_token = 654
                |       gen_ai.system = strands-agents
                |       gen_ai.usage.completion_tokens = 10
                |       gen_ai.usage.input_tokens = 490
                |       gen_ai.usage.output_tokens = 10
                |       gen_ai.usage.prompt_tokens = 490
                |       gen_ai.usage.total_tokens = 500
            |-- [SPAN] execute_event_loop_cycle
            |       event_loop.cycle_id = 21687439-2fe9-4741-846f-bb33916b4e02
            |       gen_ai.event.end_time = 2026-03-20T18:46:28.648727+00:00
            |       gen_ai.event.start_time = 2026-03-20T18:46:27.307047+00:00
                |-- [SPAN] chat
                |       gen_ai.event.end_time = 2026-03-20T18:46:28.647986+00:00
                |       gen_ai.event.start_time = 2026-03-20T18:46:27.307089+00:00
                |       gen_ai.operation.name = chat
                |       gen_ai.request.model = us.anthropic.claude-sonnet-4-20250514-v1:0
                |       gen_ai.server.request.duration = 1249
                |       gen_ai.server.time_to_first_token = 1017
                |       gen_ai.system = strands-agents
                |       gen_ai.usage.completion_tokens = 53
                |       gen_ai.usage.input_tokens = 423
                |       gen_ai.usage.output_tokens = 53
                |       gen_ai.usage.prompt_tokens = 423
                |       gen_ai.usage.total_tokens = 476
                |-- [SPAN] execute_tool lookup_capital
                |       gen_ai.event.end_time = 2026-03-20T18:46:28.648670+00:00
                |       gen_ai.event.start_time = 2026-03-20T18:46:28.648290+00:00
                |       gen_ai.operation.name = execute_tool
                |       gen_ai.system = strands-agents
                |       gen_ai.tool.call.id = tooluse_sEAVGZ0ybLNWYVWcVuZx98
                |       gen_ai.tool.description = Look up the capital city of a country.
                |       gen_ai.tool.json_schema = {"properties": {"country": {"description": "Name of the country to look up.", "type": "string"}}, "required": ["country"], "type": "object"}
                |       gen_ai.tool.name = lookup_capital
                |       gen_ai.tool.status = success
    |-- [SPAN] evaluator ToolCalled
    |       gen_ai.evaluation.case.name = france
    |       gen_ai.evaluation.explanation = tool 'lookup_capital' was called
    |       gen_ai.evaluation.name = ToolCalled
    |       gen_ai.evaluation.score.label = YES
    |       gen_ai.evaluation.score.value = 1.0
    |       gen_ai.evaluation.test_pass = True
    |-- [SPAN] evaluator Contains
    |       gen_ai.evaluation.case.name = france
    |       gen_ai.evaluation.explanation = actual_output contains 'capital'
    |       gen_ai.evaluation.name = Contains
    |       gen_ai.evaluation.score.label = YES
    |       gen_ai.evaluation.score.value = 1.0
    |       gen_ai.evaluation.test_pass = True
[SPAN] execute_case japan
    gen_ai.evaluation.case.input = "What is the capital of Japan?"
    gen_ai.evaluation.case.name = japan
    |-- [SPAN] task_execution japan
    |       gen_ai.evaluation.case.name = japan
    |       gen_ai.evaluation.data.actual_output = "The capital of Japan is Tokyo.\n"
    |       gen_ai.evaluation.data.expected_output = null
    |       gen_ai.evaluation.data.has_interactions = False
    |       gen_ai.evaluation.data.has_trajectory = True
    |       gen_ai.evaluation.data.input = "What is the capital of Japan?"
    |       gen_ai.evaluation.task.type = agent_task
        |-- [SPAN] invoke_agent Strands Agents
        |       gen_ai.agent.name = Strands Agents
        |       gen_ai.agent.tools = ["lookup_capital"]
        |       gen_ai.event.end_time = 2026-03-20T18:46:33.622556+00:00
        |       gen_ai.event.start_time = 2026-03-20T18:46:29.531875+00:00
        |       gen_ai.operation.name = invoke_agent
        |       gen_ai.request.model = us.anthropic.claude-sonnet-4-20250514-v1:0
        |       gen_ai.system = strands-agents
        |       gen_ai.usage.cache_read_input_tokens = 0
        |       gen_ai.usage.cache_write_input_tokens = 0
        |       gen_ai.usage.completion_tokens = 75
        |       gen_ai.usage.input_tokens = 924
        |       gen_ai.usage.output_tokens = 75
        |       gen_ai.usage.prompt_tokens = 924
        |       gen_ai.usage.total_tokens = 999
        |       system_prompt = You are a geography assistant. Use the lookup_capital tool when asked about capitals. Always include the capital city name in your response.
            |-- [SPAN] execute_event_loop_cycle
            |       event_loop.cycle_id = 44cc2550-89d2-4b00-8ddb-2a8c8a0314c7
            |       event_loop.parent_cycle_id = 0910e1cf-d22c-4ea8-a286-1dfb87af4266
            |       gen_ai.event.end_time = 2026-03-20T18:46:33.622495+00:00
            |       gen_ai.event.start_time = 2026-03-20T18:46:31.968513+00:00
                |-- [SPAN] chat
                |       gen_ai.event.end_time = 2026-03-20T18:46:33.622410+00:00
                |       gen_ai.event.start_time = 2026-03-20T18:46:31.968577+00:00
                |       gen_ai.operation.name = chat
                |       gen_ai.request.model = us.anthropic.claude-sonnet-4-20250514-v1:0
                |       gen_ai.server.request.duration = 1642
                |       gen_ai.server.time_to_first_token = 1564
                |       gen_ai.system = strands-agents
                |       gen_ai.usage.completion_tokens = 10
                |       gen_ai.usage.input_tokens = 501
                |       gen_ai.usage.output_tokens = 10
                |       gen_ai.usage.prompt_tokens = 501
                |       gen_ai.usage.total_tokens = 511
            |-- [SPAN] execute_event_loop_cycle
            |       event_loop.cycle_id = 0910e1cf-d22c-4ea8-a286-1dfb87af4266
            |       gen_ai.event.end_time = 2026-03-20T18:46:31.968427+00:00
            |       gen_ai.event.start_time = 2026-03-20T18:46:29.531980+00:00
                |-- [SPAN] chat
                |       gen_ai.event.end_time = 2026-03-20T18:46:31.967888+00:00
                |       gen_ai.event.start_time = 2026-03-20T18:46:29.532021+00:00
                |       gen_ai.operation.name = chat
                |       gen_ai.request.model = us.anthropic.claude-sonnet-4-20250514-v1:0
                |       gen_ai.server.request.duration = 2392
                |       gen_ai.server.time_to_first_token = 2013
                |       gen_ai.system = strands-agents
                |       gen_ai.usage.completion_tokens = 65
                |       gen_ai.usage.input_tokens = 423
                |       gen_ai.usage.output_tokens = 65
                |       gen_ai.usage.prompt_tokens = 423
                |       gen_ai.usage.total_tokens = 488
                |-- [SPAN] execute_tool lookup_capital
                |       gen_ai.event.end_time = 2026-03-20T18:46:31.968374+00:00
                |       gen_ai.event.start_time = 2026-03-20T18:46:31.968074+00:00
                |       gen_ai.operation.name = execute_tool
                |       gen_ai.system = strands-agents
                |       gen_ai.tool.call.id = tooluse_T2RysxFyiNXdlxSxsrRkq7
                |       gen_ai.tool.description = Look up the capital city of a country.
                |       gen_ai.tool.json_schema = {"properties": {"country": {"description": "Name of the country to look up.", "type": "string"}}, "required": ["country"], "type": "object"}
                |       gen_ai.tool.name = lookup_capital
                |       gen_ai.tool.status = success
    |-- [SPAN] evaluator ToolCalled
    |       gen_ai.evaluation.case.name = japan
    |       gen_ai.evaluation.explanation = tool 'lookup_capital' was called
    |       gen_ai.evaluation.name = ToolCalled
    |       gen_ai.evaluation.score.label = YES
    |       gen_ai.evaluation.score.value = 1.0
    |       gen_ai.evaluation.test_pass = True
    |-- [SPAN] evaluator Contains
    |       gen_ai.evaluation.case.name = japan
    |       gen_ai.evaluation.explanation = actual_output contains 'capital'
    |       gen_ai.evaluation.name = Contains
    |       gen_ai.evaluation.score.label = YES
    |       gen_ai.evaluation.score.value = 1.0
    |       gen_ai.evaluation.test_pass = True
[SPAN] execute_case brazil
    gen_ai.evaluation.case.input = "What is the capital of Brazil?"
    gen_ai.evaluation.case.name = brazil
    |-- [SPAN] task_execution brazil
    |       gen_ai.evaluation.case.name = brazil
    |       gen_ai.evaluation.data.actual_output = "The capital of Brazil is Brasilia.\n"
    |       gen_ai.evaluation.data.expected_output = null
    |       gen_ai.evaluation.data.has_interactions = False
    |       gen_ai.evaluation.data.has_trajectory = True
    |       gen_ai.evaluation.data.input = "What is the capital of Brazil?"
    |       gen_ai.evaluation.task.type = agent_task
        |-- [SPAN] invoke_agent Strands Agents
        |       gen_ai.agent.name = Strands Agents
        |       gen_ai.agent.tools = ["lookup_capital"]
        |       gen_ai.event.end_time = 2026-03-20T18:46:35.948398+00:00
        |       gen_ai.event.start_time = 2026-03-20T18:46:33.714875+00:00
        |       gen_ai.operation.name = invoke_agent
        |       gen_ai.request.model = us.anthropic.claude-sonnet-4-20250514-v1:0
        |       gen_ai.system = strands-agents
        |       gen_ai.usage.cache_read_input_tokens = 0
        |       gen_ai.usage.cache_write_input_tokens = 0
        |       gen_ai.usage.completion_tokens = 77
        |       gen_ai.usage.input_tokens = 926
        |       gen_ai.usage.output_tokens = 77
        |       gen_ai.usage.prompt_tokens = 926
        |       gen_ai.usage.total_tokens = 1003
        |       system_prompt = You are a geography assistant. Use the lookup_capital tool when asked about capitals. Always include the capital city name in your response.
            |-- [SPAN] execute_event_loop_cycle
            |       event_loop.cycle_id = 14c3ca78-57bd-41f8-8e9c-051987584d4d
            |       event_loop.parent_cycle_id = 8984506b-d7e4-43a8-b5c6-a0f158cb2e91
            |       gen_ai.event.end_time = 2026-03-20T18:46:35.948338+00:00
            |       gen_ai.event.start_time = 2026-03-20T18:46:34.910111+00:00
                |-- [SPAN] chat
                |       gen_ai.event.end_time = 2026-03-20T18:46:35.948252+00:00
                |       gen_ai.event.start_time = 2026-03-20T18:46:34.910181+00:00
                |       gen_ai.operation.name = chat
                |       gen_ai.request.model = us.anthropic.claude-sonnet-4-20250514-v1:0
                |       gen_ai.server.request.duration = 1012
                |       gen_ai.server.time_to_first_token = 886
                |       gen_ai.system = strands-agents
                |       gen_ai.usage.completion_tokens = 12
                |       gen_ai.usage.input_tokens = 503
                |       gen_ai.usage.output_tokens = 12
                |       gen_ai.usage.prompt_tokens = 503
                |       gen_ai.usage.total_tokens = 515
            |-- [SPAN] execute_event_loop_cycle
            |       event_loop.cycle_id = 8984506b-d7e4-43a8-b5c6-a0f158cb2e91
            |       gen_ai.event.end_time = 2026-03-20T18:46:34.910034+00:00
            |       gen_ai.event.start_time = 2026-03-20T18:46:33.714969+00:00
                |-- [SPAN] chat
                |       gen_ai.event.end_time = 2026-03-20T18:46:34.909537+00:00
                |       gen_ai.event.start_time = 2026-03-20T18:46:33.715057+00:00
                |       gen_ai.operation.name = chat
                |       gen_ai.request.model = us.anthropic.claude-sonnet-4-20250514-v1:0
                |       gen_ai.server.request.duration = 1151
                |       gen_ai.server.time_to_first_token = 723
                |       gen_ai.system = strands-agents
                |       gen_ai.usage.completion_tokens = 65
                |       gen_ai.usage.input_tokens = 423
                |       gen_ai.usage.output_tokens = 65
                |       gen_ai.usage.prompt_tokens = 423
                |       gen_ai.usage.total_tokens = 488
                |-- [SPAN] execute_tool lookup_capital
                |       gen_ai.event.end_time = 2026-03-20T18:46:34.909984+00:00
                |       gen_ai.event.start_time = 2026-03-20T18:46:34.909705+00:00
                |       gen_ai.operation.name = execute_tool
                |       gen_ai.system = strands-agents
                |       gen_ai.tool.call.id = tooluse_ocmMJitzIwGonot169jSZy
                |       gen_ai.tool.description = Look up the capital city of a country.
                |       gen_ai.tool.json_schema = {"properties": {"country": {"description": "Name of the country to look up.", "type": "string"}}, "required": ["country"], "type": "object"}
                |       gen_ai.tool.name = lookup_capital
                |       gen_ai.tool.status = success
    |-- [SPAN] evaluator ToolCalled
    |       gen_ai.evaluation.case.name = brazil
    |       gen_ai.evaluation.explanation = tool 'lookup_capital' was called
    |       gen_ai.evaluation.name = ToolCalled
    |       gen_ai.evaluation.score.label = YES
    |       gen_ai.evaluation.score.value = 1.0
    |       gen_ai.evaluation.test_pass = True
    |-- [SPAN] evaluator Contains
    |       gen_ai.evaluation.case.name = brazil
    |       gen_ai.evaluation.explanation = actual_output contains 'capital'
    |       gen_ai.evaluation.name = Contains
    |       gen_ai.evaluation.score.label = YES
    |       gen_ai.evaluation.score.value = 1.0
    |       gen_ai.evaluation.test_pass = True

======================================================================
  ASYNC PATH: Experiment.run_evaluations_async()
  Total spans captured: 30
======================================================================

[SPAN] execute_case france
    gen_ai.evaluation.case.input = "What is the capital of France?"
    gen_ai.evaluation.case.name = france
    |-- [SPAN] task_execution france
    |       gen_ai.evaluation.case.name = france
    |       gen_ai.evaluation.data.actual_output = "The capital of France is Paris.\n"
    |       gen_ai.evaluation.data.expected_output = null
    |       gen_ai.evaluation.data.has_interactions = False
    |       gen_ai.evaluation.data.has_trajectory = True
    |       gen_ai.evaluation.data.input = "What is the capital of France?"
    |       gen_ai.evaluation.task.type = agent_task
        |-- [SPAN] invoke_agent Strands Agents
        |       gen_ai.agent.name = Strands Agents
        |       gen_ai.agent.tools = ["lookup_capital"]
        |       gen_ai.event.end_time = 2026-03-20T18:46:39.663648+00:00
        |       gen_ai.event.start_time = 2026-03-20T18:46:36.086926+00:00
        |       gen_ai.operation.name = invoke_agent
        |       gen_ai.request.model = us.anthropic.claude-sonnet-4-20250514-v1:0
        |       gen_ai.system = strands-agents
        |       gen_ai.usage.cache_read_input_tokens = 0
        |       gen_ai.usage.cache_write_input_tokens = 0
        |       gen_ai.usage.completion_tokens = 75
        |       gen_ai.usage.input_tokens = 924
        |       gen_ai.usage.output_tokens = 75
        |       gen_ai.usage.prompt_tokens = 924
        |       gen_ai.usage.total_tokens = 999
        |       system_prompt = You are a geography assistant. Use the lookup_capital tool when asked about capitals. Always include the capital city name in your response.
            |-- [SPAN] execute_event_loop_cycle
            |       event_loop.cycle_id = 062d635e-98b8-4605-b397-f76ca2c10e3c
            |       event_loop.parent_cycle_id = ec1bb302-425e-4412-8e0e-9f768b1c249f
            |       gen_ai.event.end_time = 2026-03-20T18:46:39.663587+00:00
            |       gen_ai.event.start_time = 2026-03-20T18:46:37.503761+00:00
                |-- [SPAN] chat
                |       gen_ai.event.end_time = 2026-03-20T18:46:39.663504+00:00
                |       gen_ai.event.start_time = 2026-03-20T18:46:37.503822+00:00
                |       gen_ai.operation.name = chat
                |       gen_ai.request.model = us.anthropic.claude-sonnet-4-20250514-v1:0
                |       gen_ai.server.request.duration = 2140
                |       gen_ai.server.time_to_first_token = 2159
                |       gen_ai.system = strands-agents
                |       gen_ai.usage.completion_tokens = 10
                |       gen_ai.usage.input_tokens = 501
                |       gen_ai.usage.output_tokens = 10
                |       gen_ai.usage.prompt_tokens = 501
                |       gen_ai.usage.total_tokens = 511
            |-- [SPAN] execute_event_loop_cycle
            |       event_loop.cycle_id = ec1bb302-425e-4412-8e0e-9f768b1c249f
            |       gen_ai.event.end_time = 2026-03-20T18:46:37.503678+00:00
            |       gen_ai.event.start_time = 2026-03-20T18:46:36.087018+00:00
                |-- [SPAN] chat
                |       gen_ai.event.end_time = 2026-03-20T18:46:37.503181+00:00
                |       gen_ai.event.start_time = 2026-03-20T18:46:36.087054+00:00
                |       gen_ai.operation.name = chat
                |       gen_ai.request.model = us.anthropic.claude-sonnet-4-20250514-v1:0
                |       gen_ai.server.request.duration = 1372
                |       gen_ai.server.time_to_first_token = 965
                |       gen_ai.system = strands-agents
                |       gen_ai.usage.completion_tokens = 65
                |       gen_ai.usage.input_tokens = 423
                |       gen_ai.usage.output_tokens = 65
                |       gen_ai.usage.prompt_tokens = 423
                |       gen_ai.usage.total_tokens = 488
                |-- [SPAN] execute_tool lookup_capital
                |       gen_ai.event.end_time = 2026-03-20T18:46:37.503627+00:00
                |       gen_ai.event.start_time = 2026-03-20T18:46:37.503349+00:00
                |       gen_ai.operation.name = execute_tool
                |       gen_ai.system = strands-agents
                |       gen_ai.tool.call.id = tooluse_QtWdg2gxN8uaRbFAYLb3pr
                |       gen_ai.tool.description = Look up the capital city of a country.
                |       gen_ai.tool.json_schema = {"properties": {"country": {"description": "Name of the country to look up.", "type": "string"}}, "required": ["country"], "type": "object"}
                |       gen_ai.tool.name = lookup_capital
                |       gen_ai.tool.status = success
    |-- [SPAN] evaluator ToolCalled
    |       gen_ai.evaluation.case.name = france
    |       gen_ai.evaluation.explanation = tool 'lookup_capital' was called
    |       gen_ai.evaluation.name = ToolCalled
    |       gen_ai.evaluation.score.label = YES
    |       gen_ai.evaluation.score.value = 1.0
    |       gen_ai.evaluation.test_pass = True
    |-- [SPAN] evaluator Contains
    |       gen_ai.evaluation.case.name = france
    |       gen_ai.evaluation.explanation = actual_output contains 'capital'
    |       gen_ai.evaluation.name = Contains
    |       gen_ai.evaluation.score.label = YES
    |       gen_ai.evaluation.score.value = 1.0
    |       gen_ai.evaluation.test_pass = True
[SPAN] execute_case japan
    gen_ai.evaluation.case.input = "What is the capital of Japan?"
    gen_ai.evaluation.case.name = japan
    |-- [SPAN] task_execution japan
    |       gen_ai.evaluation.case.name = japan
    |       gen_ai.evaluation.data.actual_output = "The capital of Japan is Tokyo.\n"
    |       gen_ai.evaluation.data.expected_output = null
    |       gen_ai.evaluation.data.has_interactions = False
    |       gen_ai.evaluation.data.has_trajectory = True
    |       gen_ai.evaluation.data.input = "What is the capital of Japan?"
    |       gen_ai.evaluation.task.type = agent_task
        |-- [SPAN] invoke_agent Strands Agents
        |       gen_ai.agent.name = Strands Agents
        |       gen_ai.agent.tools = ["lookup_capital"]
        |       gen_ai.event.end_time = 2026-03-20T18:46:42.864078+00:00
        |       gen_ai.event.start_time = 2026-03-20T18:46:39.751310+00:00
        |       gen_ai.operation.name = invoke_agent
        |       gen_ai.request.model = us.anthropic.claude-sonnet-4-20250514-v1:0
        |       gen_ai.system = strands-agents
        |       gen_ai.usage.cache_read_input_tokens = 0
        |       gen_ai.usage.cache_write_input_tokens = 0
        |       gen_ai.usage.completion_tokens = 75
        |       gen_ai.usage.input_tokens = 924
        |       gen_ai.usage.output_tokens = 75
        |       gen_ai.usage.prompt_tokens = 924
        |       gen_ai.usage.total_tokens = 999
        |       system_prompt = You are a geography assistant. Use the lookup_capital tool when asked about capitals. Always include the capital city name in your response.
            |-- [SPAN] execute_event_loop_cycle
            |       event_loop.cycle_id = 193cd4c3-48d8-418b-9f26-dd20a4a28f47
            |       event_loop.parent_cycle_id = 7e0d09d7-69aa-46e7-878b-2f09daef00d4
            |       gen_ai.event.end_time = 2026-03-20T18:46:42.864018+00:00
            |       gen_ai.event.start_time = 2026-03-20T18:46:41.996427+00:00
                |-- [SPAN] chat
                |       gen_ai.event.end_time = 2026-03-20T18:46:42.863933+00:00
                |       gen_ai.event.start_time = 2026-03-20T18:46:41.996482+00:00
                |       gen_ai.operation.name = chat
                |       gen_ai.request.model = us.anthropic.claude-sonnet-4-20250514-v1:0
                |       gen_ai.server.request.duration = 856
                |       gen_ai.server.time_to_first_token = 715
                |       gen_ai.system = strands-agents
                |       gen_ai.usage.completion_tokens = 10
                |       gen_ai.usage.input_tokens = 501
                |       gen_ai.usage.output_tokens = 10
                |       gen_ai.usage.prompt_tokens = 501
                |       gen_ai.usage.total_tokens = 511
            |-- [SPAN] execute_event_loop_cycle
            |       event_loop.cycle_id = 7e0d09d7-69aa-46e7-878b-2f09daef00d4
            |       gen_ai.event.end_time = 2026-03-20T18:46:41.996356+00:00
            |       gen_ai.event.start_time = 2026-03-20T18:46:39.751403+00:00
                |-- [SPAN] chat
                |       gen_ai.event.end_time = 2026-03-20T18:46:41.995833+00:00
                |       gen_ai.event.start_time = 2026-03-20T18:46:39.751489+00:00
                |       gen_ai.operation.name = chat
                |       gen_ai.request.model = us.anthropic.claude-sonnet-4-20250514-v1:0
                |       gen_ai.server.request.duration = 2202
                |       gen_ai.server.time_to_first_token = 1593
                |       gen_ai.system = strands-agents
                |       gen_ai.usage.completion_tokens = 65
                |       gen_ai.usage.input_tokens = 423
                |       gen_ai.usage.output_tokens = 65
                |       gen_ai.usage.prompt_tokens = 423
                |       gen_ai.usage.total_tokens = 488
                |-- [SPAN] execute_tool lookup_capital
                |       gen_ai.event.end_time = 2026-03-20T18:46:41.996308+00:00
                |       gen_ai.event.start_time = 2026-03-20T18:46:41.996007+00:00
                |       gen_ai.operation.name = execute_tool
                |       gen_ai.system = strands-agents
                |       gen_ai.tool.call.id = tooluse_9lok4ouJShIuwS5d7Wxdvx
                |       gen_ai.tool.description = Look up the capital city of a country.
                |       gen_ai.tool.json_schema = {"properties": {"country": {"description": "Name of the country to look up.", "type": "string"}}, "required": ["country"], "type": "object"}
                |       gen_ai.tool.name = lookup_capital
                |       gen_ai.tool.status = success
    |-- [SPAN] evaluator ToolCalled
    |       gen_ai.evaluation.case.name = japan
    |       gen_ai.evaluation.explanation = tool 'lookup_capital' was called
    |       gen_ai.evaluation.name = ToolCalled
    |       gen_ai.evaluation.score.label = YES
    |       gen_ai.evaluation.score.value = 1.0
    |       gen_ai.evaluation.test_pass = True
    |-- [SPAN] evaluator Contains
    |       gen_ai.evaluation.case.name = japan
    |       gen_ai.evaluation.explanation = actual_output contains 'capital'
    |       gen_ai.evaluation.name = Contains
    |       gen_ai.evaluation.score.label = YES
    |       gen_ai.evaluation.score.value = 1.0
    |       gen_ai.evaluation.test_pass = True
[SPAN] execute_case brazil
    gen_ai.evaluation.case.input = "What is the capital of Brazil?"
    gen_ai.evaluation.case.name = brazil
    |-- [SPAN] task_execution brazil
    |       gen_ai.evaluation.case.name = brazil
    |       gen_ai.evaluation.data.actual_output = "The capital of Brazil is Brasilia.\n"
    |       gen_ai.evaluation.data.expected_output = null
    |       gen_ai.evaluation.data.has_interactions = False
    |       gen_ai.evaluation.data.has_trajectory = True
    |       gen_ai.evaluation.data.input = "What is the capital of Brazil?"
    |       gen_ai.evaluation.task.type = agent_task
        |-- [SPAN] invoke_agent Strands Agents
        |       gen_ai.agent.name = Strands Agents
        |       gen_ai.agent.tools = ["lookup_capital"]
        |       gen_ai.event.end_time = 2026-03-20T18:46:46.372294+00:00
        |       gen_ai.event.start_time = 2026-03-20T18:46:42.954357+00:00
        |       gen_ai.operation.name = invoke_agent
        |       gen_ai.request.model = us.anthropic.claude-sonnet-4-20250514-v1:0
        |       gen_ai.system = strands-agents
        |       gen_ai.usage.cache_read_input_tokens = 0
        |       gen_ai.usage.cache_write_input_tokens = 0
        |       gen_ai.usage.completion_tokens = 77
        |       gen_ai.usage.input_tokens = 926
        |       gen_ai.usage.output_tokens = 77
        |       gen_ai.usage.prompt_tokens = 926
        |       gen_ai.usage.total_tokens = 1003
        |       system_prompt = You are a geography assistant. Use the lookup_capital tool when asked about capitals. Always include the capital city name in your response.
            |-- [SPAN] execute_event_loop_cycle
            |       event_loop.cycle_id = f9109e29-9fe0-4153-8d04-2366494ba9d1
            |       event_loop.parent_cycle_id = c724fb50-146f-4d15-b358-430690ed7107
            |       gen_ai.event.end_time = 2026-03-20T18:46:46.372235+00:00
            |       gen_ai.event.start_time = 2026-03-20T18:46:44.126301+00:00
                |-- [SPAN] chat
                |       gen_ai.event.end_time = 2026-03-20T18:46:46.372148+00:00
                |       gen_ai.event.start_time = 2026-03-20T18:46:44.126359+00:00
                |       gen_ai.operation.name = chat
                |       gen_ai.request.model = us.anthropic.claude-sonnet-4-20250514-v1:0
                |       gen_ai.server.request.duration = 2234
                |       gen_ai.server.time_to_first_token = 2095
                |       gen_ai.system = strands-agents
                |       gen_ai.usage.completion_tokens = 12
                |       gen_ai.usage.input_tokens = 503
                |       gen_ai.usage.output_tokens = 12
                |       gen_ai.usage.prompt_tokens = 503
                |       gen_ai.usage.total_tokens = 515
            |-- [SPAN] execute_event_loop_cycle
            |       event_loop.cycle_id = c724fb50-146f-4d15-b358-430690ed7107
            |       gen_ai.event.end_time = 2026-03-20T18:46:44.126225+00:00
            |       gen_ai.event.start_time = 2026-03-20T18:46:42.954465+00:00
                |-- [SPAN] chat
                |       gen_ai.event.end_time = 2026-03-20T18:46:44.125724+00:00
                |       gen_ai.event.start_time = 2026-03-20T18:46:42.954505+00:00
                |       gen_ai.operation.name = chat
                |       gen_ai.request.model = us.anthropic.claude-sonnet-4-20250514-v1:0
                |       gen_ai.server.request.duration = 1097
                |       gen_ai.server.time_to_first_token = 891
                |       gen_ai.system = strands-agents
                |       gen_ai.usage.completion_tokens = 65
                |       gen_ai.usage.input_tokens = 423
                |       gen_ai.usage.output_tokens = 65
                |       gen_ai.usage.prompt_tokens = 423
                |       gen_ai.usage.total_tokens = 488
                |-- [SPAN] execute_tool lookup_capital
                |       gen_ai.event.end_time = 2026-03-20T18:46:44.126174+00:00
                |       gen_ai.event.start_time = 2026-03-20T18:46:44.125893+00:00
                |       gen_ai.operation.name = execute_tool
                |       gen_ai.system = strands-agents
                |       gen_ai.tool.call.id = tooluse_uSx0DCWCDjgxbnJoD1tiyY
                |       gen_ai.tool.description = Look up the capital city of a country.
                |       gen_ai.tool.json_schema = {"properties": {"country": {"description": "Name of the country to look up.", "type": "string"}}, "required": ["country"], "type": "object"}
                |       gen_ai.tool.name = lookup_capital
                |       gen_ai.tool.status = success
    |-- [SPAN] evaluator ToolCalled
    |       gen_ai.evaluation.case.name = brazil
    |       gen_ai.evaluation.explanation = tool 'lookup_capital' was called
    |       gen_ai.evaluation.name = ToolCalled
    |       gen_ai.evaluation.score.label = YES
    |       gen_ai.evaluation.score.value = 1.0
    |       gen_ai.evaluation.test_pass = True
    |-- [SPAN] evaluator Contains
    |       gen_ai.evaluation.case.name = brazil
    |       gen_ai.evaluation.explanation = actual_output contains 'capital'
    |       gen_ai.evaluation.name = Contains
    |       gen_ai.evaluation.score.label = YES
    |       gen_ai.evaluation.score.value = 1.0
    |       gen_ai.evaluation.test_pass = True

afarntrog added 2 commits March 20, 2026 14:02

afarntrog had a problem deploying to auto-approve March 20, 2026 18:14 — with GitHub Actions Failure

afarntrog commented Mar 20, 2026

View reviewed changes

src/strands_evals/experiment.py Show resolved Hide resolved

afarntrog commented Mar 20, 2026

View reviewed changes

src/strands_evals/experiment.py Show resolved Hide resolved

afarntrog temporarily deployed to auto-approve March 20, 2026 19:31 — with GitHub Actions Inactive

poshinchen approved these changes Mar 23, 2026

View reviewed changes

afarntrog merged commit ae0e434 into strands-agents:main Mar 23, 2026
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: unify sync/async evaluation by defaulting aevaluate to asyncio.to_thread#173

refactor: unify sync/async evaluation by defaulting aevaluate to asyncio.to_thread#173
afarntrog merged 3 commits intostrands-agents:mainfrom
afarntrog:wip/refactor-run-evaluations-delegates-to-async

afarntrog commented Mar 20, 2026

Uh oh!

Uh oh!

Uh oh!

afarntrog commented Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

afarntrog commented Mar 20, 2026

Description

Related Issues

Documentation PR

Type of Change

Testing

Checklist

Uh oh!

Uh oh!

Uh oh!

afarntrog commented Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants