Skip to content

refactor: unify sync/async evaluation by defaulting aevaluate to asyncio.to_thread#173

Merged
afarntrog merged 3 commits intostrands-agents:mainfrom
afarntrog:wip/refactor-run-evaluations-delegates-to-async
Mar 23, 2026
Merged

refactor: unify sync/async evaluation by defaulting aevaluate to asyncio.to_thread#173
afarntrog merged 3 commits intostrands-agents:mainfrom
afarntrog:wip/refactor-run-evaluations-delegates-to-async

Conversation

@afarntrog
Copy link
Contributor

Description

refactor: unify sync/async evaluation by defaulting aevaluate to asyncio.to_thread

  • Change base Evaluator.aevaluate() to delegate to evaluate() via
    asyncio.to_thread instead of raising NotImplementedError
  • Remove duplicated sync/async helper methods from Experiment class
    (_record_evaluator_result, _run_task, and related run logic)
  • Consolidate experiment execution paths to reduce code duplication

Related Issues

#172

Documentation PR

Type of Change

New feature

Testing

How have you tested the change? Verify that the changes do not break functionality or introduce warnings in consuming repositories: agents-docs, agents-tools, agents-cli

  • I ran hatch run prepare

Checklist

  • I have read the CONTRIBUTING document
  • I have added any necessary tests that prove my fix is effective or my feature works
  • I have updated the documentation accordingly
  • I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

…cio.to_thread

- Change base Evaluator.aevaluate() to delegate to evaluate() via
  asyncio.to_thread instead of raising NotImplementedError
- Remove duplicated sync/async helper methods from Experiment class
  (_record_evaluator_result, _run_task, and related run logic)
- Consolidate experiment execution paths to reduce code duplication
…cio.to_thread

- Change base Evaluator.aevaluate() to delegate to evaluate() via
  asyncio.to_thread instead of raising NotImplementedError
- Remove duplicated sync/async helper methods from Experiment class
  (_record_evaluator_result, _run_task, and related run logic)
- Consolidate experiment execution paths to reduce code duplication
Add semantic convention attributes to evaluation spans for improved
observability. Wraps task execution in a dedicated  span
and enriches case, evaluator, and score spans with structured attributes
such as case name, input, evaluator type, and score results.
@afarntrog
Copy link
Contributor Author

The following script run the experiments in sync and async methods to show the generate spans.

"""Script to prove the OpenTelemetry span hierarchy in Experiment.

Runs both sync and async evaluation paths using StrandsEvalsTelemetry's
InMemorySpanExporter, then dumps the full span tree to docs/span_hierarchy_proof.txt.
"""

import asyncio
from pathlib import Path

from opentelemetry.sdk.trace.export.in_memory_span_exporter import InMemorySpanExporter
from strands import Agent, tool

from strands_evals import Case, Experiment
from strands_evals.evaluators.deterministic import Contains, ToolCalled
from strands_evals.extractors.tools_use_extractor import extract_agent_tools_used_from_messages
from strands_evals.telemetry import StrandsEvalsTelemetry

# Set up telemetry with in-memory exporter
telemetry = StrandsEvalsTelemetry().setup_in_memory_exporter()
exporter: InMemorySpanExporter = telemetry.in_memory_exporter


# --- Tool ---
@tool
def lookup_capital(country: str) -> str:
    """Look up the capital city of a country.

    Args:
        country: Name of the country to look up.
    """
    capitals = {
        "France": "Paris",
        "Japan": "Tokyo",
        "Brazil": "Brasilia",
        "Australia": "Canberra",
    }
    return capitals.get(country, f"Unknown capital for {country}")


# --- Task function using a real agent ---
def geography_agent_task(case: Case) -> dict:
    """Agent with a capital-lookup tool answering geography questions."""
    agent = Agent(
        system_prompt="You are a geography assistant. Use the lookup_capital tool when asked about capitals. "
        "Always include the capital city name in your response.",
        tools=[lookup_capital],
        callback_handler=None,
    )
    response = agent(case.input)
    trajectory = extract_agent_tools_used_from_messages(agent.messages)
    tool_names = [t["name"] for t in trajectory]
    return {"output": str(response), "trajectory": tool_names}


# --- Build test cases ---
cases = [
    Case(name="france", input="What is the capital of France?", expected_output=None),
    Case(name="japan", input="What is the capital of Japan?", expected_output=None),
    Case(name="brazil", input="What is the capital of Brazil?", expected_output=None),
]

evaluators = [
    ToolCalled(tool_name="lookup_capital"),
    Contains(value="capital"),
]


def build_span_tree(spans):
    """Organize flat span list into a parent-child tree."""
    by_id = {}
    children = {}

    for s in spans:
        sid = s.get_span_context().span_id
        by_id[sid] = s
        children.setdefault(sid, [])

    roots = []
    for s in spans:
        sid = s.get_span_context().span_id
        pid = s.parent.span_id if s.parent else None
        if pid and pid in by_id:
            children[pid].append(sid)
        else:
            roots.append(sid)

    return by_id, children, roots


def format_tree(by_id, children, roots, indent=0):
    """Recursively format the span tree as indented text."""
    lines = []
    for sid in roots:
        s = by_id[sid]
        attrs = dict(s.attributes) if s.attributes else {}
        prefix = "    " * indent
        marker = "|-- " if indent > 0 else ""
        lines.append(f"{prefix}{marker}[SPAN] {s.name}")
        if attrs:
            for k, v in sorted(attrs.items()):
                lines.append(f"{prefix}{'|   ' if indent > 0 else ''}    {k} = {v}")
        for child_id in children.get(sid, []):
            lines.extend(format_tree(by_id, children, [child_id], indent + 1))
    return lines


def dump_spans(label):
    """Dump current spans, return lines, and clear the exporter."""
    spans = exporter.get_finished_spans()
    print('begin non formatted finished spans')
    print(100*'=')
    for s in spans:
        print(s.to_json(indent=2))
    print(100*'=')
    print('end non formatted finished spans')
    by_id, children_map, roots = build_span_tree(spans)

    lines = [
        "=" * 70,
        f"  {label}",
        f"  Total spans captured: {len(spans)}",
        "=" * 70,
        "",
    ]
    lines.extend(format_tree(by_id, children_map, roots))
    lines.append("")
    exporter.clear()
    return lines


# --- Run both paths and collect output ---
output_lines = [
    "SPAN HIERARCHY PROOF",
    f"Generated by: {Path(__file__).name}",
    "",
    "This file proves the OpenTelemetry span parent-child relationships",
    "produced by Experiment.run_evaluations() and run_evaluations_async().",
    "",
    "Indented spans are CHILDREN of the span above them.",
    "Top-level (non-indented) spans have no parent in the evaluation scope.",
    "",
]

# 1. Sync path
exp_sync = Experiment(cases=cases, evaluators=evaluators)
exp_sync.run_evaluations(geography_agent_task)
output_lines.extend(dump_spans("SYNC PATH: Experiment.run_evaluations()"))

# 2. Async path
exp_async = Experiment(cases=cases, evaluators=evaluators)
asyncio.run(exp_async.run_evaluations_async(geography_agent_task, max_workers=1))
output_lines.extend(dump_spans("ASYNC PATH: Experiment.run_evaluations_async()"))

# --- Write to file ---
# add datetime to the path
from datetime import datetime
dt_str = datetime.now().strftime("%Y%m%d_%H%M%S")
out_path = Path(__file__).parent / f"span_hierarchy_proof_{dt_str}.txt"
text = "\n".join(output_lines)
out_path.write_text(text)
print(text)
print(f"\nWritten to {out_path}")

Output==>

SPAN HIERARCHY PROOF
Generated by: prove_span_hierarchy.py

This file proves the OpenTelemetry span parent-child relationships
produced by Experiment.run_evaluations() and run_evaluations_async().

Indented spans are CHILDREN of the span above them.
Top-level (non-indented) spans have no parent in the evaluation scope.

======================================================================
  SYNC PATH: Experiment.run_evaluations()
  Total spans captured: 30
======================================================================

[SPAN] execute_case france
    gen_ai.evaluation.case.input = "What is the capital of France?"
    gen_ai.evaluation.case.name = france
    |-- [SPAN] task_execution france
    |       gen_ai.evaluation.case.name = france
    |       gen_ai.evaluation.data.actual_output = "The capital of France is Paris.\n"
    |       gen_ai.evaluation.data.expected_output = null
    |       gen_ai.evaluation.data.has_interactions = False
    |       gen_ai.evaluation.data.has_trajectory = True
    |       gen_ai.evaluation.data.input = "What is the capital of France?"
    |       gen_ai.evaluation.task.type = agent_task
        |-- [SPAN] invoke_agent Strands Agents
        |       gen_ai.agent.name = Strands Agents
        |       gen_ai.agent.tools = ["lookup_capital"]
        |       gen_ai.event.end_time = 2026-03-20T18:46:29.410186+00:00
        |       gen_ai.event.start_time = 2026-03-20T18:46:27.306517+00:00
        |       gen_ai.operation.name = invoke_agent
        |       gen_ai.request.model = us.anthropic.claude-sonnet-4-20250514-v1:0
        |       gen_ai.system = strands-agents
        |       gen_ai.usage.cache_read_input_tokens = 0
        |       gen_ai.usage.cache_write_input_tokens = 0
        |       gen_ai.usage.completion_tokens = 63
        |       gen_ai.usage.input_tokens = 913
        |       gen_ai.usage.output_tokens = 63
        |       gen_ai.usage.prompt_tokens = 913
        |       gen_ai.usage.total_tokens = 976
        |       system_prompt = You are a geography assistant. Use the lookup_capital tool when asked about capitals. Always include the capital city name in your response.
            |-- [SPAN] execute_event_loop_cycle
            |       event_loop.cycle_id = c7ac92f5-40b5-4cdd-a159-b131aa8b6ebf
            |       event_loop.parent_cycle_id = 21687439-2fe9-4741-846f-bb33916b4e02
            |       gen_ai.event.end_time = 2026-03-20T18:46:29.410113+00:00
            |       gen_ai.event.start_time = 2026-03-20T18:46:28.648806+00:00
                |-- [SPAN] chat
                |       gen_ai.event.end_time = 2026-03-20T18:46:29.410019+00:00
                |       gen_ai.event.start_time = 2026-03-20T18:46:28.648874+00:00
                |       gen_ai.operation.name = chat
                |       gen_ai.request.model = us.anthropic.claude-sonnet-4-20250514-v1:0
                |       gen_ai.server.request.duration = 749
                |       gen_ai.server.time_to_first_token = 654
                |       gen_ai.system = strands-agents
                |       gen_ai.usage.completion_tokens = 10
                |       gen_ai.usage.input_tokens = 490
                |       gen_ai.usage.output_tokens = 10
                |       gen_ai.usage.prompt_tokens = 490
                |       gen_ai.usage.total_tokens = 500
            |-- [SPAN] execute_event_loop_cycle
            |       event_loop.cycle_id = 21687439-2fe9-4741-846f-bb33916b4e02
            |       gen_ai.event.end_time = 2026-03-20T18:46:28.648727+00:00
            |       gen_ai.event.start_time = 2026-03-20T18:46:27.307047+00:00
                |-- [SPAN] chat
                |       gen_ai.event.end_time = 2026-03-20T18:46:28.647986+00:00
                |       gen_ai.event.start_time = 2026-03-20T18:46:27.307089+00:00
                |       gen_ai.operation.name = chat
                |       gen_ai.request.model = us.anthropic.claude-sonnet-4-20250514-v1:0
                |       gen_ai.server.request.duration = 1249
                |       gen_ai.server.time_to_first_token = 1017
                |       gen_ai.system = strands-agents
                |       gen_ai.usage.completion_tokens = 53
                |       gen_ai.usage.input_tokens = 423
                |       gen_ai.usage.output_tokens = 53
                |       gen_ai.usage.prompt_tokens = 423
                |       gen_ai.usage.total_tokens = 476
                |-- [SPAN] execute_tool lookup_capital
                |       gen_ai.event.end_time = 2026-03-20T18:46:28.648670+00:00
                |       gen_ai.event.start_time = 2026-03-20T18:46:28.648290+00:00
                |       gen_ai.operation.name = execute_tool
                |       gen_ai.system = strands-agents
                |       gen_ai.tool.call.id = tooluse_sEAVGZ0ybLNWYVWcVuZx98
                |       gen_ai.tool.description = Look up the capital city of a country.
                |       gen_ai.tool.json_schema = {"properties": {"country": {"description": "Name of the country to look up.", "type": "string"}}, "required": ["country"], "type": "object"}
                |       gen_ai.tool.name = lookup_capital
                |       gen_ai.tool.status = success
    |-- [SPAN] evaluator ToolCalled
    |       gen_ai.evaluation.case.name = france
    |       gen_ai.evaluation.explanation = tool 'lookup_capital' was called
    |       gen_ai.evaluation.name = ToolCalled
    |       gen_ai.evaluation.score.label = YES
    |       gen_ai.evaluation.score.value = 1.0
    |       gen_ai.evaluation.test_pass = True
    |-- [SPAN] evaluator Contains
    |       gen_ai.evaluation.case.name = france
    |       gen_ai.evaluation.explanation = actual_output contains 'capital'
    |       gen_ai.evaluation.name = Contains
    |       gen_ai.evaluation.score.label = YES
    |       gen_ai.evaluation.score.value = 1.0
    |       gen_ai.evaluation.test_pass = True
[SPAN] execute_case japan
    gen_ai.evaluation.case.input = "What is the capital of Japan?"
    gen_ai.evaluation.case.name = japan
    |-- [SPAN] task_execution japan
    |       gen_ai.evaluation.case.name = japan
    |       gen_ai.evaluation.data.actual_output = "The capital of Japan is Tokyo.\n"
    |       gen_ai.evaluation.data.expected_output = null
    |       gen_ai.evaluation.data.has_interactions = False
    |       gen_ai.evaluation.data.has_trajectory = True
    |       gen_ai.evaluation.data.input = "What is the capital of Japan?"
    |       gen_ai.evaluation.task.type = agent_task
        |-- [SPAN] invoke_agent Strands Agents
        |       gen_ai.agent.name = Strands Agents
        |       gen_ai.agent.tools = ["lookup_capital"]
        |       gen_ai.event.end_time = 2026-03-20T18:46:33.622556+00:00
        |       gen_ai.event.start_time = 2026-03-20T18:46:29.531875+00:00
        |       gen_ai.operation.name = invoke_agent
        |       gen_ai.request.model = us.anthropic.claude-sonnet-4-20250514-v1:0
        |       gen_ai.system = strands-agents
        |       gen_ai.usage.cache_read_input_tokens = 0
        |       gen_ai.usage.cache_write_input_tokens = 0
        |       gen_ai.usage.completion_tokens = 75
        |       gen_ai.usage.input_tokens = 924
        |       gen_ai.usage.output_tokens = 75
        |       gen_ai.usage.prompt_tokens = 924
        |       gen_ai.usage.total_tokens = 999
        |       system_prompt = You are a geography assistant. Use the lookup_capital tool when asked about capitals. Always include the capital city name in your response.
            |-- [SPAN] execute_event_loop_cycle
            |       event_loop.cycle_id = 44cc2550-89d2-4b00-8ddb-2a8c8a0314c7
            |       event_loop.parent_cycle_id = 0910e1cf-d22c-4ea8-a286-1dfb87af4266
            |       gen_ai.event.end_time = 2026-03-20T18:46:33.622495+00:00
            |       gen_ai.event.start_time = 2026-03-20T18:46:31.968513+00:00
                |-- [SPAN] chat
                |       gen_ai.event.end_time = 2026-03-20T18:46:33.622410+00:00
                |       gen_ai.event.start_time = 2026-03-20T18:46:31.968577+00:00
                |       gen_ai.operation.name = chat
                |       gen_ai.request.model = us.anthropic.claude-sonnet-4-20250514-v1:0
                |       gen_ai.server.request.duration = 1642
                |       gen_ai.server.time_to_first_token = 1564
                |       gen_ai.system = strands-agents
                |       gen_ai.usage.completion_tokens = 10
                |       gen_ai.usage.input_tokens = 501
                |       gen_ai.usage.output_tokens = 10
                |       gen_ai.usage.prompt_tokens = 501
                |       gen_ai.usage.total_tokens = 511
            |-- [SPAN] execute_event_loop_cycle
            |       event_loop.cycle_id = 0910e1cf-d22c-4ea8-a286-1dfb87af4266
            |       gen_ai.event.end_time = 2026-03-20T18:46:31.968427+00:00
            |       gen_ai.event.start_time = 2026-03-20T18:46:29.531980+00:00
                |-- [SPAN] chat
                |       gen_ai.event.end_time = 2026-03-20T18:46:31.967888+00:00
                |       gen_ai.event.start_time = 2026-03-20T18:46:29.532021+00:00
                |       gen_ai.operation.name = chat
                |       gen_ai.request.model = us.anthropic.claude-sonnet-4-20250514-v1:0
                |       gen_ai.server.request.duration = 2392
                |       gen_ai.server.time_to_first_token = 2013
                |       gen_ai.system = strands-agents
                |       gen_ai.usage.completion_tokens = 65
                |       gen_ai.usage.input_tokens = 423
                |       gen_ai.usage.output_tokens = 65
                |       gen_ai.usage.prompt_tokens = 423
                |       gen_ai.usage.total_tokens = 488
                |-- [SPAN] execute_tool lookup_capital
                |       gen_ai.event.end_time = 2026-03-20T18:46:31.968374+00:00
                |       gen_ai.event.start_time = 2026-03-20T18:46:31.968074+00:00
                |       gen_ai.operation.name = execute_tool
                |       gen_ai.system = strands-agents
                |       gen_ai.tool.call.id = tooluse_T2RysxFyiNXdlxSxsrRkq7
                |       gen_ai.tool.description = Look up the capital city of a country.
                |       gen_ai.tool.json_schema = {"properties": {"country": {"description": "Name of the country to look up.", "type": "string"}}, "required": ["country"], "type": "object"}
                |       gen_ai.tool.name = lookup_capital
                |       gen_ai.tool.status = success
    |-- [SPAN] evaluator ToolCalled
    |       gen_ai.evaluation.case.name = japan
    |       gen_ai.evaluation.explanation = tool 'lookup_capital' was called
    |       gen_ai.evaluation.name = ToolCalled
    |       gen_ai.evaluation.score.label = YES
    |       gen_ai.evaluation.score.value = 1.0
    |       gen_ai.evaluation.test_pass = True
    |-- [SPAN] evaluator Contains
    |       gen_ai.evaluation.case.name = japan
    |       gen_ai.evaluation.explanation = actual_output contains 'capital'
    |       gen_ai.evaluation.name = Contains
    |       gen_ai.evaluation.score.label = YES
    |       gen_ai.evaluation.score.value = 1.0
    |       gen_ai.evaluation.test_pass = True
[SPAN] execute_case brazil
    gen_ai.evaluation.case.input = "What is the capital of Brazil?"
    gen_ai.evaluation.case.name = brazil
    |-- [SPAN] task_execution brazil
    |       gen_ai.evaluation.case.name = brazil
    |       gen_ai.evaluation.data.actual_output = "The capital of Brazil is Brasilia.\n"
    |       gen_ai.evaluation.data.expected_output = null
    |       gen_ai.evaluation.data.has_interactions = False
    |       gen_ai.evaluation.data.has_trajectory = True
    |       gen_ai.evaluation.data.input = "What is the capital of Brazil?"
    |       gen_ai.evaluation.task.type = agent_task
        |-- [SPAN] invoke_agent Strands Agents
        |       gen_ai.agent.name = Strands Agents
        |       gen_ai.agent.tools = ["lookup_capital"]
        |       gen_ai.event.end_time = 2026-03-20T18:46:35.948398+00:00
        |       gen_ai.event.start_time = 2026-03-20T18:46:33.714875+00:00
        |       gen_ai.operation.name = invoke_agent
        |       gen_ai.request.model = us.anthropic.claude-sonnet-4-20250514-v1:0
        |       gen_ai.system = strands-agents
        |       gen_ai.usage.cache_read_input_tokens = 0
        |       gen_ai.usage.cache_write_input_tokens = 0
        |       gen_ai.usage.completion_tokens = 77
        |       gen_ai.usage.input_tokens = 926
        |       gen_ai.usage.output_tokens = 77
        |       gen_ai.usage.prompt_tokens = 926
        |       gen_ai.usage.total_tokens = 1003
        |       system_prompt = You are a geography assistant. Use the lookup_capital tool when asked about capitals. Always include the capital city name in your response.
            |-- [SPAN] execute_event_loop_cycle
            |       event_loop.cycle_id = 14c3ca78-57bd-41f8-8e9c-051987584d4d
            |       event_loop.parent_cycle_id = 8984506b-d7e4-43a8-b5c6-a0f158cb2e91
            |       gen_ai.event.end_time = 2026-03-20T18:46:35.948338+00:00
            |       gen_ai.event.start_time = 2026-03-20T18:46:34.910111+00:00
                |-- [SPAN] chat
                |       gen_ai.event.end_time = 2026-03-20T18:46:35.948252+00:00
                |       gen_ai.event.start_time = 2026-03-20T18:46:34.910181+00:00
                |       gen_ai.operation.name = chat
                |       gen_ai.request.model = us.anthropic.claude-sonnet-4-20250514-v1:0
                |       gen_ai.server.request.duration = 1012
                |       gen_ai.server.time_to_first_token = 886
                |       gen_ai.system = strands-agents
                |       gen_ai.usage.completion_tokens = 12
                |       gen_ai.usage.input_tokens = 503
                |       gen_ai.usage.output_tokens = 12
                |       gen_ai.usage.prompt_tokens = 503
                |       gen_ai.usage.total_tokens = 515
            |-- [SPAN] execute_event_loop_cycle
            |       event_loop.cycle_id = 8984506b-d7e4-43a8-b5c6-a0f158cb2e91
            |       gen_ai.event.end_time = 2026-03-20T18:46:34.910034+00:00
            |       gen_ai.event.start_time = 2026-03-20T18:46:33.714969+00:00
                |-- [SPAN] chat
                |       gen_ai.event.end_time = 2026-03-20T18:46:34.909537+00:00
                |       gen_ai.event.start_time = 2026-03-20T18:46:33.715057+00:00
                |       gen_ai.operation.name = chat
                |       gen_ai.request.model = us.anthropic.claude-sonnet-4-20250514-v1:0
                |       gen_ai.server.request.duration = 1151
                |       gen_ai.server.time_to_first_token = 723
                |       gen_ai.system = strands-agents
                |       gen_ai.usage.completion_tokens = 65
                |       gen_ai.usage.input_tokens = 423
                |       gen_ai.usage.output_tokens = 65
                |       gen_ai.usage.prompt_tokens = 423
                |       gen_ai.usage.total_tokens = 488
                |-- [SPAN] execute_tool lookup_capital
                |       gen_ai.event.end_time = 2026-03-20T18:46:34.909984+00:00
                |       gen_ai.event.start_time = 2026-03-20T18:46:34.909705+00:00
                |       gen_ai.operation.name = execute_tool
                |       gen_ai.system = strands-agents
                |       gen_ai.tool.call.id = tooluse_ocmMJitzIwGonot169jSZy
                |       gen_ai.tool.description = Look up the capital city of a country.
                |       gen_ai.tool.json_schema = {"properties": {"country": {"description": "Name of the country to look up.", "type": "string"}}, "required": ["country"], "type": "object"}
                |       gen_ai.tool.name = lookup_capital
                |       gen_ai.tool.status = success
    |-- [SPAN] evaluator ToolCalled
    |       gen_ai.evaluation.case.name = brazil
    |       gen_ai.evaluation.explanation = tool 'lookup_capital' was called
    |       gen_ai.evaluation.name = ToolCalled
    |       gen_ai.evaluation.score.label = YES
    |       gen_ai.evaluation.score.value = 1.0
    |       gen_ai.evaluation.test_pass = True
    |-- [SPAN] evaluator Contains
    |       gen_ai.evaluation.case.name = brazil
    |       gen_ai.evaluation.explanation = actual_output contains 'capital'
    |       gen_ai.evaluation.name = Contains
    |       gen_ai.evaluation.score.label = YES
    |       gen_ai.evaluation.score.value = 1.0
    |       gen_ai.evaluation.test_pass = True

======================================================================
  ASYNC PATH: Experiment.run_evaluations_async()
  Total spans captured: 30
======================================================================

[SPAN] execute_case france
    gen_ai.evaluation.case.input = "What is the capital of France?"
    gen_ai.evaluation.case.name = france
    |-- [SPAN] task_execution france
    |       gen_ai.evaluation.case.name = france
    |       gen_ai.evaluation.data.actual_output = "The capital of France is Paris.\n"
    |       gen_ai.evaluation.data.expected_output = null
    |       gen_ai.evaluation.data.has_interactions = False
    |       gen_ai.evaluation.data.has_trajectory = True
    |       gen_ai.evaluation.data.input = "What is the capital of France?"
    |       gen_ai.evaluation.task.type = agent_task
        |-- [SPAN] invoke_agent Strands Agents
        |       gen_ai.agent.name = Strands Agents
        |       gen_ai.agent.tools = ["lookup_capital"]
        |       gen_ai.event.end_time = 2026-03-20T18:46:39.663648+00:00
        |       gen_ai.event.start_time = 2026-03-20T18:46:36.086926+00:00
        |       gen_ai.operation.name = invoke_agent
        |       gen_ai.request.model = us.anthropic.claude-sonnet-4-20250514-v1:0
        |       gen_ai.system = strands-agents
        |       gen_ai.usage.cache_read_input_tokens = 0
        |       gen_ai.usage.cache_write_input_tokens = 0
        |       gen_ai.usage.completion_tokens = 75
        |       gen_ai.usage.input_tokens = 924
        |       gen_ai.usage.output_tokens = 75
        |       gen_ai.usage.prompt_tokens = 924
        |       gen_ai.usage.total_tokens = 999
        |       system_prompt = You are a geography assistant. Use the lookup_capital tool when asked about capitals. Always include the capital city name in your response.
            |-- [SPAN] execute_event_loop_cycle
            |       event_loop.cycle_id = 062d635e-98b8-4605-b397-f76ca2c10e3c
            |       event_loop.parent_cycle_id = ec1bb302-425e-4412-8e0e-9f768b1c249f
            |       gen_ai.event.end_time = 2026-03-20T18:46:39.663587+00:00
            |       gen_ai.event.start_time = 2026-03-20T18:46:37.503761+00:00
                |-- [SPAN] chat
                |       gen_ai.event.end_time = 2026-03-20T18:46:39.663504+00:00
                |       gen_ai.event.start_time = 2026-03-20T18:46:37.503822+00:00
                |       gen_ai.operation.name = chat
                |       gen_ai.request.model = us.anthropic.claude-sonnet-4-20250514-v1:0
                |       gen_ai.server.request.duration = 2140
                |       gen_ai.server.time_to_first_token = 2159
                |       gen_ai.system = strands-agents
                |       gen_ai.usage.completion_tokens = 10
                |       gen_ai.usage.input_tokens = 501
                |       gen_ai.usage.output_tokens = 10
                |       gen_ai.usage.prompt_tokens = 501
                |       gen_ai.usage.total_tokens = 511
            |-- [SPAN] execute_event_loop_cycle
            |       event_loop.cycle_id = ec1bb302-425e-4412-8e0e-9f768b1c249f
            |       gen_ai.event.end_time = 2026-03-20T18:46:37.503678+00:00
            |       gen_ai.event.start_time = 2026-03-20T18:46:36.087018+00:00
                |-- [SPAN] chat
                |       gen_ai.event.end_time = 2026-03-20T18:46:37.503181+00:00
                |       gen_ai.event.start_time = 2026-03-20T18:46:36.087054+00:00
                |       gen_ai.operation.name = chat
                |       gen_ai.request.model = us.anthropic.claude-sonnet-4-20250514-v1:0
                |       gen_ai.server.request.duration = 1372
                |       gen_ai.server.time_to_first_token = 965
                |       gen_ai.system = strands-agents
                |       gen_ai.usage.completion_tokens = 65
                |       gen_ai.usage.input_tokens = 423
                |       gen_ai.usage.output_tokens = 65
                |       gen_ai.usage.prompt_tokens = 423
                |       gen_ai.usage.total_tokens = 488
                |-- [SPAN] execute_tool lookup_capital
                |       gen_ai.event.end_time = 2026-03-20T18:46:37.503627+00:00
                |       gen_ai.event.start_time = 2026-03-20T18:46:37.503349+00:00
                |       gen_ai.operation.name = execute_tool
                |       gen_ai.system = strands-agents
                |       gen_ai.tool.call.id = tooluse_QtWdg2gxN8uaRbFAYLb3pr
                |       gen_ai.tool.description = Look up the capital city of a country.
                |       gen_ai.tool.json_schema = {"properties": {"country": {"description": "Name of the country to look up.", "type": "string"}}, "required": ["country"], "type": "object"}
                |       gen_ai.tool.name = lookup_capital
                |       gen_ai.tool.status = success
    |-- [SPAN] evaluator ToolCalled
    |       gen_ai.evaluation.case.name = france
    |       gen_ai.evaluation.explanation = tool 'lookup_capital' was called
    |       gen_ai.evaluation.name = ToolCalled
    |       gen_ai.evaluation.score.label = YES
    |       gen_ai.evaluation.score.value = 1.0
    |       gen_ai.evaluation.test_pass = True
    |-- [SPAN] evaluator Contains
    |       gen_ai.evaluation.case.name = france
    |       gen_ai.evaluation.explanation = actual_output contains 'capital'
    |       gen_ai.evaluation.name = Contains
    |       gen_ai.evaluation.score.label = YES
    |       gen_ai.evaluation.score.value = 1.0
    |       gen_ai.evaluation.test_pass = True
[SPAN] execute_case japan
    gen_ai.evaluation.case.input = "What is the capital of Japan?"
    gen_ai.evaluation.case.name = japan
    |-- [SPAN] task_execution japan
    |       gen_ai.evaluation.case.name = japan
    |       gen_ai.evaluation.data.actual_output = "The capital of Japan is Tokyo.\n"
    |       gen_ai.evaluation.data.expected_output = null
    |       gen_ai.evaluation.data.has_interactions = False
    |       gen_ai.evaluation.data.has_trajectory = True
    |       gen_ai.evaluation.data.input = "What is the capital of Japan?"
    |       gen_ai.evaluation.task.type = agent_task
        |-- [SPAN] invoke_agent Strands Agents
        |       gen_ai.agent.name = Strands Agents
        |       gen_ai.agent.tools = ["lookup_capital"]
        |       gen_ai.event.end_time = 2026-03-20T18:46:42.864078+00:00
        |       gen_ai.event.start_time = 2026-03-20T18:46:39.751310+00:00
        |       gen_ai.operation.name = invoke_agent
        |       gen_ai.request.model = us.anthropic.claude-sonnet-4-20250514-v1:0
        |       gen_ai.system = strands-agents
        |       gen_ai.usage.cache_read_input_tokens = 0
        |       gen_ai.usage.cache_write_input_tokens = 0
        |       gen_ai.usage.completion_tokens = 75
        |       gen_ai.usage.input_tokens = 924
        |       gen_ai.usage.output_tokens = 75
        |       gen_ai.usage.prompt_tokens = 924
        |       gen_ai.usage.total_tokens = 999
        |       system_prompt = You are a geography assistant. Use the lookup_capital tool when asked about capitals. Always include the capital city name in your response.
            |-- [SPAN] execute_event_loop_cycle
            |       event_loop.cycle_id = 193cd4c3-48d8-418b-9f26-dd20a4a28f47
            |       event_loop.parent_cycle_id = 7e0d09d7-69aa-46e7-878b-2f09daef00d4
            |       gen_ai.event.end_time = 2026-03-20T18:46:42.864018+00:00
            |       gen_ai.event.start_time = 2026-03-20T18:46:41.996427+00:00
                |-- [SPAN] chat
                |       gen_ai.event.end_time = 2026-03-20T18:46:42.863933+00:00
                |       gen_ai.event.start_time = 2026-03-20T18:46:41.996482+00:00
                |       gen_ai.operation.name = chat
                |       gen_ai.request.model = us.anthropic.claude-sonnet-4-20250514-v1:0
                |       gen_ai.server.request.duration = 856
                |       gen_ai.server.time_to_first_token = 715
                |       gen_ai.system = strands-agents
                |       gen_ai.usage.completion_tokens = 10
                |       gen_ai.usage.input_tokens = 501
                |       gen_ai.usage.output_tokens = 10
                |       gen_ai.usage.prompt_tokens = 501
                |       gen_ai.usage.total_tokens = 511
            |-- [SPAN] execute_event_loop_cycle
            |       event_loop.cycle_id = 7e0d09d7-69aa-46e7-878b-2f09daef00d4
            |       gen_ai.event.end_time = 2026-03-20T18:46:41.996356+00:00
            |       gen_ai.event.start_time = 2026-03-20T18:46:39.751403+00:00
                |-- [SPAN] chat
                |       gen_ai.event.end_time = 2026-03-20T18:46:41.995833+00:00
                |       gen_ai.event.start_time = 2026-03-20T18:46:39.751489+00:00
                |       gen_ai.operation.name = chat
                |       gen_ai.request.model = us.anthropic.claude-sonnet-4-20250514-v1:0
                |       gen_ai.server.request.duration = 2202
                |       gen_ai.server.time_to_first_token = 1593
                |       gen_ai.system = strands-agents
                |       gen_ai.usage.completion_tokens = 65
                |       gen_ai.usage.input_tokens = 423
                |       gen_ai.usage.output_tokens = 65
                |       gen_ai.usage.prompt_tokens = 423
                |       gen_ai.usage.total_tokens = 488
                |-- [SPAN] execute_tool lookup_capital
                |       gen_ai.event.end_time = 2026-03-20T18:46:41.996308+00:00
                |       gen_ai.event.start_time = 2026-03-20T18:46:41.996007+00:00
                |       gen_ai.operation.name = execute_tool
                |       gen_ai.system = strands-agents
                |       gen_ai.tool.call.id = tooluse_9lok4ouJShIuwS5d7Wxdvx
                |       gen_ai.tool.description = Look up the capital city of a country.
                |       gen_ai.tool.json_schema = {"properties": {"country": {"description": "Name of the country to look up.", "type": "string"}}, "required": ["country"], "type": "object"}
                |       gen_ai.tool.name = lookup_capital
                |       gen_ai.tool.status = success
    |-- [SPAN] evaluator ToolCalled
    |       gen_ai.evaluation.case.name = japan
    |       gen_ai.evaluation.explanation = tool 'lookup_capital' was called
    |       gen_ai.evaluation.name = ToolCalled
    |       gen_ai.evaluation.score.label = YES
    |       gen_ai.evaluation.score.value = 1.0
    |       gen_ai.evaluation.test_pass = True
    |-- [SPAN] evaluator Contains
    |       gen_ai.evaluation.case.name = japan
    |       gen_ai.evaluation.explanation = actual_output contains 'capital'
    |       gen_ai.evaluation.name = Contains
    |       gen_ai.evaluation.score.label = YES
    |       gen_ai.evaluation.score.value = 1.0
    |       gen_ai.evaluation.test_pass = True
[SPAN] execute_case brazil
    gen_ai.evaluation.case.input = "What is the capital of Brazil?"
    gen_ai.evaluation.case.name = brazil
    |-- [SPAN] task_execution brazil
    |       gen_ai.evaluation.case.name = brazil
    |       gen_ai.evaluation.data.actual_output = "The capital of Brazil is Brasilia.\n"
    |       gen_ai.evaluation.data.expected_output = null
    |       gen_ai.evaluation.data.has_interactions = False
    |       gen_ai.evaluation.data.has_trajectory = True
    |       gen_ai.evaluation.data.input = "What is the capital of Brazil?"
    |       gen_ai.evaluation.task.type = agent_task
        |-- [SPAN] invoke_agent Strands Agents
        |       gen_ai.agent.name = Strands Agents
        |       gen_ai.agent.tools = ["lookup_capital"]
        |       gen_ai.event.end_time = 2026-03-20T18:46:46.372294+00:00
        |       gen_ai.event.start_time = 2026-03-20T18:46:42.954357+00:00
        |       gen_ai.operation.name = invoke_agent
        |       gen_ai.request.model = us.anthropic.claude-sonnet-4-20250514-v1:0
        |       gen_ai.system = strands-agents
        |       gen_ai.usage.cache_read_input_tokens = 0
        |       gen_ai.usage.cache_write_input_tokens = 0
        |       gen_ai.usage.completion_tokens = 77
        |       gen_ai.usage.input_tokens = 926
        |       gen_ai.usage.output_tokens = 77
        |       gen_ai.usage.prompt_tokens = 926
        |       gen_ai.usage.total_tokens = 1003
        |       system_prompt = You are a geography assistant. Use the lookup_capital tool when asked about capitals. Always include the capital city name in your response.
            |-- [SPAN] execute_event_loop_cycle
            |       event_loop.cycle_id = f9109e29-9fe0-4153-8d04-2366494ba9d1
            |       event_loop.parent_cycle_id = c724fb50-146f-4d15-b358-430690ed7107
            |       gen_ai.event.end_time = 2026-03-20T18:46:46.372235+00:00
            |       gen_ai.event.start_time = 2026-03-20T18:46:44.126301+00:00
                |-- [SPAN] chat
                |       gen_ai.event.end_time = 2026-03-20T18:46:46.372148+00:00
                |       gen_ai.event.start_time = 2026-03-20T18:46:44.126359+00:00
                |       gen_ai.operation.name = chat
                |       gen_ai.request.model = us.anthropic.claude-sonnet-4-20250514-v1:0
                |       gen_ai.server.request.duration = 2234
                |       gen_ai.server.time_to_first_token = 2095
                |       gen_ai.system = strands-agents
                |       gen_ai.usage.completion_tokens = 12
                |       gen_ai.usage.input_tokens = 503
                |       gen_ai.usage.output_tokens = 12
                |       gen_ai.usage.prompt_tokens = 503
                |       gen_ai.usage.total_tokens = 515
            |-- [SPAN] execute_event_loop_cycle
            |       event_loop.cycle_id = c724fb50-146f-4d15-b358-430690ed7107
            |       gen_ai.event.end_time = 2026-03-20T18:46:44.126225+00:00
            |       gen_ai.event.start_time = 2026-03-20T18:46:42.954465+00:00
                |-- [SPAN] chat
                |       gen_ai.event.end_time = 2026-03-20T18:46:44.125724+00:00
                |       gen_ai.event.start_time = 2026-03-20T18:46:42.954505+00:00
                |       gen_ai.operation.name = chat
                |       gen_ai.request.model = us.anthropic.claude-sonnet-4-20250514-v1:0
                |       gen_ai.server.request.duration = 1097
                |       gen_ai.server.time_to_first_token = 891
                |       gen_ai.system = strands-agents
                |       gen_ai.usage.completion_tokens = 65
                |       gen_ai.usage.input_tokens = 423
                |       gen_ai.usage.output_tokens = 65
                |       gen_ai.usage.prompt_tokens = 423
                |       gen_ai.usage.total_tokens = 488
                |-- [SPAN] execute_tool lookup_capital
                |       gen_ai.event.end_time = 2026-03-20T18:46:44.126174+00:00
                |       gen_ai.event.start_time = 2026-03-20T18:46:44.125893+00:00
                |       gen_ai.operation.name = execute_tool
                |       gen_ai.system = strands-agents
                |       gen_ai.tool.call.id = tooluse_uSx0DCWCDjgxbnJoD1tiyY
                |       gen_ai.tool.description = Look up the capital city of a country.
                |       gen_ai.tool.json_schema = {"properties": {"country": {"description": "Name of the country to look up.", "type": "string"}}, "required": ["country"], "type": "object"}
                |       gen_ai.tool.name = lookup_capital
                |       gen_ai.tool.status = success
    |-- [SPAN] evaluator ToolCalled
    |       gen_ai.evaluation.case.name = brazil
    |       gen_ai.evaluation.explanation = tool 'lookup_capital' was called
    |       gen_ai.evaluation.name = ToolCalled
    |       gen_ai.evaluation.score.label = YES
    |       gen_ai.evaluation.score.value = 1.0
    |       gen_ai.evaluation.test_pass = True
    |-- [SPAN] evaluator Contains
    |       gen_ai.evaluation.case.name = brazil
    |       gen_ai.evaluation.explanation = actual_output contains 'capital'
    |       gen_ai.evaluation.name = Contains
    |       gen_ai.evaluation.score.label = YES
    |       gen_ai.evaluation.score.value = 1.0
    |       gen_ai.evaluation.test_pass = True

@afarntrog afarntrog merged commit ae0e434 into strands-agents:main Mar 23, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants