refactor: unify sync/async evaluation by defaulting aevaluate to asyncio.to_thread#173
Merged
afarntrog merged 3 commits intostrands-agents:mainfrom Mar 23, 2026
Conversation
…cio.to_thread - Change base Evaluator.aevaluate() to delegate to evaluate() via asyncio.to_thread instead of raising NotImplementedError - Remove duplicated sync/async helper methods from Experiment class (_record_evaluator_result, _run_task, and related run logic) - Consolidate experiment execution paths to reduce code duplication
…cio.to_thread - Change base Evaluator.aevaluate() to delegate to evaluate() via asyncio.to_thread instead of raising NotImplementedError - Remove duplicated sync/async helper methods from Experiment class (_record_evaluator_result, _run_task, and related run logic) - Consolidate experiment execution paths to reduce code duplication
afarntrog
commented
Mar 20, 2026
afarntrog
commented
Mar 20, 2026
Add semantic convention attributes to evaluation spans for improved observability. Wraps task execution in a dedicated span and enriches case, evaluator, and score spans with structured attributes such as case name, input, evaluator type, and score results.
Contributor
Author
|
The following script run the experiments in sync and async methods to show the generate spans. """Script to prove the OpenTelemetry span hierarchy in Experiment.
Runs both sync and async evaluation paths using StrandsEvalsTelemetry's
InMemorySpanExporter, then dumps the full span tree to docs/span_hierarchy_proof.txt.
"""
import asyncio
from pathlib import Path
from opentelemetry.sdk.trace.export.in_memory_span_exporter import InMemorySpanExporter
from strands import Agent, tool
from strands_evals import Case, Experiment
from strands_evals.evaluators.deterministic import Contains, ToolCalled
from strands_evals.extractors.tools_use_extractor import extract_agent_tools_used_from_messages
from strands_evals.telemetry import StrandsEvalsTelemetry
# Set up telemetry with in-memory exporter
telemetry = StrandsEvalsTelemetry().setup_in_memory_exporter()
exporter: InMemorySpanExporter = telemetry.in_memory_exporter
# --- Tool ---
@tool
def lookup_capital(country: str) -> str:
"""Look up the capital city of a country.
Args:
country: Name of the country to look up.
"""
capitals = {
"France": "Paris",
"Japan": "Tokyo",
"Brazil": "Brasilia",
"Australia": "Canberra",
}
return capitals.get(country, f"Unknown capital for {country}")
# --- Task function using a real agent ---
def geography_agent_task(case: Case) -> dict:
"""Agent with a capital-lookup tool answering geography questions."""
agent = Agent(
system_prompt="You are a geography assistant. Use the lookup_capital tool when asked about capitals. "
"Always include the capital city name in your response.",
tools=[lookup_capital],
callback_handler=None,
)
response = agent(case.input)
trajectory = extract_agent_tools_used_from_messages(agent.messages)
tool_names = [t["name"] for t in trajectory]
return {"output": str(response), "trajectory": tool_names}
# --- Build test cases ---
cases = [
Case(name="france", input="What is the capital of France?", expected_output=None),
Case(name="japan", input="What is the capital of Japan?", expected_output=None),
Case(name="brazil", input="What is the capital of Brazil?", expected_output=None),
]
evaluators = [
ToolCalled(tool_name="lookup_capital"),
Contains(value="capital"),
]
def build_span_tree(spans):
"""Organize flat span list into a parent-child tree."""
by_id = {}
children = {}
for s in spans:
sid = s.get_span_context().span_id
by_id[sid] = s
children.setdefault(sid, [])
roots = []
for s in spans:
sid = s.get_span_context().span_id
pid = s.parent.span_id if s.parent else None
if pid and pid in by_id:
children[pid].append(sid)
else:
roots.append(sid)
return by_id, children, roots
def format_tree(by_id, children, roots, indent=0):
"""Recursively format the span tree as indented text."""
lines = []
for sid in roots:
s = by_id[sid]
attrs = dict(s.attributes) if s.attributes else {}
prefix = " " * indent
marker = "|-- " if indent > 0 else ""
lines.append(f"{prefix}{marker}[SPAN] {s.name}")
if attrs:
for k, v in sorted(attrs.items()):
lines.append(f"{prefix}{'| ' if indent > 0 else ''} {k} = {v}")
for child_id in children.get(sid, []):
lines.extend(format_tree(by_id, children, [child_id], indent + 1))
return lines
def dump_spans(label):
"""Dump current spans, return lines, and clear the exporter."""
spans = exporter.get_finished_spans()
print('begin non formatted finished spans')
print(100*'=')
for s in spans:
print(s.to_json(indent=2))
print(100*'=')
print('end non formatted finished spans')
by_id, children_map, roots = build_span_tree(spans)
lines = [
"=" * 70,
f" {label}",
f" Total spans captured: {len(spans)}",
"=" * 70,
"",
]
lines.extend(format_tree(by_id, children_map, roots))
lines.append("")
exporter.clear()
return lines
# --- Run both paths and collect output ---
output_lines = [
"SPAN HIERARCHY PROOF",
f"Generated by: {Path(__file__).name}",
"",
"This file proves the OpenTelemetry span parent-child relationships",
"produced by Experiment.run_evaluations() and run_evaluations_async().",
"",
"Indented spans are CHILDREN of the span above them.",
"Top-level (non-indented) spans have no parent in the evaluation scope.",
"",
]
# 1. Sync path
exp_sync = Experiment(cases=cases, evaluators=evaluators)
exp_sync.run_evaluations(geography_agent_task)
output_lines.extend(dump_spans("SYNC PATH: Experiment.run_evaluations()"))
# 2. Async path
exp_async = Experiment(cases=cases, evaluators=evaluators)
asyncio.run(exp_async.run_evaluations_async(geography_agent_task, max_workers=1))
output_lines.extend(dump_spans("ASYNC PATH: Experiment.run_evaluations_async()"))
# --- Write to file ---
# add datetime to the path
from datetime import datetime
dt_str = datetime.now().strftime("%Y%m%d_%H%M%S")
out_path = Path(__file__).parent / f"span_hierarchy_proof_{dt_str}.txt"
text = "\n".join(output_lines)
out_path.write_text(text)
print(text)
print(f"\nWritten to {out_path}")
Output==> |
poshinchen
approved these changes
Mar 23, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
refactor: unify sync/async evaluation by defaulting aevaluate to asyncio.to_thread
Evaluator.aevaluate()to delegate toevaluate()viaasyncio.to_thread insteadof raisingNotImplementedError(
_record_evaluator_result,_run_task, and related run logic)Related Issues
#172
Documentation PR
Type of Change
New feature
Testing
How have you tested the change? Verify that the changes do not break functionality or introduce warnings in consuming repositories: agents-docs, agents-tools, agents-cli
hatch run prepareChecklist
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.