feat(detectors): added basic interfaces, and failure_detector agent by poshinchen · Pull Request #175 · strands-agents/evals

poshinchen · 2026-03-23T03:12:07Z

Description

Here's the Phase 1 to develop detectors (failure_detector).
Summary

Add strands_evals.detectors package
1.1. add single-trace detection capabilities into strands-evals
Implement detect_failures() for LLM-based semantic failure detection (hallucinations, tool misuse, goal deviation, etc.) with automatic chunking fallback for sessions exceeding context limits
Add shared chunking utilities (token estimation, span splitting, chunk merge) used by all LLM-based detectors
Add all Pydantic types for the detectors package (FailureOutput, RCAOutput, DiagnosisResult, and LLM structured output schemas)

What's included:

types/detector
- All detector Pydantic models (output types + LLM schemas)
detectors/chunking.py
- Shared context window management and span chunking
- Pure-parsing user request extraction (no LLM)
detectors/failure_detector.py
- detect_failures() with direct → chunked fallback
detectors/prompt_templates/failure_detection/
- Failure detection prompt (verbatim from Lens)
detectors/__init__.py
- Public exports

Design decisions

Standalone functions, not classes — each detector is a module-level function with _-prefixed helpers, following the design doc
Agent(model=..., system_prompt=..., callback_handler=None) for all LLM calls — same pattern as existing evaluators
Default model: Haiku (us.anthropic.claude-haiku-4-5-20251001-v1:0) — fast and cost-effective for detection tasks
Context overflow handling — pre-flight token estimation + ContextWindowOverflowException catch + string-matching fallback
Prompt text copied verbatim from Lens Jinja2 templates - only the rendering mechanism changed ({{ var }} → f-string)

How to Use failure_detector

memory_exporter.clear()
create_agents_and_graph(case)
finished_spans = memory_exporter.get_finished_spans()
mapper = StrandsInMemorySessionMapper()
high_conf = detect_failures(flawed_session, confidence_threshold="medium")

Test Scenarios

Unit tests for all Pydantic types (test_types.py)
Unit tests for chunking utilities (test_chunking.py)
Unit tests for user request extraction (test_user_request_extractor.py)
Unit tests for failure detector with mocked LLM (test_failure_detector.py)
Manual integration test: ran detect_failures() on a flawed multi-agent graph session — correctly detected goal deviation, tool schema errors, and task non-compliance
Manual integration test: ran detect_failures() on a clean math agent session

Related Issues

N/A

Documentation PR

N/A

Type of Change

New feature

Testing

How have you tested the change? Verify that the changes do not break functionality or introduce warnings in consuming repositories: agents-docs, agents-tools, agents-cli

I ran hatch run prepare

Checklist

I have read the CONTRIBUTING document
I have added any necessary tests that prove my fix is effective or my feature works
I have updated the documentation accordingly
I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
My changes generate no new warnings
Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

poshinchen had a problem deploying to auto-approve March 23, 2026 03:12 — with GitHub Actions Failure

poshinchen mentioned this pull request Mar 23, 2026

feat(detectors): added basic interfaces, and failure_detector agent #162

Closed

7 tasks

feat(detectors): added basic interfaces, and failure_detector agent

08359d0

poshinchen force-pushed the feat/interface-and-detector branch from 7165296 to 08359d0 Compare March 23, 2026 03:29

poshinchen temporarily deployed to auto-approve March 23, 2026 03:29 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(detectors): added basic interfaces, and failure_detector agent#175

feat(detectors): added basic interfaces, and failure_detector agent#175
poshinchen wants to merge 1 commit intostrands-agents:mainfrom
poshinchen:feat/interface-and-detector

poshinchen commented Mar 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

poshinchen commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

What's included:

Design decisions

How to Use failure_detector

Test Scenarios

Related Issues

Documentation PR

Type of Change

Testing

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

poshinchen commented Mar 23, 2026 •

edited

Loading