feat: Add EvaluationPlugin for agent invocation evaluation and retry by afarntrog · Pull Request #166 · strands-agents/evals

afarntrog · 2026-03-18T15:53:07Z

Description

Adds a new EvaluationPlugin that integrates with the Strands agents plugin system to evaluate agent output after each invocation and automatically retry with improved system prompts on failure.

Changes

New: strands_evals.plugins module

EvaluationPlugin — A Strands Plugin that wraps agent invocations to:
- Run configured evaluators against agent output after each call
- On evaluation failure, use an LLM to generate an improved system prompt based on the failure reasons
- Automatically retry the invocation with the improved prompt (configurable max_retries)
- Restore the original system prompt after all attempts complete
- Return a structured ImprovementSuggestion (reasoning + system prompt) from the improvement LLM
improvement_suggestion.py — Prompt template for the improvement suggestion agent that analyzes evaluation failures and proposes system prompt modifications
Expected output and trajectory can be set at construction time or overridden per-invocation via invocation_state

Other changes

Bumped strands-agents minimum version from >=1.0.0 to >=1.28.0 (required for the plugin interface)
Added ruff line-length exemption for plugins/prompt_templates/

Testing

480 lines of unit tests covering: initialization, agent wrapping, evaluator execution, retry behavior, system prompt restoration, improvement suggestion generation, partial failure handling, max retries, and multi-evaluator scenarios

Let me know if you want any tweaks to this.

Related Issues

Documentation PR

Type of Change

New feature

Testing

How have you tested the change? Verify that the changes do not break functionality or introduce warnings in consuming repositories: agents-docs, agents-tools, agents-cli

I ran hatch run prepare

Checklist

I have read the CONTRIBUTING document
I have added any necessary tests that prove my fix is effective or my feature works
I have updated the documentation accordingly
I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
My changes generate no new warnings
Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Introduce an EvaluationPlugin that hooks into agent invocations to evaluate outputs against expected results and automatically retries with improved system prompts on failure. - Add EvaluationPlugin class that wraps agent __call__ to intercept invocations, run evaluators, and retry with LLM-suggested prompt improvements when evaluations fail - Add improvement suggestion prompt template for generating better system prompts based on evaluation feedback - Add comprehensive test suite covering plugin initialization, wrapping, evaluation execution, retry logic, and edge cases - Update ruff config to ignore line-length in plugin prompt templates

- Add comprehensive docstrings to all methods in EvaluationPlugin - Change _suggest_improvements to return ImprovementSuggestion instead of str - Add debug logging with reasoning when applying improved system prompt - Replace Union[X, Y] with modern X | Y syntax - Use dict default in kwargs.get() instead of

- Export new module from the package's public API - Bump minimum strands-agents dependency from 1.0.0 to 1.28.0 to support functionality required by the plugins module

afarntrog · 2026-03-18T16:09:17Z

src/strands_evals/plugins/evaluation_plugin.py

+
+        def wrapped_call(self_agent: Any, prompt: Any = None, **kwargs: Any) -> Any:
+            return plugin._invoke_with_evaluation(self_agent, original_call, prompt, **kwargs)
+
+        wrapped_class = type(
+            agent.__class__.__name__,
+            (agent.__class__,),
+            {"__call__": wrapped_call},
+        )
+        agent.__class__ = wrapped_class


We need to do this monkey patching because hooks alone will not give us the retry pattern we need:

What EvaluationPlugin needs to do:

Intercept an agent invocation

Let it run

Evaluate the result

If evaluation fails, modify the system prompt and run the agent again

Repeat up to max_retries times

What hooks give you:

BeforeInvocationEvent fires before the agent runs. You can modify messages. No result yet.

AfterInvocationEvent fires after the agent runs. You get the result. But you're inside _run_loop's finally block, and stream_async still holds the invocation lock. If you call agent() from here → deadlock.

Hooks give you "before" and "after", not "around". There's no way to say "evaluate this result and, if it's bad, run the whole thing again" from within a hook. The retry requires calling agent(), which requires the lock, which is held by the invocation that's firing your hook. __class__ swap helps because:
This replaces __call__ — every call to agent(prompt) now goes through _invoke_with_evaluation, which runs evaluators and handles retries.

afarntrog · 2026-03-19T14:16:32Z

Although there can be/is some overlap with steering the primary difference between these are:

Steering: hooks into the agent's inner loop (per model call, per tool call: AfterModelCallEvent)
EvaluationPlugin: wraps the agent's outer loop (per full invocation: AfterInvocationEvent)

afarntrog · 2026-03-19T15:57:47Z

see strands-agents/sdk-python#1711

afarntrog added 4 commits March 17, 2026 10:52

rm print

2e8d943

feat: add plugins module and bump strands-agents to >=1.28.0

e2fa123

- Export new module from the package's public API - Bump minimum strands-agents dependency from 1.0.0 to 1.28.0 to support functionality required by the plugins module

afarntrog temporarily deployed to auto-approve March 18, 2026 15:53 — with GitHub Actions Inactive

afarntrog commented Mar 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add EvaluationPlugin for agent invocation evaluation and retry#166

feat: Add EvaluationPlugin for agent invocation evaluation and retry#166
afarntrog wants to merge 4 commits intostrands-agents:mainfrom
afarntrog:wip/evaluation-plugin

afarntrog commented Mar 18, 2026 •

edited

Loading

Uh oh!

afarntrog Mar 18, 2026 •

edited

Loading

Uh oh!

afarntrog commented Mar 19, 2026

Uh oh!

afarntrog commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

afarntrog commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Testing

Related Issues

Documentation PR

Type of Change

Testing

Checklist

Uh oh!

afarntrog Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

afarntrog commented Mar 19, 2026

Uh oh!

afarntrog commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

afarntrog commented Mar 18, 2026 •

edited

Loading

afarntrog Mar 18, 2026 •

edited

Loading