Skip to content

feat: Add EvaluationPlugin for agent invocation evaluation and retry#166

Open
afarntrog wants to merge 4 commits intostrands-agents:mainfrom
afarntrog:wip/evaluation-plugin
Open

feat: Add EvaluationPlugin for agent invocation evaluation and retry#166
afarntrog wants to merge 4 commits intostrands-agents:mainfrom
afarntrog:wip/evaluation-plugin

Conversation

@afarntrog
Copy link
Contributor

@afarntrog afarntrog commented Mar 18, 2026

Description

Adds a new EvaluationPlugin that integrates with the Strands agents plugin system to evaluate agent output after each invocation and automatically retry with improved system prompts on failure.

Changes

New: strands_evals.plugins module

  • EvaluationPlugin — A Strands Plugin that wraps agent invocations to:
    • Run configured evaluators against agent output after each call
    • On evaluation failure, use an LLM to generate an improved system prompt based on the failure reasons
    • Automatically retry the invocation with the improved prompt (configurable max_retries)
    • Restore the original system prompt after all attempts complete
    • Return a structured ImprovementSuggestion (reasoning + system prompt) from the improvement LLM
  • improvement_suggestion.py — Prompt template for the improvement suggestion agent that analyzes evaluation failures and proposes system prompt modifications
  • Expected output and trajectory can be set at construction time or overridden per-invocation via invocation_state

Other changes

  • Bumped strands-agents minimum version from >=1.0.0 to >=1.28.0 (required for the plugin interface)
  • Added ruff line-length exemption for plugins/prompt_templates/

Testing

  • 480 lines of unit tests covering: initialization, agent wrapping, evaluator execution, retry behavior, system prompt restoration, improvement suggestion generation, partial failure handling, max retries, and multi-evaluator scenarios

Let me know if you want any tweaks to this.

Related Issues

Documentation PR

Type of Change

New feature

Testing

How have you tested the change? Verify that the changes do not break functionality or introduce warnings in consuming repositories: agents-docs, agents-tools, agents-cli

  • I ran hatch run prepare

Checklist

  • I have read the CONTRIBUTING document
  • I have added any necessary tests that prove my fix is effective or my feature works
  • I have updated the documentation accordingly
  • I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Introduce an EvaluationPlugin that hooks into agent invocations to
evaluate outputs against expected results and automatically retries
with improved system prompts on failure.

- Add EvaluationPlugin class that wraps agent __call__ to intercept
  invocations, run evaluators, and retry with LLM-suggested prompt
  improvements when evaluations fail
- Add improvement suggestion prompt template for generating better
  system prompts based on evaluation feedback
- Add comprehensive test suite covering plugin initialization, wrapping,
  evaluation execution, retry logic, and edge cases
- Update ruff config to ignore line-length in plugin prompt templates
- Add comprehensive docstrings to all methods in EvaluationPlugin
- Change _suggest_improvements to return ImprovementSuggestion instead of str
- Add debug logging with reasoning when applying improved system prompt
- Replace Union[X, Y] with modern X | Y syntax
- Use dict default in kwargs.get() instead of
- Export new  module from the package's public API
- Bump minimum strands-agents dependency from 1.0.0 to 1.28.0
  to support functionality required by the plugins module
Comment on lines +74 to +83

def wrapped_call(self_agent: Any, prompt: Any = None, **kwargs: Any) -> Any:
return plugin._invoke_with_evaluation(self_agent, original_call, prompt, **kwargs)

wrapped_class = type(
agent.__class__.__name__,
(agent.__class__,),
{"__call__": wrapped_call},
)
agent.__class__ = wrapped_class
Copy link
Contributor Author

@afarntrog afarntrog Mar 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to do this monkey patching because hooks alone will not give us the retry pattern we need:

What EvaluationPlugin needs to do:

  1. Intercept an agent invocation
  2. Let it run
  3. Evaluate the result
  4. If evaluation fails, modify the system prompt and run the agent again
  5. Repeat up to max_retries times

What hooks give you:

  • BeforeInvocationEvent fires before the agent runs. You can modify messages. No result yet.
  • AfterInvocationEvent fires after the agent runs. You get the result. But you're inside _run_loop's finally block, and stream_async still holds the invocation lock. If you call agent() from here → deadlock.

Hooks give you "before" and "after", not "around". There's no way to say "evaluate this result and, if it's bad, run the whole thing again" from within a hook. The retry requires calling agent(), which requires the lock, which is held by the invocation that's firing your hook. __class__ swap helps because:
This replaces __call__ — every call to agent(prompt) now goes through _invoke_with_evaluation, which runs evaluators and handles retries.

@afarntrog
Copy link
Contributor Author

Although there can be/is some overlap with steering the primary difference between these are:

  • Steering: hooks into the agent's inner loop (per model call, per tool call: AfterModelCallEvent)
  • EvaluationPlugin: wraps the agent's outer loop (per full invocation: AfterInvocationEvent)

@afarntrog
Copy link
Contributor Author

see strands-agents/sdk-python#1711

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant