docs: add chaos testing and resilience evaluator doc pages and example scripts#2697
docs: add chaos testing and resilience evaluator doc pages and example scripts#2697ybdarrenwang wants to merge 6 commits into
Conversation
466b341 to
99089be
Compare
53ad645 to
82a8d36
Compare
|
Issue: These four example scripts aren't referenced by any documentation page or navigation entry. Every existing example under Suggestion: Add at least one |
|
Assessment: Comment Solid set of chaos-testing examples — clear scenarios, good system prompts modeling realistic failure handling, and all four scripts compile cleanly. The earlier feedback on root-level imports and Review Categories
Nice work demonstrating pre-hook, post-hook, and compound chaos with the |
|
|
||
| for config_name, system_prompt in configs.items(): | ||
| def make_task(prompt): | ||
| def task_function(case: ChaosCase) -> dict: |
There was a problem hiding this comment.
Issue: These two advanced-pattern snippets (Pattern 1 here, and Pattern 3 "Multi-turn" further down) still use the old manual task style — def task_function(case) -> dict: that calls agent(case.input) and returns {"output": str(response)}. Every other snippet in this PR was migrated to @eval_task(TracedHandler()) per the earlier review feedback, so these two now stand out as inconsistent.
More importantly, the resilience evaluators are SESSION_LEVEL and rely on the conversation trace/trajectory. Pattern 1 wires up PartialCompletionEvaluator, but the task returns only {"output": ...} with no trajectory — which is exactly what TracedHandler captures automatically. As written, the evaluation here would likely be degraded or fail to find the trace.
Suggestion: Convert both patterns to the @eval_task(TracedHandler()) form used elsewhere (decorate the task and return Agent(...)), so they're consistent and actually feed the session trace to the evaluators.
|
Assessment: Comment (very close to approve) Thanks for the thorough revision — all of my earlier feedback is fully addressed: doc pages added for the chaos framework and all three resilience evaluators, navigation wired in, links pointing at Remaining item
Nice work — once the two advanced snippets are aligned, this is ready to ship. |
Description
Adds documentation pages and example scripts for the chaos testing module and three new resilience evaluators in Strands Evals.
Doc pages:
chaos_testing.mdx): Overview and guide for the chaos testing framework, coveringChaosPlugin,ChaosCase,ChaosExperiment, chaos effects (pre-hook and post-hook),ChaosCase.expand()for Cartesian product generation, integration withToolSimulator, resilience evaluators, advanced patterns (agent comparison, degradation sweep, multi-turn with user simulator), and best practicesExample scripts:
chaos_testing.py: Demonstrates end-to-end chaos testing with named effect maps,ChaosCase.expand(),ChaosPlugin,ToolSimulator, andGoalSuccessRateEvaluatorchaos_failure_communication_evaluator.py: Evaluating agent failure communication under timeout and network errorschaos_partial_completion_evaluator.py: Measuring partial task completion when tools return corrupted or failed responseschaos_recovery_strategy_evaluator.py: Assessing agent recovery strategies (tool pivoting, retry discipline)Also updates:
index.mdx): Adds a new "Resilience Evaluators" subsectionnavigation.yml: Wires the new chaos testing and evaluator pages into the evals-sdk sidebarstrands_evals.chaos(the canonical re-export path) consistently across all four scriptsRelated Issues
strands-agents/evals#114
strands-agents/docs#836 (original PR, repo archived)
#2724 (PR for doc pages only; combined)
Documentation PR
Type of Change
Documentation update
Testing
How have you tested the change? Verify that the changes do not break functionality or introduce warnings in consuming repositories: agents-docs, agents-tools, agents-cli
npm run devChecklist
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.