Skip to content

docs: add chaos testing and resilience evaluator doc pages and example scripts#2697

Open
ybdarrenwang wants to merge 6 commits into
strands-agents:mainfrom
ybdarrenwang:docs/chaos-tool
Open

docs: add chaos testing and resilience evaluator doc pages and example scripts#2697
ybdarrenwang wants to merge 6 commits into
strands-agents:mainfrom
ybdarrenwang:docs/chaos-tool

Conversation

@ybdarrenwang

@ybdarrenwang ybdarrenwang commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Description

Adds documentation pages and example scripts for the chaos testing module and three new resilience evaluators in Strands Evals.

Doc pages:

  • Chaos Testing page (chaos_testing.mdx): Overview and guide for the chaos testing framework, covering ChaosPlugin, ChaosCase, ChaosExperiment, chaos effects (pre-hook and post-hook), ChaosCase.expand() for Cartesian product generation, integration with ToolSimulator, resilience evaluators, advanced patterns (agent comparison, degradation sweep, multi-turn with user simulator), and best practices
  • FailureCommunicationEvaluator page: Documents the evaluator that assesses how well an agent communicates tool failures to users, using a five-level scoring rubric (No Communication → Excellent Communication)
  • PartialCompletionEvaluator page: Documents the evaluator that scores what fraction of a user's goal was achieved despite failures, returning a continuous 0.0–1.0 score
  • RecoveryStrategyEvaluator page: Documents the evaluator that scores the quality of an agent's recovery actions (retries, fallbacks, tool switching) when tools fail

Example scripts:

  • chaos_testing.py: Demonstrates end-to-end chaos testing with named effect maps, ChaosCase.expand(), ChaosPlugin, ToolSimulator, and GoalSuccessRateEvaluator
  • chaos_failure_communication_evaluator.py: Evaluating agent failure communication under timeout and network errors
  • chaos_partial_completion_evaluator.py: Measuring partial task completion when tools return corrupted or failed responses
  • chaos_recovery_strategy_evaluator.py: Assessing agent recovery strategies (tool pivoting, retry discipline)

Also updates:

  • Evaluators index (index.mdx): Adds a new "Resilience Evaluators" subsection
  • navigation.yml: Wires the new chaos testing and evaluator pages into the evals-sdk sidebar
  • Fixes inconsistent import paths: all effect classes now import from strands_evals.chaos (the canonical re-export path) consistently across all four scripts

Related Issues

strands-agents/evals#114
strands-agents/docs#836 (original PR, repo archived)
#2724 (PR for doc pages only; combined)

Documentation PR

Type of Change

Documentation update

Testing

How have you tested the change? Verify that the changes do not break functionality or introduce warnings in consuming repositories: agents-docs, agents-tools, agents-cli

  • I ran npm run dev

Checklist

  • I have read the CONTRIBUTING document
  • I have added any necessary tests that prove my fix is effective or my feature works
  • I have updated the documentation accordingly
  • I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@github-actions github-actions Bot added the size/l label Jun 9, 2026
@ybdarrenwang ybdarrenwang changed the title add chaos testing and resilience evaluator example scripts docs: add chaos testing and resilience evaluator example scripts Jun 9, 2026
Comment thread site/docs/examples/evals-sdk/chaos_failure_communication_evaluator.py Outdated
Comment thread site/docs/examples/evals-sdk/chaos_failure_communication_evaluator.py Outdated
@github-actions github-actions Bot added size/l and removed size/l labels Jun 9, 2026
@yonib05 yonib05 added python Pull requests that update python code enhancement New feature or request documentation Documentation changes, improvements, additions, content updates, site improvements, examples, guides labels Jun 9, 2026
@github-actions github-actions Bot added size/l and removed size/l labels Jun 10, 2026
poshinchen
poshinchen previously approved these changes Jun 10, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Issue: These four example scripts aren't referenced by any documentation page or navigation entry. Every existing example under site/docs/examples/evals-sdk/ (e.g. goal_success_rate_evaluator.py, trajectory_evaluator.py) has a companion .mdx page in src/content/docs/user-guide/evals-sdk/evaluators/ that links to it with a "A complete example can be found [here]" reference. As-is, the chaos examples are undiscoverable from the docs site. The PR checklist marks "I have updated the documentation accordingly," but no doc page or nav entry was added.

Suggestion: Add at least one .mdx page (e.g. a "Chaos Testing" / resilience-evaluators page) that introduces the chaos module and links to these scripts, and wire it into the evals-sdk navigation. Note that the existing example links point to github.com/strands-agents/docs/blob/main/... (the archived repo) — any new page should reference the harness-sdk paths instead.

Comment thread site/docs/examples/evals-sdk/chaos_failure_communication_evaluator.py Outdated
@github-actions

Copy link
Copy Markdown
Contributor

Assessment: Comment

Solid set of chaos-testing examples — clear scenarios, good system prompts modeling realistic failure handling, and all four scripts compile cleanly. The earlier feedback on root-level imports and EvaluationReport.flatten has been addressed. Two items remain worth resolving before merge.

Review Categories
  • Documentation integration: The new examples are not linked from any .mdx page or nav entry, unlike every other example in this directory — they're currently undiscoverable from the docs site.
  • Consistency: Effect-class import paths vary within and across the four files (root strands_evals.chaos vs. strands_evals.chaos.effects); worth standardizing since examples model canonical usage.

Nice work demonstrating pre-hook, post-hook, and compound chaos with the ChaosCase.expand Cartesian-product pattern — the baseline-vs-injected comparison is a great teaching example.

@github-actions github-actions Bot added size/xl and removed size/l labels Jun 11, 2026
@ybdarrenwang ybdarrenwang changed the title docs: add chaos testing and resilience evaluator example scripts docs: add chaos testing and resilience evaluator doc pages and example scripts Jun 11, 2026
Comment thread site/=4.8.2 Outdated

for config_name, system_prompt in configs.items():
def make_task(prompt):
def task_function(case: ChaosCase) -> dict:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: These two advanced-pattern snippets (Pattern 1 here, and Pattern 3 "Multi-turn" further down) still use the old manual task style — def task_function(case) -> dict: that calls agent(case.input) and returns {"output": str(response)}. Every other snippet in this PR was migrated to @eval_task(TracedHandler()) per the earlier review feedback, so these two now stand out as inconsistent.

More importantly, the resilience evaluators are SESSION_LEVEL and rely on the conversation trace/trajectory. Pattern 1 wires up PartialCompletionEvaluator, but the task returns only {"output": ...} with no trajectory — which is exactly what TracedHandler captures automatically. As written, the evaluation here would likely be degraded or fail to find the trace.

Suggestion: Convert both patterns to the @eval_task(TracedHandler()) form used elsewhere (decorate the task and return Agent(...)), so they're consistent and actually feed the session trace to the evaluators.

@github-actions

Copy link
Copy Markdown
Contributor

Assessment: Comment (very close to approve)

Thanks for the thorough revision — all of my earlier feedback is fully addressed: doc pages added for the chaos framework and all three resilience evaluators, navigation wired in, links pointing at harness-sdk, imports unified to strands_evals.chaos, and the examples migrated to @eval_task(TracedHandler()). The chaos_testing.mdx guide is genuinely excellent — the effect-type tables, the "Interpreting Results" score-combination matrix, and the best-practices section are very useful. All four example scripts compile cleanly.

Remaining item
  • Consistency / correctness: Two of the three "Advanced Chaos Testing Patterns" snippets (Pattern 1 and Pattern 3) still use the old manual -> dict task style without a trajectory, while everything else now uses @eval_task(TracedHandler()). Since the resilience evaluators are SESSION_LEVEL, those snippets should be migrated too (inline comment above).

Nice work — once the two advanced snippets are aligned, this is ready to ship.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Documentation changes, improvements, additions, content updates, site improvements, examples, guides enhancement New feature or request python Pull requests that update python code size/xl

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants