docs: add chaos testing and resilience evaluator doc pages and example scripts by ybdarrenwang · Pull Request #2697 · strands-agents/harness-sdk

ybdarrenwang · 2026-06-09T16:23:30Z

Description

Adds documentation pages and example scripts for the chaos testing module and three new resilience evaluators in Strands Evals.

Doc pages:

Chaos Testing page (chaos_testing.mdx): Overview and guide for the chaos testing framework, covering ChaosPlugin, ChaosCase, ChaosExperiment, chaos effects (pre-hook and post-hook), ChaosCase.expand() for Cartesian product generation, integration with ToolSimulator, resilience evaluators, advanced patterns (agent comparison, degradation sweep, multi-turn with user simulator), and best practices
FailureCommunicationEvaluator page: Documents the evaluator that assesses how well an agent communicates tool failures to users, using a five-level scoring rubric (No Communication → Excellent Communication)
PartialCompletionEvaluator page: Documents the evaluator that scores what fraction of a user's goal was achieved despite failures, returning a continuous 0.0–1.0 score
RecoveryStrategyEvaluator page: Documents the evaluator that scores the quality of an agent's recovery actions (retries, fallbacks, tool switching) when tools fail

Example scripts:

chaos_testing.py: Demonstrates end-to-end chaos testing with named effect maps, ChaosCase.expand(), ChaosPlugin, ToolSimulator, and GoalSuccessRateEvaluator
chaos_failure_communication_evaluator.py: Evaluating agent failure communication under timeout and network errors
chaos_partial_completion_evaluator.py: Measuring partial task completion when tools return corrupted or failed responses
chaos_recovery_strategy_evaluator.py: Assessing agent recovery strategies (tool pivoting, retry discipline)

Also updates:

Evaluators index (index.mdx): Adds a new "Resilience Evaluators" subsection
navigation.yml: Wires the new chaos testing and evaluator pages into the evals-sdk sidebar
Fixes inconsistent import paths: all effect classes now import from strands_evals.chaos (the canonical re-export path) consistently across all four scripts

Related Issues

strands-agents/evals#114
strands-agents/docs#836 (original PR, repo archived)
#2724 (PR for doc pages only; combined)

Documentation PR

Type of Change

Documentation update

Testing

How have you tested the change? Verify that the changes do not break functionality or introduce warnings in consuming repositories: agents-docs, agents-tools, agents-cli

I ran npm run dev

Checklist

I have read the CONTRIBUTING document
I have added any necessary tests that prove my fix is effective or my feature works
I have updated the documentation accordingly
I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
My changes generate no new warnings
Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

github-actions · 2026-06-10T20:40:27Z

Issue: These four example scripts aren't referenced by any documentation page or navigation entry. Every existing example under site/docs/examples/evals-sdk/ (e.g. goal_success_rate_evaluator.py, trajectory_evaluator.py) has a companion .mdx page in src/content/docs/user-guide/evals-sdk/evaluators/ that links to it with a "A complete example can be found [here]" reference. As-is, the chaos examples are undiscoverable from the docs site. The PR checklist marks "I have updated the documentation accordingly," but no doc page or nav entry was added.

Suggestion: Add at least one .mdx page (e.g. a "Chaos Testing" / resilience-evaluators page) that introduces the chaos module and links to these scripts, and wire it into the evals-sdk navigation. Note that the existing example links point to github.com/strands-agents/docs/blob/main/... (the archived repo) — any new page should reference the harness-sdk paths instead.

github-actions · 2026-06-10T20:40:30Z

Assessment: Comment

Solid set of chaos-testing examples — clear scenarios, good system prompts modeling realistic failure handling, and all four scripts compile cleanly. The earlier feedback on root-level imports and EvaluationReport.flatten has been addressed. Two items remain worth resolving before merge.

Review Categories

Documentation integration: The new examples are not linked from any .mdx page or nav entry, unlike every other example in this directory — they're currently undiscoverable from the docs site.
Consistency: Effect-class import paths vary within and across the four files (root strands_evals.chaos vs. strands_evals.chaos.effects); worth standardizing since examples model canonical usage.

Nice work demonstrating pre-hook, post-hook, and compound chaos with the ChaosCase.expand Cartesian-product pattern — the baseline-vs-injected comparison is a great teaching example.

github-actions · 2026-06-12T19:24:37Z

+
+    for config_name, system_prompt in configs.items():
+        def make_task(prompt):
+            def task_function(case: ChaosCase) -> dict:


Issue: These two advanced-pattern snippets (Pattern 1 here, and Pattern 3 "Multi-turn" further down) still use the old manual task style — def task_function(case) -> dict: that calls agent(case.input) and returns {"output": str(response)}. Every other snippet in this PR was migrated to @eval_task(TracedHandler()) per the earlier review feedback, so these two now stand out as inconsistent.

More importantly, the resilience evaluators are SESSION_LEVEL and rely on the conversation trace/trajectory. Pattern 1 wires up PartialCompletionEvaluator, but the task returns only {"output": ...} with no trajectory — which is exactly what TracedHandler captures automatically. As written, the evaluation here would likely be degraded or fail to find the trace.

Suggestion: Convert both patterns to the @eval_task(TracedHandler()) form used elsewhere (decorate the task and return Agent(...)), so they're consistent and actually feed the session trace to the evaluators.

github-actions · 2026-06-12T19:24:38Z

Assessment: Comment (very close to approve)

Thanks for the thorough revision — all of my earlier feedback is fully addressed: doc pages added for the chaos framework and all three resilience evaluators, navigation wired in, links pointing at harness-sdk, imports unified to strands_evals.chaos, and the examples migrated to @eval_task(TracedHandler()). The chaos_testing.mdx guide is genuinely excellent — the effect-type tables, the "Interpreting Results" score-combination matrix, and the best-practices section are very useful. All four example scripts compile cleanly.

Remaining item

Consistency / correctness: Two of the three "Advanced Chaos Testing Patterns" snippets (Pattern 1 and Pattern 3) still use the old manual -> dict task style without a trajectory, while everything else now uses @eval_task(TracedHandler()). Since the resilience evaluators are SESSION_LEVEL, those snippets should be migrated too (inline comment above).

Nice work — once the two advanced snippets are aligned, this is ready to ship.

github-actions Bot added the size/l label Jun 9, 2026

ybdarrenwang requested a deployment to manual-approval June 9, 2026 16:23 — with GitHub Actions Waiting

ybdarrenwang had a problem deploying to manual-approval June 9, 2026 16:23 — with GitHub Actions Error

ybdarrenwang changed the title ~~add chaos testing and resilience evaluator example scripts~~ docs: add chaos testing and resilience evaluator example scripts Jun 9, 2026

poshinchen reviewed Jun 9, 2026

View reviewed changes

Comment thread site/docs/examples/evals-sdk/chaos_failure_communication_evaluator.py Outdated

poshinchen reviewed Jun 9, 2026

View reviewed changes

Comment thread site/docs/examples/evals-sdk/chaos_failure_communication_evaluator.py Outdated

docs: add chaos testing and resilience evaluator example scripts

99089be

ybdarrenwang force-pushed the docs/chaos-tool branch from 466b341 to 99089be Compare June 9, 2026 17:10

github-actions Bot added size/l and removed size/l labels Jun 9, 2026

ybdarrenwang requested a deployment to manual-approval June 9, 2026 17:10 — with GitHub Actions Waiting

ybdarrenwang had a problem deploying to manual-approval June 9, 2026 17:10 — with GitHub Actions Error

yonib05 added python Pull requests that update python code enhancement New feature or request documentation Documentation changes, improvements, additions, content updates, site improvements, examples, guides labels Jun 9, 2026

github-actions Bot added size/l and removed size/l labels Jun 10, 2026

ybdarrenwang requested a deployment to manual-approval June 10, 2026 16:19 — with GitHub Actions Waiting

ybdarrenwang had a problem deploying to manual-approval June 10, 2026 16:19 — with GitHub Actions Error

use run_evaluations_async as example

82a8d36

ybdarrenwang force-pushed the docs/chaos-tool branch from 53ad645 to 82a8d36 Compare June 10, 2026 16:24

github-actions Bot added size/l and removed size/l labels Jun 10, 2026

ybdarrenwang temporarily deployed to manual-approval June 10, 2026 16:24 — with GitHub Actions Inactive

ybdarrenwang had a problem deploying to manual-approval June 10, 2026 16:24 — with GitHub Actions Error

poshinchen previously approved these changes Jun 10, 2026

View reviewed changes

github-actions Bot added the strands-running label Jun 10, 2026

github-actions Bot reviewed Jun 10, 2026

View reviewed changes

Comment thread site/docs/examples/evals-sdk/chaos_failure_communication_evaluator.py Outdated

github-actions Bot removed the strands-running label Jun 10, 2026

ybdarrenwang mentioned this pull request Jun 11, 2026

docs: pages for chaos testing and resilience evaluators #2724

Open

7 tasks

ybdarrenwang added 2 commits June 11, 2026 16:26

docs: pages for chaos testing and resilience evaluators

edcb08c

fix import pattern and link display

eba2989

ybdarrenwang dismissed poshinchen’s stale review via eba2989 June 11, 2026 16:44

github-actions Bot added size/xl and removed size/l labels Jun 11, 2026

ybdarrenwang requested a deployment to manual-approval June 11, 2026 16:44 — with GitHub Actions Waiting

ybdarrenwang had a problem deploying to manual-approval June 11, 2026 16:44 — with GitHub Actions Error

ybdarrenwang changed the title ~~docs: add chaos testing and resilience evaluator example scripts~~ docs: add chaos testing and resilience evaluator doc pages and example scripts Jun 11, 2026

poshinchen reviewed Jun 11, 2026

View reviewed changes

Comment thread site/=4.8.2 Outdated

poshinchen reviewed Jun 11, 2026

View reviewed changes

Comment thread site/src/content/docs/user-guide/evals-sdk/evaluators/failure_communication_evaluator.mdx Outdated

poshinchen reviewed Jun 11, 2026

View reviewed changes

Comment thread site/src/content/docs/user-guide/evals-sdk/evaluators/failure_communication_evaluator.mdx Outdated

simplify test codes with @eval_task

c835eb3

github-actions Bot added size/xl and removed size/xl labels Jun 12, 2026

ybdarrenwang requested a deployment to manual-approval June 12, 2026 17:43 — with GitHub Actions Waiting

ybdarrenwang had a problem deploying to manual-approval June 12, 2026 17:43 — with GitHub Actions Error

polish chaos testing docand snippets

c906fda

github-actions Bot added size/xl and removed size/xl labels Jun 12, 2026

ybdarrenwang temporarily deployed to manual-approval June 12, 2026 18:18 — with GitHub Actions Inactive

ybdarrenwang requested a deployment to manual-approval June 12, 2026 18:18 — with GitHub Actions Waiting

github-actions Bot added the strands-running label Jun 12, 2026

github-actions Bot reviewed Jun 12, 2026

View reviewed changes

github-actions Bot removed the strands-running label Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add chaos testing and resilience evaluator doc pages and example scripts#2697

docs: add chaos testing and resilience evaluator doc pages and example scripts#2697
ybdarrenwang wants to merge 6 commits into
strands-agents:mainfrom
ybdarrenwang:docs/chaos-tool

ybdarrenwang commented Jun 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ybdarrenwang commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Documentation PR

Type of Change

Testing

Checklist

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ybdarrenwang commented Jun 9, 2026 •

edited

Loading