chaos-engineer: add execution guardrails, separate security chaos scope, and define metrics


## Summary

The `chaos-engineer` agent has three issues that could lead to unsafe or ambiguous behavior when used in production systems. All proposed changes are backward-compatible additions; nothing is removed.

## Problem 1: No Execution Guardrails

The agent declares `Read, Write, Edit, Bash, Glob, Grep` tools and states it will "design and **execute** controlled failure experiments" and "**Implement** chaos experiments". However, the definition lacks:

- Approval gates before execution
- Environment restrictions (nothing prevents running in production)
- Stop conditions or kill switches
- TTL limits on experiment duration

**Proposed addition:** A required safety configuration block placed after the existing checklist:

```
Safety configuration (required before execution):
- environment_allowlist: [non-prod, staging]  # production requires explicit override
- required_approvals: 2  # minimum approvals from system owner + SRE
- stop_conditions: [error_rate > 5%, SLO breach, manual abort]
- kill_switch: reference to the mechanism that halts all active experiments
- rollback_verification: confirm rollback procedure works BEFORE starting
- ttl: 2h  # experiments must not run longer without re-approval
- slo_guardrails: [availability > 99.9%, latency_p99 < 500ms]
```

Default behavior should be `plan-only`; execution requires explicit `approved-execution` context.

## Problem 2: Security Chaos Scope is Too Broad

The "Security chaos" section mixes resilience-adjacent items with security-specific scenarios:

**Resilience testing (appropriate here):**
- Certificate rotation, Key rotation, Firewall changes, DDoS simulation

**Security-specific (should be delegated):**
- Authorization bypass
- Breach scenarios

These require different expertise, authorization levels, and legal review. Mixing them risks using the chaos-engineer agent for unauthorized security testing.

**Proposed fix:** Add a delegation note within the section:

> Note: `authorization bypass` and `breach scenarios` require security-engineer expertise and explicit authorization. Delegate these to the `security-engineer` agent; do not execute them within a chaos experiment without legal review.

## Problem 3: Metrics Are Undefined

The progress tracking example shows `"mttr_reduction": "65%"`, and the delivery notification includes: *"improving system resilience score from 2.3 to 4.1"*.

Neither metric is defined:
- `resilience score`: no scale, formula, or components
- `mttr_reduction`: no baseline period, incident type, or scope

**Proposed fix:**
- Remove the specific `2.3 → 4.1` example from the delivery notification (or add formula/scale)
- Add a note near `mttr_reduction`: "compare against pre-program baseline, scoped to experiments run"

## Proposed Changes

1. Add "Safety configuration (required)" section after the chaos engineering checklist
2. Add delegation note in the "Security chaos" section
3. Update delivery notification to remove or define the resilience score example

## Impact

- No breaking change to existing planning-mode use
- Conservative defaults protect against accidental production chaos
- Clearer scope reduces misuse risk for unauthorized security testing
- All changes are purely instructional additions to the markdown

## References

- [Chaos Engineering Principles](https://principlesofchaos.org/)
- [AWS Well-Architected: Reliability Pillar — failure injection](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/failure-management.html)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chaos-engineer: add execution guardrails, separate security chaos scope, and define metrics #181

Summary

Problem 1: No Execution Guardrails

Problem 2: Security Chaos Scope is Too Broad

Problem 3: Metrics Are Undefined

Proposed Changes

Impact

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

chaos-engineer: add execution guardrails, separate security chaos scope, and define metrics #181

Description

Summary

Problem 1: No Execution Guardrails

Problem 2: Security Chaos Scope is Too Broad

Problem 3: Metrics Are Undefined

Proposed Changes

Impact

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions