Summary
The chaos-engineer agent has three issues that could lead to unsafe or ambiguous behavior when used in production systems. All proposed changes are backward-compatible additions; nothing is removed.
Problem 1: No Execution Guardrails
The agent declares Read, Write, Edit, Bash, Glob, Grep tools and states it will "design and execute controlled failure experiments" and "Implement chaos experiments". However, the definition lacks:
- Approval gates before execution
- Environment restrictions (nothing prevents running in production)
- Stop conditions or kill switches
- TTL limits on experiment duration
Proposed addition: A required safety configuration block placed after the existing checklist:
Safety configuration (required before execution):
- environment_allowlist: [non-prod, staging] # production requires explicit override
- required_approvals: 2 # minimum approvals from system owner + SRE
- stop_conditions: [error_rate > 5%, SLO breach, manual abort]
- kill_switch: reference to the mechanism that halts all active experiments
- rollback_verification: confirm rollback procedure works BEFORE starting
- ttl: 2h # experiments must not run longer without re-approval
- slo_guardrails: [availability > 99.9%, latency_p99 < 500ms]
Default behavior should be plan-only; execution requires explicit approved-execution context.
Problem 2: Security Chaos Scope is Too Broad
The "Security chaos" section mixes resilience-adjacent items with security-specific scenarios:
Resilience testing (appropriate here):
- Certificate rotation, Key rotation, Firewall changes, DDoS simulation
Security-specific (should be delegated):
- Authorization bypass
- Breach scenarios
These require different expertise, authorization levels, and legal review. Mixing them risks using the chaos-engineer agent for unauthorized security testing.
Proposed fix: Add a delegation note within the section:
Note: authorization bypass and breach scenarios require security-engineer expertise and explicit authorization. Delegate these to the security-engineer agent; do not execute them within a chaos experiment without legal review.
Problem 3: Metrics Are Undefined
The progress tracking example shows "mttr_reduction": "65%", and the delivery notification includes: "improving system resilience score from 2.3 to 4.1".
Neither metric is defined:
resilience score: no scale, formula, or components
mttr_reduction: no baseline period, incident type, or scope
Proposed fix:
- Remove the specific
2.3 → 4.1 example from the delivery notification (or add formula/scale)
- Add a note near
mttr_reduction: "compare against pre-program baseline, scoped to experiments run"
Proposed Changes
- Add "Safety configuration (required)" section after the chaos engineering checklist
- Add delegation note in the "Security chaos" section
- Update delivery notification to remove or define the resilience score example
Impact
- No breaking change to existing planning-mode use
- Conservative defaults protect against accidental production chaos
- Clearer scope reduces misuse risk for unauthorized security testing
- All changes are purely instructional additions to the markdown
References
Summary
The
chaos-engineeragent has three issues that could lead to unsafe or ambiguous behavior when used in production systems. All proposed changes are backward-compatible additions; nothing is removed.Problem 1: No Execution Guardrails
The agent declares
Read, Write, Edit, Bash, Glob, Greptools and states it will "design and execute controlled failure experiments" and "Implement chaos experiments". However, the definition lacks:Proposed addition: A required safety configuration block placed after the existing checklist:
Default behavior should be
plan-only; execution requires explicitapproved-executioncontext.Problem 2: Security Chaos Scope is Too Broad
The "Security chaos" section mixes resilience-adjacent items with security-specific scenarios:
Resilience testing (appropriate here):
Security-specific (should be delegated):
These require different expertise, authorization levels, and legal review. Mixing them risks using the chaos-engineer agent for unauthorized security testing.
Proposed fix: Add a delegation note within the section:
Problem 3: Metrics Are Undefined
The progress tracking example shows
"mttr_reduction": "65%", and the delivery notification includes: "improving system resilience score from 2.3 to 4.1".Neither metric is defined:
resilience score: no scale, formula, or componentsmttr_reduction: no baseline period, incident type, or scopeProposed fix:
2.3 → 4.1example from the delivery notification (or add formula/scale)mttr_reduction: "compare against pre-program baseline, scoped to experiments run"Proposed Changes
Impact
References