Skip to content

chaos-engineer: add execution guardrails, separate security chaos scope, and define metrics #181

@bamboocity

Description

@bamboocity

Summary

The chaos-engineer agent has three issues that could lead to unsafe or ambiguous behavior when used in production systems. All proposed changes are backward-compatible additions; nothing is removed.

Problem 1: No Execution Guardrails

The agent declares Read, Write, Edit, Bash, Glob, Grep tools and states it will "design and execute controlled failure experiments" and "Implement chaos experiments". However, the definition lacks:

  • Approval gates before execution
  • Environment restrictions (nothing prevents running in production)
  • Stop conditions or kill switches
  • TTL limits on experiment duration

Proposed addition: A required safety configuration block placed after the existing checklist:

Safety configuration (required before execution):
- environment_allowlist: [non-prod, staging]  # production requires explicit override
- required_approvals: 2  # minimum approvals from system owner + SRE
- stop_conditions: [error_rate > 5%, SLO breach, manual abort]
- kill_switch: reference to the mechanism that halts all active experiments
- rollback_verification: confirm rollback procedure works BEFORE starting
- ttl: 2h  # experiments must not run longer without re-approval
- slo_guardrails: [availability > 99.9%, latency_p99 < 500ms]

Default behavior should be plan-only; execution requires explicit approved-execution context.

Problem 2: Security Chaos Scope is Too Broad

The "Security chaos" section mixes resilience-adjacent items with security-specific scenarios:

Resilience testing (appropriate here):

  • Certificate rotation, Key rotation, Firewall changes, DDoS simulation

Security-specific (should be delegated):

  • Authorization bypass
  • Breach scenarios

These require different expertise, authorization levels, and legal review. Mixing them risks using the chaos-engineer agent for unauthorized security testing.

Proposed fix: Add a delegation note within the section:

Note: authorization bypass and breach scenarios require security-engineer expertise and explicit authorization. Delegate these to the security-engineer agent; do not execute them within a chaos experiment without legal review.

Problem 3: Metrics Are Undefined

The progress tracking example shows "mttr_reduction": "65%", and the delivery notification includes: "improving system resilience score from 2.3 to 4.1".

Neither metric is defined:

  • resilience score: no scale, formula, or components
  • mttr_reduction: no baseline period, incident type, or scope

Proposed fix:

  • Remove the specific 2.3 → 4.1 example from the delivery notification (or add formula/scale)
  • Add a note near mttr_reduction: "compare against pre-program baseline, scoped to experiments run"

Proposed Changes

  1. Add "Safety configuration (required)" section after the chaos engineering checklist
  2. Add delegation note in the "Security chaos" section
  3. Update delivery notification to remove or define the resilience score example

Impact

  • No breaking change to existing planning-mode use
  • Conservative defaults protect against accidental production chaos
  • Clearer scope reduces misuse risk for unauthorized security testing
  • All changes are purely instructional additions to the markdown

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions