Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
186 changes: 186 additions & 0 deletions IMPLEMENTATION_SUMMARY_CHAOS_TESTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,186 @@
# Implementation Summary: End-to-End Keeper Chaos Testing

## Issue #244: Add End-to-End Keeper Chaos Testing for Network and RPC Faults

**Contributor Focus**: [Resilience Testing] Validate how the backend behaves under realistic failure conditions
**ETA**: 2 days
**Status**: COMPLETED ✅

## What Was Implemented

### 1. **ChaosRpcServer** (`src/chaosRpcServer.js`)
- Extends the existing mock RPC server with fault injection capabilities
- Supports multiple fault types:
- **Latency injection**: Add delays to RPC responses
- **Failure injection**: Return error responses
- **Partial failures**: Some methods fail while others work
- **Rate limiting**: Enforce request limits
- **Flaky behavior**: Periodic availability
- **Gradual degradation**: Increasing failure probability over time
- Dynamic configuration updates
- Comprehensive logging

### 2. **ChaosTestHarness** (`src/chaosTestHarness.js`)
- Orchestrates chaos scenarios and collects observations
- Predefined realistic failure scenarios:
- Latency Spikes
- Partial RPC Failure
- Rate Limiting
- Flaky Network
- Gradual Degradation
- Complete Outage
- Automated evaluation against expected behaviors
- Metrics collection and reporting
- Recommendation generation based on test results

### 3. **Chaos Test Suite** (`__tests__/chaos.test.js`)
- Integrated with existing Jest test framework
- 7 comprehensive test cases:
1. RPC latency spike handling
2. Partial RPC failure handling
3. Rate limiting and backoff behavior
4. Circuit breaker tripping during outages
5. Retry logic with error classification
6. End-to-end chaos scenario suite
7. Health reporting during chaos
- Can be run standalone or as part of test suite

### 4. **Command-Line Tool** (`scripts/chaos-test.js`)
- User-friendly interface for running chaos tests
- Multiple output formats (console, JSON, markdown)
- Scenario selection and filtering
- Report generation and file export
- Integration with npm scripts

### 5. **Documentation** (`docs/CHAOS_TESTING.md`)
- Comprehensive guide to chaos testing
- Scenario descriptions and expected behaviors
- Usage examples and best practices
- Troubleshooting guide
- Integration instructions

### 6. **Examples** (`examples/chaos-demo.js`)
- Interactive demonstration of chaos testing
- Example scenarios and configurations
- Programmatic usage patterns

### 7. **Package Integration**
- Added npm scripts to `package.json`:
- `npm run chaos-test` - Run all chaos scenarios
- `npm run chaos-test:list` - List available scenarios
- `npm run chaos-test:single` - Run single scenario
- `npm run test:chaos` - Run chaos tests via Jest
- Updated README with chaos testing section

## Acceptance Criteria Met

### ✅ The backend can be tested under realistic degraded dependency conditions
- **Implemented**: 6 realistic fault scenarios covering common production issues
- **Verified**: Each scenario injects specific, measurable faults
- **Testable**: Scenarios can be run individually or as a suite

### ✅ Recovery behavior is observable and repeatable
- **Implemented**: Comprehensive metrics collection (requests, failures, latency, circuit state)
- **Observable**: Real-time logging and health reporting
- **Repeatable**: Deterministic fault injection with configurable probabilities

### ✅ The test setup teaches contributors about resilience expectations
- **Implemented**: Detailed documentation with expected behaviors
- **Educational**: Each scenario documents what should happen
- **Actionable**: Recommendations generated from test results

### ✅ Findings can be turned into concrete follow-up work
- **Implemented**: Automated recommendation generation
- **Prioritized**: Recommendations categorized by severity (CRITICAL, HIGH, MEDIUM)
- **Actionable**: Specific actions suggested for each finding

## Key Features

### Realistic Fault Injection
- Not just binary success/failure - includes partial failures, degradation, flakiness
- Configurable probabilities and intensities
- Time-based behaviors (gradual degradation, periodic flakiness)

### Comprehensive Observability
- Metrics collection at multiple levels (RPC, circuit breaker, retry logic)
- Health state tracking during chaos
- Detailed logs for debugging

### Integration Ready
- Works with existing Jest test framework
- Compatible with current RPC wrapper and circuit breaker
- Minimal dependencies on existing code

### Practical and Maintainable
- Configurable scenario durations (short for CI, longer for manual testing)
- Easy to add new fault types or scenarios
- Clear separation between test infrastructure and application code

## Usage Examples

### Running Tests
```bash
# Run all chaos scenarios
npm run chaos-test

# Run specific scenarios
npm run chaos-test -- --scenario=latency,ratelimit

# Run via Jest
npm test -- chaos.test.js

# Generate JSON report
npm run chaos-test -- --output=json --file=report.json
```

### Programmatic Usage
```javascript
const { ChaosTestHarness } = require('./src/chaosTestHarness');

async function testResilience() {
const harness = new ChaosTestHarness();
const results = await harness.runAllScenarios();

console.log(`Passed ${results.summary.passedScenarios}/${results.summary.totalScenarios}`);

if (results.summary.failedScenarios > 0) {
const report = harness.generateReport(results);
console.log('Recommendations:', report.recommendations);
}
}
```

## Files Created/Modified

### New Files
1. `src/chaosRpcServer.js` - Fault-injecting RPC server
2. `src/chaosTestHarness.js` - Chaos test orchestration
3. `__tests__/chaos.test.js` - Chaos test suite
4. `scripts/chaos-test.js` - Command-line tool
5. `docs/CHAOS_TESTING.md` - Comprehensive documentation
6. `examples/chaos-demo.js` - Interactive demo
7. `IMPLEMENTATION_SUMMARY_CHAOS_TESTING.md` - This summary

### Modified Files
1. `package.json` - Added chaos testing scripts
2. `README.md` - Added chaos testing section to table of contents and documentation

## Testing Coverage

The implementation provides:
- **Unit tests**: Individual component testing
- **Integration tests**: Component interaction testing
- **Scenario tests**: Realistic failure pattern testing
- **End-to-end tests**: Full system behavior under chaos

## Next Steps for Contributors

1. **Run the chaos tests** to establish baseline resilience
2. **Review recommendations** from test reports
3. **Add new scenarios** for specific failure modes encountered in production
4. **Integrate into CI/CD** to catch resilience regressions
5. **Extend fault injection** to cover new dependency types (resolvers, databases, etc.)

## Conclusion

The chaos testing framework successfully addresses Issue #244 by providing a comprehensive, practical, and maintainable way to test keeper resilience under realistic failure conditions. The implementation meets all acceptance criteria and provides a solid foundation for ongoing resilience testing and improvement.
47 changes: 47 additions & 0 deletions keeper/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ See the centralized [Glossary](../GLOSSARY.md) for definitions of domain-specifi
- [Setup Instructions](#setup-instructions)
- [Dead-Letter Queue](#dead-letter-queue)
- [Mock Soroban RPC](#mock-soroban-rpc-for-faster-local-testing)
- [Chaos Testing](#chaos-testing)
- [Docker Deployment](#docker-deployment)
- [Troubleshooting](#troubleshooting)

Expand Down Expand Up @@ -372,6 +373,52 @@ Detailed usage, supported methods, and test examples are in [docs/mock-soroban-r
- **Cause**: Application dependencies were not correctly or fully installed.
- **Solution**: Ensure you ran `npm install` inside the `keeper/` directory correctly. Try clearing cache or removing `node_modules` (`rm -rf node_modules`) and running `npm install` again.

## Chaos Testing

The keeper includes a comprehensive chaos testing framework to validate resilience under realistic failure conditions. This helps ensure the keeper can handle network issues, RPC failures, and degraded performance.

### Running Chaos Tests

```bash
# Run all chaos scenarios
npm run chaos-test

# Run specific scenarios
npm run chaos-test -- --scenario=latency,ratelimit

# List available scenarios
npm run chaos-test:list

# Run single scenario
npm run chaos-test:single latency

# Save report to file
npm run chaos-test -- --output=json --file=chaos-report.json
```

### Available Scenarios

1. **Latency Spikes** - Test timeout handling with random delays
2. **Partial RPC Failure** - Test graceful degradation when some methods fail
3. **Rate Limiting** - Test backoff behavior under throttling
4. **Flaky Network** - Test circuit breaker recovery
5. **Gradual Degradation** - Test adaptive behavior to worsening conditions
6. **Complete Outage** - Test worst-case scenario handling

### Integration with Tests

Chaos tests are integrated into the Jest test suite:

```bash
# Run all tests including chaos tests
npm test

# Run only chaos tests
npm test -- chaos.test.js
```

See [Chaos Testing Documentation](./docs/CHAOS_TESTING.md) for detailed information.

## Docker Deployment

The Keeper ships with a multi-stage Dockerfile and a `docker-compose.yml` at the repo root so you can run it on any server with a single command — no local Node.js installation required.
Expand Down
Loading
Loading