Distributed Incident Diagnosis for modern service systems.
Ops Commander does not stop at "an alert fired". It connects signals across services, builds causal context, and produces a report that explains:
- what failed
- why it likely failed
- how the failure propagated
- which competing root-cause hypotheses were considered
Imagine checkout failures spike.
Ops Commander links DB latency, service errors, API degradation, and worker backlog into one diagnosis chain.
Instead of "Service X looks red", you get "Failure likely started here, spread through these services, and these are the most likely root causes with confidence."
| Audience | What you get |
|---|---|
| Non-technical stakeholders | A readable incident narrative with propagation paths and confidence-ranked hypotheses |
| Engineering teams | End-to-end streaming pipeline, event normalization, correlation, causal graphing, and replay evaluation |
| Recruiters and interviewers | Real runnable system with curated showcase artifacts, not a static slide deck |
- Multi-service diagnosis, not single-service alert aggregation
- Mixed-signal evidence (
metric,log,alert) - Explicit propagation edges and path metadata
- Ranked competing hypotheses with component-level scoring
- Operational endpoints (
health,readiness,metrics) and replay workflow
flowchart LR
SIM[Simulator or Real Inputs] --> RAWLOG[events.raw.logs]
SIM --> RAWMET[events.raw.metrics]
SIM --> RAWALERT[events.raw.alerts]
RAWLOG --> ING[Ingestor]
RAWMET --> ING
RAWALERT --> ING
ING --> NORM[events.normalized]
NORM --> CORR[Correlator]
CORR --> CAND[incidents.candidates]
CAND --> CAUSAL[Causal Engine]
CAUSAL --> DIAG[incidents.diagnosed]
CAUSAL --> PG[(Postgres)]
ING --> PG
API[Reporter API] --> PG
API --> REPORT[incidents.reports]
API --> REPLAYQ[(replay_jobs)]
REPLAYQ --> RW[Replay Worker]
Detailed architecture notes: docs/architecture.md
Full docs index: docs/README.md
This repo includes a curated distributed-incident artifact bundle:
- artifacts/showcase/sample_distributed_incident/SHOWCASE_SUMMARY.md
- artifacts/showcase/sample_distributed_incident/incident_inc-038827426dfd4958.json
- artifacts/showcase/sample_distributed_incident/report_inc-038827426dfd4958.json
Sample highlights:
affected_services:api-gateway,orders-db,orders-service,worker-service- non-empty
how_failure_propagated signal_coverage:alert,log,metric- multiple ranked hypotheses with confidence
Folder guide: artifacts/showcase/README.md
- Python 3.11+
- Docker Desktop
python -m pip install -e ".[dev]"
./scripts/bootstrap.ps1python -m apps.ingestor.main
python -m apps.simulator.main
python -m apps.correlator.main
python -m apps.causal_engine.main
python -m apps.reporter_api.main
python -m apps.replay_worker.mainpython scripts/inject_distributed_incident.py --chain-id elite-chain-demo./scripts/capture_showcase.ps1 -WaitSeconds 45GET /healthGET /ops/readinessGET /ops/metricsGET /incidentsGET /incidents/{id}GET /incidents/{id}/reportPOST /replay/jobsPOST /ingest/alerts/alertmanager
- Log file tail to raw logs:
python scripts/feed_logs_from_file.py --file C:/path/to/app.log --service payments-api --follow
- Prometheus to raw metrics:
python scripts/feed_metrics_from_prom.py --prom-url http://localhost:9090 --service payments-api
- Alertmanager webhook to raw alerts:
POST /ingest/alerts/alertmanager
Detailed runbook: docs/real_world_inputs.md
- Retry-backed producer and consumer commits
- Runtime metrics snapshots in each service
- Kafka lag visibility in processing services
- Chaos drills (
scripts/chaos_drill.ps1) - Replay evaluation metrics (top-1, top-3, MRR, calibration bins)
Troubleshooting guide: docs/troubleshooting.md
apps/simulator: synthetic telemetry and fault generationapps/ingestor: validation, normalization, dedupe, quarantineapps/correlator: event-time windows, cluster linking, candidate emissionapps/causal_engine: causal graph build, ranking, explainable diagnosisapps/reporter_api: incident/report/replay APIs and ingest endpointapps/replay_worker: replay jobs and diagnosis evaluation artifactslibs/common: shared models, kafka helpers, observability, resiliencelibs/storage: SQLAlchemy models and repositoriesschemas: versioned event and diagnosis schemasconfig: service topology, scenarios, scoring weightsscripts: bootstrap, demo, chaos, ingestion helpers, showcase capture
schemas/event_envelope_v1.jsonschemas/log_event_v1.jsonschemas/metric_event_v1.jsonschemas/alert_event_v1.jsonschemas/incident_candidate_v1.jsonschemas/diagnosis_v1.json
Regenerate schema files:
python scripts/export_schemas.py