Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
476 changes: 476 additions & 0 deletions docs/tutorials/fault-correlation.rst

Large diffs are not rendered by default.

4 changes: 4 additions & 0 deletions docs/tutorials/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ Step-by-step guides for common use cases with ros2_medkit.
authentication
https
snapshots
fault-correlation
docker
devcontainer
integration
Expand Down Expand Up @@ -39,6 +40,9 @@ Basic Tutorials
:doc:`snapshots`
Configure snapshot capture for fault debugging.

:doc:`fault-correlation`
Configure fault correlation for root-cause analysis and noise reduction.

:doc:`docker`
Deploy ros2_medkit in Docker containers.

Expand Down
83 changes: 81 additions & 2 deletions postman/collections/ros2-medkit-gateway.postman_collection.json
Original file line number Diff line number Diff line change
Expand Up @@ -1015,6 +1015,7 @@
},
{
"name": "Faults",
"description": "Fault management API for reporting, querying, and clearing faults.\n\n**Features:**\n- Query faults by status (pending/confirmed/cleared)\n- Real-time SSE streaming for fault events\n- Topic snapshots captured at fault confirmation\n- **Fault Correlation**: Automatic root-cause analysis and fault clustering (configure via correlation.config_file parameter)",
"item": [
{
"name": "GET Fault Events Stream (SSE)",
Expand Down Expand Up @@ -1049,7 +1050,7 @@
"faults"
]
},
"description": "List all faults across the entire system. Convenience API for dashboards and monitoring tools. By default returns PENDING and CONFIRMED faults."
"description": "List all faults across the entire system. Convenience API for dashboards and monitoring tools. By default returns PENDING and CONFIRMED faults.\n\n**Correlation fields (always included when correlation enabled):**\n- `muted_count`: Number of symptom faults muted by correlation rules\n- `cluster_count`: Number of active fault clusters\n\nUse `include_muted=true` or `include_clusters=true` for detailed correlation data."
},
"response": []
},
Expand Down Expand Up @@ -1161,7 +1162,7 @@
"SENSOR_OVERTEMP"
]
},
"description": "Clear a fault by fault_code (REQ_INTEROP_015). Changes fault status to CLEARED. Returns success status with component_id and fault_code."
"description": "Clear a fault by fault_code (REQ_INTEROP_015). Changes fault status to CLEARED. Returns success status with component_id and fault_code.\n\n**With correlation enabled:**\nIf the cleared fault is a root cause with `auto_clear_with_root=true`, the response includes `auto_cleared_codes[]` listing all symptom faults that were automatically cleared."
},
"response": []
},
Expand All @@ -1186,6 +1187,84 @@
},
"response": []
},
{
"name": "GET Faults with Correlation Data",
"request": {
"method": "GET",
"header": [],
"url": {
"raw": "{{base_url}}/faults?include_muted=true&include_clusters=true",
"host": [
"{{base_url}}"
],
"path": [
"faults"
],
"query": [
{
"key": "include_muted",
"value": "true",
"description": "Include detailed muted fault information"
},
{
"key": "include_clusters",
"value": "true",
"description": "Include detailed cluster information"
}
]
},
"description": "List faults with full correlation data. When correlation is enabled:\n\n**Response always includes:**\n- `muted_count`: Number of faults muted as symptoms of root causes\n- `cluster_count`: Number of active fault clusters\n\n**With include_muted=true:**\n- `muted_faults[]`: Array of muted fault details (fault_code, root_cause_code, rule_id, delay_ms)\n\n**With include_clusters=true:**\n- `clusters[]`: Array of cluster details (cluster_id, representative_code, fault_codes[], rule_name, first_at, last_at)\n\n**Correlation Modes:**\n1. HIERARCHICAL: Identifies root cause → symptom relationships. Symptoms are muted.\n2. AUTO_CLUSTER: Groups similar faults within a time window into clusters."
},
"response": []
},
{
"name": "GET Faults with Muted Details Only",
"request": {
"method": "GET",
"header": [],
"url": {
"raw": "{{base_url}}/faults?include_muted=true",
"host": [
"{{base_url}}"
],
"path": [
"faults"
],
"query": [
{
"key": "include_muted",
"value": "true"
}
]
},
"description": "List faults with muted fault details. Muted faults are symptoms that were suppressed because a root cause was detected.\n\nEach muted fault includes:\n- `fault_code`: The symptom fault code\n- `root_cause_code`: The root cause that triggered muting\n- `rule_id`: Which correlation rule matched\n- `delay_ms`: Time delay from root cause to symptom"
},
"response": []
},
{
"name": "GET Faults with Cluster Details Only",
"request": {
"method": "GET",
"header": [],
"url": {
"raw": "{{base_url}}/faults?include_clusters=true",
"host": [
"{{base_url}}"
],
"path": [
"faults"
],
"query": [
{
"key": "include_clusters",
"value": "true"
}
]
},
"description": "List faults with cluster details. Clusters are groups of similar faults that occurred within a time window.\n\nEach cluster includes:\n- `cluster_id`: Unique cluster identifier\n- `rule_id`, `rule_name`: The auto-cluster rule that created this cluster\n- `representative_code`: The fault shown as the cluster representative (based on rule config: first, most_recent, or highest_severity)\n- `representative_severity`: Severity of the representative fault\n- `fault_codes[]`: All fault codes in the cluster\n- `first_at`, `last_at`: Timestamps of first and last faults in cluster"
},
"response": []
},
{
"name": "GET Fault Snapshots (System-wide)",
"request": {
Expand Down
22 changes: 22 additions & 0 deletions src/ros2_medkit_fault_manager/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,10 @@ add_library(fault_manager_lib STATIC
src/fault_storage.cpp
src/sqlite_fault_storage.cpp
src/snapshot_capture.cpp
src/correlation/types.cpp
src/correlation/config_parser.cpp
src/correlation/pattern_matcher.cpp
src/correlation/correlation_engine.cpp
)

target_include_directories(fault_manager_lib PUBLIC
Expand Down Expand Up @@ -119,6 +123,18 @@ if(BUILD_TESTING)
target_link_libraries(test_snapshot_capture fault_manager_lib)
ament_target_dependencies(test_snapshot_capture rclcpp ros2_medkit_msgs)

# Correlation config parser tests
ament_add_gtest(test_correlation_config_parser test/test_correlation_config_parser.cpp)
target_link_libraries(test_correlation_config_parser fault_manager_lib)

# Pattern matcher tests
ament_add_gtest(test_pattern_matcher test/test_pattern_matcher.cpp)
target_link_libraries(test_pattern_matcher fault_manager_lib)

# Correlation engine tests
ament_add_gtest(test_correlation_engine test/test_correlation_engine.cpp)
target_link_libraries(test_correlation_engine fault_manager_lib)

# Apply coverage flags to test targets
if(ENABLE_COVERAGE)
target_compile_options(test_fault_manager PRIVATE --coverage -O0 -g)
Expand All @@ -127,6 +143,12 @@ if(BUILD_TESTING)
target_link_options(test_sqlite_storage PRIVATE --coverage)
target_compile_options(test_snapshot_capture PRIVATE --coverage -O0 -g)
target_link_options(test_snapshot_capture PRIVATE --coverage)
target_compile_options(test_correlation_config_parser PRIVATE --coverage -O0 -g)
target_link_options(test_correlation_config_parser PRIVATE --coverage)
target_compile_options(test_pattern_matcher PRIVATE --coverage -O0 -g)
target_link_options(test_pattern_matcher PRIVATE --coverage)
target_compile_options(test_correlation_engine PRIVATE --coverage -O0 -g)
target_link_options(test_correlation_engine PRIVATE --coverage)
endif()

# Integration tests
Expand Down
151 changes: 151 additions & 0 deletions src/ros2_medkit_fault_manager/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ ros2 service call /fault_manager/clear_fault ros2_medkit_msgs/srv/ClearFault \
- **Persistent storage**: SQLite backend ensures faults survive node restarts
- **Debounce filtering** (optional): AUTOSAR DEM-style counter-based fault confirmation
- **Snapshot capture**: Captures topic data when faults are confirmed for debugging (snapshots are deleted when fault is cleared)
- **Fault correlation** (optional): Root cause analysis with symptom muting and auto-clear

## Parameters

Expand Down Expand Up @@ -181,6 +182,156 @@ Event types: `0` = EVENT_FAILED, `1` = EVENT_PASSED

CRITICAL severity faults bypass debounce and are immediately CONFIRMED, regardless of threshold.

## Advanced: Fault Correlation

Fault correlation reduces noise by identifying relationships between faults. When enabled, symptom faults
(effects of a root cause) can be muted and auto-cleared when the root cause is resolved.

### Correlation Modes

**Hierarchical**: Defines explicit root cause → symptoms relationships. When a root cause fault occurs,
subsequent matching symptom faults within a time window are correlated and optionally muted.

**Auto-Cluster**: Automatically groups related faults that match a pattern within a time window.
Useful for detecting "storms" of related faults (e.g., communication errors).

### Configuration

Enable correlation by providing a YAML configuration file:

```bash
ros2 run ros2_medkit_fault_manager fault_manager_node --ros-args \
-p correlation.config_file:=/path/to/correlation.yaml
```

### Correlation Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `correlation.config_file` | string | `""` | Path to correlation YAML config (empty = disabled) |
| `correlation.cleanup_interval_sec` | double | `5.0` | Interval for cleaning up expired pending correlations (seconds) |

### Configuration File Format

```yaml
correlation:
enabled: true
default_window_ms: 500 # Default time window for symptom detection

# Reusable fault patterns (supports wildcards with *)
patterns:
motor_errors:
codes: ["MOTOR_COMM_*", "MOTOR_TIMEOUT_*"]
drive_faults:
codes: ["DRIVE_*"]
comm_errors:
codes: ["*_COMM_*", "*_TIMEOUT"]

rules:
# Hierarchical rule: E-Stop causes motor and drive faults
- id: estop_cascade
name: "E-Stop Cascade"
mode: hierarchical
root_cause:
codes: ["ESTOP_001", "ESTOP_002"]
symptoms:
- pattern: motor_errors
- pattern: drive_faults
window_ms: 1000 # Symptoms within 1s of root cause
mute_symptoms: true # Don't publish symptom events
auto_clear_with_root: true # Clear symptoms when root cause clears

# Auto-cluster rule: Group communication errors
- id: comm_storm
name: "Communication Storm"
mode: auto_cluster
match:
- pattern: comm_errors
min_count: 3 # Need 3 faults to form cluster
window_ms: 500 # Within 500ms
show_as_single: true # Only show representative fault
representative: highest_severity # first | most_recent | highest_severity
```

### Pattern Wildcards

Patterns support `*` wildcard matching:
- `MOTOR_*` matches `MOTOR_COMM`, `MOTOR_TIMEOUT`, `MOTOR_DRIVE_FAULT`
- `*_COMM_*` matches `MOTOR_COMM_FL`, `SENSOR_COMM_TIMEOUT`
- `*_TIMEOUT` matches `MOTOR_TIMEOUT`, `SENSOR_TIMEOUT`

### Querying Correlation Data

Use `include_muted` and `include_clusters` to retrieve correlation information:

```bash
# Get faults with muted fault details
ros2 service call /fault_manager/get_faults ros2_medkit_msgs/srv/GetFaults \
"{statuses: ['CONFIRMED'], include_muted: true, include_clusters: true}"
```

Response includes:
- `muted_count`: Number of muted symptom faults
- `cluster_count`: Number of active fault clusters
- `muted_faults[]`: Details of muted faults (when `include_muted=true`)
- `clusters[]`: Details of active clusters (when `include_clusters=true`)

### REST API (via Gateway)

Query parameters for GET `/api/v1/faults`:
- `include_muted=true`: Include muted fault details in response
- `include_clusters=true`: Include cluster details in response

Response fields:
```json
{
"faults": [...],
"count": 5,
"muted_count": 2,
"cluster_count": 1,
"muted_faults": [
{
"fault_code": "MOTOR_COMM_FL",
"root_cause_code": "ESTOP_001",
"rule_id": "estop_cascade",
"delay_ms": 50
}
],
"clusters": [
{
"cluster_id": "comm_storm_1",
"rule_id": "comm_storm",
"rule_name": "Communication Storm",
"representative_code": "SENSOR_TIMEOUT",
"representative_severity": "CRITICAL",
"fault_codes": ["MOTOR_COMM_FL", "SENSOR_TIMEOUT", "DRIVE_COMM_ERR"],
"count": 3,
"first_at": 1705678901.123,
"last_at": 1705678901.456
}
]
}
```

When clearing a root cause fault, `auto_cleared_codes` lists symptoms that were auto-cleared:
```json
{
"status": "success",
"fault_code": "ESTOP_001",
"message": "Fault cleared",
"auto_cleared_codes": ["MOTOR_COMM_FL", "MOTOR_COMM_FR", "DRIVE_FAULT"]
}
```

### Example: E-Stop Cascade

1. E-Stop is triggered → `ESTOP_001` fault reported
2. Motors lose power → `MOTOR_COMM_FL`, `MOTOR_COMM_FR` faults reported
3. Correlation engine detects motor faults are symptoms of E-Stop
4. Motor faults are muted (not published as events, but stored)
5. Dashboard shows only `ESTOP_001` (root cause)
6. When E-Stop is cleared → Motor faults are auto-cleared

## Building

```bash
Expand Down
Loading