-
Notifications
You must be signed in to change notification settings - Fork 2
feat(#105): add fault correlation engine for root-cause analysis #121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements a fault correlation engine for root-cause analysis that reduces noise by identifying hierarchical relationships between faults and grouping related faults into clusters. The feature addresses issue #105 by enabling automatic detection of fault cascades where symptom faults are muted when their root cause is identified.
Changes:
- Adds hierarchical correlation mode to detect root cause → symptom relationships with automatic symptom muting
- Adds auto-cluster mode to group similar faults occurring within time windows
- Extends GetFaults.srv, ClearFault.srv, and FaultEvent.msg with correlation fields (muted faults, clusters, auto-cleared codes)
- Implements YAML-based configuration with wildcard pattern matching
- Integrates correlation engine into fault_manager_node with parameter-based enablement
- Adds comprehensive unit and integration tests for pattern matching, config parsing, and correlation logic
- Updates REST API to expose correlation data via query parameters
- Includes complete documentation and Postman collection updates
Reviewed changes
Copilot reviewed 29 out of 29 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/ros2_medkit_msgs/msg/MutedFaultInfo.msg | New message type for muted symptom fault metadata |
| src/ros2_medkit_msgs/msg/ClusterInfo.msg | New message type for fault cluster information |
| src/ros2_medkit_msgs/srv/GetFaults.srv | Extended with include_muted/include_clusters flags and response fields |
| src/ros2_medkit_msgs/srv/ClearFault.srv | Extended with auto_cleared_codes response field |
| src/ros2_medkit_msgs/msg/FaultEvent.msg | Extended with auto_cleared_codes for EVENT_CLEARED |
| src/ros2_medkit_msgs/CMakeLists.txt | Registers new message types |
| src/ros2_medkit_fault_manager/include//correlation/.hpp | Core correlation engine headers (types, config parser, pattern matcher, engine) |
| src/ros2_medkit_fault_manager/src/correlation/*.cpp | Core correlation engine implementation |
| src/ros2_medkit_fault_manager/src/fault_manager_node.cpp | Integrates correlation engine into fault processing and service handlers |
| src/ros2_medkit_fault_manager/include/*/fault_manager_node.hpp | Adds correlation engine member and helper methods |
| src/ros2_medkit_fault_manager/CMakeLists.txt | Adds correlation source files and unit test targets |
| src/ros2_medkit_fault_manager/test/test_*.cpp | Unit tests for pattern matcher, config parser, and correlation engine |
| src/ros2_medkit_fault_manager/test/test_integration.test.py | Integration tests for hierarchical and auto-cluster modes |
| src/ros2_medkit_fault_manager/test/test_correlation.yaml | Test correlation configuration |
| src/ros2_medkit_gateway/include/*/fault_manager.hpp | Extended get_faults signature with correlation parameters |
| src/ros2_medkit_gateway/src/fault_manager.cpp | Passes correlation parameters to service and formats response |
| src/ros2_medkit_gateway/src/http/handlers/fault_handlers.cpp | Parses correlation query params and includes correlation data in responses |
| src/ros2_medkit_fault_manager/README.md | Documents correlation configuration and usage |
| docs/tutorials/fault-correlation.rst | Complete tutorial for configuring and using correlation |
| docs/tutorials/index.rst | Adds correlation tutorial to index |
| postman/collections/*.json | Updates API documentation with correlation endpoints |
src/ros2_medkit_fault_manager/src/correlation/correlation_engine.cpp
Outdated
Show resolved
Hide resolved
73ecb79 to
d4f6574
Compare
Implement hierarchical and auto-cluster correlation modes to reduce fault noise by identifying root causes and grouping related faults. Key features: - Hierarchical mode: detect root cause → symptom relationships - Auto-cluster mode: group similar faults within time window - YAML-based configuration with pattern wildcards - Muted faults tracking and auto-clear on root cause resolution - Gateway API support with include_muted/include_clusters params New messages: MutedFaultInfo.msg, ClusterInfo.msg Extended: GetFaults.srv, ClearFault.srv, FaultEvent.msg
Add new requests demonstrating fault correlation query parameters: - GET Faults with Correlation Data (include_muted + include_clusters) - GET Faults with Muted Details Only - GET Faults with Cluster Details Only Update existing endpoint descriptions to document: - muted_count/cluster_count fields in GET /faults response - auto_cleared_codes in DELETE clear fault response - Faults folder description with correlation feature overview
Add comprehensive tutorial covering: - Hierarchical mode (root cause → symptom relationships) - Auto-cluster mode (grouping similar faults) - YAML configuration with patterns and rules - REST API query parameters for correlation data - Complete example configuration - Troubleshooting guide
d4f6574 to
5c9cf0e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 29 out of 29 changed files in this pull request and generated 3 comments.
- Fix data race in PatternMatcher::get_compiled() by adding mutex protection for the mutable compiled_cache_ member - Add retroactive_mute_codes field to ProcessFaultResult for tracking faults that should have been muted before cluster activation - Populate retroactive_mute_codes when cluster transitions from pending to active (reaches min_count threshold) - Add debug logging for retroactive muting in fault_manager_node - Add unit test AutoClusterRetroactiveMuting to verify behavior
Add 5-second timer to periodically call cleanup_expired() on the correlation engine. This prevents stale pending root causes and pending clusters from accumulating in memory when faults are reported infrequently. Add unit tests: - CleanupExpiredRemovesPendingRootCauses - CleanupExpiredRemovesPendingClusters
- Skip regex compilation for non-wildcard patterns (perf improvement) - Add validation error for min_count=0 in auto-cluster rules - Add TODO documenting known limitation: pending_clusters_ not cleaned up when fault is cleared before cluster reaches min_count
a6b7781 to
e3f2c4d
Compare
src/ros2_medkit_fault_manager/src/correlation/config_parser.cpp
Outdated
Show resolved
Hide resolved
src/ros2_medkit_fault_manager/src/correlation/correlation_engine.cpp
Outdated
Show resolved
Hide resolved
src/ros2_medkit_fault_manager/include/ros2_medkit_fault_manager/correlation/pattern_matcher.hpp
Show resolved
Hide resolved
- Add inline symptom codes support in YAML config (codes: [...] syntax) - Refactor severity_to_string/severity_rank to shared types.hpp/cpp - Add correlation.cleanup_interval_sec parameter (default 5.0s) - Update TODO(#105) → TODO(#127) for pending_clusters cleanup - Add PlantUML sequence diagrams to fault-correlation tutorial Addresses review comments: - Implement missing inline codes parsing (#1) - Fix orphaned TODO reference (#2) - Consolidate duplicated severity functions (#3) - Parameterize magic number 5s (#4) - Add flow diagrams to docs (#6)
bburda
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Pull Request
Summary
Implement hierarchical and auto-cluster correlation modes to reduce fault noise by identifying root causes and grouping related faults.
Key features:
New messages: MutedFaultInfo.msg, ClusterInfo.msg
Extended: GetFaults.srv, ClearFault.srv, FaultEvent.msg
Issue
Link the related issue (required):
Type
Testing
How was this tested / how should reviewers verify it?
Checklist