Skip to content

Conversation

@mfaferek93
Copy link
Collaborator

@mfaferek93 mfaferek93 commented Jan 19, 2026

Pull Request

Summary

Implement hierarchical and auto-cluster correlation modes to reduce fault noise by identifying root causes and grouping related faults.

Key features:

  • Hierarchical mode: detect root cause → symptom relationships
  • Auto-cluster mode: group similar faults within time window
  • YAML-based configuration with pattern wildcards
  • Muted faults tracking and auto-clear on root cause resolution
  • Gateway API support with include_muted/include_clusters params

New messages: MutedFaultInfo.msg, ClusterInfo.msg
Extended: GetFaults.srv, ClearFault.srv, FaultEvent.msg


Issue

Link the related issue (required):


Type

  • Bug fix
  • New feature or tests
  • Breaking change
  • Documentation only

Testing

How was this tested / how should reviewers verify it?


Checklist

  • Breaking changes are clearly described (and announced in docs / changelog if needed)
  • Tests were added or updated if needed
  • Docs were updated if behavior or public API changed

@mfaferek93 mfaferek93 marked this pull request as ready for review January 19, 2026 19:55
Copilot AI review requested due to automatic review settings January 19, 2026 19:55
@mfaferek93 mfaferek93 marked this pull request as draft January 19, 2026 19:56
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a fault correlation engine for root-cause analysis that reduces noise by identifying hierarchical relationships between faults and grouping related faults into clusters. The feature addresses issue #105 by enabling automatic detection of fault cascades where symptom faults are muted when their root cause is identified.

Changes:

  • Adds hierarchical correlation mode to detect root cause → symptom relationships with automatic symptom muting
  • Adds auto-cluster mode to group similar faults occurring within time windows
  • Extends GetFaults.srv, ClearFault.srv, and FaultEvent.msg with correlation fields (muted faults, clusters, auto-cleared codes)
  • Implements YAML-based configuration with wildcard pattern matching
  • Integrates correlation engine into fault_manager_node with parameter-based enablement
  • Adds comprehensive unit and integration tests for pattern matching, config parsing, and correlation logic
  • Updates REST API to expose correlation data via query parameters
  • Includes complete documentation and Postman collection updates

Reviewed changes

Copilot reviewed 29 out of 29 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/ros2_medkit_msgs/msg/MutedFaultInfo.msg New message type for muted symptom fault metadata
src/ros2_medkit_msgs/msg/ClusterInfo.msg New message type for fault cluster information
src/ros2_medkit_msgs/srv/GetFaults.srv Extended with include_muted/include_clusters flags and response fields
src/ros2_medkit_msgs/srv/ClearFault.srv Extended with auto_cleared_codes response field
src/ros2_medkit_msgs/msg/FaultEvent.msg Extended with auto_cleared_codes for EVENT_CLEARED
src/ros2_medkit_msgs/CMakeLists.txt Registers new message types
src/ros2_medkit_fault_manager/include//correlation/.hpp Core correlation engine headers (types, config parser, pattern matcher, engine)
src/ros2_medkit_fault_manager/src/correlation/*.cpp Core correlation engine implementation
src/ros2_medkit_fault_manager/src/fault_manager_node.cpp Integrates correlation engine into fault processing and service handlers
src/ros2_medkit_fault_manager/include/*/fault_manager_node.hpp Adds correlation engine member and helper methods
src/ros2_medkit_fault_manager/CMakeLists.txt Adds correlation source files and unit test targets
src/ros2_medkit_fault_manager/test/test_*.cpp Unit tests for pattern matcher, config parser, and correlation engine
src/ros2_medkit_fault_manager/test/test_integration.test.py Integration tests for hierarchical and auto-cluster modes
src/ros2_medkit_fault_manager/test/test_correlation.yaml Test correlation configuration
src/ros2_medkit_gateway/include/*/fault_manager.hpp Extended get_faults signature with correlation parameters
src/ros2_medkit_gateway/src/fault_manager.cpp Passes correlation parameters to service and formats response
src/ros2_medkit_gateway/src/http/handlers/fault_handlers.cpp Parses correlation query params and includes correlation data in responses
src/ros2_medkit_fault_manager/README.md Documents correlation configuration and usage
docs/tutorials/fault-correlation.rst Complete tutorial for configuring and using correlation
docs/tutorials/index.rst Adds correlation tutorial to index
postman/collections/*.json Updates API documentation with correlation endpoints

@mfaferek93 mfaferek93 force-pushed the 105/faultsMutating branch 2 times, most recently from 73ecb79 to d4f6574 Compare January 19, 2026 20:14
@mfaferek93 mfaferek93 self-assigned this Jan 19, 2026
@mfaferek93 mfaferek93 added the enhancement New feature or request label Jan 19, 2026
Implement hierarchical and auto-cluster correlation modes to reduce
fault noise by identifying root causes and grouping related faults.

Key features:
- Hierarchical mode: detect root cause → symptom relationships
- Auto-cluster mode: group similar faults within time window
- YAML-based configuration with pattern wildcards
- Muted faults tracking and auto-clear on root cause resolution
- Gateway API support with include_muted/include_clusters params

New messages: MutedFaultInfo.msg, ClusterInfo.msg
Extended: GetFaults.srv, ClearFault.srv, FaultEvent.msg
Add new requests demonstrating fault correlation query parameters:
- GET Faults with Correlation Data (include_muted + include_clusters)
- GET Faults with Muted Details Only
- GET Faults with Cluster Details Only

Update existing endpoint descriptions to document:
- muted_count/cluster_count fields in GET /faults response
- auto_cleared_codes in DELETE clear fault response
- Faults folder description with correlation feature overview
Add comprehensive tutorial covering:
- Hierarchical mode (root cause → symptom relationships)
- Auto-cluster mode (grouping similar faults)
- YAML configuration with patterns and rules
- REST API query parameters for correlation data
- Complete example configuration
- Troubleshooting guide
@mfaferek93 mfaferek93 marked this pull request as ready for review January 20, 2026 18:18
@mfaferek93 mfaferek93 requested a review from Copilot January 20, 2026 18:19
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 29 out of 29 changed files in this pull request and generated 3 comments.

- Fix data race in PatternMatcher::get_compiled() by adding mutex
  protection for the mutable compiled_cache_ member
- Add retroactive_mute_codes field to ProcessFaultResult for tracking
  faults that should have been muted before cluster activation
- Populate retroactive_mute_codes when cluster transitions from
  pending to active (reaches min_count threshold)
- Add debug logging for retroactive muting in fault_manager_node
- Add unit test AutoClusterRetroactiveMuting to verify behavior
Add 5-second timer to periodically call cleanup_expired() on the
correlation engine. This prevents stale pending root causes and
pending clusters from accumulating in memory when faults are
reported infrequently.

Add unit tests:
- CleanupExpiredRemovesPendingRootCauses
- CleanupExpiredRemovesPendingClusters
@mfaferek93
Copy link
Collaborator Author

hey @bburda @nnarain PR grew a bit 😅

where to start: docs/tutorials/fault_correlation.rst - explains how it works (hierarchical vs auto-cluster)

The benefit is that 55 new tests should give peace of mind 🙂

- Skip regex compilation for non-wildcard patterns (perf improvement)
- Add validation error for min_count=0 in auto-cluster rules
- Add TODO documenting known limitation: pending_clusters_ not cleaned
  up when fault is cleared before cluster reaches min_count
- Add inline symptom codes support in YAML config (codes: [...] syntax)
- Refactor severity_to_string/severity_rank to shared types.hpp/cpp
- Add correlation.cleanup_interval_sec parameter (default 5.0s)
- Update TODO(#105) → TODO(#127) for pending_clusters cleanup
- Add PlantUML sequence diagrams to fault-correlation tutorial

Addresses review comments:
- Implement missing inline codes parsing (#1)
- Fix orphaned TODO reference (#2)
- Consolidate duplicated severity functions (#3)
- Parameterize magic number 5s (#4)
- Add flow diagrams to docs (#6)
@mfaferek93 mfaferek93 requested a review from bburda January 21, 2026 19:02
Copy link
Collaborator

@bburda bburda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mfaferek93 mfaferek93 merged commit 8414140 into main Jan 21, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add fault correlation/muting capability to FaultManager to handle "expected cascades" of faults.

4 participants