-
Notifications
You must be signed in to change notification settings - Fork 3
impl: Fault Lib Initial Impl #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
impl: Fault Lib Initial Impl #4
Conversation
First implementation proposal for faultlib
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a solid early draft - the overall architecture makes sense :)
That said, there's still quite a bit of work needed before this can be merged:
- Several critical issues (memory leaks, potential deadlocks, panic-happy error handling)
- Many todo!() and unimplemented!() in production code paths
- Mismatch between what's documented and what's actually implemented
- Some design decisions that need team discussion (e.g. FaultCatalog location, debouncing scope, Weak vs Arc for global state)
As it stands, this PR needs a new owner to drive it to completion - the current state isn't mergeable but it's a great foundation to build on
| BadDescriptor(&'static str), | ||
| #[error("other: {0}")] | ||
| Other(&'static str), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using &'static str forces callers to Box::leak dynamic strings, causing memory leaks
Suggestion: Change to Cow<'static, str> and remove Copy derive:
| BadDescriptor(&'static str), | |
| #[error("other: {0}")] | |
| Other(&'static str), | |
| BadDescriptor(Cow<'static, str>), | |
| Other(Cow<'static, str>), |
| impl From<FaultManagerError> for SinkError { | ||
| fn from(err: FaultManagerError) -> Self { | ||
| match err { | ||
| FaultManagerError::SendError(msg) => SinkError::Other(Box::leak(msg.into_boxed_str())), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Box::leak here causes memory leak on every SendError!
Fix: After changing SinkError to Cow (see my coomet in sink_error.rs):
FaultManagerError::SendError(msg) => SinkError::Other(Cow::Owned(msg))
|
|
||
| impl FaultManagerSink { | ||
| pub(crate) fn new() -> Self { | ||
| let (tx, rx) = mpsc::channel(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unbounded channel - can grow forever if IPC is slow or DFM is down
|
|
||
| pub fn run<S: SovdFaultStateStorage>(&self, mut processor: FaultRecordProcessor<S>) { | ||
| info!("FaultLibCommunicator listening..."); | ||
| while self.node.wait(DIAGNOSTIC_FAULT_MANAGER_LISTENER_CYCLE_TIME).is_ok() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop runs forever - no way to stop it gracefully.
when DiagnosticFaultManager is dropped, it will deadlock on join().
| fn drop(&mut self) { | ||
| println!("Joining fault_lib_receiver_thread"); | ||
| let handle = self.fault_lib_receiver_thread.take().unwrap(); | ||
| if let Err(err) = handle.join() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop tries to join the worker thread, but run() loops forever (see comment in fault_lib_communicator.rs). This will deadlock! Need a shutdown signal sent before join.
| // Copyright (c) 2025 Contributors to the Eclipse Foundation | ||
| // | ||
| // See the NOTICE file(s) distributed with this work for additional | ||
| // information regarding copyright ownership. | ||
| // | ||
| // This program and the accompanying materials are made available under the | ||
| // terms of the Apache License Version 2.0 which is available at | ||
| // <https://www.apache.org/licenses/LICENSE-2.0> | ||
| // | ||
| // SPDX-License-Identifier: Apache-2.0 | ||
| // | ||
|
|
||
| // Copyright (c) 2025 Contributors to the Eclipse Foundation | ||
| // | ||
| // See the NOTICE file(s) distributed with this work for additional | ||
| // information regarding copyright ownership. | ||
| // | ||
| // This program and the accompanying materials are made available under the | ||
| // terms of the Apache License Version 2.0 which is available at | ||
| // <https://www.apache.org/licenses/LICENSE-2.0> | ||
| // | ||
| // SPDX-License-Identifier: Apache-2.0 | ||
| // |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate copyright header
| fn listen_hash_check_response(&self) -> Result<bool, SinkError> { | ||
| let start = Instant::now(); | ||
| loop { | ||
| if let Some(msg) = self | ||
| .hash_check_response_subscriber | ||
| .as_ref() | ||
| .unwrap() | ||
| .receive() | ||
| .map_err(|_| SinkError::TransportDown)? | ||
| { | ||
| return Ok(*msg.payload()); | ||
| } | ||
| if start.elapsed() >= TIMEOUT { | ||
| return Err(SinkError::Timeout); | ||
| } | ||
| std::thread::sleep(Duration::from_millis(10)); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Busy-wait with sleep(10ms) burns CPU. Could use waitset or condvar
| [dependencies] | ||
| env_logger.workspace = true | ||
| common = { path = "../common" } | ||
| fault_lib = { path = "../fault_lib" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dfm_lib depends on fault_lib - this creates tight coupling. FaultCatalog and FaultCatalogBuilder should probably live in common crate, since both sides need them. Would allow cleaner separation.
| Use cases | ||
| --------- | ||
|
|
||
| Following usecases are valid for the S-CORE application using the Fault Library: | ||
|
|
||
| - registering new fault in the system | ||
| - depending on car configuration variant, enabled features etc. the number of faults detected and reported by the app can change | ||
| - depending on the current status and state of the car electronic system the APP can report different faults | ||
| - configuring debouncing for the fault | ||
| - different test can require the results to be filtered over time or debounced, to prevent setting the faults by glitches or false positives | ||
| - configuring enabling conditions for the fault | ||
| - each test can require different system conditions to be fulfilled before the test can be performed (e.g. the communication test can be done only if the power supply is in expected range) | ||
| - reporting results of diagnostic tests (fail / pass) | ||
| - reporting status of enabling conditions (if done in the app) | ||
| - the application can report only status on the enabling condition and does not report any faults | ||
| - react to the SOVD Fault Handling actions (e.g. delete faults can cause the test to restart) | ||
| - react to change in the enabling conditions (some tests could be impossible to be process when enabling conditions are not fulfilled) | ||
| - provide interface to the user which allow to provide additional environmental data to be stored with the fault | ||
|
|
||
| Following usecases applies for the Fault Library (FL) and Diagnostic Fault Manager (DFM): | ||
|
|
||
| - validate the consistency of the fault catalog shared between DFM and FL | ||
| - DFM maintain global fault catalog based on the information from each FL | ||
| - FL reports state changes in the faults to DFM over IPC | ||
| - FL reports enabling condition state change to the DFM over IPC | ||
| - DFM reports over IPC to FL enabling condition state change reported by another FL | ||
| - DFM requests restart of the test for the faults reported by FL | ||
| - DFM reports cleaning of the faults in the DFM by the SOVD client | ||
| - DFM receives and maintain current status of the environment conditions to be stored together with faults | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These use cases are documented but NOT implemented yet:
- enabling conditions (subscribe, notify, local/global)
- dynamic fault registration ("registering new fault in the system")
- reset policies / debouncing on manager side
- react to SOVD Fault Handling actions
Consider adding a "MVP Scope" section listing what's actually implemented vs planned.
| pub struct FaultDescriptor { | ||
| pub id: FaultId, | ||
|
|
||
| pub name: ShortString, | ||
| pub summary: Option<LongString>, | ||
|
|
||
| pub category: FaultType, | ||
| pub severity: FaultSeverity, | ||
| pub compliance: ComplianceVec, | ||
|
|
||
| pub reporter_side_debounce: Option<DebounceMode>, | ||
| pub reporter_side_reset: Option<ResetPolicy>, | ||
| pub manager_side_debounce: Option<DebounceMode>, | ||
| pub manager_side_reset: Option<ResetPolicy>, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fields:
reporter_side_debounce
reporter_side_reset
manager_side_debounce
manager_side_reset
are defined but none of them are actually used in the code.
Either implement debouncing/reset logic or document these as "reserved for future".
@mfaferek93 Would you take over this task? |
|
@lh-sag I need to review the priorities and check internally with 42dot team, there’s a chance @bburda42dot will handle this on our side. If it ends up being on me, I’ll work on it in my free time, so it may take a bit longer. |
|
My overall review:
|
| } | ||
|
|
||
| #[derive(Clone, Default, Debug, PartialEq, Eq)] | ||
| pub struct SovdFault { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice that three fields are missing, symptom, symptom_translation_id and schema
| impl<S: SovdFaultStateStorage> Drop for DiagnosticFaultManager<S> { | ||
| fn drop(&mut self) { | ||
| println!("Joining fault_lib_receiver_thread"); | ||
| let handle = self.fault_lib_receiver_thread.take().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unwrap on take will panic if called twice
| /// Diagnostic Fault Manager. This trait shall never become public | ||
| impl FaultSinkApi for FaultManagerSink { | ||
| fn publish(&self, path: &str, record: FaultRecord) -> Result<(), SinkError> { | ||
| let event = DiagnosticEvent::Fault((to_static_long_string(path).expect("Failed to serialize path"), record)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
path strings are converted to LongString without proper validation
First implementation proposal for faultlib from Qorix.
Summary
Implementation proposal of fault lib .
No further work will be done on this PR by the original team.
If this should move forward, it needs a new owner/volunteer.
Checklist
Related
Notes for Reviewers