Skip to content

Conversation

@vinodreddy-g
Copy link

@vinodreddy-g vinodreddy-g commented Jan 19, 2026

First implementation proposal for faultlib from Qorix.

Summary

Implementation proposal of fault lib .
No further work will be done on this PR by the original team.
If this should move forward, it needs a new owner/volunteer.

Checklist

  • I have tested my changes locally
  • I have added or updated documentation
  • I have linked related issues or discussions
  • I have added or updated tests

Related

Notes for Reviewers

First implementation proposal for faultlib
@vinodreddy-g vinodreddy-g requested a review from a team as a code owner January 19, 2026 06:25
@vinodreddy-g vinodreddy-g marked this pull request as draft January 19, 2026 07:07
Copy link

@mfaferek93 mfaferek93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a solid early draft - the overall architecture makes sense :)

That said, there's still quite a bit of work needed before this can be merged:

  • Several critical issues (memory leaks, potential deadlocks, panic-happy error handling)
  • Many todo!() and unimplemented!() in production code paths
  • Mismatch between what's documented and what's actually implemented
  • Some design decisions that need team discussion (e.g. FaultCatalog location, debouncing scope, Weak vs Arc for global state)

As it stands, this PR needs a new owner to drive it to completion - the current state isn't mergeable but it's a great foundation to build on

Comment on lines +22 to +24
BadDescriptor(&'static str),
#[error("other: {0}")]
Other(&'static str),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using &'static str forces callers to Box::leak dynamic strings, causing memory leaks
Suggestion: Change to Cow<'static, str> and remove Copy derive:

Suggested change
BadDescriptor(&'static str),
#[error("other: {0}")]
Other(&'static str),
BadDescriptor(Cow<'static, str>),
Other(Cow<'static, str>),

impl From<FaultManagerError> for SinkError {
fn from(err: FaultManagerError) -> Self {
match err {
FaultManagerError::SendError(msg) => SinkError::Other(Box::leak(msg.into_boxed_str())),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Box::leak here causes memory leak on every SendError!
Fix: After changing SinkError to Cow (see my coomet in sink_error.rs):
FaultManagerError::SendError(msg) => SinkError::Other(Cow::Owned(msg))


impl FaultManagerSink {
pub(crate) fn new() -> Self {
let (tx, rx) = mpsc::channel();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unbounded channel - can grow forever if IPC is slow or DFM is down


pub fn run<S: SovdFaultStateStorage>(&self, mut processor: FaultRecordProcessor<S>) {
info!("FaultLibCommunicator listening...");
while self.node.wait(DIAGNOSTIC_FAULT_MANAGER_LISTENER_CYCLE_TIME).is_ok() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop runs forever - no way to stop it gracefully.
when DiagnosticFaultManager is dropped, it will deadlock on join().

fn drop(&mut self) {
println!("Joining fault_lib_receiver_thread");
let handle = self.fault_lib_receiver_thread.take().unwrap();
if let Err(err) = handle.join() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop tries to join the worker thread, but run() loops forever (see comment in fault_lib_communicator.rs). This will deadlock! Need a shutdown signal sent before join.

Comment on lines +1 to +23
// Copyright (c) 2025 Contributors to the Eclipse Foundation
//
// See the NOTICE file(s) distributed with this work for additional
// information regarding copyright ownership.
//
// This program and the accompanying materials are made available under the
// terms of the Apache License Version 2.0 which is available at
// <https://www.apache.org/licenses/LICENSE-2.0>
//
// SPDX-License-Identifier: Apache-2.0
//

// Copyright (c) 2025 Contributors to the Eclipse Foundation
//
// See the NOTICE file(s) distributed with this work for additional
// information regarding copyright ownership.
//
// This program and the accompanying materials are made available under the
// terms of the Apache License Version 2.0 which is available at
// <https://www.apache.org/licenses/LICENSE-2.0>
//
// SPDX-License-Identifier: Apache-2.0
//

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate copyright header

Comment on lines +107 to +125
fn listen_hash_check_response(&self) -> Result<bool, SinkError> {
let start = Instant::now();
loop {
if let Some(msg) = self
.hash_check_response_subscriber
.as_ref()
.unwrap()
.receive()
.map_err(|_| SinkError::TransportDown)?
{
return Ok(*msg.payload());
}
if start.elapsed() >= TIMEOUT {
return Err(SinkError::Timeout);
}
std::thread::sleep(Duration::from_millis(10));
}
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Busy-wait with sleep(10ms) burns CPU. Could use waitset or condvar

[dependencies]
env_logger.workspace = true
common = { path = "../common" }
fault_lib = { path = "../fault_lib" }

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dfm_lib depends on fault_lib - this creates tight coupling. FaultCatalog and FaultCatalogBuilder should probably live in common crate, since both sides need them. Would allow cleaner separation.

Comment on lines +104 to +133
Use cases
---------

Following usecases are valid for the S-CORE application using the Fault Library:

- registering new fault in the system
- depending on car configuration variant, enabled features etc. the number of faults detected and reported by the app can change
- depending on the current status and state of the car electronic system the APP can report different faults
- configuring debouncing for the fault
- different test can require the results to be filtered over time or debounced, to prevent setting the faults by glitches or false positives
- configuring enabling conditions for the fault
- each test can require different system conditions to be fulfilled before the test can be performed (e.g. the communication test can be done only if the power supply is in expected range)
- reporting results of diagnostic tests (fail / pass)
- reporting status of enabling conditions (if done in the app)
- the application can report only status on the enabling condition and does not report any faults
- react to the SOVD Fault Handling actions (e.g. delete faults can cause the test to restart)
- react to change in the enabling conditions (some tests could be impossible to be process when enabling conditions are not fulfilled)
- provide interface to the user which allow to provide additional environmental data to be stored with the fault

Following usecases applies for the Fault Library (FL) and Diagnostic Fault Manager (DFM):

- validate the consistency of the fault catalog shared between DFM and FL
- DFM maintain global fault catalog based on the information from each FL
- FL reports state changes in the faults to DFM over IPC
- FL reports enabling condition state change to the DFM over IPC
- DFM reports over IPC to FL enabling condition state change reported by another FL
- DFM requests restart of the test for the faults reported by FL
- DFM reports cleaning of the faults in the DFM by the SOVD client
- DFM receives and maintain current status of the environment conditions to be stored together with faults

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These use cases are documented but NOT implemented yet:

  • enabling conditions (subscribe, notify, local/global)
  • dynamic fault registration ("registering new fault in the system")
  • reset policies / debouncing on manager side
  • react to SOVD Fault Handling actions
    Consider adding a "MVP Scope" section listing what's actually implemented vs planned.

Comment on lines +91 to +105
pub struct FaultDescriptor {
pub id: FaultId,

pub name: ShortString,
pub summary: Option<LongString>,

pub category: FaultType,
pub severity: FaultSeverity,
pub compliance: ComplianceVec,

pub reporter_side_debounce: Option<DebounceMode>,
pub reporter_side_reset: Option<ResetPolicy>,
pub manager_side_debounce: Option<DebounceMode>,
pub manager_side_reset: Option<ResetPolicy>,
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fields:

reporter_side_debounce
reporter_side_reset
manager_side_debounce                                                                                       
manager_side_reset 

are defined but none of them are actually used in the code.
Either implement debouncing/reset logic or document these as "reserved for future".

@lh-sag
Copy link

lh-sag commented Jan 26, 2026

It's a solid early draft - the overall architecture makes sense :)

That said, there's still quite a bit of work needed before this can be merged:

  • Several critical issues (memory leaks, potential deadlocks, panic-happy error handling)
  • Many todo!() and unimplemented!() in production code paths
  • Mismatch between what's documented and what's actually implemented
  • Some design decisions that need team discussion (e.g. FaultCatalog location, debouncing scope, Weak vs Arc for global state)

As it stands, this PR needs a new owner to drive it to completion - the current state isn't mergeable but it's a great foundation to build on

@mfaferek93 Would you take over this task?

@mfaferek93
Copy link

@lh-sag I need to review the priorities and check internally with 42dot team, there’s a chance @bburda42dot will handle this on our side. If it ends up being on me, I’ll work on it in my free time, so it may take a bit longer.
I’ll get back to you later this week with an update.

@IoanAndrasi
Copy link

My overall review:

  • Good architecture that matches design documentation in most of the cases
  • Well organized and separations between libs
  • Trait-based abstraction offering clean separation
  • Still couple of TODO's but I think its sufficient for initial implementation

}

#[derive(Clone, Default, Debug, PartialEq, Eq)]
pub struct SovdFault {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice that three fields are missing, symptom, symptom_translation_id and schema

impl<S: SovdFaultStateStorage> Drop for DiagnosticFaultManager<S> {
fn drop(&mut self) {
println!("Joining fault_lib_receiver_thread");
let handle = self.fault_lib_receiver_thread.take().unwrap();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unwrap on take will panic if called twice

/// Diagnostic Fault Manager. This trait shall never become public
impl FaultSinkApi for FaultManagerSink {
fn publish(&self, path: &str, record: FaultRecord) -> Result<(), SinkError> {
let event = DiagnosticEvent::Fault((to_static_long_string(path).expect("Failed to serialize path"), record));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

path strings are converted to LongString without proper validation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants