Skip to content

Release 3.7.1#1034

Merged
fcmonteiro merged 59 commits into
COVESA:masterfrom
andrefesilva:release_3.7.1
Apr 9, 2026
Merged

Release 3.7.1#1034
fcmonteiro merged 59 commits into
COVESA:masterfrom
andrefesilva:release_3.7.1

Conversation

@andrefesilva
Copy link
Copy Markdown
Contributor

No description provided.

danielsantiago and others added 30 commits April 8, 2026 14:50
Fix selective event subscription race
When we receive multiple selective event subscriptions simultaneously for the same Service/Instance/Eventgroup,
the first thread adds the client to the map and uses that map to process the subscription, but another thread may
start processing the next subscription, adding the client to the map and starting to process with the current state of
the map, causing the ACK subscription not to be sent because the second subscription, when it started, did not have
the client from the first subscription marked as ACK, leaving one client always in PENDING.
To solve the issue, it was verified that for each service/instance/eventgroup, an its_parent is created, which is
the reference to the first subscription received, and this is always updated by all subscriptions, thus having the
current status of all clients. Given this, to decide whether SUBSCRIBE_ACK is sent or not, we now check whether
the subscription's parent has pending clients instead of checking whether the subscription has pending clients.
However, in the case of the first subscription, there is no parent, so in that case, we must continue to check the
subscription's pending clients.
Additionally, a fake_socket was created to replicate the issue, and helpers were updated to support the delay in
sending messages from Service Discovery to UDP.
It was decreased in 2c212f7 but this is fragile, memory_test
especially is very "intensive" and can have variable runtimes,
especially under sanitizers
Since there is no fake_socket test yet to check for
"shadow events", add some.
Drop second registration phase

Deprecates REGISTER_APPLICATION_ID, REGISTER_APPLICATION_ACK and RIE_ADD_CLIENT
Registration is now completed as soon as the routing endpoint is created and the ASSIGN_CLIENT_ACK
command is received by the application
CONFIG_ID is now sent before ASSIGN_CLIENT_ACK
As a consequence the registration is no longer processed by the client_registration_thread_
New state transition flow is ST_DEREGISTERED->ST_REGISTERING->ST_REGISTERED
Fixes flaky boardnet tests
The call to app::subscribe does actually subscribe to both event and field.
Happens that subscription_record_ would sometimes catch event as last record,
other times field as last record.
Event or field is not relevant for the purpose of both test cases, therefore,
subscription is now done only to an event.
while at it,
added helpers to simplify test's implementation
Fix queue size not being reset when clearing queue
Before this change there was a chance of a deleted routing
info if a client could faster reconnect then the router call the
error handler.
The exact sequence was:

endpoint_manager_impl::add_local_routing_endpoint
-> notices that there is still an endpoint
-> queues the error_handler
endpoint_manager_impl::add_local_routing_endpoint
-> unlocks the mutex and queues the registration of the application
(even though the error handler is still in flight for the old one)
registration_func (register application) would add the routing info
the error handler would queue a deregister func -> remove the new routing info
new endpoint is started and offers are received, but no routing info
is given to be distributed.

this race is resolved by noticing:
The router is deducing the routing_info from the endpoint, therefore
the available routing info should be bound to the active endpoint.
The same observation is not true for a routing client for which reason
the map has to be maintained differently.
This problem was noticed in the course of: 8b067c7 and should
be given since: 245d390
It is a recent change from 2c212f7; good intentions (small timeouts
means faster feedback), but not a good idea with parallel test execution
Both the fake socket tests and the memory tests were hitting the
timeout, and others are too near for comfort
Enable and fix test_vlan_toggle_registration
While working on 29fb750 found that test_vlan_toggle_registration test suite was not being compiled/executed
Also fix the failing tests based on the changes made on 29fb750
It is an interesting sequence:

SOME/IP-SD usei is running, doing both unicast+multicast "read loops"
absence of SOME/IP-SD messages leads to triggering of rejoin
mechanism from 17d3af7
rejoin starts another multicast "read loop"
there are now several "read loops" on the same multicast socket,
reading SOME/IP-SD messages
mere scheduling delay now leads to SOME/IP messages read
out-of-order, which leads to "reboot detection", broken availability

From the code, one would usually expect 3) to stop the previous
"multicast read loop", but there is no dedicated "lifecycle counter" for
the multicast socket, which allows the previous loop to continue
Fix it by adding a lifecycle counter, incremented on every multicast
socket open/close
The deregistration process was a soft protection against
clean-up ordering problems and the router distributing outdated
routing-info, but this is no required because a stronger
guarantee was delivered with: 8b830d8
When processing a subscription, 3 distinct steps are done:

insert subscription state
send subscription ack
send initial notification

These are very granular steps, and there is nothing that protects state
changes from happening (see 74512fd), so it is possible for a notify
to slip between step 2) and 3), leading to double initial notifications
The double initial notification normally is no problem (see 97a98a6),
but the notification in 3) is problematic as it ignores debouncing
filters. That causes debouncing test failures
Use the subscription mutex from 61f26b1 to cover the entire
subscription sequence, and use it in event notifications as well. It
also solves other hypothetical issues, e.g., a notification slipping
between step 1) and 2)
Create the switch client id test with fake sockets
This test replicates an issue where a client
would not correctly subscribe to a service
that was previously offered by a vsomeip app
that has changed its client id from when it
first offered the service.
Already in the codebase since 611e7fd but somehow managed to evade
valgrind/sanitizers..
Dynamic Service Releases can Block the Router
Because we:

release service endpoints dynamically
flush on release
block the thread until flush is done

it can happen that a all io threads are blocked, for some time:
time = (flush timeout x floor(number of service releases / number of io-threads))
One possible scenario where this would be encountered is when Android is suspended, with external communication
being active.
Fix deadlock on routingmanagerd
The deadlock was caused by the usage of mutexes and condition variables which are not signal safe.
Pass the ownership of "detached" dispatcher threads to a new static object called thread_manager
instead ofactually detaching it.
This PR aims to fix the issue of detached stop threads resulting in memory leaks when outliving the program.
The fix was done by passing such threads' ownership to a new static object called thread_manager, which blocks
in its destructor to ensure such dispatcher threads are properly joined when terminating the program.
This PR also create 5 tests to avoid regression.
Fix valgrind errors not being detected in parallel tests.
Modify the scripts responsible for running tests and detecting failures to properly identify valgrind errors.
This PR does not address other things such as all parallel tests having the same .xml output.
In the specific case of this issue, we have two different processes A and B, which share the same VSOMEIP
app configuration, specifically the same client id.
Assuming A spawns and registers before B, the sequence is:

A registers its application instance with client id X (ST_REGISTERED)
Eventually, B starts the registration process (ST_REGISTERING), even though A is still registered (ST_REGISTERED),
but has already called application_impl->stop().
B receives ST_DEREGISTERED because the daemon (routing host) is sending a message intended for the first
process to the second. Process A finishes, but process B now has to recover its state, eventually recovering to
(ST_REGISTERED).
Process B now cannot send OFFERs, due to an edge case in state management.
Update async_accept to use the peer-endpoint overload
When a client connects, sends the necessary notifications, and disconnects immediately, all quickly, the client is able
to connect, send notifications, and disconnect before the routing manager executes accept_cbk asynchronously.
So, when accept_cbk is executed, the remote address no longer exists, so it is entered as 0.0.0.0 (unspecified).
Then, when reading the messages in rmi::on_message, it drops the messages because they are from an unspecified IP.
This issue is a limitation of boost with regard to async_accept.
To resolve the issue, it was necessary to use async_accept, which uses peer-endpoint overload to capture the remote
address by the kernel at the accept( ) time, then passing the remote address to accept_cbk. This way, even if the
client disconnects, it will continue to use the remote address and read all messages.
Additionally, shutdown_tests have been updated to send notifications with and without payload to the client to
connect, send notifications, and disconnect in less time, increasing the likelihood of this issue being reproduced.
It was also necessary to add a magic sleep on the client side to give time for the last message to be read in router,
before the clean-up starts the forceful stop of the “server” connection within the route, otherwise, the Routing
endpoint is stopped/Client deregistered -> end all the related endpoints -> reading is stopped -> messages are not
processed.
Otherwise reading the logs is hell
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The .xml output often mangles the output, especially whenever sanitizers
find something. Therefore, drop .xml, switch to .log, which is otherwise
equivalent
Remove minor comparison when requesting or releasing a service.
In scenarios where two clients require the same service though with different minors, with only one of them
fully matching the offered service, both would get the service availability and be able to manipulate the
service interface. When the client that required the correct minor releases the service, the remaining one
would no longer be able to use the boardnet service.
During tests, it was seen that the endpoint would be destroyed and then created again on the following offer.
This PR introduces a fix where the minor is no longer of interest when requesting/releasing a service, i.e.,
When a client requests a service with a distinct minor, the client will still be added to the client list of the service
info and thus always keep a reference to it.
Fix thread ctor race against join.
A very small race condition exists between the start of a std::thread's handler and the assigning of it to t0.
As such, a small cv is necessary to ensure that the thread is properly moved to t0 before we try to do thread.join().
Fix compilation with gcc15
Without this change, compilation fails because of "stringop-overflow"
Upgrade CI to Ubuntu 24.04.4 (Noble Numbat)

Defaults to GCC-13 and Clang-18
Previous default were GCC-11 and Clang-12

Added jobs for GCC-10 and 14
Added jobs for Clang 16 and 20
Added memcheck suppresions for ip and tc commands
From build:
ld.lld: error: undefined symbol: vsomeip_v3::thread_manager::get()
referenced by application_impl.cpp:489
The router/stub has a few more threads than a normal libvsomeip
application. The threads that process events MUST also use the nice
level from the configuration, or we risk scheduling delays and priority
inversion in the router
is an example of this, of registration delays happening
because the processing happens in a dedicated thread that had a
lower priority than the io threads
The registration thread was already removed with 29fb750 + 8b067c7, so
nothing to do there. The only other case is the multicast thread
Split the responsibility of a vsomeip routing application internally into two.
The current vsomeip-application API implies this split already:
Every function that internally can only forward if the routing is the routing_manager_impl
implies this is a "router only" function. All the others are "client only" functions.
But at the moment the routing_manager_impl serves both purposes which makes
the code more complex and couples different concerns. To resolve this
tension a vsomeip-application will with this PR unconditionally have a routing_manager_client
instance, and an optional routing part.
From a high-level this will mean that the routing process will contain a routing_client that
will connect to the router part of the same process. This overhead is considered worth
the complexity reduction.
To ensure that the routing_manager_impl is now a purely concerned with the routing,
it receives a new implementation of the routing_manager_host interface, the
"routing_application".
This new wrapper logs errors whenever the routing_manager_impl tries to act as a client,
but forward calls received from the vsomeip application.
One additional change is that the routing_manager_impl will no longer posses any "real"
client id, but only the "VSOMEIP_ROUTING_CLIENT_ID".
Adjust the initial timings of the suspend_resume_initial setup,
to work around the debounce of service requests in routing_manager_client.
ricardolleite and others added 29 commits April 8, 2026 17:43
These null checks do nothing, and it's not even coherent - rmc has
them, but neither rms nor rmi do
A few of the tests use boost udp/tcp sockets, spawn threads, read/write
into the sockets, with no concern for synchronization, so tsan catches a
few races
The boardnet variant (boardnet_routing_host) only contains the couple
of methods that the boardnet endpoints need, very little
Besides it being cleaner, it also allows to improve the rest of the
routing_host interface, namely the on_message, which currently is a
horrifying amalgamation
There are a few situations where we send SubNack but log nothing at all,
even though this is a "major" event that can lead to a TCP connection
break
Clear remote_subscription_state_ entry when endpoint receives error
Previously, when a tcp client endpoint detected a timeout, the app
subscribed to the service would get the ON_UNAVAILABLE and
the stop offer service, but the remote state subscription for that
service would remain unchanged, with the subscription
acknowledged state.
This would lead to a situation where after the connection
recovered, the client would subscribe again to the service
but not receive a subscribe ack and not triggering the
subscription status handler.
With this change, when the connection recovers the
client will subscribe to the service and receive the
subscribe ack.
In case of funky business when parsing local messages, dump the entire
payload (with some limitations)
Add helper, generalize it
Add a dedicated event_dispatcher interface, to remove
the last dependency towards the routing_manager interface.
With this smaller interface, the routing_manager is no longer required.
Reducing the amount of left-over complexity and freeing the name.
Ease the writing of boardnet tests without fixture duplication.
This simplification is achieved by two new helpers:

ecu_config: models the configuration for apps running on one ecu
(or in a guest system).
ecu_setup: Bundles app access + ecu_config + config writing for one ecu
Refactor atomic_bool in dispatch app tests.
This refactor removes atomics from dispatch-app-stop tests where the shared flags are already synchronized through
a condition variable. Protecting the write with the same mutex as the waiter is safer than using atomics,
since otherwise there is a risk of a lost wakeup if the write/notify occurs between predicate evaluation and
the next wait. The new mutex + cv usage preserves the code's correct behavior.
Fix build
routing_application.cpp:13:10: fatal error: 'logger_ext.hpp' file not found
Increase timeout for clients when waiting any notification from the test manager.
Reuse client id tests startup is time consuming and registration phase can take up
to 10/12s as each client process is started and registration awaited sequentially.
Follow up to b11118f. The client config needs bumping
as well.
byteorder.hpp already does a decent job to check at compile time, so
use it, and make it a compile-time failure if we cannot check
(because the runtime check is just silly)
We cannot use it without an equivalent for the Android build, therefore
remove it
Accept TCP incoming connections before rejecting a subscription.
The TCP connection must be accepted before processing subscription
messages related to it. And we cannot reorder the service discovery
messages.
With Linux, this is not a problem, because we can control the order in
which network messages are processed. But with Boost ASIO, this is
not possible: we don't know in which order the messages are processed.
In the method tcp_server_endpoint_impl::is_established_to, which is
called while processing the service discovery messages, we check two
conditions:

the queue of incoming TCP connection is empty ;
all the incoming TCP connections have been processed.

The last condition requires to manage the incoming TCP connections
in a separate thread, so they cannot blocked by another job and we
can check if they have been processed.
The solution is to create an io_service dedicated to incoming
connections that allows to synchronize the management of incoming
connections and service discovery messages.
However this solution isn't supported by Boost ASIO 1.6.
Added a test: stress_tcp_subscriptions` creates multiple thread
to connect via TCP to the server and then sends subscription
requests. It verifies that the notification is received.
Add predicate to condition_variable.
dispatch_app_outlives_program and dispatch_app_outlives_global were both using wait(lock) instead of
wait(lock, predicate), meaning they could lose the notify_one().
This PR properly adds a predicate and a bool to ensure the condition variable doesn't get stuck on wait().
This PR also defines a timeout of 20 seconds for each of the dispatch_app tests.
Most of the protocol commands are already correct-by-construction, and
serialization is anyhow caller-controlled - we are not parsing a message
from outside
Therefore, drop all of the pointless error checking, streamline the code
It is necessary to start another thread, otherwise there's nothing to
run events on
Furthermore, start the thread only on start, so that we do not
immediately begin executing a thread on application::init (due to
construction of auxiliary_context)
While at it, add unit tests
Otherwise every boardnet test need to specify these ports,
otherwise boardnet communication would not work.
This file grew by too much, containing different sort of tests.
Within this change this test suite is split into:

protocol tests - focusing on the "on-the-wire" messages exchanged
client lifecycle tests - focusing of variations of start/stop, offer/stop-offer
connection restoration - focusing on the reparation after errors.

This Change is a "mechanical" change: No new tests, no tests removed,
only moving tests from one place into another one.
Remove assert from common helper functions.
It would happen especially for StopOffers, leading to this sequence:
ON_STOP_OFFER_SERVICE (0001): [0001.0001:1.1]
usei#1::111.111.111.111:111::send_queued: skipped, no socket!

It of course makes no sense to do any SD communication while suspended,
so suppress the messages generally
While at it, remove use of sd::runtime, which anyhow is only necessary
to create the service_discovery_impl, nothing more
Use existing classes from the lib and print more information.
This eases debugging from fake_socket_tests for boardnet
tests.
Restructure the data storage of the "known clients".
Instead of keeping a central map that adds book keeping overhead,
keep the data local to the endpoint. Meanwhile also
differentiate the usage of lazy_loading vs. the actual usage of relating
the client id with the environment data.

A client does not seem to require the environment data of the server.
A server needs to lazy_load (covered via the local_server), but also have the
the mapping of the  client <-> environment during message parsing.
Therefore the on_message signature was adjusted to provide both data points
together
The router does not seem to need it for something else then the lazy load.

One side finding was that the "reconnect" functionality of the rmc
was racy and was risking having some "lingering" endpoints.
or routing_info or client to sec_client mapping.
This side finding triggered some movement of code from routing_manager_base
to the routing_manager_client, and only a fraction of that to the
routing_manager_impl.
Enable shutdown_tests
shutdown_tests has been disabled in until the issue is resolved. It can now be enabled again.
Remove dead code
Moved documentation to test directory
Unused options, weird use of threads, missing expectations
Moved existing test documentation to new file on test directory
Test is taking 3-4 seconds less to execute too!
As of fe1f962 (router split), it is no longer necessary. While at it
remove an unused version of register_event
Otherwise the coverage check anyhow can fail because of missing basic
libc symbols (since it is a dummy lib that has no dependencies)
Changes:
routing: cleanup rmi notification hack
Remove npdu test dead code
Enable shutdown_tests
Move the client environment into local_endpoint
Interpret more data on the wire (fake_socket_tests)
sd: no messages while suspended
Remove assert from common
split the test_connection_restoration file
Specify ports by default
endpoints: fix auxiliary_context restart
protocol: serialization cannot fail
Solve dispatch_app_outlives_global failures
accept tcp incoming connections before subscriptions + test
add daemon check
build: remove use of target_include_directories
build: fix runtime endianness check
Bump initial timeout in suspend_resume_initial test COVESA#2
Fix reuse client id test
Fix logger include
Refactor bool in dispatch app tests
introduce config writer helper
Remove the routing_manager interface
misc: dump message on parsing error
Remove subscription state on disconnect
sd: improve subnack logging
endpoints: split routing_host into a boardnet variant
tests: fix race(s) in malicious_data_test
routing: cleanup policy_manager usage
Bump initial phase timeout for suspend_resume_initial test
Split the routing app in two
misc: set thread priority on multicast thread
Fix android compilation issues
Upgrade CI to Noble
Fix compilation with gcc15
Fix thread ctor race against join
Remove minor for request/release service
zuul: switch from ctest .xml to .log output
tests: use boardnet addresses for boardnet tests
Update async_accept to use the peer-endpoint overload
Make verify-icts-compilance job voting
test concurrent fixed client-id app registration
Fix script to properly detect valgrind errors
Fix detached dispatcher thread
fix routingmanagerd deadlock
Revert flush_queue for boardnet
conf: fix uninitialized variable
Switch client id test with fake sockets
routing: fix subscription race
Drop graceful de-registration
endpoints: fix multicast read loop race
fix test_vlan_toggle_registration
zuul: set valgrind as voting
Add ICTS check
tests: remove small timeouts
avoid routing info race on reconnect
reset queue size when dropping messages
Fixes flaky boardnet tests
drop second registration phase
more boardnet tests
tests: increase timeout for memory_test
Fix selective event subscription race
@fcmonteiro fcmonteiro merged commit fed7a25 into COVESA:master Apr 9, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants