Release 3.7.1#1034
Merged
Merged
Conversation
Fix selective event subscription race When we receive multiple selective event subscriptions simultaneously for the same Service/Instance/Eventgroup, the first thread adds the client to the map and uses that map to process the subscription, but another thread may start processing the next subscription, adding the client to the map and starting to process with the current state of the map, causing the ACK subscription not to be sent because the second subscription, when it started, did not have the client from the first subscription marked as ACK, leaving one client always in PENDING. To solve the issue, it was verified that for each service/instance/eventgroup, an its_parent is created, which is the reference to the first subscription received, and this is always updated by all subscriptions, thus having the current status of all clients. Given this, to decide whether SUBSCRIBE_ACK is sent or not, we now check whether the subscription's parent has pending clients instead of checking whether the subscription has pending clients. However, in the case of the first subscription, there is no parent, so in that case, we must continue to check the subscription's pending clients. Additionally, a fake_socket was created to replicate the issue, and helpers were updated to support the delay in sending messages from Service Discovery to UDP.
It was decreased in 2c212f7 but this is fragile, memory_test especially is very "intensive" and can have variable runtimes, especially under sanitizers
Since there is no fake_socket test yet to check for "shadow events", add some.
Drop second registration phase Deprecates REGISTER_APPLICATION_ID, REGISTER_APPLICATION_ACK and RIE_ADD_CLIENT Registration is now completed as soon as the routing endpoint is created and the ASSIGN_CLIENT_ACK command is received by the application CONFIG_ID is now sent before ASSIGN_CLIENT_ACK As a consequence the registration is no longer processed by the client_registration_thread_ New state transition flow is ST_DEREGISTERED->ST_REGISTERING->ST_REGISTERED
Fixes flaky boardnet tests The call to app::subscribe does actually subscribe to both event and field. Happens that subscription_record_ would sometimes catch event as last record, other times field as last record. Event or field is not relevant for the purpose of both test cases, therefore, subscription is now done only to an event. while at it, added helpers to simplify test's implementation
Fix queue size not being reset when clearing queue
Before this change there was a chance of a deleted routing info if a client could faster reconnect then the router call the error handler. The exact sequence was: endpoint_manager_impl::add_local_routing_endpoint -> notices that there is still an endpoint -> queues the error_handler endpoint_manager_impl::add_local_routing_endpoint -> unlocks the mutex and queues the registration of the application (even though the error handler is still in flight for the old one) registration_func (register application) would add the routing info the error handler would queue a deregister func -> remove the new routing info new endpoint is started and offers are received, but no routing info is given to be distributed. this race is resolved by noticing: The router is deducing the routing_info from the endpoint, therefore the available routing info should be bound to the active endpoint. The same observation is not true for a routing client for which reason the map has to be maintained differently. This problem was noticed in the course of: 8b067c7 and should be given since: 245d390
It is a recent change from 2c212f7; good intentions (small timeouts means faster feedback), but not a good idea with parallel test execution Both the fake socket tests and the memory tests were hitting the timeout, and others are too near for comfort
It is an interesting sequence: SOME/IP-SD usei is running, doing both unicast+multicast "read loops" absence of SOME/IP-SD messages leads to triggering of rejoin mechanism from 17d3af7 rejoin starts another multicast "read loop" there are now several "read loops" on the same multicast socket, reading SOME/IP-SD messages mere scheduling delay now leads to SOME/IP messages read out-of-order, which leads to "reboot detection", broken availability From the code, one would usually expect 3) to stop the previous "multicast read loop", but there is no dedicated "lifecycle counter" for the multicast socket, which allows the previous loop to continue Fix it by adding a lifecycle counter, incremented on every multicast socket open/close
The deregistration process was a soft protection against clean-up ordering problems and the router distributing outdated routing-info, but this is no required because a stronger guarantee was delivered with: 8b830d8
When processing a subscription, 3 distinct steps are done: insert subscription state send subscription ack send initial notification These are very granular steps, and there is nothing that protects state changes from happening (see 74512fd), so it is possible for a notify to slip between step 2) and 3), leading to double initial notifications The double initial notification normally is no problem (see 97a98a6), but the notification in 3) is problematic as it ignores debouncing filters. That causes debouncing test failures Use the subscription mutex from 61f26b1 to cover the entire subscription sequence, and use it in event notifications as well. It also solves other hypothetical issues, e.g., a notification slipping between step 1) and 2)
Create the switch client id test with fake sockets This test replicates an issue where a client would not correctly subscribe to a service that was previously offered by a vsomeip app that has changed its client id from when it first offered the service.
Already in the codebase since 611e7fd but somehow managed to evade valgrind/sanitizers..
Dynamic Service Releases can Block the Router Because we: release service endpoints dynamically flush on release block the thread until flush is done it can happen that a all io threads are blocked, for some time: time = (flush timeout x floor(number of service releases / number of io-threads)) One possible scenario where this would be encountered is when Android is suspended, with external communication being active.
Fix deadlock on routingmanagerd The deadlock was caused by the usage of mutexes and condition variables which are not signal safe.
Pass the ownership of "detached" dispatcher threads to a new static object called thread_manager instead ofactually detaching it. This PR aims to fix the issue of detached stop threads resulting in memory leaks when outliving the program. The fix was done by passing such threads' ownership to a new static object called thread_manager, which blocks in its destructor to ensure such dispatcher threads are properly joined when terminating the program. This PR also create 5 tests to avoid regression.
Fix valgrind errors not being detected in parallel tests. Modify the scripts responsible for running tests and detecting failures to properly identify valgrind errors. This PR does not address other things such as all parallel tests having the same .xml output.
In the specific case of this issue, we have two different processes A and B, which share the same VSOMEIP app configuration, specifically the same client id. Assuming A spawns and registers before B, the sequence is: A registers its application instance with client id X (ST_REGISTERED) Eventually, B starts the registration process (ST_REGISTERING), even though A is still registered (ST_REGISTERED), but has already called application_impl->stop(). B receives ST_DEREGISTERED because the daemon (routing host) is sending a message intended for the first process to the second. Process A finishes, but process B now has to recover its state, eventually recovering to (ST_REGISTERED). Process B now cannot send OFFERs, due to an edge case in state management.
Update async_accept to use the peer-endpoint overload When a client connects, sends the necessary notifications, and disconnects immediately, all quickly, the client is able to connect, send notifications, and disconnect before the routing manager executes accept_cbk asynchronously. So, when accept_cbk is executed, the remote address no longer exists, so it is entered as 0.0.0.0 (unspecified). Then, when reading the messages in rmi::on_message, it drops the messages because they are from an unspecified IP. This issue is a limitation of boost with regard to async_accept. To resolve the issue, it was necessary to use async_accept, which uses peer-endpoint overload to capture the remote address by the kernel at the accept( ) time, then passing the remote address to accept_cbk. This way, even if the client disconnects, it will continue to use the remote address and read all messages. Additionally, shutdown_tests have been updated to send notifications with and without payload to the client to connect, send notifications, and disconnect in less time, increasing the likelihood of this issue being reproduced. It was also necessary to add a magic sleep on the client side to give time for the last message to be read in router, before the clean-up starts the forceful stop of the “server” connection within the route, otherwise, the Routing endpoint is stopped/Client deregistered -> end all the related endpoints -> reading is stopped -> messages are not processed.
Otherwise reading the logs is hell
MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The .xml output often mangles the output, especially whenever sanitizers find something. Therefore, drop .xml, switch to .log, which is otherwise equivalent
Remove minor comparison when requesting or releasing a service. In scenarios where two clients require the same service though with different minors, with only one of them fully matching the offered service, both would get the service availability and be able to manipulate the service interface. When the client that required the correct minor releases the service, the remaining one would no longer be able to use the boardnet service. During tests, it was seen that the endpoint would be destroyed and then created again on the following offer. This PR introduces a fix where the minor is no longer of interest when requesting/releasing a service, i.e., When a client requests a service with a distinct minor, the client will still be added to the client list of the service info and thus always keep a reference to it.
Fix thread ctor race against join. A very small race condition exists between the start of a std::thread's handler and the assigning of it to t0. As such, a small cv is necessary to ensure that the thread is properly moved to t0 before we try to do thread.join().
Fix compilation with gcc15 Without this change, compilation fails because of "stringop-overflow"
Upgrade CI to Ubuntu 24.04.4 (Noble Numbat) Defaults to GCC-13 and Clang-18 Previous default were GCC-11 and Clang-12 Added jobs for GCC-10 and 14 Added jobs for Clang 16 and 20 Added memcheck suppresions for ip and tc commands
From build: ld.lld: error: undefined symbol: vsomeip_v3::thread_manager::get() referenced by application_impl.cpp:489
The router/stub has a few more threads than a normal libvsomeip application. The threads that process events MUST also use the nice level from the configuration, or we risk scheduling delays and priority inversion in the router is an example of this, of registration delays happening because the processing happens in a dedicated thread that had a lower priority than the io threads The registration thread was already removed with 29fb750 + 8b067c7, so nothing to do there. The only other case is the multicast thread
Split the responsibility of a vsomeip routing application internally into two. The current vsomeip-application API implies this split already: Every function that internally can only forward if the routing is the routing_manager_impl implies this is a "router only" function. All the others are "client only" functions. But at the moment the routing_manager_impl serves both purposes which makes the code more complex and couples different concerns. To resolve this tension a vsomeip-application will with this PR unconditionally have a routing_manager_client instance, and an optional routing part. From a high-level this will mean that the routing process will contain a routing_client that will connect to the router part of the same process. This overhead is considered worth the complexity reduction. To ensure that the routing_manager_impl is now a purely concerned with the routing, it receives a new implementation of the routing_manager_host interface, the "routing_application". This new wrapper logs errors whenever the routing_manager_impl tries to act as a client, but forward calls received from the vsomeip application. One additional change is that the routing_manager_impl will no longer posses any "real" client id, but only the "VSOMEIP_ROUTING_CLIENT_ID".
Adjust the initial timings of the suspend_resume_initial setup, to work around the debounce of service requests in routing_manager_client.
These null checks do nothing, and it's not even coherent - rmc has them, but neither rms nor rmi do
A few of the tests use boost udp/tcp sockets, spawn threads, read/write into the sockets, with no concern for synchronization, so tsan catches a few races
The boardnet variant (boardnet_routing_host) only contains the couple of methods that the boardnet endpoints need, very little Besides it being cleaner, it also allows to improve the rest of the routing_host interface, namely the on_message, which currently is a horrifying amalgamation
There are a few situations where we send SubNack but log nothing at all, even though this is a "major" event that can lead to a TCP connection break
Clear remote_subscription_state_ entry when endpoint receives error Previously, when a tcp client endpoint detected a timeout, the app subscribed to the service would get the ON_UNAVAILABLE and the stop offer service, but the remote state subscription for that service would remain unchanged, with the subscription acknowledged state. This would lead to a situation where after the connection recovered, the client would subscribe again to the service but not receive a subscribe ack and not triggering the subscription status handler. With this change, when the connection recovers the client will subscribe to the service and receive the subscribe ack.
In case of funky business when parsing local messages, dump the entire payload (with some limitations) Add helper, generalize it
Add a dedicated event_dispatcher interface, to remove the last dependency towards the routing_manager interface. With this smaller interface, the routing_manager is no longer required. Reducing the amount of left-over complexity and freeing the name.
Ease the writing of boardnet tests without fixture duplication. This simplification is achieved by two new helpers: ecu_config: models the configuration for apps running on one ecu (or in a guest system). ecu_setup: Bundles app access + ecu_config + config writing for one ecu
Refactor atomic_bool in dispatch app tests. This refactor removes atomics from dispatch-app-stop tests where the shared flags are already synchronized through a condition variable. Protecting the write with the same mutex as the waiter is safer than using atomics, since otherwise there is a risk of a lost wakeup if the write/notify occurs between predicate evaluation and the next wait. The new mutex + cv usage preserves the code's correct behavior.
Fix build routing_application.cpp:13:10: fatal error: 'logger_ext.hpp' file not found
Increase timeout for clients when waiting any notification from the test manager. Reuse client id tests startup is time consuming and registration phase can take up to 10/12s as each client process is started and registration awaited sequentially.
Follow up to b11118f. The client config needs bumping as well.
byteorder.hpp already does a decent job to check at compile time, so use it, and make it a compile-time failure if we cannot check (because the runtime check is just silly)
We cannot use it without an equivalent for the Android build, therefore remove it
Accept TCP incoming connections before rejecting a subscription. The TCP connection must be accepted before processing subscription messages related to it. And we cannot reorder the service discovery messages. With Linux, this is not a problem, because we can control the order in which network messages are processed. But with Boost ASIO, this is not possible: we don't know in which order the messages are processed. In the method tcp_server_endpoint_impl::is_established_to, which is called while processing the service discovery messages, we check two conditions: the queue of incoming TCP connection is empty ; all the incoming TCP connections have been processed. The last condition requires to manage the incoming TCP connections in a separate thread, so they cannot blocked by another job and we can check if they have been processed. The solution is to create an io_service dedicated to incoming connections that allows to synchronize the management of incoming connections and service discovery messages. However this solution isn't supported by Boost ASIO 1.6. Added a test: stress_tcp_subscriptions` creates multiple thread to connect via TCP to the server and then sends subscription requests. It verifies that the notification is received.
Add predicate to condition_variable. dispatch_app_outlives_program and dispatch_app_outlives_global were both using wait(lock) instead of wait(lock, predicate), meaning they could lose the notify_one(). This PR properly adds a predicate and a bool to ensure the condition variable doesn't get stuck on wait(). This PR also defines a timeout of 20 seconds for each of the dispatch_app tests.
Most of the protocol commands are already correct-by-construction, and serialization is anyhow caller-controlled - we are not parsing a message from outside Therefore, drop all of the pointless error checking, streamline the code
It is necessary to start another thread, otherwise there's nothing to run events on Furthermore, start the thread only on start, so that we do not immediately begin executing a thread on application::init (due to construction of auxiliary_context) While at it, add unit tests
Otherwise every boardnet test need to specify these ports, otherwise boardnet communication would not work.
This file grew by too much, containing different sort of tests. Within this change this test suite is split into: protocol tests - focusing on the "on-the-wire" messages exchanged client lifecycle tests - focusing of variations of start/stop, offer/stop-offer connection restoration - focusing on the reparation after errors. This Change is a "mechanical" change: No new tests, no tests removed, only moving tests from one place into another one.
Remove assert from common helper functions.
It would happen especially for StopOffers, leading to this sequence: ON_STOP_OFFER_SERVICE (0001): [0001.0001:1.1] usei#1::111.111.111.111:111::send_queued: skipped, no socket! It of course makes no sense to do any SD communication while suspended, so suppress the messages generally While at it, remove use of sd::runtime, which anyhow is only necessary to create the service_discovery_impl, nothing more
Use existing classes from the lib and print more information. This eases debugging from fake_socket_tests for boardnet tests.
Restructure the data storage of the "known clients". Instead of keeping a central map that adds book keeping overhead, keep the data local to the endpoint. Meanwhile also differentiate the usage of lazy_loading vs. the actual usage of relating the client id with the environment data. A client does not seem to require the environment data of the server. A server needs to lazy_load (covered via the local_server), but also have the the mapping of the client <-> environment during message parsing. Therefore the on_message signature was adjusted to provide both data points together The router does not seem to need it for something else then the lazy load. One side finding was that the "reconnect" functionality of the rmc was racy and was risking having some "lingering" endpoints. or routing_info or client to sec_client mapping. This side finding triggered some movement of code from routing_manager_base to the routing_manager_client, and only a fraction of that to the routing_manager_impl.
Enable shutdown_tests shutdown_tests has been disabled in until the issue is resolved. It can now be enabled again.
Remove dead code Moved documentation to test directory Unused options, weird use of threads, missing expectations Moved existing test documentation to new file on test directory Test is taking 3-4 seconds less to execute too!
As of fe1f962 (router split), it is no longer necessary. While at it remove an unused version of register_event
Otherwise the coverage check anyhow can fail because of missing basic libc symbols (since it is a dummy lib that has no dependencies)
Changes: routing: cleanup rmi notification hack Remove npdu test dead code Enable shutdown_tests Move the client environment into local_endpoint Interpret more data on the wire (fake_socket_tests) sd: no messages while suspended Remove assert from common split the test_connection_restoration file Specify ports by default endpoints: fix auxiliary_context restart protocol: serialization cannot fail Solve dispatch_app_outlives_global failures accept tcp incoming connections before subscriptions + test add daemon check build: remove use of target_include_directories build: fix runtime endianness check Bump initial timeout in suspend_resume_initial test COVESA#2 Fix reuse client id test Fix logger include Refactor bool in dispatch app tests introduce config writer helper Remove the routing_manager interface misc: dump message on parsing error Remove subscription state on disconnect sd: improve subnack logging endpoints: split routing_host into a boardnet variant tests: fix race(s) in malicious_data_test routing: cleanup policy_manager usage Bump initial phase timeout for suspend_resume_initial test Split the routing app in two misc: set thread priority on multicast thread Fix android compilation issues Upgrade CI to Noble Fix compilation with gcc15 Fix thread ctor race against join Remove minor for request/release service zuul: switch from ctest .xml to .log output tests: use boardnet addresses for boardnet tests Update async_accept to use the peer-endpoint overload Make verify-icts-compilance job voting test concurrent fixed client-id app registration Fix script to properly detect valgrind errors Fix detached dispatcher thread fix routingmanagerd deadlock Revert flush_queue for boardnet conf: fix uninitialized variable Switch client id test with fake sockets routing: fix subscription race Drop graceful de-registration endpoints: fix multicast read loop race fix test_vlan_toggle_registration zuul: set valgrind as voting Add ICTS check tests: remove small timeouts avoid routing info race on reconnect reset queue size when dropping messages Fixes flaky boardnet tests drop second registration phase more boardnet tests tests: increase timeout for memory_test Fix selective event subscription race
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.