Deadlock in multicast_observer

I've run into a deadlock that I can't seem to reproduce a minimal example of. My case appears to be a very rare race condition, and the only way I've found to reproduce it reliably is by repeatedly running a large set of convoluted unit tests (which were written for an application I'm working on) until it happens to get triggered in one of the runs. I often have to leave the tests running on repeat for 1-2 hours (that's potentially hundreds of reruns) before I see the deadlock happen. I still don't know what exact conditions need to align to cause it, but luckily I do know what the stack trace is when it happens (ordered from bottom of the stack to top of the stack):

1. [`multicast_observer::add`](https://github.com/ReactiveX/RxCpp/blob/9002d9bea0e6b90624672e90a409b56de5286fc6/Rx/v2/src/rxcpp/subjects/rx-subject.hpp#L140)
2. [`subscriber::add`](https://github.com/ReactiveX/RxCpp/blob/9002d9bea0e6b90624672e90a409b56de5286fc6/Rx/v2/src/rxcpp/rx-subscriber.hpp#L204)
3. [`composite_subscription::add`](https://github.com/ReactiveX/RxCpp/blob/9002d9bea0e6b90624672e90a409b56de5286fc6/Rx/v2/src/rxcpp/rx-subscription.hpp#L516)
4. [`composite_subscription_inner::add`](https://github.com/ReactiveX/RxCpp/blob/9002d9bea0e6b90624672e90a409b56de5286fc6/Rx/v2/src/rxcpp/rx-subscription.hpp#L427)
5. [`composite_subscription_state::add`](https://github.com/ReactiveX/RxCpp/blob/9002d9bea0e6b90624672e90a409b56de5286fc6/Rx/v2/src/rxcpp/rx-subscription.hpp#L286)
6. [`subscription::unsubscribe`](https://github.com/ReactiveX/RxCpp/blob/9002d9bea0e6b90624672e90a409b56de5286fc6/Rx/v2/src/rxcpp/rx-subscription.hpp#L183)
7. [`subscription_state::unsubscribe`](https://github.com/ReactiveX/RxCpp/blob/9002d9bea0e6b90624672e90a409b56de5286fc6/Rx/v2/src/rxcpp/rx-subscription.hpp#L99)
8. [`static_subscription::unsubscribe`](https://github.com/ReactiveX/RxCpp/blob/9002d9bea0e6b90624672e90a409b56de5286fc6/Rx/v2/src/rxcpp/rx-subscription.hpp#L63)
9. [`multicast_observer::add::<lambda>`](https://github.com/ReactiveX/RxCpp/blob/9002d9bea0e6b90624672e90a409b56de5286fc6/Rx/v2/src/rxcpp/subjects/rx-subject.hpp#L143)

The deadlock happens because [this mutex](https://github.com/ReactiveX/RxCpp/blob/9002d9bea0e6b90624672e90a409b56de5286fc6/Rx/v2/src/rxcpp/subjects/rx-subject.hpp#L41) gets locked twice in this one thread (as shown in the stack trace above): [[i]](https://github.com/ReactiveX/RxCpp/blob/9002d9bea0e6b90624672e90a409b56de5286fc6/Rx/v2/src/rxcpp/subjects/rx-subject.hpp#L134) and [[ii]](https://github.com/ReactiveX/RxCpp/blob/9002d9bea0e6b90624672e90a409b56de5286fc6/Rx/v2/src/rxcpp/subjects/rx-subject.hpp#L143).

In most cases this won't happen because this whole branch is protected by the condition that [the observer is subscribed](https://github.com/ReactiveX/RxCpp/blob/9002d9bea0e6b90624672e90a409b56de5286fc6/Rx/v2/src/rxcpp/subjects/rx-subject.hpp#L138), so we can usually rely on [this condition](https://github.com/ReactiveX/RxCpp/blob/9002d9bea0e6b90624672e90a409b56de5286fc6/Rx/v2/src/rxcpp/rx-subscription.hpp#L285) to prevent frame [5] in the stack trace from being run.

The race condition appears to be that somehow between frame [1] and frame [5] another thread changes the observer's state from subscribed to unsubscribed. As I mentioned at the start I haven't figured out a way to minimally reproduce this, but assuming it's possible for another thread to change the observer to unsubscribed, it should be clear from the stack trace that what I've described is a deadlock hazard.

This race condition was happening for me on release v4.1.0, which I understand is a few years behind `master`, but the problematic code path seems to still exist, as the lines I linked above are from the latest `master`.

A very easy way to fix this problem is to change [this `std::mutex`](https://github.com/ReactiveX/RxCpp/blob/9002d9bea0e6b90624672e90a409b56de5286fc6/Rx/v2/src/rxcpp/subjects/rx-subject.hpp#L41) to a `std::recursive_mutex` (and of course change the template parameter on the locking mechanisms that use it). I'm happy to provide a PR to fix this, but I don't know how to make a regression test to prove the fix.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Deadlock in multicast_observer #555

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Deadlock in multicast_observer #555

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions