Skip to content

Deadlock in multicast_observer #555

@mxgrey

Description

@mxgrey

I've run into a deadlock that I can't seem to reproduce a minimal example of. My case appears to be a very rare race condition, and the only way I've found to reproduce it reliably is by repeatedly running a large set of convoluted unit tests (which were written for an application I'm working on) until it happens to get triggered in one of the runs. I often have to leave the tests running on repeat for 1-2 hours (that's potentially hundreds of reruns) before I see the deadlock happen. I still don't know what exact conditions need to align to cause it, but luckily I do know what the stack trace is when it happens (ordered from bottom of the stack to top of the stack):

  1. multicast_observer::add
  2. subscriber::add
  3. composite_subscription::add
  4. composite_subscription_inner::add
  5. composite_subscription_state::add
  6. subscription::unsubscribe
  7. subscription_state::unsubscribe
  8. static_subscription::unsubscribe
  9. multicast_observer::add::<lambda>

The deadlock happens because this mutex gets locked twice in this one thread (as shown in the stack trace above): [i] and [ii].

In most cases this won't happen because this whole branch is protected by the condition that the observer is subscribed, so we can usually rely on this condition to prevent frame [5] in the stack trace from being run.

The race condition appears to be that somehow between frame [1] and frame [5] another thread changes the observer's state from subscribed to unsubscribed. As I mentioned at the start I haven't figured out a way to minimally reproduce this, but assuming it's possible for another thread to change the observer to unsubscribed, it should be clear from the stack trace that what I've described is a deadlock hazard.

This race condition was happening for me on release v4.1.0, which I understand is a few years behind master, but the problematic code path seems to still exist, as the lines I linked above are from the latest master.

A very easy way to fix this problem is to change this std::mutex to a std::recursive_mutex (and of course change the template parameter on the locking mechanisms that use it). I'm happy to provide a PR to fix this, but I don't know how to make a regression test to prove the fix.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions