Skip to content

MQTT5 client silently stops processing messages with no lifecycle events #623

@ahmadaal

Description

@ahmadaal

Describe the bug

Our AWS IoT MQTT5 client intermittently stops receiving messages with zero error logs or lifecycle events, requiring an application restart to fix. The client appears connected (no disconnection logs) but all message delivery fails silently. This occurs in production on a commercial WiFi network with a PC that has a USB wifi dongle, which are network conditions that could cause intermittent issues.

It's hard to say when exactly the issue happens (zero error logs and lifecycle events), but if I leave the device alone for 2-4 hours, it usually manifests).

After the issue has triggered, any publishes to a topic (from the outside world) that the client SHOULD be subscribed to are never received by the client. I confirmed that at that point, the client DOES have general network connectivity.

Regression Issue

  • Select this option if this issue appears to be a regression.

Expected Behavior

  • When network issues prevent message delivery, the client should detect PINGRESP timeouts after 30 seconds (default ping_timeout_ms) and trigger lifecycle connection failure events. Client should then attempt reconnection with exponential backoff
  • Application should receive on_lifecycle_connection_failure callbacks
  • Any background thread crashes should be logged or cause the client to fail gracefully

Either the PING/PINGRESP keep alive feature is working, in which case if the network fails, I'd expect to get some sort of message or lifecycle callback (but I don't). If that's not happening, that's a bug.

Or it's not working and has silently crashed/hung, and so a callback will never ever fire after that point. In which case it would be a bug for there to not be some kind of lifecycle callback on an internal SDK thread crash

Current Behavior

  • Messages stop being received completely
  • Zero lifecycle events triggered (no connection_failure, no reconnection attempts)
  • No error logs from the MQTT5 client
  • Only application restart fixes the issue (not OS restart)

Reproduction Steps

Hard to give repro steps as this is an environmental issue, but I can share my MQTT client python code:

    client = mqtt5_client_builder.mtls_from_path(
        endpoint=IOT_ENDPOINT,
        cert_filepath=...
        pri_key_filepath=...
        ca_filepath=...,
        client_id=f"{thing_name}",
        on_publish_received=on_publish_received(
           ...
        ),
        on_lifecycle_stopped=on_lifecycle_stopped,
        on_lifecycle_connection_success=on_lifecycle_connection_success,
        on_lifecycle_connection_failure=on_lifecycle_connection_failure,
    )

I subscribe to my topics within on_lifecycle_connection_success, so that always resubscribe on connection success.

Possible Solution

  • The SDK should recover if there's some internal hang.
  • The SDK should log if there's an internal hang

Additional Information/Context

No response

SDK version used

1.22.1

Environment details (OS name and version, etc.)

Ubuntu 24.04

Metadata

Metadata

Assignees

No one assigned

    Labels

    closed-for-stalenessresponse-requestedWaiting on additional info and feedback. Will move to "closing-soon" in 2 days.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions