MQTT publish seems to prevent detection of connection loss #802

nadrm · 2025-05-01T16:46:49Z

nadrm
May 1, 2025

I'm using the Aws::Crt::Mqtt::MqttConnection::Publish method (aws-iot-device-sdk-cpp-v2 v1.32.0) to send data to the AWS MQTT backend approximately every second. Both OnConnectionInterrupted and OnDisconnect callbacks of MqttConnectionHandlers are registered to handle connection events.

The connection is configured with the following settings:

Keep-alive interval: 60 seconds
Ping timeout: 5 seconds
Protocol operation timeout: 25 seconds

The issue arises when the device enters suspend mode (e.g., system sleep). While suspended, the application is frozen and unable to perform any MQTT/TCP activity. During this period, the backend closes the connection due to timeout. However, because the application is frozen, it misses the disconnection.

Upon resume, the application continues calling Publish() as usual. At this point:

Each Publish() call fails with AWS_ERROR_MQTT_TIMEOUT, which is expected.
However, no MqttConnectionHandler events (e.g., OnConnectionInterrupted) are triggered.
The MQTT ping task appears to be continuously rescheduled at each publish and never executed.

This state persists for up to 30 minutes, until a socket timeout finally occurs. At that point, the SDK emits a connection interrupted event and reconnects.

I would expect the keep-alive ping to be sent even during publish operations to allow timely detection of a dead connection.
Is this the expected behavior or should I assume that AWS_ERROR_MQTT_TIMEOUT on Publish implies that the connection is no longer alive, even in the absence of a disconnect event?
Is there any way to change or customize this behavior so that the connection loss can be detected earlier, even when publish operations are failing?

A log excerpt is provided below showing repeated publish failures, ping postponements, and eventual disconnection due to a socket timeout.
logs.txt

Answered by bretambrose

May 2, 2025

The issue was fixed here in April 24: https://github.com/awslabs/aws-c-mqtt/releases/tag/v0.10.4
So if you're using 1.32.0 (which is Jan 24) you have the old, bad behavior; updating should fix.

View full answer

bretambrose · 2025-05-01T17:02:53Z

bretambrose
May 1, 2025
Maintainer

There's no way to control this behavior but I agree that it is undesirable. Ultimately what's broken is that the ping push-out should only occur on receipt of Acks for submitted operations (and the push out should only be against the timestamp that the acknowledged operation was sent out on). In your case, there would never be a push out because nothing ever gets acked.

The MQTT5 client does not have this behavior as near as I can tell, and given that it is also a superior protocol (and client implementation), I would switch to it.

0 replies

xiazhvera · 2025-05-01T22:11:13Z

xiazhvera
May 1, 2025
Maintainer

We addressed the ping request scheduling logic last year, and with the current implementation (assuming the fix is working as expected), ping requests should only be scheduled after receiving successful ACKs. Based on that, the behavior you're seeing shouldn’t happen—or it may be unrelated to the publish operation itself.

I ran a local test using a mock server that never sends PUBACKs or PINGRESPs, and in that case, the client was able to initiate pings with the ongoing publishes.

One area worth checking is whether the device clock might have been affected by a system suspend/resume event. The client uses timestamps to determine when to send a ping (The code link: https://github.com/awslabs/aws-c-mqtt/blob/main/source/client_channel_handler.c#L96). (Well, it still not explain why it run the ping task ~7 seconds after a publish operation.)

To investigate further, could you please share the log with TRACE level enabled? That would give us more insight into what’s happening.

2 replies

nadrm May 2, 2025
Author

Thank you for your response!

I'll try to collect logs with TRACE level enabled sometime next week.

In the meantime, I wanted to mention two points I hadn’t included before:

I noticed that if the Protocol Operation Timeout is greater than the sum of the Keep-Alive Interval and the Ping Timeout, everything seems to work as expected. In this case, the client manages to send a ping before the publish times out, and the connection interrupted event is triggered correctly.
I'm currently using an older version of the SDK: 1.32.0. That was the latest version available when I started development, and I haven’t upgraded since, as everything seemed to work fine initially (we were using both Keep-Alive and Ping Timeout set to 5 seconds, which, as per point 1, avoided the issue).

From what you describe, it sounds like updating to the latest SDK version (1.36.0) should already include the fix for this issue. Is that correct?

bretambrose May 2, 2025
Maintainer

The issue was fixed here in April 24: https://github.com/awslabs/aws-c-mqtt/releases/tag/v0.10.4
So if you're using 1.32.0 (which is Jan 24) you have the old, bad behavior; updating should fix.

Answer selected by nadrm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MQTT publish seems to prevent detection of connection loss #802

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

MQTT publish seems to prevent detection of connection loss #802

Uh oh!

nadrm May 1, 2025

Replies: 2 comments · 2 replies

Uh oh!

bretambrose May 1, 2025 Maintainer

Uh oh!

Uh oh!

xiazhvera May 1, 2025 Maintainer

Uh oh!

nadrm May 2, 2025 Author

Uh oh!

bretambrose May 2, 2025 Maintainer

nadrm
May 1, 2025

Replies: 2 comments 2 replies

bretambrose
May 1, 2025
Maintainer

xiazhvera
May 1, 2025
Maintainer

nadrm May 2, 2025
Author

bretambrose May 2, 2025
Maintainer