fix: consecutive open timeout errors #3282

victorboissiere · 2025-08-20T12:32:57Z

Sometimes AWS can close TCP connections but Ruby pool might not be aware of that. It leads to failures after failures while trying to reach the service on the same existing connection.

We run many services in production, only a few containers have this issue from time to time only in Ruby services. My investigations led to this issue.

This update ensure that when that happens, instead of retrying on the same stalled TCP connection, we create a new one which should work on second try right away.

mullermp · 2025-08-20T13:04:51Z

Thanks for opening a pull request. Can you be specific about your threading and process setup, why you are generally seeing timeouts, and for what services/use cases? The connection pool code is already taking out the endpoint from the pool before the connection is tried. If there is any failure like a timeout, it's not checked back in, so I'm skeptical that this is doing anything different. The endpoint is normalized without a path and query, and that is used as the key. You can also try rescuing in your code and calling this method https://github.com/aws/aws-sdk-ruby/blob/version-3/gems%2Faws-sdk-core%2Flib%2Faws-sdk-core.rb#L148.

I recall a recent customer issue relating to pooling, but it turned out that their system could only open so many connections at once. Are you by chance using passenger/nginx?

victorboissiere · 2025-08-20T16:28:43Z

Thanks for opening a pull request. Can you be specific about your threading and process setup, why you are generally seeing timeouts, and for what services/use cases? The connection pool code is already taking out the endpoint from the pool before the connection is tried. If there is any failure like a timeout, it's not checked back in, so I'm skeptical that this is doing anything different. The endpoint is normalized without a path and query, and that is used as the key. You can also try rescuing in your code and calling this method https://github.com/aws/aws-sdk-ruby/blob/version-3/gems%2Faws-sdk-core%2Flib%2Faws-sdk-core.rb#L148.

I recall a recent customer issue relating to pooling, but it turned out that their system could only open so many connections at once. Are you by chance using passenger/nginx?

Thanks for your reply! We're seeing a bunch of this

 ServiceToken::Errors::STSError (Failed to open TCP connection to sts.eu-west-3.amazonaws.com:443 (execution expired))

and it fails until the pod restart. It's always some pods but not all of them (for this service for instance, we have at least 20 of them) and the recent case we had was "for more than 12 hours the STS endpoint failed only in one pod with very frequent requests made to STS, all ending up in execution expired".
With retries, the same underlying connection is always used. This is only happening on Ruby services that use this library (not on other services, like on some written in Go). The applications are all deployed in EKS as a docker container, and there isn't any proxy between the app and STS API (on our side, maybe on the AWS side, yes).

Other recent cases are sporadic failures from time to time on one pod at a time, until pod restarts or suddenly it stops after a while.

jterapin · 2025-08-21T17:53:43Z

Hi @victorboissiere - we will need more information to investigate this further. Here's what we need from you:

Ruby version
aws-sdk-core and aws-sdk-sts versions
Full error stack trace

Also...

With retries, the same underlying connection is always used.

How do you determine that the connection was re-used? Also, have you been running the proposed fix and did it mitigate the issue?

fix: consecutive open timeout errors

8b67867

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: consecutive open timeout errors #3282

fix: consecutive open timeout errors #3282

victorboissiere commented Aug 20, 2025 •

edited

Loading

Uh oh!

mullermp commented Aug 20, 2025

Uh oh!

victorboissiere commented Aug 20, 2025 •

edited

Loading

Uh oh!

jterapin commented Aug 21, 2025

Uh oh!

Uh oh!

fix: consecutive open timeout errors #3282

Are you sure you want to change the base?

fix: consecutive open timeout errors #3282

Conversation

victorboissiere commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mullermp commented Aug 20, 2025

Uh oh!

victorboissiere commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jterapin commented Aug 21, 2025

Uh oh!

Uh oh!

victorboissiere commented Aug 20, 2025 •

edited

Loading

victorboissiere commented Aug 20, 2025 •

edited

Loading