Skip to content

Conversation

victorboissiere
Copy link

@victorboissiere victorboissiere commented Aug 20, 2025

Sometimes AWS can close TCP connections but Ruby pool might not be aware of that. It leads to failures after failures while trying to reach the service on the same existing connection.

image

We run many services in production, only a few containers have this issue from time to time only in Ruby services. My investigations led to this issue.

This update ensure that when that happens, instead of retrying on the same stalled TCP connection, we create a new one which should work on second try right away.

@mullermp
Copy link
Contributor

Thanks for opening a pull request. Can you be specific about your threading and process setup, why you are generally seeing timeouts, and for what services/use cases? The connection pool code is already taking out the endpoint from the pool before the connection is tried. If there is any failure like a timeout, it's not checked back in, so I'm skeptical that this is doing anything different. The endpoint is normalized without a path and query, and that is used as the key. You can also try rescuing in your code and calling this method https://github.com/aws/aws-sdk-ruby/blob/version-3/gems%2Faws-sdk-core%2Flib%2Faws-sdk-core.rb#L148.

I recall a recent customer issue relating to pooling, but it turned out that their system could only open so many connections at once. Are you by chance using passenger/nginx?

@victorboissiere
Copy link
Author

victorboissiere commented Aug 20, 2025

Thanks for opening a pull request. Can you be specific about your threading and process setup, why you are generally seeing timeouts, and for what services/use cases? The connection pool code is already taking out the endpoint from the pool before the connection is tried. If there is any failure like a timeout, it's not checked back in, so I'm skeptical that this is doing anything different. The endpoint is normalized without a path and query, and that is used as the key. You can also try rescuing in your code and calling this method https://github.com/aws/aws-sdk-ruby/blob/version-3/gems%2Faws-sdk-core%2Flib%2Faws-sdk-core.rb#L148.

I recall a recent customer issue relating to pooling, but it turned out that their system could only open so many connections at once. Are you by chance using passenger/nginx?

Thanks for your reply! We're seeing a bunch of this

 ServiceToken::Errors::STSError (Failed to open TCP connection to sts.eu-west-3.amazonaws.com:443 (execution expired))

and it fails until the pod restart. It's always some pods but not all of them (for this service for instance, we have at least 20 of them) and the recent case we had was "for more than 12 hours the STS endpoint failed only in one pod with very frequent requests made to STS, all ending up in execution expired".
With retries, the same underlying connection is always used. This is only happening on Ruby services that use this library (not on other services, like on some written in Go). The applications are all deployed in EKS as a docker container, and there isn't any proxy between the app and STS API (on our side, maybe on the AWS side, yes).

Other recent cases are sporadic failures from time to time on one pod at a time, until pod restarts or suddenly it stops after a while.

@jterapin
Copy link
Contributor

Hi @victorboissiere - we will need more information to investigate this further. Here's what we need from you:

  • Ruby version
  • aws-sdk-core and aws-sdk-sts versions
  • Full error stack trace

Also...

With retries, the same underlying connection is always used.

How do you determine that the connection was re-used? Also, have you been running the proposed fix and did it mitigate the issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants