Skip to content

A94: OTel metrics for Subchannels #485

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

yashykt
Copy link
Member

@yashykt yashykt commented Mar 19, 2025

@yashykt
Copy link
Member Author

yashykt commented Mar 19, 2025

cc @joybestourous

Copy link

@joybestourous joybestourous left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for including the open_connection metric, this is super helpful for us.

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing this up, Yash!

subchannel shutdown | The subchannel was shutdown. This can happen due to reasons such as the parent channel shutting down, channel becoming idle, the load balancing policy changing due to a resolver update or a change in list of endpoint addresses.
connection reset | Connection was reset (eg. ECONNRESET, WSAECONNERESET)
connection timed out | Connection timed out, also includes connections closed due to gRPC keepalives.
connection aborted | Connection was aborted
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember discussing this one -- not sure if we didn't, or if I just forgot.

What exactly triggers this one? How is it different from "connection reset"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added "connection timed out", "connection reset" and "connection aborted" as the most common socket errors and implemented across platforms. There are some differences across platforms as to the exact implementation but overall -
"connection reset" - a TCP RST was received
"connection timed out" - keepalive or tcp user timeout
"connection aborted" - The connection was aborted/closed by the host itself (software initiated - application/OS). This can happen due to reasons like resource exhaustion, unexpected/internal error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In C-core, how are we going to differentiate between these cases? We would presumably need to expose the difference between them in the EventEngine API, and then we need to plumb that data through the transport and up to the subchannel. How exactly will this work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EventEngine callbacks have a status parameter that we can use. The endpoint will annotate the status object with the correct socket error message.

For the subchannel, in Core, we have connectivity state watchers on the transport that also have a status object that is part of the callback, so the same status object that can be propagated up to the subchannel that can decode the error message.

@yashykt yashykt changed the title A94: gRPC OTel metrics for Subchannels A94: OTel metrics for Subchannels Apr 3, 2025
| ----------------------- | ----------- | ------------------------------------ |
| grpc.target | Required | Indicates the target of the gRPC |
: : : channel (defined in [A66].) :
| grpc.lb.backend_service | Optional | The backend service to which the RPC |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to discuss how backend_service and locality will be plumbed from the LB tree for the metrics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agreed, we need a plan for this. Unfortunately, due to pre-existing differences in attribute semantics across languages, the details will probably be a little different in each language.

In C-core, we'll just pass these through as channel args that are part of the subchannel key. I think we actually already have most of this, but we'll need to modify it slightly to be non-xDS-specific.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add connection metrics to standard Otel Stats
4 participants