-
Notifications
You must be signed in to change notification settings - Fork 244
A94: OTel metrics for Subchannels #485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for including the open_connection metric, this is super helpful for us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for writing this up, Yash!
A94-subchannel-otel-metrics.md
Outdated
subchannel shutdown | The subchannel was shutdown. This can happen due to reasons such as the parent channel shutting down, channel becoming idle, the load balancing policy changing due to a resolver update or a change in list of endpoint addresses. | ||
connection reset | Connection was reset (eg. ECONNRESET, WSAECONNERESET) | ||
connection timed out | Connection timed out, also includes connections closed due to gRPC keepalives. | ||
connection aborted | Connection was aborted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't remember discussing this one -- not sure if we didn't, or if I just forgot.
What exactly triggers this one? How is it different from "connection reset"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added "connection timed out", "connection reset" and "connection aborted" as the most common socket errors and implemented across platforms. There are some differences across platforms as to the exact implementation but overall -
"connection reset" - a TCP RST was received
"connection timed out" - keepalive or tcp user timeout
"connection aborted" - The connection was aborted/closed by the host itself (software initiated - application/OS). This can happen due to reasons like resource exhaustion, unexpected/internal error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In C-core, how are we going to differentiate between these cases? We would presumably need to expose the difference between them in the EventEngine API, and then we need to plumb that data through the transport and up to the subchannel. How exactly will this work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EventEngine callbacks have a status parameter that we can use. The endpoint will annotate the status object with the correct socket error message.
For the subchannel, in Core, we have connectivity state watchers on the transport that also have a status object that is part of the callback, so the same status object that can be propagated up to the subchannel that can decode the error message.
A94-subchannel-otel-metrics.md
Outdated
| ----------------------- | ----------- | ------------------------------------ | | ||
| grpc.target | Required | Indicates the target of the gRPC | | ||
: : : channel (defined in [A66].) : | ||
| grpc.lb.backend_service | Optional | The backend service to which the RPC | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to discuss how backend_service and locality will be plumbed from the LB tree for the metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, agreed, we need a plan for this. Unfortunately, due to pre-existing differences in attribute semantics across languages, the details will probably be a little different in each language.
In C-core, we'll just pass these through as channel args that are part of the subchannel key. I think we actually already have most of this, but we'll need to modify it slightly to be non-xDS-specific.
Closes grpc/grpc#34886