Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add telemetry event span for socket dispatch #6019

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

danschultzer
Copy link

@danschultzer danschultzer commented Dec 16, 2024

The reason for adding this telemetry event is that my OTel spans are full of WebSocket connections that just reads GET. This is because Bandit will just use the method which is correct as the URL may be high cardinality, however the socket path is low cardinality and should show up as e.g. GET /live/websocket.

Since socket requests are handled and halted before the rest of the endpoint plugs I can't depend on Plug.Telemetry to report them and use the [:phoenix, :endpoint] events. I could depend on [:phoenix, :socket_connected] but this will only be for successful connections.

So I'm proposing that we add a [:phoenix, :socket_dispatch] event matching the [:phoenix, :router_dispatch] events. It somewhat conflicts with the existing [:phoenix, :socket_connected] though this event only have high level metadata vs socket connected having connection metadata including the result of setting up the socket.

With the ability to attach to this telemetry event we can update the :opentelemetry_phoenix to update the span name with the route template in the same way it currently does for router dispatch events.

@josevalim
Copy link
Member

Hi @danschultzer. Our long term goal is to move the socket dispatch out of being hardcoded and make it part of the regular plug pipeline. This means we would eventually "break" this event. Therefore, if we want to automate it, I would do it for the upgrade command itself, if possible, otherwise we wait.

@danschultzer
Copy link
Author

Thanks @josevalim, makes sense to not introduce new events if they have to be removed later.

Therefore, if we want to automate it, I would do it for the upgrade command itself, if possible, otherwise we wait.

Unfortunately I don't see any way to resolve this because the socket handling happens before anything else. This is a gotcha for everyone implementing OTel tracing where they need to set up custom sampling to filter out all these stray GET traces as it can't be properly attributed to the socket route (and only filter out those as there can be other generic HTTP traces you want to keep). The socket HTTP requests are the most common HTTP requests with Phoenix LiveView.

There are several issues:

  1. It's for any socket transport, not just websocket, e.g. longpoll does not send any events at all
  2. The upgrade event is only sent when it's succesful, which means there may be stray events for unsuccesful upgrades with websocket
  3. The socket can have path variables like /:foo/socket which means we can't depend on the url attribute alone (from bandit/cowboy), we need the URL template route, and Plug.Telemetry won't know about the route template

Our long term goal is to move the socket dispatch out of being hardcoded and make it part of the regular plug pipeline. This means we would eventually "break" this event.

If moved to the plug pipeline how would the routing work? Would the socket definitions be pushed to the router instead so it becomes part of the router dispatch events? The problem is that we receive telemetry events from the web server (bandit/cowboy), and when it reaches Phoenix we need to enrich the traces with the route template so it can show up with an appropriate span name and attributes for context.

I also think it's preferable to have events/spans for all requests happening in the Phoenix endpoint.

If the idea is to add it to the router, then we could maybe make it a router_dispatch event for now? If not, maybe we can somehow leverage Plug.Telemetry events inferring it during socket dispatch, but the (lesser) problem with that is that Plug.Telemetry events doesn't have any route metadata so it has to be handled on the consumer side with introspecting of the endpoint.

I'm trying to think of ways to use existing events to resolve this for now, while not making it be a thing that will break in the future. Long term can mean it'll be a very long time before sockets becomes part of the plug pipeline (and even then Plug.Telemetry doesn't have the low cardinailty route in metadata).

@josevalim
Copy link
Member

Is this something that could be pushed to the transport and each transport emits its own event? The benefit is that they could attach metadata relevant to them?

@danschultzer
Copy link
Author

Yeah, that could work, but not sure there's much benefit to it. The route needs to be passed to the transport plug, the transport config is already handled in the endpoint instead of in the plug, and we have all metadata in the returned conn. The only thing I can think of that may be missing would be metadata for exceptions raised, e.g. connect_info, but I don't think any exceptions are raised unless it's a config error.

Alternatively we can have the span around the plug call as in this PR, but change the event name to [:phoenix, :websocket] and [:phoenix, :longpoll]? Does that naming make sense longterm?

@josevalim
Copy link
Member

Alternatively we can have the span around the plug call as in this PR

The point is that this call may no longer exist or no longer be the main entry point in the future, so we should not do anything additional in it. :)

@josevalim
Copy link
Member

We should also double check if Bandit is already not emitting events for these.

@danschultzer
Copy link
Author

We should also double check if Bandit is already not emitting events for these.

The span is actually from Bandit/Cowboy. The way OpenTelemetryPhoenix works is that it attaches to the Phoenix.Router telemetry event and then enrich the span set by Bandit/Cowboy, as Bandit/Cowboy won't know about route template and will just use method as the span name. That's why you'll see a bunch of GET spans for the sockets because the spans are not enriched.

The point is that this call may no longer exist or no longer be the main entry point in the future, so we should not do anything additional in it. :)

Yeah, I see that, but how will the socket routing work in the future? With the regular routes OpenTelemetryPhoenix listens to the Phoenix.Router telemetry event to populate the OTel span name with the template route. It seems that the sockets still need a telemetry events that has route template metadata. Is the idea that the sockets will live in the router or the endpoint pipeline? I don't think the latter will have any route template info as the event would be triggered before routing. If the former then we could maybe make this a [:phoenix, :router_dispatch] event.

The bottom line is that the socket requests becomes stray Bandit/Cowboy spans in OTel with little context. You either have to filter them out entirely with a custom sampler, or find some alternative way to enrich the spans (which would mean you have to hardcode the sockets routes somewhere else to match them).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants