-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add OTEL Tracing to Scheduling/Consolidation #2005
Comments
/priority important-soon |
/triage accepted |
One other thing that we've talked about is somehow recording what the current cluster state is and periodically pushing that to some logging backend -- that might work, though it's unclear how much data that this would be. That option would at least let us reconstruct the inputs to our consolidation and provisioning loops but would require a re-run of the functions rather than being a historical record of what has already happened. |
There is also an example of this being done for K8s system components -- one of particular interest (that would probably have traces closest to Karpenter) would be the kubelet tracing. |
Hey, i'm one of the TLs for SIG Instrumentation, and have been working on the tracing integration in the Kubelet and APIServer. I'm not very familiar with karpenter, but i'm assuming it follows the normal "operator" pattern, rather than serving requests directly. The good aspects of tracing for operators is that it is a good way to provide very detailed information about the operator's behavior, especially if it is complex (e.g. multiple steps, parallelism), or if it involves making requests to external systems (e.g. cloud provider APIs). The challenges are:
I'm also a maintainer of the OpenTelemetry-Go project, so if you have any general questions about it, i'm happy to help. |
I think the question is less about Karpenter itself not doing anything and more about Karpenter reacting to something and then deciding that nothing needs to be changed on the cluster. In that scenario, I feel like tracing would be appropriate. The big problem today is that Karpenter will log when something is executed, but it won't log when nothing is done. As it stands today, there are so many operations that are taking place that logging may not be an effective option to enable for noting down when nothing happened. I would be interested to hear what you perceive as the trade-offs between using tracing and using logging w.r.t. tracking Karpenter's decision making.
Can't you force it to sample always? My understanding was that sampling is opt-in and that the default out-of-the-box experience with OTEL is that it would keep track and forward all spans back from the application. I was referring back to this documentation here. I guess it's mostly a question of the trade-offs between the performance impact of having this data and wanting things to scale well in production. |
Right. Your only option if you really want to sample something is to turn it up to 100% sampling. That is probably ok if you are developing or testing, but depending on the number of spans you generate might be too much for prod (or maybe it isn't!). Logs can be structured, and can also attach a trace context (although this is odd to do without tracing), so the main differences between spans and logs are:
You should also consider using kubernetes events. If what you are trying to expose is relatively high-level, many tools/UIs already integrate well with events, and they integrate nicely with kubectl. |
Yeah, we already have events enabled but we are looking for something that we could give to folks who are looking to really understand the behavior and decision making of the system. If we were to fire that as events, it would be too much and would overwhelm the apiserver. |
Tracing sounds like a reasonable fit for your needs, then. |
/assign jonathan-innis |
I'm POC-ing something and if anyone has thoughts or wants to help me out, let me know! |
Something interesting and relevant to the conversation as well from Kubecon: https://www.youtube.com/watch?v=kzXT0WlTBpw |
Description
What problem are you trying to solve?
Right now, there isn't a good way to trace through what scheduling or consolidation is doing -- particularly when there is no output from the scheduling or consolidation loops -- it would be really nice if we built-out OTEL-based tracing that generated spans for the different function blocks and recorded important information like which nodes were attempted within that block.
How important is this feature to you?
This could provide critical insight into how the application is running that would help users debug for themselves what Karpenter is doing without having to dive too deep into the code internals and guessing at what the scheduler is doing.
The text was updated successfully, but these errors were encountered: