-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metric to track the size and labels of individual disks #91
Conversation
@paulajulve I believe it still does. That particular comment refers to adding the disk ID as a label as the main culprit to leading to cardinality explosion of metrics. Say for instance you have 1,000 unused disks on each provider, this means you'll end up with a label with 1,000 times the number of provider unique values. As per this article label cardinality should be as low as possible if we don't want to hog down Prometheus. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As explained on the main comments section, I'm concerned about the cardinality of the new metric, in particular in the disk
label.
Given that the information was logged we could use Loki and its unwrap
LogQL keyword to create values from these, instead of loading Prometheus.
I could update the "unused disk found" log to include the k8s_namespace too (and maybe use bytes instead of GB/GiB) and use unwrap more or less like this (explore):
I'm unclear on next steps tho 🤔 we want to add this to our TCO recording rules. Can we add this expression to a prometheus recording rule? I've seen loki based alerts, but I'm not sure we can mix and match like this out of the box. |
As these are metrics we're adding, it'll be a matter of juggling how we could come up with a recording rule to "fill the gap(s)" of the missing labels to join, Crafting a promQL to join the metrics we're adding to some existing ones (note there I mocked 3 existing disks ab-using
ends up showing: Then, with the |
IMO these disk IDs live in the ~same playground as pod IDs (which of course we already carry from KSM metrics), and I think we may need them as metrics, ready there to be able to create tables (showing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM semantically and syntactically up to my (GOlang) read/ability, but let's better also get @inkel 's approval for the latter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Re: https://github.com/grafana/deployment_tools/issues/111858
Once the disks are detached from the cluster, the only way back to knowing where they belonged is by using the namespace and the zone. But some namespaces exist in different zones/clusters (ie, generic names like
hosted-exporters
). The currentunused_disks_size_gb
metric was adding up the sizes of all disks under the same namespace. In the case of unique namespaces, that's alright. But in the case of those that span multiple regions, it can lead to costs being associated to the wrong cluster, and somewhat difficult to interpret data.To make it easier for future us, I'm adding a new
unused_disk_size_bytes
metric here, that publishes the size data per disk, instead of added up. It still carries thek8s_namespace
,type
, andzone
labels, since we'll need all of them to be able to attribute costs appropriately.I've also opted for bytes as the unit, since the size field is not uniform across providers. Azure and GCP offer the size of the volume in GB, while AWS uses GiB. Even though they all end up billing per GB 🙄
I've tested this out locally both for AWS and GCP. I have not tested it for Azure. I'll work on that now.