Add metric to track the size and labels of individual disks #91

paulajulve · 2024-04-08T12:49:29Z

Re: https://github.com/grafana/deployment_tools/issues/111858

Once the disks are detached from the cluster, the only way back to knowing where they belonged is by using the namespace and the zone. But some namespaces exist in different zones/clusters (ie, generic names like hosted-exporters). The current unused_disks_size_gb metric was adding up the sizes of all disks under the same namespace. In the case of unique namespaces, that's alright. But in the case of those that span multiple regions, it can lead to costs being associated to the wrong cluster, and somewhat difficult to interpret data.

To make it easier for future us, I'm adding a new unused_disk_size_bytes metric here, that publishes the size data per disk, instead of added up. It still carries the k8s_namespace, type, and zone labels, since we'll need all of them to be able to attribute costs appropriately.

I've also opted for bytes as the unit, since the size field is not uniform across providers. Azure and GCP offer the size of the volume in GB, while AWS uses GiB. Even though they all end up billing per GB 🙄

I've tested this out locally both for AWS and GCP. I have not tested it for Azure. I'll work on that now.

paulajulve · 2024-04-08T12:59:06Z

Oh, I've just realised this one will require updating the readme also. I'll work on that.

In the meantime, @inkel , I'm slightly concerned about this warning here. Does this still hold true? I don't fully understand what it means.

aws/disk.go

gcp/disk.go

unusedtest/disk.go

inkel · 2024-04-08T13:27:38Z

Oh, I've just realised this one will require updating the readme also. I'll work on that.

In the meantime, @inkel , I'm slightly concerned about this warning here. Does this still hold true? I don't fully understand what it means.

@paulajulve I believe it still does. That particular comment refers to adding the disk ID as a label as the main culprit to leading to cardinality explosion of metrics. Say for instance you have 1,000 unused disks on each provider, this means you'll end up with a label with 1,000 times the number of provider unique values. As per this article label cardinality should be as low as possible if we don't want to hog down Prometheus.

inkel

As explained on the main comments section, I'm concerned about the cardinality of the new metric, in particular in the disk label.

Given that the information was logged we could use Loki and its unwrap LogQL keyword to create values from these, instead of loading Prometheus.

gcp/disk.go

paulajulve · 2024-04-08T14:39:36Z

I could update the "unused disk found" log to include the k8s_namespace too (and maybe use bytes instead of GB/GiB) and use unwrap more or less like this (explore):

sum by (k8s_namespace, zone) (
  sum_over_time(
    {namespace="unused-exporter", container!="eventrouter"}
      | logfmt
      | msg="unused disk found"
      | unwrap size_bytes [1m])
  )

I'm unclear on next steps tho 🤔 we want to add this to our TCO recording rules. Can we add this expression to a prometheus recording rule? I've seen loki based alerts, but I'm not sure we can mix and match like this out of the box.

jjo · 2024-04-09T00:25:36Z

I'm unclear on next steps tho 🤔 we want to add this to our TCO recording rules. Can we add this expression to a prometheus recording rule? I've seen loki based alerts, but I'm not sure we can mix and match like this out of the box.

As these are metrics we're adding, it'll be a matter of juggling how we could come up with a recording rule to "fill the gap(s)" of the missing labels to join, cluster in particular as a very relevant one.

Crafting a promQL to join the metrics we're adding to some existing ones (note there I mocked 3 existing disks ab-using absent(...), as if those metrics were generated by unused exporter after merging this PR):

/explore

# Ultimate goal: account (aggregate) disk GB by: cluster,namespace,team
#
# NOTE: all label_join() calls are there to "normalize" existing labels towards: cluster, namespace, zone
#
sum by (cluster, namespace, team)(
    # Mock 3 disks on respective ns,zone with 10GB, 10GB, 100GB
    label_join(  # k8s_namespace -> namespace
        sum by (k8s_namespace, zone)(
            10e9 * absent(
                unused_disk_size_bytes{k8s_namespace="loki-ops", type="ssd", zone="us-east4", id="what-ever-001"}
            ) or
            10e9 * absent(
                unused_disk_size_bytes{k8s_namespace="loki-ops", type="ssd", zone="us-east4", id="what-ever-002"}
            ) or
            100e9 * absent(
                unused_disk_size_bytes{k8s_namespace="cortex-prod-10", type="ssd", zone="us-central1", id="what-ever-002"}
            )
        ), "namespace", "", "k8s_namespace"
    )

    # Join the (few) labels we have from `unused_disk_size_bytes` metric against those from tanka
    * on (namespace, zone) group_right()

    label_join(         # location -> zone
        label_join(     # exported_cluster -> cluster
            label_join( # exported_namespace -> namespace

                # HACK: (just) pick _one_ cluster from the zone, as `unused_disk_size_bytes` doesn't have "cluster" label
                max by (exported_cluster, exported_namespace, location, team)(
                    tanka_environment_info
                    * on (exported_cluster) group_left(location)
                    tanka_cluster_info
                ), "namespace", "", "exported_namespace"
            ), "cluster", "", "exported_cluster"
        ), "zone", "", "location"
    )
) / 1e9 # show the sum in GB

ends up showing:

Then, with the (cluster, namespace, team) grouping we should be ready to aggregate them into a cost recording-rule, as decided in Design Doc: include unused disks on TCO d/doc

jjo · 2024-04-09T00:34:00Z

As explained on the main comments section, I'm concerned about the cardinality of the new metric, in particular in the disk label.

Given that the information was logged we could use Loki and its unwrap LogQL keyword to create values from these, instead of loading Prometheus.

IMO these disk IDs live in the ~same playground as pod IDs (which of course we already carry from KSM metrics), and I think we may need them as metrics, ready there to be able to create tables (showing ID, size) for teams to prioritize their cleanup.
Regarding cardinality, I'm not much worried about it in this use case, because there's no O(n²) (or worse) as there's 1:1 ID<->disk (plus other labels) without further permutations, and we just have a thousand ~ish of these.

aws/disk.go

cmd/unused-exporter/exporter.go

jjo

LGTM semantically and syntactically up to my (GOlang) read/ability, but let's better also get @inkel 's approval for the latter.

inkel

LGTM.

Add metric to track the size and labels of individual disks

2d61798

paulajulve requested review from jjo, inkel, Pokom, the-it, nikimanoledaki and logyball April 8, 2024 12:49

jjo reviewed Apr 8, 2024

View reviewed changes

aws/disk.go Outdated Show resolved Hide resolved

gcp/disk.go Outdated Show resolved Hide resolved

unusedtest/disk.go Outdated Show resolved Hide resolved

inkel reviewed Apr 8, 2024

View reviewed changes

gcp/disk.go Outdated Show resolved Hide resolved

paulajulve added 3 commits April 9, 2024 16:39

Improve sizeBytes function

6f8dda4

Add region to the metric

ecd6352

Add new metric to the README

807a441

jjo reviewed Apr 9, 2024

View reviewed changes

aws/disk.go Outdated Show resolved Hide resolved

cmd/unused-exporter/exporter.go Show resolved Hide resolved

cmd/unused-exporter/exporter.go Show resolved Hide resolved

fixup! Improve sizeBytes function

1dff10c

paulajulve self-assigned this Apr 10, 2024

jjo approved these changes Apr 10, 2024

View reviewed changes

inkel approved these changes Apr 10, 2024

View reviewed changes

paulajulve merged commit 7a15d5e into main Apr 10, 2024
3 checks passed

paulajulve deleted the paula/add-disk-size-bytes-metric branch April 10, 2024 13:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add metric to track the size and labels of individual disks #91

Add metric to track the size and labels of individual disks #91

paulajulve commented Apr 8, 2024

paulajulve commented Apr 8, 2024

inkel commented Apr 8, 2024

inkel left a comment

paulajulve commented Apr 8, 2024 •

edited

Loading

jjo commented Apr 9, 2024

jjo commented Apr 9, 2024

jjo left a comment

inkel left a comment

Add metric to track the size and labels of individual disks #91

Add metric to track the size and labels of individual disks #91

Conversation

paulajulve commented Apr 8, 2024

paulajulve commented Apr 8, 2024

inkel commented Apr 8, 2024

inkel left a comment

Choose a reason for hiding this comment

paulajulve commented Apr 8, 2024 • edited Loading

jjo commented Apr 9, 2024

jjo commented Apr 9, 2024

jjo left a comment

Choose a reason for hiding this comment

inkel left a comment

Choose a reason for hiding this comment

paulajulve commented Apr 8, 2024 •

edited

Loading