Skip to content

Conversation

@djeebus
Copy link
Contributor

@djeebus djeebus commented Nov 27, 2025

This roughly triples the number of series we collect wrt to hostmetrics, or an increase of ~30% of my whole dev cluster. It provides lots of useful information that's worth tracking though.


Note

Expands otel-collector hostmetrics (CPU/load/filesystem/memory/network/processes), adds CPU time aggregation, refines filters/labels, and removes an unused feature gate.

  • Observability (otel-collector):
    • Hostmetrics expansion in iac/provider-gcp/nomad/configs/otel-collector.yaml:
      • Enable additional CPU (time, counts), load (adds 15m), filesystem (usage/inodes/utilization), memory (usage/utilization/dirty/hugepages), network (connections/dropped/errors/io/conntrack), and processes (count/created) metrics; disable unneeded modules via comments.
      • Add metricstransform/single_cpu to aggregate system.cpu.time by node.id and state; include in metrics/host pipeline.
      • Broaden filter/drop_by_device to regex-match all system.network.* on veth/docker/lo and drop all loop-device metrics; strip filesystem labels via explicit list.
      • Minor pipeline processor list formatting; remove otelcol.* from OTLP include set.
    • Nomad job in iac/provider-gcp/nomad/jobs/otel-collector.hcl:
      • Remove --feature-gates=pkg.translator.prometheus.NormalizeName arg.
  • Tooling:
    • Add .editorconfig for YAML (2-space, spaces).

Written by Cursor Bugbot for commit c93a45a. This will update automatically on new commits. Configure here.

@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

enabled: false
system.cpu.utilization:
enabled: false
# default, only need to be disabled
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you keep it explicit? Or is there a reason why not to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do that, it just seemed arbitrary which series we redundantly defined and which ones we omitted. mind if I add all of them explicitly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it can be harder to maintain, but I don’t think the host metrics change often.

Having them explicitly defined makes reviews simpler, and it also helps anyone browsing the repo understand exactly which metrics we expose without needing to check the OTEL docs or deploy the observability stack.

It also makes our intent clear—what we’ve chosen to enable or disable—so if new metrics appear later, it’s obvious that it isn't on intentionally and it's safe to turn off.

@djeebus djeebus enabled auto-merge (squash) December 2, 2025 19:27
@djeebus djeebus merged commit 50e24f6 into main Dec 2, 2025
27 checks passed
@djeebus djeebus deleted the add-more-metrics branch December 2, 2025 19:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants