Skip to content

Commit 1a4d91e

Browse files
committed
CP-35087: Upgrade Prometheus to 3.x with backward compatibility
Prometheus 3.0 was released and includes performance improvements and new features. We want to upgrade to the latest stable version (3.7.3) while maintaining backward compatibility for customers who may be using custom Prometheus 2.x images. Impact: Before: Chart used Prometheus 2.55.1 by default. Prometheus 3.x images would produce warnings about unknown flag "--enable-feature=agent" since the agent mode flag changed between versions. After: Chart uses Prometheus 3.7.3 by default. Both 2.x and 3.x versions work correctly with automatic flag detection based on image version. Scope: All customers, but particularly those using custom Prometheus images via server.image.tag or components.prometheus.image.tag overrides. Implementation Approach: Prometheus 3.x changed the agent mode flag from `--enable-feature=agent` to just `--agent`. To support both versions seamlessly, we needed version-aware flag selection that respects the same image tag fallback chain used for actual image generation. Solution: 1. Updated Chart.AppVersion from v2.55.1 to v3.7.3 in helm/Chart.yaml 2. Created cloudzero-agent.prometheusAgentFlag helper in helm/templates/_helpers.tpl that: - Checks if mode is "agent" or "federated" (using existing mode derivation logic) - Uses same tag fallback as image generation: server.image.tag → components.prometheus.image.tag → Chart.AppVersion - Returns "--enable-feature=agent" for v2.x tags - Returns "--agent" for v3.x and newer tags - Returns empty string for server/clustered modes 3. Updated helm/templates/agent-deploy.yaml and helm/templates/agent-daemonset.yaml to use the new helper instead of hardcoded flags 4. Updated helm/tests/agent_mode_derivation_test.yaml to expect "--agent" flag for default Chart.AppVersion (now 3.7.3) 5. Added comprehensive test suite in helm/tests/prometheus_version_flag_test.yaml covering: - Prometheus 2.x uses --enable-feature=agent - Prometheus 3.x uses --agent - server.image.tag precedence over components.prometheus.image.tag - Both agent and federated modes - Server mode (no agent flag) - Fallback to Chart.AppVersion Validation: - All 255 Helm unit tests pass (added 8 new tests) - Schema validation tests pass - Go unit tests pass - Deployed to brahms cluster successfully - Verified Prometheus 3.7.3 starts correctly in agent mode with no warnings - Confirmed all expected metrics (container, node, pod, GPU) flowing correctly - Zero dropped metrics in production deployment - Manually tested both v2.x and v3.x tags produce correct flags - Verified server.image.tag override precedence works correctly
1 parent 635a1b3 commit 1a4d91e

File tree

9 files changed

+183
-151
lines changed

9 files changed

+183
-151
lines changed

helm/Chart.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ kubeVersion: ">= 1.21.0-0"
88
maintainers:
99
- name: CloudZero
1010
11-
appVersion: "v2.55.1"
11+
appVersion: "v3.7.3"
1212
dependencies:
1313
- name: kube-state-metrics
1414
version: "5.36.*"

helm/templates/_helpers.tpl

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1294,3 +1294,34 @@ alloy-config.river
12941294
prometheus.yml
12951295
{{- end -}}
12961296
{{- end -}}
1297+
1298+
{{/*
1299+
Get the appropriate Prometheus agent mode flag based on version and mode
1300+
1301+
Determines whether Prometheus should run in agent mode and which flag to use:
1302+
- Prometheus 2.x uses --enable-feature=agent
1303+
- Prometheus 3.x uses --agent
1304+
- Returns empty string if not in agent/federated mode
1305+
1306+
The cloudzero-agent.Values.components.agent.mode helper already handles all the
1307+
complex mode derivation logic, so we just check if it returns "agent" or "federated"
1308+
and then determine the appropriate version-specific flag.
1309+
1310+
Uses the same tag fallback chain as image generation:
1311+
server.image.tag -> components.prometheus.image.tag -> Chart.AppVersion
1312+
1313+
Usage: {{ include "cloudzero-agent.prometheusAgentFlag" . }}
1314+
Returns: string (either "--agent", "--enable-feature=agent", or empty string)
1315+
*/}}
1316+
{{- define "cloudzero-agent.prometheusAgentFlag" -}}
1317+
{{- $mode := include "cloudzero-agent.Values.components.agent.mode" . -}}
1318+
{{- if or (eq $mode "agent") (eq $mode "federated") -}}
1319+
{{- /* Use same fallback chain as image generation: server.image.tag -> components.prometheus.image.tag -> Chart.AppVersion */ -}}
1320+
{{- $tag := .Values.server.image.tag | default .Values.components.prometheus.image.tag | default .Chart.AppVersion -}}
1321+
{{- if hasPrefix "v2." $tag -}}
1322+
--enable-feature=agent
1323+
{{- else -}}
1324+
--agent
1325+
{{- end -}}
1326+
{{- end -}}
1327+
{{- end -}}

helm/templates/agent-daemonset.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -111,9 +111,9 @@ spec:
111111
- --web.enable-lifecycle
112112
- --web.console.libraries=/etc/prometheus/console_libraries
113113
- --web.console.templates=/etc/prometheus/consoles
114-
{{- /* In federated mode, default to Prometheus agent unless server.agentMode is explicitly false */ -}}
115-
{{- if or (not .Values.server) (eq .Values.server.agentMode nil) (eq .Values.server.agentMode true) }}
116-
- --enable-feature=agent
114+
{{- $agentFlag := include "cloudzero-agent.prometheusAgentFlag" . }}
115+
{{- if $agentFlag }}
116+
- {{ $agentFlag }}
117117
{{- end }}
118118
ports:
119119
- containerPort: 9090

helm/templates/agent-deploy.yaml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -148,8 +148,9 @@ spec:
148148
- /checks/app/config/validator.yml
149149
args:
150150
{{ toYaml .Values.server.args | nindent 12}}
151-
{{- if or (eq (include "cloudzero-agent.Values.components.agent.mode" .) "agent") (eq (include "cloudzero-agent.Values.components.agent.mode" .) "federated") }}
152-
- --enable-feature=agent
151+
{{- $agentFlag := include "cloudzero-agent.prometheusAgentFlag" . }}
152+
{{- if $agentFlag }}
153+
- {{ $agentFlag }}
153154
{{- end }}
154155
- --log.level={{ .Values.server.logging.level | default "info" }}
155156
ports:

helm/tests/agent_mode_derivation_test.yaml

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@ tests:
6868
# Should have agent mode flag (default)
6969
- contains:
7070
path: spec.template.spec.containers[1].args
71-
content: --enable-feature=agent
71+
content: --agent
7272
template: templates/agent-deploy.yaml
7373

7474
# Test automatic mode: server.agentMode=false derives mode as "server"
@@ -85,7 +85,7 @@ tests:
8585
# Should NOT have agent mode flag (server mode)
8686
- notContains:
8787
path: spec.template.spec.containers[1].args
88-
content: --enable-feature=agent
88+
content: --agent
8989
template: templates/agent-deploy.yaml
9090
# Should have Prometheus config (not Alloy)
9191
- isNotEmpty:
@@ -106,7 +106,7 @@ tests:
106106
# Should have agent mode flag (default)
107107
- contains:
108108
path: spec.template.spec.containers[1].args
109-
content: --enable-feature=agent
109+
content: --agent
110110
template: templates/agent-deploy.yaml
111111
# Should have Prometheus config (not Alloy)
112112
- isNotEmpty:
@@ -155,7 +155,7 @@ tests:
155155
# Should have agent mode flag
156156
- contains:
157157
path: spec.template.spec.containers[1].args
158-
content: --enable-feature=agent
158+
content: --agent
159159
template: templates/agent-deploy.yaml
160160

161161
# Test that components.agent.mode="server" works
@@ -170,7 +170,7 @@ tests:
170170
# Should NOT have agent mode flag
171171
- notContains:
172172
path: spec.template.spec.containers[1].args
173-
content: --enable-feature=agent
173+
content: --agent
174174
template: templates/agent-deploy.yaml
175175

176176
# Test that components.agent.mode="clustered" works

0 commit comments

Comments
 (0)