fix(site-agent): use mTLS for Flow gRPC and protect temporal-certs secret#2942
Conversation
Summary by CodeRabbit
WalkthroughThe Helm chart now preserves the Temporal client TLS placeholder secret across upgrades and adds Flow gRPC environment values. The Flow gRPC client now returns its client certificate through ChangesFlow deployment and startup
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
🔐 TruffleHog Secret Scan✅ No secrets or credentials found! Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉 🕐 Last updated: 2026-06-27 00:16:55 UTC | Commit: 22fb1d3 |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
rest-api/site-workflow/pkg/grpc/client/flow_client.go (1)
169-173: 🔒 Security & Privacy | 🔵 Trivial | 💤 Low valueOptional hardening: consider pinning
MinVersion. Static analysis notes the absence ofMinVersion. As a client, Go already negotiates a TLS 1.2 floor, so this is not a defect — but explicitly settingMinVersion: tls.VersionTLS13would make the security posture intentional rather than implicit, provided the Flow server supports it.🔒 Optional diff
mutualTLSConfig := &tls.Config{ + MinVersion: tls.VersionTLS13, GetClientCertificate: func(*tls.CertificateRequestInfo) (*tls.Certificate, error) { return &clientCert, nil }, RootCAs: capool, }Please confirm the Flow gRPC server negotiates TLS 1.3 before pinning, to avoid handshake regressions.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@rest-api/site-workflow/pkg/grpc/client/flow_client.go` around lines 169 - 173, The mutual TLS setup in flow client construction should explicitly set a minimum TLS version if the Flow gRPC server supports it. Review the tls.Config used with mutualTLSConfig and, before pinning, confirm server-side TLS 1.3 negotiation is available; if it is, add MinVersion to the tls.Config to make the security posture explicit without regressing handshakes.Source: Linters/SAST tools
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@helm-prereqs/setup.sh`:
- Around line 807-816: The Docker config JSON in the registry pull secret setup
is being built with printf in a way that can emit invalid JSON when
REGISTRY_PULL_USERNAME or REGISTRY_PULL_SECRET contains special characters.
Update the shell logic in the registry secret construction block to use a proper
JSON encoder or equivalent escaping step before base64 encoding, keeping the
existing NICO_FLOW_ARGS and image-pull-secret wiring intact.
In `@helm/rest/nico-rest-site-agent/templates/temporal-certs-secret.yaml`:
- Around line 4-6: Clarify the keep-policy behavior in the Secret template
comments for temporal-certs-secret.yaml: `helm.sh/resource-policy: keep`
preserves the Secret on uninstall, so a reinstall will reuse the existing object
unless it is manually deleted first. Update the explanatory text near the kept
Secret to reflect this behavior accurately, using the Secret template and its
bootstrap placeholder context as the reference point.
---
Nitpick comments:
In `@rest-api/site-workflow/pkg/grpc/client/flow_client.go`:
- Around line 169-173: The mutual TLS setup in flow client construction should
explicitly set a minimum TLS version if the Flow gRPC server supports it. Review
the tls.Config used with mutualTLSConfig and, before pinning, confirm
server-side TLS 1.3 negotiation is available; if it is, add MinVersion to the
tls.Config to make the security posture explicit without regressing handshakes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: b4eeeb77-48f7-4aa0-836d-db000ecd1635
📒 Files selected for processing (6)
helm-prereqs/setup.shhelm/rest/nico-rest-site-agent/templates/temporal-certs-secret.yamlhelm/rest/nico-rest-site-agent/values.yamlrest-api/site-agent/pkg/components/managers/flowgrpc/init.gorest-api/site-agent/pkg/components/managers/flowgrpc/metrics.gorest-api/site-workflow/pkg/grpc/client/flow_client.go
22fb1d3 to
2f521ef
Compare
🔍 Container Scan Summary
Per-CVE detail lives in the per-service |
There was a problem hiding this comment.
♻️ Duplicate comments (1)
helm/rest/nico-rest-site-agent/templates/temporal-certs-secret.yaml (1)
4-6: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick winCorrect the reinstall behavior note for the kept Secret.
Line 6 still implies a release delete/reinstall creates a fresh placeholder. With
helm.sh/resource-policy: keep, Helm skips deletion when an operation would remove the resource, so the kept Secret can survive and be reused unless the Secret itself is deleted first. (v3-1-0.helm.sh)As per path instructions,
helm/**: “Review Helm charts for template correctness, values compatibility, Kubernetes API compatibility, security context, RBAC, probes, and upgrade behavior.”Suggested wording
-# bootstrap process; deleting and re-installing the release creates a fresh placeholder. +# bootstrap process. The kept Secret also survives uninstall; delete the Secret +# explicitly before reinstalling if a fresh placeholder is required.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@helm/rest/nico-rest-site-agent/templates/temporal-certs-secret.yaml` around lines 4 - 6, The comment in temporal-certs-secret.yaml overstates reinstall behavior for the kept Secret. Update the explanatory text around the helm.sh/resource-policy: keep note to say that Helm preserves the Secret across delete/reinstall flows unless the Secret itself is removed first, and avoid claiming a fresh placeholder is automatically created. Use the existing Temporal client TLS certs / bootstrap placeholder wording so the guidance matches the template’s actual behavior.Source: Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@helm/rest/nico-rest-site-agent/templates/temporal-certs-secret.yaml`:
- Around line 4-6: The comment in temporal-certs-secret.yaml overstates
reinstall behavior for the kept Secret. Update the explanatory text around the
helm.sh/resource-policy: keep note to say that Helm preserves the Secret across
delete/reinstall flows unless the Secret itself is removed first, and avoid
claiming a fresh placeholder is automatically created. Use the existing Temporal
client TLS certs / bootstrap placeholder wording so the guidance matches the
template’s actual behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 43ce878f-8cda-407c-9e1f-e31238e9e4e8
📒 Files selected for processing (3)
helm/rest/nico-rest-site-agent/templates/temporal-certs-secret.yamlhelm/rest/nico-rest-site-agent/values.yamlrest-api/site-workflow/pkg/grpc/client/flow_client.go
🚧 Files skipped from review as they are similar to previous changes (1)
- helm/rest/nico-rest-site-agent/values.yaml
…cret
Three related issues caused the site-agent to fail connecting to Flow gRPC:
1. FLOW_GRPC_SEC_OPT was missing from the site-agent Helm chart values, so
the Flow gRPC client defaulted to FlowServerTLS (1). Flow requires mTLS
(RequireAndVerifyClientCert); the server rejected the connection with
"tls: certificate required". Fix: add FLOW_GRPC_SEC_OPT: "2" as the
chart default so FlowMutualTLS is used by default.
2. The FlowMutualTLS case in NewFlowGrpcClient used tls.Config{Certificates: ...}
to present the client cert. Go's TLS stack only selects a cert from that
field when the cert's issuer matches the server's CertificateRequest acceptable
CA list; when there is no match it silently sends no cert, and the server
again rejects with "certificate required". Fix: use GetClientCertificate
instead, which unconditionally returns the cert — the same approach used in
rest-api/flow/pkg/certs/certs.go TLSConfig().
3. temporal-certs-secret.yaml created the secret with empty placeholder values
on every helm upgrade, overwriting certs written by the bootstrap process.
On the next pod restart the site-agent found the OTP gone, re-attempted
bootstrap (which fails because the OTP was already consumed), and blocked
forever — preventing FlowGrpc().Start() from running. Fix: add
helm.sh/resource-policy: keep so upgrades never touch this secret after
initial creation.
80d7f90 to
309fe8c
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@helm/rest/nico-rest-site-agent/templates/temporal-certs-secret.yaml`:
- Around line 5-15: The Temporal cert Secret template still renders blank data
on every upgrade, so the bootstrap-populated values can be overwritten even with
helm.sh/resource-policy: keep. Update temporal-certs-secret.yaml to use lookup
or conditional rendering in the Secret manifest so the existing Secret data is
preserved on helm upgrade, and only emit the placeholder empty data when the
Secret does not already exist.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 6e160872-7a87-4d6a-a485-51998db8d079
📒 Files selected for processing (3)
helm/rest/nico-rest-site-agent/templates/temporal-certs-secret.yamlhelm/rest/nico-rest-site-agent/values.yamlrest-api/site-workflow/pkg/grpc/client/flow_client.go
✅ Files skipped from review due to trivial changes (1)
- helm/rest/nico-rest-site-agent/values.yaml
| # resource-policy: keep prevents helm upgrade from overwriting certs written by the | ||
| # bootstrap process; deleting and re-installing the release creates a fresh placeholder. | ||
| apiVersion: v1 | ||
| kind: Secret | ||
| metadata: | ||
| name: {{ .Values.secrets.temporalClientCerts }} | ||
| namespace: {{ include "nico-rest-site-agent.namespace" . }} | ||
| labels: | ||
| {{- include "nico-rest-site-agent.labels" . | nindent 4 }} | ||
| annotations: | ||
| helm.sh/resource-policy: keep |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect whether this chart has any preserve-existing-data logic around the kept Secret.
rg -n --hidden --glob 'helm/rest/nico-rest-site-agent/**' \
'temporalClientCerts|temporal-client-site-agent-certs|helm.sh/resource-policy|lookup|stringData' \
helm/rest/nico-rest-site-agentRepository: NVIDIA/infra-controller
Length of output: 797
🏁 Script executed:
#!/bin/bash
set -euo pipefail
cat -n helm/rest/nico-rest-site-agent/templates/temporal-certs-secret.yamlRepository: NVIDIA/infra-controller
Length of output: 1062
🌐 Web query:
Helm "helm.sh/resource-policy: keep" upgrade patch behavior Secret
💡 Result:
Helm annotation helm.sh/resource-policy: keep on a Secret tells Helm to skip deleting that Secret during helm upgrade when the upgrade would otherwise result in its deletion. However, once kept, the Secret becomes “orphaned” and Helm “will no longer manage it in any way”, which means Helm will not reliably patch/update its content on subsequent upgrades. [1][2] So for “upgrade patch behavior” on Secrets: - If the Secret is still rendered by the chart, Helm may patch it normally as part of the upgrade; the main guarantee of keep is about avoiding deletion when Helm would remove it. - If the Secret would be removed because it no longer appears in the rendered manifests, keep prevents deletion, but then Helm stops managing it; patching/updating it from the chart is not expected. [1][2] Notes from Helm issue history: there were versions where the annotation was not respected during helm upgrade (leading to deletion/data loss), with reports that it worked again in later versions. [3] Reference: Helm docs for helm.sh/resource-policy: keep explicitly state it’s used to skip deleting resources during operations like helm upgrade and that the resource becomes orphaned / unmanaged afterward. [1][2]
Citations:
- 1: https://helm.sh/docs/howto/charts_tips_and_tricks
- 2: https://helm.sh/docs/howto/charts_tips_and_tricks/
- 3: Annotation "helm.sh/resource-policy: keep" is ignored on helm upgrade helm/helm#7677
Prevent Helm upgrades from blanking the Temporal cert Secret
helm.sh/resource-policy: keep only avoids deletion; it does not make this Secret install-only. Because the template still renders empty stringData, a routine helm upgrade can reapply those blanks and overwrite the bootstrap-populated certs. Use lookup/conditional rendering so upgrades preserve the existing data.
🧰 Tools
🪛 YAMLlint (1.37.1)
[error] 13-13: syntax error: expected the node content, but found '-'
(syntax)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@helm/rest/nico-rest-site-agent/templates/temporal-certs-secret.yaml` around
lines 5 - 15, The Temporal cert Secret template still renders blank data on
every upgrade, so the bootstrap-populated values can be overwritten even with
helm.sh/resource-policy: keep. Update temporal-certs-secret.yaml to use lookup
or conditional rendering in the Secret manifest so the existing Secret data is
preserved on helm upgrade, and only emit the placeholder empty data when the
Secret does not already exist.
Source: Path instructions
Three fixes for site-agent → Flow gRPC connectivity:
flow_client.go: switchFlowMutualTLSfromtls.Config{Certificates}toGetClientCertificate— Go silently drops the cert when the issuer doesn't match the server's CA list, causingtls: certificate requiredvalues.yaml: addFLOW_GRPC_SEC_OPT: "2"as chart default (was unset, defaulted to server-TLS only)temporal-certs-secret.yaml: addhelm.sh/resource-policy: keep— template was overwriting bootstrap-written certs on everyhelm upgrade, causing bootstrap to retry forever and block Flow gRPC startupRelated issues
Type of Change
Breaking Changes
Testing
Verified on dev6: site-agent pods reach
1/1 Runningwith 0 restarts and logs showFlowGrpcClient: Successfully connected to server.helm upgradeno longer overwritestemporal-client-site-agent-certs.Additional Notes