Skip to content

fix(site-agent): use mTLS for Flow gRPC and protect temporal-certs secret#2942

Merged
shayan1995 merged 2 commits into
NVIDIA:mainfrom
shayan1995:fix/flowgrpc-mtls-client-cert
Jun 29, 2026
Merged

fix(site-agent): use mTLS for Flow gRPC and protect temporal-certs secret#2942
shayan1995 merged 2 commits into
NVIDIA:mainfrom
shayan1995:fix/flowgrpc-mtls-client-cert

Conversation

@shayan1995

Copy link
Copy Markdown
Contributor

Three fixes for site-agent → Flow gRPC connectivity:

  • flow_client.go: switch FlowMutualTLS from tls.Config{Certificates} to GetClientCertificate — Go silently drops the cert when the issuer doesn't match the server's CA list, causing tls: certificate required
  • values.yaml: add FLOW_GRPC_SEC_OPT: "2" as chart default (was unset, defaulted to server-TLS only)
  • temporal-certs-secret.yaml: add helm.sh/resource-policy: keep — template was overwriting bootstrap-written certs on every helm upgrade, causing bootstrap to retry forever and block Flow gRPC startup

Related issues

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Verified on dev6: site-agent pods reach 1/1 Running with 0 restarts and logs show FlowGrpcClient: Successfully connected to server. helm upgrade no longer overwrites temporal-client-site-agent-certs.

Additional Notes

@shayan1995 shayan1995 requested review from a team as code owners June 27, 2026 00:13
@coderabbitai

coderabbitai Bot commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Summary by CodeRabbit

  • New Features
    • Added support for additional Flow gRPC environment settings in the deployment configuration.
  • Bug Fixes
    • Improved mutual TLS client certificate handling for Flow gRPC connections.
    • Preserved placeholder certificate secrets across Helm upgrades to prevent overwriting certificates during deployments.

Walkthrough

The Helm chart now preserves the Temporal client TLS placeholder secret across upgrades and adds Flow gRPC environment values. The Flow gRPC client now returns its client certificate through tls.Config.GetClientCertificate while retaining the existing root CAs.

Changes

Flow deployment and startup

Layer / File(s) Summary
Flow phase, certs, and chart bootstrap
helm/rest/nico-rest-site-agent/templates/temporal-certs-secret.yaml, helm/rest/nico-rest-site-agent/values.yaml
The Temporal client TLS placeholder secret gains a helm.sh/resource-policy: keep annotation and updated comment, and envConfig adds FLOW_GRPC_ENABLED and FLOW_GRPC_SEC_OPT.
Flow client certificate selection
rest-api/site-workflow/pkg/grpc/client/flow_client.go
The Flow gRPC client now supplies its mutual TLS certificate through GetClientCertificate instead of Certificates, while keeping RootCAs unchanged.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the two main changes: Flow gRPC mTLS and preserving the Temporal cert secret.
Description check ✅ Passed The description is directly aligned with the implemented fixes and testing notes in the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@github-actions

Copy link
Copy Markdown

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-06-27 00:16:55 UTC | Commit: 22fb1d3

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
rest-api/site-workflow/pkg/grpc/client/flow_client.go (1)

169-173: 🔒 Security & Privacy | 🔵 Trivial | 💤 Low value

Optional hardening: consider pinning MinVersion. Static analysis notes the absence of MinVersion. As a client, Go already negotiates a TLS 1.2 floor, so this is not a defect — but explicitly setting MinVersion: tls.VersionTLS13 would make the security posture intentional rather than implicit, provided the Flow server supports it.

🔒 Optional diff
 		mutualTLSConfig := &tls.Config{
+			MinVersion: tls.VersionTLS13,
 			GetClientCertificate: func(*tls.CertificateRequestInfo) (*tls.Certificate, error) {
 				return &clientCert, nil
 			},
 			RootCAs: capool,
 		}

Please confirm the Flow gRPC server negotiates TLS 1.3 before pinning, to avoid handshake regressions.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rest-api/site-workflow/pkg/grpc/client/flow_client.go` around lines 169 -
173, The mutual TLS setup in flow client construction should explicitly set a
minimum TLS version if the Flow gRPC server supports it. Review the tls.Config
used with mutualTLSConfig and, before pinning, confirm server-side TLS 1.3
negotiation is available; if it is, add MinVersion to the tls.Config to make the
security posture explicit without regressing handshakes.

Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@helm-prereqs/setup.sh`:
- Around line 807-816: The Docker config JSON in the registry pull secret setup
is being built with printf in a way that can emit invalid JSON when
REGISTRY_PULL_USERNAME or REGISTRY_PULL_SECRET contains special characters.
Update the shell logic in the registry secret construction block to use a proper
JSON encoder or equivalent escaping step before base64 encoding, keeping the
existing NICO_FLOW_ARGS and image-pull-secret wiring intact.

In `@helm/rest/nico-rest-site-agent/templates/temporal-certs-secret.yaml`:
- Around line 4-6: Clarify the keep-policy behavior in the Secret template
comments for temporal-certs-secret.yaml: `helm.sh/resource-policy: keep`
preserves the Secret on uninstall, so a reinstall will reuse the existing object
unless it is manually deleted first. Update the explanatory text near the kept
Secret to reflect this behavior accurately, using the Secret template and its
bootstrap placeholder context as the reference point.

---

Nitpick comments:
In `@rest-api/site-workflow/pkg/grpc/client/flow_client.go`:
- Around line 169-173: The mutual TLS setup in flow client construction should
explicitly set a minimum TLS version if the Flow gRPC server supports it. Review
the tls.Config used with mutualTLSConfig and, before pinning, confirm
server-side TLS 1.3 negotiation is available; if it is, add MinVersion to the
tls.Config to make the security posture explicit without regressing handshakes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b4eeeb77-48f7-4aa0-836d-db000ecd1635

📥 Commits

Reviewing files that changed from the base of the PR and between 45cb7a1 and 22fb1d3.

📒 Files selected for processing (6)
  • helm-prereqs/setup.sh
  • helm/rest/nico-rest-site-agent/templates/temporal-certs-secret.yaml
  • helm/rest/nico-rest-site-agent/values.yaml
  • rest-api/site-agent/pkg/components/managers/flowgrpc/init.go
  • rest-api/site-agent/pkg/components/managers/flowgrpc/metrics.go
  • rest-api/site-workflow/pkg/grpc/client/flow_client.go

Comment thread helm-prereqs/setup.sh
Comment thread helm/rest/nico-rest-site-agent/templates/temporal-certs-secret.yaml
@shayan1995 shayan1995 force-pushed the fix/flowgrpc-mtls-client-cert branch from 22fb1d3 to 2f521ef Compare June 27, 2026 00:23
@github-actions

github-actions Bot commented Jun 27, 2026

Copy link
Copy Markdown

🔍 Container Scan Summary

Service Total Critical High Medium Low Other
nico-flow 43 3 21 9 2 8
nico-nsm 49 0 10 31 8 0
nico-psm 43 3 21 9 2 8
nico-rest-api 43 3 21 9 2 8
nico-rest-cert-manager 42 3 21 9 1 8
nico-rest-db 43 3 21 9 2 8
nico-rest-site-agent 42 3 21 9 1 8
nico-rest-site-manager 42 3 21 9 1 8
nico-rest-workflow 43 3 21 9 2 8
TOTAL 390 24 178 103 21 64

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
helm/rest/nico-rest-site-agent/templates/temporal-certs-secret.yaml (1)

4-6: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Correct the reinstall behavior note for the kept Secret.

Line 6 still implies a release delete/reinstall creates a fresh placeholder. With helm.sh/resource-policy: keep, Helm skips deletion when an operation would remove the resource, so the kept Secret can survive and be reused unless the Secret itself is deleted first. (v3-1-0.helm.sh)

As per path instructions, helm/**: “Review Helm charts for template correctness, values compatibility, Kubernetes API compatibility, security context, RBAC, probes, and upgrade behavior.”

Suggested wording
-# bootstrap process; deleting and re-installing the release creates a fresh placeholder.
+# bootstrap process. The kept Secret also survives uninstall; delete the Secret
+# explicitly before reinstalling if a fresh placeholder is required.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@helm/rest/nico-rest-site-agent/templates/temporal-certs-secret.yaml` around
lines 4 - 6, The comment in temporal-certs-secret.yaml overstates reinstall
behavior for the kept Secret. Update the explanatory text around the
helm.sh/resource-policy: keep note to say that Helm preserves the Secret across
delete/reinstall flows unless the Secret itself is removed first, and avoid
claiming a fresh placeholder is automatically created. Use the existing Temporal
client TLS certs / bootstrap placeholder wording so the guidance matches the
template’s actual behavior.

Source: Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@helm/rest/nico-rest-site-agent/templates/temporal-certs-secret.yaml`:
- Around line 4-6: The comment in temporal-certs-secret.yaml overstates
reinstall behavior for the kept Secret. Update the explanatory text around the
helm.sh/resource-policy: keep note to say that Helm preserves the Secret across
delete/reinstall flows unless the Secret itself is removed first, and avoid
claiming a fresh placeholder is automatically created. Use the existing Temporal
client TLS certs / bootstrap placeholder wording so the guidance matches the
template’s actual behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 43ce878f-8cda-407c-9e1f-e31238e9e4e8

📥 Commits

Reviewing files that changed from the base of the PR and between 22fb1d3 and 2f521ef.

📒 Files selected for processing (3)
  • helm/rest/nico-rest-site-agent/templates/temporal-certs-secret.yaml
  • helm/rest/nico-rest-site-agent/values.yaml
  • rest-api/site-workflow/pkg/grpc/client/flow_client.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • helm/rest/nico-rest-site-agent/values.yaml

Comment thread helm/rest/nico-rest-site-agent/values.yaml Outdated
…cret

Three related issues caused the site-agent to fail connecting to Flow gRPC:

1. FLOW_GRPC_SEC_OPT was missing from the site-agent Helm chart values, so
   the Flow gRPC client defaulted to FlowServerTLS (1). Flow requires mTLS
   (RequireAndVerifyClientCert); the server rejected the connection with
   "tls: certificate required". Fix: add FLOW_GRPC_SEC_OPT: "2" as the
   chart default so FlowMutualTLS is used by default.

2. The FlowMutualTLS case in NewFlowGrpcClient used tls.Config{Certificates: ...}
   to present the client cert. Go's TLS stack only selects a cert from that
   field when the cert's issuer matches the server's CertificateRequest acceptable
   CA list; when there is no match it silently sends no cert, and the server
   again rejects with "certificate required". Fix: use GetClientCertificate
   instead, which unconditionally returns the cert — the same approach used in
   rest-api/flow/pkg/certs/certs.go TLSConfig().

3. temporal-certs-secret.yaml created the secret with empty placeholder values
   on every helm upgrade, overwriting certs written by the bootstrap process.
   On the next pod restart the site-agent found the OTP gone, re-attempted
   bootstrap (which fails because the OTP was already consumed), and blocked
   forever — preventing FlowGrpc().Start() from running. Fix: add
   helm.sh/resource-policy: keep so upgrades never touch this secret after
   initial creation.
@shayan1995 shayan1995 force-pushed the fix/flowgrpc-mtls-client-cert branch from 80d7f90 to 309fe8c Compare June 29, 2026 19:53

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@helm/rest/nico-rest-site-agent/templates/temporal-certs-secret.yaml`:
- Around line 5-15: The Temporal cert Secret template still renders blank data
on every upgrade, so the bootstrap-populated values can be overwritten even with
helm.sh/resource-policy: keep. Update temporal-certs-secret.yaml to use lookup
or conditional rendering in the Secret manifest so the existing Secret data is
preserved on helm upgrade, and only emit the placeholder empty data when the
Secret does not already exist.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6e160872-7a87-4d6a-a485-51998db8d079

📥 Commits

Reviewing files that changed from the base of the PR and between 80d7f90 and 309fe8c.

📒 Files selected for processing (3)
  • helm/rest/nico-rest-site-agent/templates/temporal-certs-secret.yaml
  • helm/rest/nico-rest-site-agent/values.yaml
  • rest-api/site-workflow/pkg/grpc/client/flow_client.go
✅ Files skipped from review due to trivial changes (1)
  • helm/rest/nico-rest-site-agent/values.yaml

Comment on lines +5 to +15
# resource-policy: keep prevents helm upgrade from overwriting certs written by the
# bootstrap process; deleting and re-installing the release creates a fresh placeholder.
apiVersion: v1
kind: Secret
metadata:
name: {{ .Values.secrets.temporalClientCerts }}
namespace: {{ include "nico-rest-site-agent.namespace" . }}
labels:
{{- include "nico-rest-site-agent.labels" . | nindent 4 }}
annotations:
helm.sh/resource-policy: keep

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect whether this chart has any preserve-existing-data logic around the kept Secret.
rg -n --hidden --glob 'helm/rest/nico-rest-site-agent/**' \
  'temporalClientCerts|temporal-client-site-agent-certs|helm.sh/resource-policy|lookup|stringData' \
  helm/rest/nico-rest-site-agent

Repository: NVIDIA/infra-controller

Length of output: 797


🏁 Script executed:

#!/bin/bash
set -euo pipefail

cat -n helm/rest/nico-rest-site-agent/templates/temporal-certs-secret.yaml

Repository: NVIDIA/infra-controller

Length of output: 1062


🌐 Web query:

Helm "helm.sh/resource-policy: keep" upgrade patch behavior Secret

💡 Result:

Helm annotation helm.sh/resource-policy: keep on a Secret tells Helm to skip deleting that Secret during helm upgrade when the upgrade would otherwise result in its deletion. However, once kept, the Secret becomes “orphaned” and Helm “will no longer manage it in any way”, which means Helm will not reliably patch/update its content on subsequent upgrades. [1][2] So for “upgrade patch behavior” on Secrets: - If the Secret is still rendered by the chart, Helm may patch it normally as part of the upgrade; the main guarantee of keep is about avoiding deletion when Helm would remove it. - If the Secret would be removed because it no longer appears in the rendered manifests, keep prevents deletion, but then Helm stops managing it; patching/updating it from the chart is not expected. [1][2] Notes from Helm issue history: there were versions where the annotation was not respected during helm upgrade (leading to deletion/data loss), with reports that it worked again in later versions. [3] Reference: Helm docs for helm.sh/resource-policy: keep explicitly state it’s used to skip deleting resources during operations like helm upgrade and that the resource becomes orphaned / unmanaged afterward. [1][2]

Citations:


Prevent Helm upgrades from blanking the Temporal cert Secret
helm.sh/resource-policy: keep only avoids deletion; it does not make this Secret install-only. Because the template still renders empty stringData, a routine helm upgrade can reapply those blanks and overwrite the bootstrap-populated certs. Use lookup/conditional rendering so upgrades preserve the existing data.

🧰 Tools
🪛 YAMLlint (1.37.1)

[error] 13-13: syntax error: expected the node content, but found '-'

(syntax)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@helm/rest/nico-rest-site-agent/templates/temporal-certs-secret.yaml` around
lines 5 - 15, The Temporal cert Secret template still renders blank data on
every upgrade, so the bootstrap-populated values can be overwritten even with
helm.sh/resource-policy: keep. Update temporal-certs-secret.yaml to use lookup
or conditional rendering in the Secret manifest so the existing Secret data is
preserved on helm upgrade, and only emit the placeholder empty data when the
Secret does not already exist.

Source: Path instructions

@shayan1995 shayan1995 merged commit 5df2d5f into NVIDIA:main Jun 29, 2026
118 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants