Skip to content

Conversation

@alimaazamat
Copy link
Contributor

@alimaazamat alimaazamat commented Jul 29, 2025

Fixes issue #107
Socket path names in the format {baseName}{suffix} works fine for legacy deployments but if we have seamless upgrades then it doesn't work because the socket path names include uid {baseName}-{uid}{suffix}.

Solution:
In health.go change socket path name based on config, which gets populated by helm chart value which is false by default.

How to test:

# Build the kubelet plugin binary with the changes in health.go
make cmd-dra-example-kubeletplugin
# in demo dir and build/load image
cd demo
./scripts/build-driver-image.sh
./scripts/load-driver-image-into-kind.sh
cd ..
# made a new namespace
kubectl create namespace dra-example-driver

# Test Default Config where seamless upgrades disabled
helm install dra-example-driver deployments/helm/dra-example-driver --namespace dra-example-driver
# Check deployment
kubectl get pods --namespace dra-example-driver
# I verified DaemonSet update strategy correctly has **maxSurge: 0, maxUnavailable: 1**
# Check logs for legacy socket paths
kubectl logs -n dra-example-driver $POD_NAME | grep -E "(socket|endpoint)"
 
# Upgrade deployment to enable seamless upgrades
helm upgrade dra-example-driver deployments/helm/dra-example-driver --namespace dra-example-driver --set kubeletPlugin.seamlessUpgrades.enabled=true
# Here I verified the updated DaemonSet configuration has **maxSurge: 1, maxUnavailable: 0**
 
# Check pod status and verify POD_UID is set in env
kubectl get pods --namespace dra-example-driver
kubectl exec -n dra-example-driver $(kubectl get pods -n dra-example-driver -o jsonpath='{.items[0].metadata.name}') -- env | grep -E "(POD_UID|NODE_NAME)" | sort
# Logs correctly have UID sockets path names
kubectl logs -n dra-example-driver $NEW_POD_NAME | grep -E "(socket|endpoint|health|UID)"
 
# Patch to trigger a rolling update to test seamless upgrade
kubectl patch daemonset dra-example-driver-kubeletplugin -n dra-example-driver -p '{"spec":{"template":{"metadata":{"annotations":{"test-rolling-update":"'$(date +%s)'"}}}}}'
# Verified new pod uses UID sockets path names
kubectl get pods -n dra-example-driver
kubectl logs -n dra-example-driver $(kubectl get pods -n dra-example-driver -o jsonpath='{.items[0].metadata.name}') | grep -E "(socket|endpoint)" | tail -3
# new pod UID matches socket paths
kubectl get pod -n dra-example-driver $(kubectl get pods -n dra-example-driver -o jsonpath='{.items[0].metadata.name}') -o jsonpath='{.metadata.uid}'

Results:
Seamless Upgrades Disabled:

I0730 22:03:40.970503       1 health.go:73] "connecting to registration socket" path="unix:///var/lib/kubelet/plugins_registry/gpu.example.com-reg.sock"
I0730 22:03:40.970563       1 health.go:91] "connecting to DRA socket" path="unix:///var/lib/kubelet/plugins/gpu.example.com/dra.sock"

Seamless Upgrades Enabled:

I0730 22:09:55.489910       1 health.go:73] "connecting to registration socket" path="unix:///var/lib/kubelet/plugins_registry/gpu.example.com-a06912b8-f2ef-4645-a609-17050f3902e6-reg.sock"
I0730 22:09:55.489983       1 health.go:91] "connecting to DRA socket" path="unix:///var/lib/kubelet/plugins/gpu.example.com/dra-a06912b8-f2ef-4645-a609-17050f3902e6.sock"

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 29, 2025
@k8s-ci-robot k8s-ci-robot requested a review from klueska July 29, 2025 23:24
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: alimaazamat
Once this PR has been reviewed and has the lgtm label, please assign elezar for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested a review from nojnhuh July 29, 2025 23:24
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jul 29, 2025
)
}

// Enable seamless upgrades when POD_UID is available (which gets set by Helm when seamlessUpgrades.enabled=true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be some other reason down the road that the pod UID is useful, so I think I would prefer to add a separate toggle to the config (exposed by a command line flag) that only sets the RollingUpdate option.

Copy link
Contributor Author

@alimaazamat alimaazamat Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we instead have seamless upgrades flag instead? like isnt the user choice seamless upgrades which is done with rollingUpdate?

rollingUpdate:
maxSurge: 1
maxUnavailable: 0
{{- else }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting kubeletPlugin.seamlessUpgrades.enabled will silently ignore the kubeletPlugin.updateStrategy value which isn't ideal.

Maybe we could keep the existing updateStrategy template block and use a different default for maxSurge and maxUnavailable when seamless upgrades are enabled? Then kubeletPlugin.updateStrategy is still authoritative if users express an opinion on that, but only setting kubeletPlugin.seamlessUpgrades.enabled is enough to set them to values that actually exercise seamless upgrades if they don't.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, if user provides updateStrategy but seamlessUpgrades is enabled, then overwrite just the maxSurge and maxUnavailable of the given updateStrategy?

- name: HEALTHCHECK_PORT
value: {{ .Values.kubeletPlugin.containers.plugin.healthcheckPort | quote }}
{{- end }}
{{- if .Values.kubeletPlugin.seamlessUpgrades.enabled }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't enable seamless upgrades based on whether the pod UID is available, then I think it's reasonable to always unconditionally set this environment variable for the pods.

Copy link
Contributor Author

@alimaazamat alimaazamat Aug 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so POD_UID unconditionally set and then I added SEAMLESS_UPGRADES to yaml when enabled so that way it can be a CLI flag.

@alimaazamat alimaazamat changed the title fix: supports seamless upgrade socket path name with uid [WIP] fix: supports seamless upgrade socket path name with uid Aug 15, 2025
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 15, 2025
@alimaazamat
Copy link
Contributor Author

/assign

@pohly pohly moved this from 🆕 New to 🏗 In progress in Dynamic Resource Allocation Aug 22, 2025
@alimaazamat alimaazamat force-pushed the seamless-upgrade-socket-discovery branch 2 times, most recently from 39da3da to 28f1e83 Compare August 26, 2025 23:54
@alimaazamat alimaazamat changed the title [WIP] fix: supports seamless upgrade socket path name with uid fix: supports seamless upgrade socket path name with uid Aug 27, 2025
@alimaazamat alimaazamat requested a review from nojnhuh August 27, 2025 22:16
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 27, 2025
@alimaazamat alimaazamat force-pushed the seamless-upgrade-socket-discovery branch from 28f1e83 to ce936de Compare August 27, 2025 23:04
@pohly pohly moved this from 🏗 In progress to 👀 In review in Dynamic Resource Allocation Sep 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants