Encapsulate Checkpoint internal state #90

bg-chun · 2025-04-03T21:10:40Z

This PR encapsulates the internal state of Checkpoint and prevents direct access to CheckpointV1 from DeviceState.

nojnhuh

I left a couple nits but definitely a positive change overall I think. Thanks!

nojnhuh · 2025-04-04T16:46:54Z

cmd/dra-example-kubeletplugin/checkpoint.go

 	V1       *CheckpointV1     `json:"v1,omitempty"`
 }

+var _ checkpointmanager.Checkpoint = (*Checkpoint)(nil)


nit: this check isn't strictly necessary since the compiler will already verify this type satisfies the interface where checkpointmanager.CreateCheckpoint and checkpointmanager.GetCheckpoint are called.

True, it’s not strictly necessary. But it helps readers new to the code understand which interface checkpoint must implement.

I agree that this interface assertions can help readers of the code.

I think = &Checkpoint{} is a bit more idiomatic.

updated in e14b2a7

nojnhuh · 2025-04-04T17:00:01Z

cmd/dra-example-kubeletplugin/state.go


-	if preparedClaims[claimUID] != nil {
-		return preparedClaims[claimUID].GetDevices(), nil
+	if exist := checkpoint.GetPreparedDevices(claimUID); exist != nil {


nit: Could we call this existingDevices or something like that? exist is a good name for a boolean, but since this isn't a bool then exist != nil looks a little more confusing to me than something like existingDevices != nil.

updated to reuse preparedDevices in 96e98a6

nojnhuh

LGTM pending squashing to one commit.

Signed-off-by: Byonggon Chun <[email protected]>

bg-chun · 2025-04-11T05:17:22Z

commits has been squashed

nojnhuh

commits has been squashed

Thanks!

/lgtm

k8s-ci-robot · 2025-04-11T16:06:00Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bg-chun, nojnhuh
Once this PR has been reviewed and has the lgtm label, please assign byako for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pohly · 2025-04-22T07:24:40Z

cmd/dra-example-kubeletplugin/checkpoint.go

 import (
 	"encoding/json"

+	"k8s.io/kubernetes/pkg/kubelet/checkpointmanager"


k8s.io/kubernetes should not be imported. Any code inside it is considered internal and not meant for public consumption. There are some exceptions (most notably the scheduler framework for building custom schedulers), but not this one here.

It's not even a particularly good package. We had huge issues with figuring out how checksumming was meant to be used and what the purpose of checksumming was in the first place.

Can we perhaps use this opportunity to drop the dependency?

cc @pohly

Can we perhaps use this opportunity to drop the dependency?

=> In terms of dependency, I'm on the same page. I will update PR to introduce simple checkpoint util to drop the dependency.
But for checksum do you mean we don't need checksum here? Seems there was some issues with dra_manager_state in the past. Or do you mean we need well designed checkpoint implementation along with checksum. Since it is example dra driver, maybe simple checkpointing without checksum is fine.

The only purpose of checksumming that I could imagine is to detect bit flips in the file. As a DRA driver author, is that important to you on top of whatever potential checksumming and error correcting the OS might do?

Are there other reasons for it?

I haven’t thought about it seriously.
The checkpoint manager already existed when I was involved with resource managers in kubelet around 2019.

Seems, checksum is originated from PodSandbox checkpointer of dockershim.

kubernetes/kubernetes@d62bd9e#diff-7cf7a43176686c06054024ad9d7226062787c188c66fd17d837d7e821b164ab0

Approaching conservatively, there seem to be a few possible cases:

The process could restart while writing a file (e.g., OOM, kill, restart).

The file might be corrupted unintentionally by a person or script.

The process could restart while writing a file (e.g., OOM, kill, restart).

The usual approach is to write a temp file, sync, then rename. But I suppose a checksum is easier.

The file might be corrupted unintentionally by a person or script.

True, albeit a bit unlikely.

I will update PR to introduce simple checkpoint util to drop the dependency.

Gentle reminder that this is pending.

/lgtm cancel

cc @pohly
I will resume this PR soon. Before I do, I have a couple of quick questions:

My understanding from your feedback is that checksums might be unnecessary here. Would implementing a simple checkpointing mechanism without checksums address this concern?

Aside from that, do you have any additional suggestions for an ideal checkpointing mechanism?

I'm fine without a checkpoint checksum. For the "writing file fails" case I think the "write tmp file, sync, close, remove, rename" approach would be useful. I don't know about other existing mechanisms that could be used here.

cc @nojnhuh

k8s-triage-robot · 2025-10-15T10:03:46Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

github-project-automation bot added this to Dynamic Resource Allocation Apr 3, 2025

github-project-automation bot moved this to 🆕 New in Dynamic Resource Allocation Apr 3, 2025

k8s-ci-robot requested review from byako and pohly April 3, 2025 21:10

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 3, 2025

bg-chun mentioned this pull request Apr 3, 2025

Add unit testing #41

Closed

nojnhuh reviewed Apr 4, 2025

View reviewed changes

nojnhuh moved this from 🆕 New to 👀 In review in Dynamic Resource Allocation Apr 5, 2025

bg-chun requested a review from nojnhuh April 9, 2025 01:26

nojnhuh reviewed Apr 10, 2025

View reviewed changes

encapsulate Checkpoint internal state

a8b18a7

Signed-off-by: Byonggon Chun <[email protected]>

bg-chun force-pushed the encapsulate_checkpoint branch from e14b2a7 to a8b18a7 Compare April 11, 2025 03:49

bg-chun requested a review from nojnhuh April 11, 2025 05:16

nojnhuh approved these changes Apr 11, 2025

View reviewed changes

k8s-ci-robot assigned nojnhuh Apr 11, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 11, 2025

pohly reviewed Apr 22, 2025

View reviewed changes

k8s-ci-robot assigned pohly May 13, 2025

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 13, 2025

nojnhuh mentioned this pull request Jul 31, 2025

DRAAdminAccess: add example #112

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 15, 2025

Encapsulate Checkpoint internal state #90

Are you sure you want to change the base?

Encapsulate Checkpoint internal state #90

Conversation

bg-chun commented Apr 3, 2025

Uh oh!

nojnhuh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nojnhuh left a comment

Choose a reason for hiding this comment

Uh oh!

bg-chun commented Apr 11, 2025

Uh oh!

nojnhuh left a comment

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Apr 11, 2025

Uh oh!

pohly Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

k8s-triage-robot commented Oct 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pohly Apr 22, 2025 •

edited

Loading