Skip to content

Conversation

@bg-chun
Copy link

@bg-chun bg-chun commented Apr 3, 2025

This PR encapsulates the internal state of Checkpoint and prevents direct access to CheckpointV1 from DeviceState.

@k8s-ci-robot k8s-ci-robot requested review from byako and pohly April 3, 2025 21:10
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 3, 2025
@bg-chun bg-chun mentioned this pull request Apr 3, 2025
Copy link
Contributor

@nojnhuh nojnhuh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a couple nits but definitely a positive change overall I think. Thanks!

V1 *CheckpointV1 `json:"v1,omitempty"`
}

var _ checkpointmanager.Checkpoint = (*Checkpoint)(nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this check isn't strictly necessary since the compiler will already verify this type satisfies the interface where checkpointmanager.CreateCheckpoint and checkpointmanager.GetCheckpoint are called.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, it’s not strictly necessary. But it helps readers new to the code understand which interface checkpoint must implement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this interface assertions can help readers of the code.

I think = &Checkpoint{} is a bit more idiomatic.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in e14b2a7


if preparedClaims[claimUID] != nil {
return preparedClaims[claimUID].GetDevices(), nil
if exist := checkpoint.GetPreparedDevices(claimUID); exist != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could we call this existingDevices or something like that? exist is a good name for a boolean, but since this isn't a bool then exist != nil looks a little more confusing to me than something like existingDevices != nil.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to reuse preparedDevices in 96e98a6

@nojnhuh nojnhuh moved this from 🆕 New to 👀 In review in Dynamic Resource Allocation Apr 5, 2025
@bg-chun bg-chun requested a review from nojnhuh April 9, 2025 01:26
Copy link
Contributor

@nojnhuh nojnhuh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending squashing to one commit.

@bg-chun bg-chun force-pushed the encapsulate_checkpoint branch from e14b2a7 to a8b18a7 Compare April 11, 2025 03:49
@bg-chun bg-chun requested a review from nojnhuh April 11, 2025 05:16
@bg-chun
Copy link
Author

bg-chun commented Apr 11, 2025

commits has been squashed

Copy link
Contributor

@nojnhuh nojnhuh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commits has been squashed

Thanks!

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 11, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bg-chun, nojnhuh
Once this PR has been reviewed and has the lgtm label, please assign byako for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

import (
"encoding/json"

"k8s.io/kubernetes/pkg/kubelet/checkpointmanager"
Copy link
Contributor

@pohly pohly Apr 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k8s.io/kubernetes should not be imported. Any code inside it is considered internal and not meant for public consumption. There are some exceptions (most notably the scheduler framework for building custom schedulers), but not this one here.

It's not even a particularly good package. We had huge issues with figuring out how checksumming was meant to be used and what the purpose of checksumming was in the first place.

Can we perhaps use this opportunity to drop the dependency?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @pohly

Can we perhaps use this opportunity to drop the dependency?

=> In terms of dependency, I'm on the same page. I will update PR to introduce simple checkpoint util to drop the dependency.
But for checksum do you mean we don't need checksum here? Seems there was some issues with dra_manager_state in the past. Or do you mean we need well designed checkpoint implementation along with checksum. Since it is example dra driver, maybe simple checkpointing without checksum is fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only purpose of checksumming that I could imagine is to detect bit flips in the file. As a DRA driver author, is that important to you on top of whatever potential checksumming and error correcting the OS might do?

Are there other reasons for it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven’t thought about it seriously.
The checkpoint manager already existed when I was involved with resource managers in kubelet around 2019.

Seems, checksum is originated from PodSandbox checkpointer of dockershim.

Approaching conservatively, there seem to be a few possible cases:

  • The process could restart while writing a file (e.g., OOM, kill, restart).
  • The file might be corrupted unintentionally by a person or script.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The process could restart while writing a file (e.g., OOM, kill, restart).

The usual approach is to write a temp file, sync, then rename. But I suppose a checksum is easier.

The file might be corrupted unintentionally by a person or script.

True, albeit a bit unlikely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update PR to introduce simple checkpoint util to drop the dependency.

Gentle reminder that this is pending.

/lgtm cancel

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @pohly
I will resume this PR soon. Before I do, I have a couple of quick questions:

  1. My understanding from your feedback is that checksums might be unnecessary here. Would implementing a simple checkpointing mechanism without checksums address this concern?

  2. Aside from that, do you have any additional suggestions for an ideal checkpointing mechanism?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine without a checkpoint checksum. For the "writing file fails" case I think the "write tmp file, sync, close, remove, rename" approach would be useful. I don't know about other existing mechanisms that could be used here.

cc @nojnhuh

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 13, 2025
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants