-
Notifications
You must be signed in to change notification settings - Fork 75
Encapsulate Checkpoint internal state #90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a couple nits but definitely a positive change overall I think. Thanks!
| V1 *CheckpointV1 `json:"v1,omitempty"` | ||
| } | ||
|
|
||
| var _ checkpointmanager.Checkpoint = (*Checkpoint)(nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this check isn't strictly necessary since the compiler will already verify this type satisfies the interface where checkpointmanager.CreateCheckpoint and checkpointmanager.GetCheckpoint are called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, it’s not strictly necessary. But it helps readers new to the code understand which interface checkpoint must implement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this interface assertions can help readers of the code.
I think = &Checkpoint{} is a bit more idiomatic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated in e14b2a7
|
|
||
| if preparedClaims[claimUID] != nil { | ||
| return preparedClaims[claimUID].GetDevices(), nil | ||
| if exist := checkpoint.GetPreparedDevices(claimUID); exist != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Could we call this existingDevices or something like that? exist is a good name for a boolean, but since this isn't a bool then exist != nil looks a little more confusing to me than something like existingDevices != nil.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated to reuse preparedDevices in 96e98a6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM pending squashing to one commit.
Signed-off-by: Byonggon Chun <[email protected]>
e14b2a7 to
a8b18a7
Compare
|
commits has been squashed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commits has been squashed
Thanks!
/lgtm
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: bg-chun, nojnhuh The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
| import ( | ||
| "encoding/json" | ||
|
|
||
| "k8s.io/kubernetes/pkg/kubelet/checkpointmanager" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
k8s.io/kubernetes should not be imported. Any code inside it is considered internal and not meant for public consumption. There are some exceptions (most notably the scheduler framework for building custom schedulers), but not this one here.
It's not even a particularly good package. We had huge issues with figuring out how checksumming was meant to be used and what the purpose of checksumming was in the first place.
Can we perhaps use this opportunity to drop the dependency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @pohly
Can we perhaps use this opportunity to drop the dependency?
=> In terms of dependency, I'm on the same page. I will update PR to introduce simple checkpoint util to drop the dependency.
But for checksum do you mean we don't need checksum here? Seems there was some issues with dra_manager_state in the past. Or do you mean we need well designed checkpoint implementation along with checksum. Since it is example dra driver, maybe simple checkpointing without checksum is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only purpose of checksumming that I could imagine is to detect bit flips in the file. As a DRA driver author, is that important to you on top of whatever potential checksumming and error correcting the OS might do?
Are there other reasons for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven’t thought about it seriously.
The checkpoint manager already existed when I was involved with resource managers in kubelet around 2019.
Seems, checksum is originated from PodSandbox checkpointer of dockershim.
Approaching conservatively, there seem to be a few possible cases:
- The process could restart while writing a file (e.g., OOM, kill, restart).
- The file might be corrupted unintentionally by a person or script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The process could restart while writing a file (e.g., OOM, kill, restart).
The usual approach is to write a temp file, sync, then rename. But I suppose a checksum is easier.
The file might be corrupted unintentionally by a person or script.
True, albeit a bit unlikely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will update PR to introduce simple checkpoint util to drop the dependency.
Gentle reminder that this is pending.
/lgtm cancel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @pohly
I will resume this PR soon. Before I do, I have a couple of quick questions:
-
My understanding from your feedback is that checksums might be unnecessary here. Would implementing a simple checkpointing mechanism without checksums address this concern?
-
Aside from that, do you have any additional suggestions for an ideal checkpointing mechanism?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine without a checkpoint checksum. For the "writing file fails" case I think the "write tmp file, sync, close, remove, rename" approach would be useful. I don't know about other existing mechanisms that could be used here.
cc @nojnhuh
|
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
This PR encapsulates the internal state of Checkpoint and prevents direct access to CheckpointV1 from DeviceState.