-
Notifications
You must be signed in to change notification settings - Fork 448
MCO-1807: Add CPMS support in the MCO's boot image controller #5332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Skipping CI for Draft Pull Request. |
@djoshy: This pull request references MCO-1807 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: djoshy The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test all |
@djoshy: This pull request references MCO-1807 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@djoshy: This pull request references MCO-1807 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
502f475
to
a52885f
Compare
/test verify |
This captures updates for the ManagedBootImages API
a52885f
to
4aa7aca
Compare
4aa7aca
to
42d5df2
Compare
@djoshy: This pull request references MCO-1807 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Opening this up for initial review; I've integrated the API from openshift/api#2396. |
go func() { ctrl.syncMAPIMachineSets("MAPIMachinesetDeleted") }() | ||
} | ||
|
||
func (ctrl *Controller) addControlPlaneMachineSet(obj interface{}) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking nit: comments overviewing the functions throughout this file might be useful (though the function names are pretty self-explanatory)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, will update on my next pass 😄
// Update/Check all ControlPlaneMachineSets instead of just this one. This prevents needing to maintain a local | ||
// store of machineset conditions. As this is using a lister, it is relatively inexpensive to do | ||
// this. | ||
go func() { ctrl.syncControlPlaneMachineSets("ControlPlaneMachineSetAdded") }() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprissed about how the handling of events has been done, not in a bad way, but I do think it's unconventional and it may have some caveats. Do we care about event arrival order? If so (and I think we do), the ideal approach would be the usual pattern of a channel consumed in a single go-routine that pulls events from the channel as soon as they arrive. That would preserve the order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - I went a bit unconventional here because I wanted all the resource reconciliations to happen in a single thread, one resource after the other; mainly to preserve the condition updates since each update writes to the status of the MachineConfiguration resource. And any event results in the same action i.e. loop through all the machine resources. I added mutexes in the actual sync functions to help with the ordering, so follow-up syncs won't step on each other.
Even so, the ordering itself isn't all that important, since the controller is listening on all the machine resources for any deviations. We do this because we want to alert the admin if we're hot looping when there is another actor on the boot image field. This does cause a quirk though: right after the MCO does perform an update, it immediately does a no-op loop through the resources.
I'm happy to rework this into a channel based system as a follow-up - this was a bit of an experiment I did at the time because I was curious about using mutexes instead of the channels 😅
@djoshy: This pull request references MCO-1807 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@djoshy: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
- What I did
This PR adds support for boot image updates to
ControlPlaneMachineSet
for the AWS, Azure and GCP platforms. A couple of key points to know about CPMS:cluster
. The boot images are stored under spec, in a field similar toMachineSets
. For example, in AWS(abbreviated to only important fields):RollingUpdate
,Recreate
orOnDelete
. InRollingUpdate
mode, this meant that any deviation in the spec of the CPMS from the nodes will cause a complete control plane replacement, which is undesirable if the only deviation was boot images. This is because the nodes pivot to the latest RHCOS image described by the OCP release image, and it would effectively be no-op, adding to upgrade time. To avoid this issue, the CPMS operator was updated to ignore boot image fields during control plane machine reconciliation.- How to verify it
TechPreview
featureset.cluster
for comparison purposes.MachineConfiguration
object:ami-00abe7f9c6bd85a77
.projects/rhcos-cloud/global/images/
, for exampleprojects/rhcos-cloud/global/images/test
.MachineConfiguration
object's status to see if the CPMS was reconciled successfully. The CPMS boot image fields should reflect the values you initially saw post-install. These are the values described in thecoreos-bootimages
configmap. Themachine-config-controller
logs should also mention that a boot image update took place.spec.replicas
value, and it should be able to do so successfully. This process might take a while(took about 10-15 minutes on GCP for me) to complete as the CPMS controller will first scale up the replacement and then drain and delete the older control plane machine. I think this is to maintain etcd quorum at all points of the process.MachineConfiguration
object's status to see if the CPMS object was reconciled successfully. The CPMS boot image fields should reflect the values you set, and not the values described in thecoreos-bootimages
configmap. Themachine-config-controller
logs should also mention that a boot image update did not take place.Note: Since these are singleton objects, the
Partial
selection mode is not permitted while specifying boot image configuration. Hence, that mode does not need to be tested. The APIServer will reject any attempt to setPartial
for CPMS objects, so I suppose that is something to test as well! 😄