Fix: bug: bug: False positive readiness check when API server watch channels close unexpectedly#670
Conversation
Signed-off-by: gabrnavarro <gabrnavarro@users.noreply.github.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
👋 Welcome to the KubeStellar community! 💖 Thanks and congrats 🎉 for opening your first PR here! We're excited to have you contributing. Before merge, please ensure:
📬 If you're using KubeStellar in your organization, please add your name to our Adopters list. 🙏 It really helps the project gain momentum and credibility — a small contribution back with a big impact. Resources:
A maintainer will review your PR soon. Hope you have a great time here! 🌟 ~~~~~~~~~~ 🌟 📬 If you like KubeStellar, please ⭐ star ⭐ our repo to support it! 🙏 It really helps the project gain momentum and credibility — a small contribution back with a big impact. |
|
Hi @gabrnavarro. Thanks for your PR. I'm waiting for a kubestellar member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
Pull request overview
This pull request fixes a critical bug where watch-based status check functions (WaitForDeploymentReady, WaitForStatefulSetReady, WaitForNamespaceDeletion) in pkg/util/status_check.go were returning nil when watch channels closed unexpectedly, creating false-positive success signals that led to race conditions. The fix changes these functions to return explicit errors instead. The parameter types are also changed from concrete kubernetes.Clientset to kubernetes.Interface to enable dependency injection in tests. A comprehensive test file has been added to verify both the error case and happy paths.
Changes:
- Modified three watch-based status check functions to return errors when watch channels close prematurely instead of returning nil
- Changed function signatures to accept
kubernetes.Interfaceinstead ofkubernetes.Clientsetfor better testability - Added comprehensive test coverage for all three functions with both error and success scenarios
- Updated call sites across multiple files to use the new function signatures
- Fixed formatting issues and import ordering in several files
Reviewed changes
Copilot reviewed 10 out of 11 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/util/status_check.go | Fixed three watch functions to return explicit errors when channels close unexpectedly; changed parameter types to interfaces |
| pkg/util/status_check_test.go | Added new test file with comprehensive coverage of error and success paths |
| cmd/kflex/init/init.go | Updated function calls to new signatures; removed unnecessary dereference; fixed formatting |
| cmd/kflex/delete/delete.go | Updated function call to new signature; removed unnecessary dereference |
| cmd/kflex/create/create.go | Updated three function calls to use pointer variable instead of dereferenced value |
| cmd/cmupdate/main.go | Updated function calls to remove dereferencing |
| internal/controller/controlplane_controller.go | Fixed indentation in switch statement |
| pkg/reconcilers/shared/postcreate_hook.go | Fixed import ordering to place stdlib imports before third-party imports |
| pkg/reconcilers/shared/reconciler.go | Added blank line for formatting consistency |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| util.WaitForStatefulSetReady( | ||
| clientset, | ||
| clientsetp, | ||
| util.GeneratePSReplicaSetName(util.DBReleaseName), | ||
| util.SystemNamespace) |
There was a problem hiding this comment.
The call to WaitForStatefulSetReady() does not check the returned error. Since the function signature was changed to return an error when the watch channel closes unexpectedly, this error must be checked and handled. The function can now fail and the caller needs to respond appropriately rather than silently ignoring the error.
| util.WaitForDeploymentReady( | ||
| clientset, | ||
| clientsetp, | ||
| util.GenerateOperatorDeploymentName(), | ||
| util.SystemNamespace) |
There was a problem hiding this comment.
The call to WaitForDeploymentReady() does not check the returned error. Since the function signature was changed to return an error when the watch channel closes unexpectedly, this error must be checked and handled. The function can now fail and the caller needs to respond appropriately rather than silently ignoring the error.
| } | ||
| util.PrintStatus(fmt.Sprintf("Waiting for control plane %s to be deleted...", cp.Name), done, &wg, chattyStatus) | ||
| util.WaitForNamespaceDeletion(*clientsetKflex, util.GenerateNamespaceFromControlPlaneName(cp.Name)) | ||
| util.WaitForNamespaceDeletion(clientsetKflex, util.GenerateNamespaceFromControlPlaneName(cp.Name)) |
There was a problem hiding this comment.
The call to WaitForNamespaceDeletion() does not check the returned error. Since the function signature was changed to return an error when the watch channel closes unexpectedly, this error must be checked and handled. The function can now fail and the caller needs to respond appropriately rather than silently ignoring the error.
| util.WaitForNamespaceDeletion(clientsetKflex, util.GenerateNamespaceFromControlPlaneName(cp.Name)) | |
| if err := util.WaitForNamespaceDeletion(clientsetKflex, util.GenerateNamespaceFromControlPlaneName(cp.Name)); err != nil { | |
| done <- true | |
| wg.Wait() | |
| return fmt.Errorf("error waiting for control plane namespace deletion: %v", err) | |
| } |
| t.Fatal("expected error when watch channel closes before statefulset is ready, got nil") | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
The test file is missing a test for the happy path of WaitForStatefulSetReady(). While there is a test for WaitForDeploymentReady_ReadyBeforeChannelClose and TestWaitForNamespaceDeletion_DeletedSuccessfully, there is no corresponding positive test for WaitForStatefulSetReady. This leaves the success path for StatefulSet readiness without explicit test coverage.
| // TestWaitForStatefulSetReady_ReadyBeforeChannelClose verifies the happy path: | |
| // statefulset reports ready replicas and the function returns nil. | |
| func TestWaitForStatefulSetReady_ReadyBeforeChannelClose(t *testing.T) { | |
| fakeClient := fake.NewSimpleClientset() | |
| fw := watch.NewFake() | |
| fakeClient.PrependWatchReactor("statefulsets", func(action k8stesting.Action) (bool, watch.Interface, error) { | |
| return true, fw, nil | |
| }) | |
| sts := &v1.StatefulSet{ | |
| ObjectMeta: metav1.ObjectMeta{ | |
| Name: "test-sts", | |
| Namespace: "test-ns", | |
| }, | |
| Spec: v1.StatefulSetSpec{ | |
| Replicas: int32Ptr(3), | |
| }, | |
| Status: v1.StatefulSetStatus{ | |
| ReadyReplicas: 3, | |
| }, | |
| } | |
| go func() { | |
| fw.Modify(sts) | |
| }() | |
| err := WaitForStatefulSetReady(fakeClient, "test-sts", "test-ns") | |
| if err != nil { | |
| t.Fatalf("expected nil when statefulset becomes ready, got: %v", err) | |
| } | |
| } |
What changed
In
pkg/util/status_check.go, the three watch-based functions now return an explicit error instead ofnilwhen the watch channel closes before the expected condition is met:WaitForDeploymentReady: returnsfmt.Errorf("watch channel closed before deployment %s/%s became ready", ...)WaitForStatefulSetReady: returnsfmt.Errorf("watch channel closed before statefulset %s/%s became ready", ...)WaitForNamespaceDeletion: returnsfmt.Errorf("watch channel closed before namespace %s was deleted", ...)The parameter types were also changed from
kubernetes.Clientset(concrete struct) tokubernetes.Interfaceto allow injection of a fake client in tests.A new test file
pkg/util/status_check_test.gocovers both the bug scenario (channel closed early → error) and the happy path (condition met → nil) for all three functions.Why
When the Kubernetes API server closes a watch channel prematurely (timeout, dropped connection, restart), the original code returned
nil, signaling success to callers. This caused the reconciler to proceed as if the Deployment/StatefulSet was ready or the Namespace was deleted, leading to race conditions and incorrect downstream behavior.Fixes #669
Testing
TestWaitForDeploymentReady_ChannelClosedBeforeReady: closes watch channel before deployment is ready, asserts error returnedTestWaitForDeploymentReady_ReadyBeforeChannelClose: sends a ready deployment event, asserts nil returnedTestWaitForStatefulSetReady_ChannelClosedBeforeReady: same pattern for StatefulSetTestWaitForNamespaceDeletion_ChannelClosedBeforeDeletion: closes watch channel before namespace deleted, asserts errorTestWaitForNamespaceDeletion_DeletedSuccessfully: sends Terminating then Deleted events, asserts nilAll existing tests continue to pass (
go test ./...).