Skip to content

Fix: bug: bug: False positive readiness check when API server watch channels close unexpectedly#670

Open
gabrnavarro wants to merge 1 commit into
kubestellar:mainfrom
gabrnavarro:bot/task-256-issue-669
Open

Fix: bug: bug: False positive readiness check when API server watch channels close unexpectedly#670
gabrnavarro wants to merge 1 commit into
kubestellar:mainfrom
gabrnavarro:bot/task-256-issue-669

Conversation

@gabrnavarro

Copy link
Copy Markdown

What changed

In pkg/util/status_check.go, the three watch-based functions now return an explicit error instead of nil when the watch channel closes before the expected condition is met:

  • WaitForDeploymentReady: returns fmt.Errorf("watch channel closed before deployment %s/%s became ready", ...)
  • WaitForStatefulSetReady: returns fmt.Errorf("watch channel closed before statefulset %s/%s became ready", ...)
  • WaitForNamespaceDeletion: returns fmt.Errorf("watch channel closed before namespace %s was deleted", ...)

The parameter types were also changed from kubernetes.Clientset (concrete struct) to kubernetes.Interface to allow injection of a fake client in tests.

A new test file pkg/util/status_check_test.go covers both the bug scenario (channel closed early → error) and the happy path (condition met → nil) for all three functions.

Why

When the Kubernetes API server closes a watch channel prematurely (timeout, dropped connection, restart), the original code returned nil, signaling success to callers. This caused the reconciler to proceed as if the Deployment/StatefulSet was ready or the Namespace was deleted, leading to race conditions and incorrect downstream behavior.

Fixes #669

Testing

  • TestWaitForDeploymentReady_ChannelClosedBeforeReady: closes watch channel before deployment is ready, asserts error returned
  • TestWaitForDeploymentReady_ReadyBeforeChannelClose: sends a ready deployment event, asserts nil returned
  • TestWaitForStatefulSetReady_ChannelClosedBeforeReady: same pattern for StatefulSet
  • TestWaitForNamespaceDeletion_ChannelClosedBeforeDeletion: closes watch channel before namespace deleted, asserts error
  • TestWaitForNamespaceDeletion_DeletedSuccessfully: sends Terminating then Deleted events, asserts nil

All existing tests continue to pass (go test ./...).

Signed-off-by: gabrnavarro <gabrnavarro@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 3, 2026 15:28
@kubestellar-prow kubestellar-prow Bot added the dco-signoff: yes Indicates the PR's author has signed the DCO. label Apr 3, 2026
@kubestellar-prow

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign clubanderson for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions

github-actions Bot commented Apr 3, 2026

Copy link
Copy Markdown

👋 Welcome to the KubeStellar community! 💖

Thanks and congrats 🎉 for opening your first PR here! We're excited to have you contributing.

Before merge, please ensure:

  • DCO Sign-off — All commits signed with git commit -s (DCO)
  • PR Title — Starts with an emoji: ✨ feature | 🐛 bug fix | 📖 docs | 🌱 infra/tests | ⚠️ breaking

📬 If you're using KubeStellar in your organization, please add your name to our Adopters list. 🙏 It really helps the project gain momentum and credibility — a small contribution back with a big impact.

Resources:

A maintainer will review your PR soon. Hope you have a great time here!

🌟 ~~~~~~~~~~ 🌟

📬 If you like KubeStellar, please ⭐ star ⭐ our repo to support it!

🙏 It really helps the project gain momentum and credibility — a small contribution back with a big impact.

@kubestellar-prow

Copy link
Copy Markdown
Contributor

Hi @gabrnavarro. Thanks for your PR.

I'm waiting for a kubestellar member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@kubestellar-prow kubestellar-prow Bot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 3, 2026

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request fixes a critical bug where watch-based status check functions (WaitForDeploymentReady, WaitForStatefulSetReady, WaitForNamespaceDeletion) in pkg/util/status_check.go were returning nil when watch channels closed unexpectedly, creating false-positive success signals that led to race conditions. The fix changes these functions to return explicit errors instead. The parameter types are also changed from concrete kubernetes.Clientset to kubernetes.Interface to enable dependency injection in tests. A comprehensive test file has been added to verify both the error case and happy paths.

Changes:

  • Modified three watch-based status check functions to return errors when watch channels close prematurely instead of returning nil
  • Changed function signatures to accept kubernetes.Interface instead of kubernetes.Clientset for better testability
  • Added comprehensive test coverage for all three functions with both error and success scenarios
  • Updated call sites across multiple files to use the new function signatures
  • Fixed formatting issues and import ordering in several files

Reviewed changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
pkg/util/status_check.go Fixed three watch functions to return explicit errors when channels close unexpectedly; changed parameter types to interfaces
pkg/util/status_check_test.go Added new test file with comprehensive coverage of error and success paths
cmd/kflex/init/init.go Updated function calls to new signatures; removed unnecessary dereference; fixed formatting
cmd/kflex/delete/delete.go Updated function call to new signature; removed unnecessary dereference
cmd/kflex/create/create.go Updated three function calls to use pointer variable instead of dereferenced value
cmd/cmupdate/main.go Updated function calls to remove dereferencing
internal/controller/controlplane_controller.go Fixed indentation in switch statement
pkg/reconcilers/shared/postcreate_hook.go Fixed import ordering to place stdlib imports before third-party imports
pkg/reconcilers/shared/reconciler.go Added blank line for formatting consistency

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cmd/kflex/init/init.go
Comment on lines 146 to 149
util.WaitForStatefulSetReady(
clientset,
clientsetp,
util.GeneratePSReplicaSetName(util.DBReleaseName),
util.SystemNamespace)

Copilot AI Apr 3, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The call to WaitForStatefulSetReady() does not check the returned error. Since the function signature was changed to return an error when the watch channel closes unexpectedly, this error must be checked and handled. The function can now fail and the caller needs to respond appropriately rather than silently ignoring the error.

Copilot uses AI. Check for mistakes.
Comment thread cmd/kflex/init/init.go
Comment on lines 160 to 163
util.WaitForDeploymentReady(
clientset,
clientsetp,
util.GenerateOperatorDeploymentName(),
util.SystemNamespace)

Copilot AI Apr 3, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The call to WaitForDeploymentReady() does not check the returned error. Since the function signature was changed to return an error when the watch channel closes unexpectedly, this error must be checked and handled. The function can now fail and the caller needs to respond appropriately rather than silently ignoring the error.

Copilot uses AI. Check for mistakes.
}
util.PrintStatus(fmt.Sprintf("Waiting for control plane %s to be deleted...", cp.Name), done, &wg, chattyStatus)
util.WaitForNamespaceDeletion(*clientsetKflex, util.GenerateNamespaceFromControlPlaneName(cp.Name))
util.WaitForNamespaceDeletion(clientsetKflex, util.GenerateNamespaceFromControlPlaneName(cp.Name))

Copilot AI Apr 3, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The call to WaitForNamespaceDeletion() does not check the returned error. Since the function signature was changed to return an error when the watch channel closes unexpectedly, this error must be checked and handled. The function can now fail and the caller needs to respond appropriately rather than silently ignoring the error.

Suggested change
util.WaitForNamespaceDeletion(clientsetKflex, util.GenerateNamespaceFromControlPlaneName(cp.Name))
if err := util.WaitForNamespaceDeletion(clientsetKflex, util.GenerateNamespaceFromControlPlaneName(cp.Name)); err != nil {
done <- true
wg.Wait()
return fmt.Errorf("error waiting for control plane namespace deletion: %v", err)
}

Copilot uses AI. Check for mistakes.
t.Fatal("expected error when watch channel closes before statefulset is ready, got nil")
}
}

Copilot AI Apr 3, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test file is missing a test for the happy path of WaitForStatefulSetReady(). While there is a test for WaitForDeploymentReady_ReadyBeforeChannelClose and TestWaitForNamespaceDeletion_DeletedSuccessfully, there is no corresponding positive test for WaitForStatefulSetReady. This leaves the success path for StatefulSet readiness without explicit test coverage.

Suggested change
// TestWaitForStatefulSetReady_ReadyBeforeChannelClose verifies the happy path:
// statefulset reports ready replicas and the function returns nil.
func TestWaitForStatefulSetReady_ReadyBeforeChannelClose(t *testing.T) {
fakeClient := fake.NewSimpleClientset()
fw := watch.NewFake()
fakeClient.PrependWatchReactor("statefulsets", func(action k8stesting.Action) (bool, watch.Interface, error) {
return true, fw, nil
})
sts := &v1.StatefulSet{
ObjectMeta: metav1.ObjectMeta{
Name: "test-sts",
Namespace: "test-ns",
},
Spec: v1.StatefulSetSpec{
Replicas: int32Ptr(3),
},
Status: v1.StatefulSetStatus{
ReadyReplicas: 3,
},
}
go func() {
fw.Modify(sts)
}()
err := WaitForStatefulSetReady(fakeClient, "test-sts", "test-ns")
if err != nil {
t.Fatalf("expected nil when statefulset becomes ready, got: %v", err)
}
}

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has signed the DCO. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: bug: False positive readiness check when API server watch channels close unexpectedly

2 participants