Skip to content

Missing processes cause the FoundationDBClusterStatus to be out of sync with actual cluster status #2292

Open
@hxu

Description

@hxu

What happened?

In our k8s cluster, we sometimes have processes / nodes that are killed by other systems running in the cluster. When this happens, the node can disappear from the cluster. Sometimes, this seems to result in the the process staying in the FDBClusterStatus, despite no longer being part of the cluster.

We see these log lines:

skip updating fault domain for process group with missing process in FoundationDB cluster status

from the updateStatus reconciler, and the processGroupID is one that no longer exists in the cluster. This then causes problems in certain operations, such as updating pods because some of the reconcilers seem to iterate over the processes from the FDBClusterStatus and tries to fetch their details from k8s, but then they cannot find that pod.

We encountered this on operator version 2.3.0.

What did you expect to happen?

I would expect processes that are no longer reported in the machine readable status to be removed from the FDBClusterStatus

How can we reproduce it (as minimally and precisely as possible)?

I'm not totally sure because I don't have a exact replication, but I think you can just delete a pod or node from k8s while the cluster is running.

Anything else we need to know?

No response

FDB Kubernetes operator

FDB version 7.1.67
Operator version v2.3.0

Kubernetes version

$ kubectl version
Client Version: v1.32.1
Kustomize Version: v5.5.0
Server Version: v1.31.601

Cloud provider

AWS

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions