Description
What happened?
In our k8s cluster, we sometimes have processes / nodes that are killed by other systems running in the cluster. When this happens, the node can disappear from the cluster. Sometimes, this seems to result in the the process staying in the FDBClusterStatus, despite no longer being part of the cluster.
We see these log lines:
skip updating fault domain for process group with missing process in FoundationDB cluster status
from the updateStatus
reconciler, and the processGroupID is one that no longer exists in the cluster. This then causes problems in certain operations, such as updating pods because some of the reconcilers seem to iterate over the processes from the FDBClusterStatus and tries to fetch their details from k8s, but then they cannot find that pod.
We encountered this on operator version 2.3.0.
What did you expect to happen?
I would expect processes that are no longer reported in the machine readable status to be removed from the FDBClusterStatus
How can we reproduce it (as minimally and precisely as possible)?
I'm not totally sure because I don't have a exact replication, but I think you can just delete a pod or node from k8s while the cluster is running.
Anything else we need to know?
No response
FDB Kubernetes operator
FDB version 7.1.67
Operator version v2.3.0
Kubernetes version
$ kubectl version
Client Version: v1.32.1
Kustomize Version: v5.5.0
Server Version: v1.31.601