-
Notifications
You must be signed in to change notification settings - Fork 8
Kubernetes reschedules master pod which causes lost data #68
Comments
Results of initial analysis of the issue. All these points are made under the following assumptions (which are true as of this writing):
Currently, the dashboard starts a background Redis Job that uses But this problem also has a wider scope, Fault Tolerance in general. Workers can't write metrics while the dashboard is down, workers could fail, which is pretty much guaranteed for long running experiments. A solution to one of these issues should go hand-in-hand with an overall solution for Fault Tolerance, since this is needed to permit longrunning experiments. Otherwise mlbench would be limited to experiments that take on the order of minutes, not hours, which would severely reduce its usefulness. After some analysis, a workable solution with current K8s capabilities (and one similar to the approach adopted by kubeflow mpi-operator, see https://github.com/kubeflow/community/blob/master/proposals/mpi-operator-proposal.md ) would be the following components:
Dashboard/Master: The dashboard persists its Postgresql and Redis state to persistent storage (possibly the same volumes as the workers use, to limit the amount of resources needed). The current long-running background task with the SPDY K8s Job: The background task in the dashboard is replaced by a new K8s Worker StatefulSet: Works mostly as before, but we have to make sure that all workers Normal execution would look like this:
On Dashboard Failure/Eviction: The Dashboard is started by K8s on a different node. Data is not lost due to being saved to a Persistent Volume. Workers write all metrics that could not be written and the background task is periodically executed again, grabbing the On On Worker Failure/Eviction: In no case is relevant data lost, training continues as if nothing happened and this approach is robust towards failure of any of the components. Metrics themselves should not be affected by restarting from checkpoint. Implementation of this entails several changes that should be done in their own tickets (since there's little dependencies between them and they can be worked on in parallel):
|
Kubernetes can reschedule the master pod arbitrarily, mostly due to resource constraints.
In this case, all active and past jobs are lost in the Redis Qeue used for job management.
See https://estl.tech/deploying-redis-with-persistence-on-google-kubernetes-engine-c1d60f70a043 for more details
Postgresql persistence is also needed in a similar fashion
The text was updated successfully, but these errors were encountered: