|
| 1 | +# Lifecycle of a VM pod |
| 2 | + |
| 3 | +This document describes the lifecycle of VM pod managed by Virtlet. |
| 4 | + |
| 5 | +This description omits the details of volume setup (using |
| 6 | +[flexvolumes](https://kubernetes.io/docs/concepts/storage/volumes/#flexvolume)), |
| 7 | +handling of logs, the VM console and port forwarding (done by |
| 8 | +[streaming server](https://github.com/Mirantis/virtlet/tree/master/pkg/stream)), |
| 9 | + or port forwarding. |
| 10 | + |
| 11 | +## Assumptions |
| 12 | + |
| 13 | +Communication between kubelet and Virtlet goes through [criproxy](https://github.com/Mirantis/criproxy) |
| 14 | +which directs requests to Virtlet only if the requests concern a pod that has |
| 15 | +Virtlet-specific annotation or an image that has Virtlet-specific prefix. |
| 16 | + |
| 17 | +## Lifecycle |
| 18 | + |
| 19 | +### VM Pod Startup |
| 20 | + |
| 21 | + * A pod is created in Kubernetes cluster, either directly by the user or via |
| 22 | + some other mechanism such as a higher-level Kubernetes object managed by |
| 23 | + `kube-controller-manager` (ReplicaSet, DaemonSet etc.). |
| 24 | + * Scheduler places the pod on a node based on the requested resources |
| 25 | + (CPU, memory, etc.) as well as pod's nodeSelector and pod/node affinity |
| 26 | + constraints, taints/tolerations and so on. |
| 27 | + * `kubelet` running on the target node accepts the pod. |
| 28 | + * `kubelet` invokes a [CRI](https://contributor.kubernetes.io/contributors/devel/container-runtime-interface/) |
| 29 | + call RunPodSandbox to create the pod sandbox which |
| 30 | + will enclose all the containers in the pod definition. Note that at this |
| 31 | + point no information about the containers within the pod is passed |
| 32 | + to the call. `kubelet` can later request the information about the pod |
| 33 | + by means of `PodSandboxStatus` calls. |
| 34 | + * If there's a Virtlet-specific annotation `kubernetes.io/target-runtime: virtlet.cloud`, |
| 35 | + CRI proxy passes the call to Virtlet. |
| 36 | + * Virtlet saves sandbox metadata in its internal database, sets up the |
| 37 | + network namespace and then uses internal `tapmanager` mechanism to invoke |
| 38 | + `ADD` operation via the CNI plugin as specified by the |
| 39 | + CNI configuration on the node. |
| 40 | + * The CNI plugin configures the network namespace by setting up |
| 41 | + network interfaces, IP addresses, routes, iptables rules and so on, |
| 42 | + and returns the network configuration information to the caller as described |
| 43 | + in the [CNI spec](https://github.com/containernetworking/cni/blob/master/SPEC.md#result). |
| 44 | + * Virtlet's [`tapmanager`](https://github.com/Mirantis/virtlet/tree/master/pkg/tapmanager) |
| 45 | + mechanism adjusts the configuration of the network namespace to make it work with the VM. |
| 46 | + * After creating the sandbox, kubelet starts the containers defined in |
| 47 | + the pod sandbox. Currently, Virtlet supports just one container per VM pod. |
| 48 | + So, the VM pod startup steps after this one describe the startup of this single container. |
| 49 | + * Depending on the image pull police of the container, kubelet checks if |
| 50 | + the image needs to be pulled by means of `ImageStatus` call and then uses |
| 51 | + `PullImage` CRI call to pull the image if it doesn't exist or if |
| 52 | + `imagePullPolicy: Always` is used. |
| 53 | + * If `PullImage` is invoked, Virtlet resolves the image location based on the |
| 54 | + [image name translation configuration](https://github.com/Mirantis/virtlet/blob/master/docs/image-name-translation.md), |
| 55 | + then downloads the file and stores it in the image store. |
| 56 | + * After the image is ready (no pull was needed or the `PullImage` call completed |
| 57 | + successfully), kubelet uses `CreateContainer` CRI call to create |
| 58 | + the container in the pod sandbox using the specified image. |
| 59 | + * Virtlet uses the sandbox and container metadata to generate libvirt domain definition, |
| 60 | + using [`vmwrapper`](https://github.com/Mirantis/virtlet/tree/master/cmd/vmwrapper) |
| 61 | + binary as the emulator and without specifying any network configuration in the domain. |
| 62 | + * After `CreateContainer` call completes, `kubelet` invokes `StartContainer` call |
| 63 | + on the newly created container. |
| 64 | + * Virtlet starts the libvirt domain. libvirt invokes `vmwrapper` as the emulator, |
| 65 | + passing it the necessary command line arguments as well as environment variables |
| 66 | + set by Virtlet. `vmwrapper` uses the environment variable values passed |
| 67 | + to Virtlet to communicate with `tapmanager` over an Unix domain socket, |
| 68 | + retrieving a file descriptor for a tap device and/or pci address of SR-IOV |
| 69 | + device set up by `tapmanager`. `tapmanager` uses its own simple protocol to |
| 70 | + communicate with `vmwrapper` because it needs to send file descriptors over |
| 71 | + the socket. This is not usually supported by RPC libraries, see e.g. |
| 72 | + [grpc/grpc#11417](https://github.com/grpc/grpc/issues/11417). |
| 73 | + `vmwrapper` then updates the command line arguments to include the network |
| 74 | + interface information and execs the actual emulator (`qemu`). |
| 75 | + |
| 76 | +At this point the VM is running and accessible via the network, and the pod is |
| 77 | +in `Running` state as well as it's only container. |
| 78 | + |
| 79 | +### Deleting the pod |
| 80 | + |
| 81 | +This sequence is initiated when the pod is deleted, either by means of `kubectl delete` |
| 82 | +or a controller manager action due to deletion or downscaling of a higher-level object. |
| 83 | + |
| 84 | + * `kubelet` notices the pod being deleted. |
| 85 | + * `kubelet` invokes `StopContainer` CRI calls which is getting forwared |
| 86 | + to Virtlet based on the containing pod sandbox annotations. |
| 87 | + * Virtlet stops the libvirt domain. libvirt sends a signal to `qemu`, which initiates |
| 88 | + the shutdown. If it doesn't quit in a reasonable time determined by pod's |
| 89 | + termination grace period, Virtlet will forcibly terminate the domain, |
| 90 | + thus killing the `qemu` process. |
| 91 | + * After all the containers in the pod (the single container in case of |
| 92 | + Virtlet VM pod) are stopped, kubelet invokes `StopPodSandbox` CRI call. |
| 93 | + * Virtlet asks its `tapmanager` to remove pod from the network by means of |
| 94 | + `CNI DEL` command. |
| 95 | + * after `StopPodSandbox` returns, the pod sandbox will be eventually GC'd |
| 96 | + by `kubelet` by means of `RemovePodSandbox` CRI call. |
| 97 | + * Upon `RemovePodSandbox`, Virtlet removes the pod metadata from its internal database. |
0 commit comments