TIPS

Shutting down or killing a container
------------------------------------

From the host, the inject utility can be used to run an appropriate command
within the container to start a graceful shut down. For example

  inject PID /bin/halt

To immediately kill a container and all its processes, it is sufficient to
send the init process a SIGKILL from the host using

  pkill -KILL -P PID

where PID is the process ID of a running container supervisor. It is very
important not to SIGKILL the container supervisor itself or the container
will be orphaned, continuing to run unsupervised as a child of the host
init.


Using cgroups to limit memory and processes available to a container
--------------------------------------------------------------------

If cgroup support, the memory controller and the pids controller are
compiled into the kernel, a mounted cgroup2 filesystem can be used to apply
memory and process-count limits to a container as it is started. For
example, the shell script

  #!/bin/sh -e
  echo +memory +pids >/sys/fs/cgroup/cgroup.subtree_control
  mkdir /sys/fs/cgroup/mycontainer
  echo $$ >/sys/fs/cgroup/mycontainer/tasks
  echo 2G >/sys/fs/cgroup/mycontainer/memory.high
  echo 3G >/sys/fs/cgroup/mycontainer/memory.max
  echo 2G >/sys/fs/cgroup/mycontainer/memory.swap.max
  echo 256 >sys/fs/cgroup/mycontainer/pids.max
  exec contain [...]

applies a best-efforts limit of 2GB memory with a hard limit of 3GB. Swap
usage is restricted to at most 2G, and no more than 256 process can be
forked within the container.

In addition, if contain is built and run on Linux 4.6 or later, a cgroup
namespace will be used to virtualise the container's view of the cgroup
hierarchy in /sys/fs/cgroup and /proc/*/cgroup. /sys/fs/cgroup/mycontainer
will appear as the root of the hierarchy at /sys/fs/cgroup within the
container.

See linux/kernel/Documentation/cgroup-v2.txt for detailed info on the
available controllers and configuration parameters.


Troubleshooting
---------------

The contain/psuedo error message 'Failed to unshare user namespace: Invalid
argument' typically means that your kernel is not compiled with support for
user namespaces, i.e. CONFIG_USER_NS is not set. The contain tool will also
die with a similar message referring to one of the other required namespaces
if support for that is not available in the kernel.

To run these tools you need to be running Linux 3.8 or later with

  CONFIG_CGROUPS=y
  CONFIG_UTS_NS=y
  CONFIG_TIME_NS=y
  CONFIG_IPC_NS=y
  CONFIG_USER_NS=y
  CONFIG_PID_NS=y
  CONFIG_NET_NS=y

set in the kernel build config. Note that before Linux 3.12, CONFIG_XFS_FS
conflicted with CONFIG_USER_NS, so these tools could not be used where XFS
support was compiled either into the kernel or as a module.

The contain tool will fail to mount /dev/pts unless

  CONFIG_DEVPTS_MULTIPLE_INSTANCES=y

is set in the kernel build config. Both container and host /dev/pts must be
mounted with -o newinstance, with /dev/ptmx symlinked to pts/ptmx.

Linux 3.12 introduced tighter restrictions on mounting proc and sysfs, which
broke older versions of contain. To comply with these new rules, contain
now ensures that procfs and sysfs are mounted in the new mount namespace
before pivoting into the container and detaching the host root.

A bug in Linux 3.12 will prevent contain from mounting /proc in a container
if binfmt_misc is mounted on /proc/sys/fs/binfmt_misc in the host
filesystem. This was fixed in Linux 3.13.

Linux 3.19 introduced restrictions on writing a user namespace GID map as an
unprivileged user unless setgroups() has been permanently disabled, which
broke older versions of contain. Run non-setuid and unprivileged, contain
and pseudo must now disable setgroups() to create containers, but if they
are installed setuid, they will bypass this kernel restriction and leave
setgroups() enabled in the resulting containers.