Linux containers in 500 lines of code

int capabilities()
{ fprintf(stderr, "=> dropping capabilities...");

CAP_AUDIT_CONTROL, _READ, and _WRITE allow access to the audit system of the kernel (i.e. functions like audit_set_enabled, usually used with auditctl). The kernel prevents messages that normally require CAP_AUDIT_CONTROL outside of the first pid namespace, but it does allow messages that would require CAP_AUDIT_READ and CAP_AUDIT_WRITE from any namespace.12 So let's drop them all. We especially want to drop CAP_AUDIT_READ, since it isn't namespaced13 and may contain important information, but CAP_AUDIT_WRITE may also allow the contained process to falsify logs or DOS the audit system.


CAP_BLOCK_SUSPEND lets programs prevent the system from suspending, either with EPOLLWAKEUP or /proc/sys/wake_lock.14 Supend isn't namespaced, so we'd like to prevent this.


CAP_DAC_READ_SEARCH lets programs call open_by_handle_at with an arbitrary struct file_handle *. struct file_handle is in theory an opaque type, but in practice it corresponds to inode numbers. So it's easy to brute-force them, and read arbitrary files. This was used by Sebastian Krahmer to write a program to read arbitrary system files from within Docker in 2014.15


CAP_FSETID, without user namespacing, allows the process to modify a setuid executable without removing the setuid bit. This is pretty dangerous! It means that if we include a setuid binary in a container, it's easy for us to accidentally leave a dangerous setuid root binary on our disk, which any user can use to escalate privileges.16


CAP_IPC_LOCK can be used to lock more of a process' own memory than would normally be allowed17, which could be a way to deny service.


CAP_MAC_ADMIN and CAP_MAC_OVERRIDE are used by the mandatory acess control systems Apparmor, SELinux, and SMACK to restrict access to their settings. These aren't namespaced, so they could be used by the contained programs to circumvent system-wide access control.


CAP_MKNOD, without user namespacing, allows programs to create device files corresponding to real-world devices. This includes creating new device files for existing hardware. If this capability were not dropped, a contained process could re-create the hard disk device, remount it, and read or write to it.18


I was worried that CAP_SETFCAP could be used to add a capability to an executable and execve it, but it's not actually possible for a process to set capabilities it doesn't have19. But! An executable altered this way could be executed by any unsandboxed user, so I think it unacceptably undermines the security of the system.


CAP_SYSLOG lets users perform destructive actions against the syslog. Importantly, it doesn't prevent contained processes from reading the syslog, which could be risky. It also exposes kernel addresses, which could be used to circumvent kernel address layout randomization20.


CAP_SYS_ADMIN allows many behaviors! We don't want most of them (mount, vm86, etc). Some would be nice to have (sethostname, mount for bind mounts…) but the extra complexity doesn't seem worth it.


CAP_SYS_BOOT allows programs to restart the system (the reboot syscall) and load new kernels (the kexec_load and kexec_file syscalls)21. We absolutely don't want this. reboot is user-namespaced, and the kexec* functions only work in the root user namespace, but neither of those help us.


CAP_SYS_MODULE is used by the syscalls delete_module, init_module, finit_module 22, by the code for kmod 23, and by the code for loading device modules with ioctl24.


CAP_SYS_NICE allows processes to set higher priority on given pids than the default25. The default kernel scheduler doesn't know anything about pid namespaces, so it's possible for a contained process to deny service to the rest of the system26.


CAP_SYS_RAWIO allows full access to the host systems memory with /proc/kcore, /dev/mem, and /dev/kmem 27, but a contained process would need mknod to access these within the namespace.28. But it also allows things like iopl and ioperm, which give raw access to the IO ports29.


CAP_SYS_RESOURCE specifically allows circumventing kernel-wide limits, so we probably should drop it30. But I don't think this can do more than DOS the kernel, in general31.


CAP_SYS_TIME: setting the time isn't namespaced, so we should prevent contained processes from altering the system-wide time32.


CAP_WAKE_ALARM, like CAP_BLOCK_SUSPEND, lets the contained process interfere with suspend33, and we'd like to prevent that.

	size_t num_caps = sizeof(drop_caps) / sizeof(*drop_caps); fprintf(stderr, "bounding..."); for (size_t i = 0; i < num_caps; i++) { if (prctl(PR_CAPBSET_DROP, drop_caps[i], 0, 0, 0)) { fprintf(stderr, "prctl failed: %m\n"); return 1; } } fprintf(stderr, "inheritable..."); cap_t caps = NULL; if (!(caps = cap_get_proc()) || cap_set_flag(caps, CAP_INHERITABLE, num_caps, drop_caps, CAP_CLEAR) || cap_set_proc(caps)) { fprintf(stderr, "failed: %m\n"); if (caps) cap_free(caps); return 1; } cap_free(caps); fprintf(stderr, "done.\n"); return 0;