“We are not makers of history. We are made by history.” Martin Luther King
Docker is one of the most known containers platforms nowadays and it was released in 2013. However, the use of isolation and containerization started before this date. Let’s go back to the year 1979 when we started using the Chroot Jail and see the most known containerization technologies that came after. This will help us understand new concepts not necessarily related to history but also to technology.
It all started when the Chroot Jail and the Chroot system call were introduced during the development of Version 7 Unix in 1979. Chroot jail is for “Change Root” and it’s considered as one of the first containerization technologies. It allows you to isolate a process and its children from the rest of the operating system. The only problem with this isolation is that a root process can easily exit the chroot. It was never intended as a security mechanism. The FreeBSD Jail were introduced in FreeBSD OS in the year 2000 and it was intended to bring more security to the simple Chroot file isolation. Unlike the Chroot, FreeBSD implementation isolates also the processes and their activities to a particular view of the filesystem.
When the operating system-level virtualization capabilities were added to the Linux kernel, Linux VServer was introduced in 2001 and it used both a chroot-like mechanism combined with “security contexts” and operating system-level virtualization (containerization) to provide a virtualization solution. It is more advanced than the simple chroot and it lets you run multiple Linux distributions on a single distribution (VPS).
In February 2004, Oracle released Oracle Solaris Containers, an implementation of Linux-Vserver for X86 and SPARC processors.
SPARC is a RISC (reduced instruction set computing) architecture developed by Sun Microsystems.
A Solaris Container is a combination of system resource controls and the boundary separation provided by “zone”.
Similar to Solaris Containers, the first version of OpenVZ was introduced in 2005. OpenVZ, like Linux-VServer, uses the OS-level virtualization and it was adopted by many hosting companies to isolate and sell VPSs. OS-level virtualization has some limits since containers share the same architecture and kernel version the disadvantage happens in situations where guests require different kernel versions than that of the host.
Linux-VServer and OpenVZ require patching the kernel to add some control mechanisms used to create an isolated container. OpenVZ patches were not integrated into the Kernel.
In 2007, Google released CGroups, a mechanism that limits and isolates the resource usage (CPU, memory, disk I/O, network, etc.) of a collection of processes. CGroups was, as opposed to OpenVZ Kernel, mainlined into the Linux kernel in 2007.
In 2008, the first version of LXC (Linux Containers) was released. LXC is similar to OpenVZ, Solaris Containers and Linux-VServer, however, it uses CGroups which is already implemented in the Linux Kernel. Then CloudFoundry created Warden in 2013, an API to manages isolated, ephemeral, and resource controlled environments. In its first versions, Warden used LXC.
In 2013, the first version of Docker was introduced. It performs, like OpenVZ and Solaris Containers, operating-system-level virtualization.
In 2014, Google introduced LMCTFY ( Let me contain that for you), the open source version of Google’s container stack, which provides Linux application containers. Google engineers have been collaborating with Docker over libcontainer and porting the core concepts and abstractions to libcontainer. So the project is not actively being developed and in the future, the core of this project will be probably replaced by libcontainer.
LMCTFY run applications in isolated environments on the same Kernel and without patching it since it uses CGroups, Namespaces and other Linux Kernel features.
Google is a leader in the container industry. Everything at Google runs on containers. There are more than 2 billion containers running on Google infrastructure every week.
In December 2014, CoreOS released and started to support rkt (initially released as Rocket) as an alternative to Docker.
Isolation and resource control are the common goals behind using Jails, Zones, VPSs, VMs, and Containers but every technology use different ways to achieve it, has its limits and its advantages.
Until now, we have briefly seen how a Jail work and we introduced how Linux-VServer allows running isolated user spaces in which computer programs run directly on the host operating system’s kernel but have access to a restricted subset of its resources.
Linux-VServer allows running “Virtual Private Servers” and host kernel must be patched to use it. (Consider VPS as the commercial name.)
Solaris containers are called Zones.
A “Virtual Machine” is a generic term to describe an emulated virtual machine on top of a “real hardware machine”. The term was originally defined by Popek and Goldberg as an efficient, isolated duplicate of a real computer machine.
Virtual machines can be either “System Virtual Machines” or “Process Virtual Machines”. In our everyday use of the word VMs, we usually mean “system virtual machines” which is the emulation of the host hardware to emulate an entire operating system. However, “Process Virtual Machine”, sometimes called “Application Virtual Machine”, are used to emulate the programming environment for the execution of an individual process: Java Virtual Machine is an example.
The OS Level virtualization is also called containerization. Technologies like Linux-VServer and OpenVZ can run multiple operating systems while sharing the same architecture and kernel version.
Sharing the same architecture and kernel have some limitations and disadvantages in situations where guests require different kernel versions than that of the host.
System containers (e.g LXC) offer an environment as close as possible as the one you’d get from a VM but without the overhead that comes with running a separate kernel and simulating all the hardware.
OS-level virtualization helps us in creating containers. Technologies like LXC and Docker use this type of isolation. We have two types of containers here:
- OS Containers where the operating system with the whole application stack is packaged (example LEMP).
- Applications Containers that usually run a single process per container.
In the case App Containers, we would have 3 containers to create a LEMP stack:
- PHP server (or PHP FPM).
- Web server (Nginx).
Short answer: Both.
When Docker started it used LXC as a container runtime, the idea was to create an API to manage the container runtime, isolate single processes running applications and supervise the container life cycle and the resources it uses.
In early 2013, the Docker project was to build a “standard container” as we can see in this manifesto.
The standard container manifesto was removed.
Docker started building a monolithic application with multiple features from launching Cloud servers to building and running images/containers.
Docker used “libcontainer” to interface with Linux kernel facilities like Control Groups and Namespaces.
I am using Ubuntu in this example, but it should be the same for most distros. Start by installing CGroup Tools and stress utility as we are going to make some stress tests.
sudo apt install cgroup-tools
sudo apt install stress
This command will create a new execution context:
sudo unshare --fork --pid --mount-proc bash
The `unshare` command disassociates parts of the process execution context
unshare() allows a process (or thread) to disassociate parts of its execution context that are currently being shared with other processes (or threads). Part of the execution context, such as the mount namespace, is shared implicitly when a new process is created using fork(2) or vfork(2), while other parts, such as virtual memory, may be shared by explicit request when creating a process or thread using clone(2).
cgcreate we can create a Control Groups and define two controllers, one on memory and the other on CPU.
The next step is defining a limit for the memory and activate it:
echo 3000000 > /sys/fs/cgroup/memory/mygroup/memory.kmem.limit_in_bytes
cgexec -g memory:mygroup bash
Now let’s stress the isolated namespace we created with memory limits.
stress --vm 1 --vm-bytes 1G --timeout 10s
We can notice that the execution is failed, therefore we know that the memory limit is working fine.
If we do the same thing on the host machine (no imitation on a 16G RAM), the test will never fail, unless you really do not have enough free memory:
Following these steps will help in understanding how Linux facilities like CGroups and other resource control features can create and manage isolated environments in Linux systems.
libcontainer interfaces with these facilities to manage and run Docker containers.
In 2015, Docker announced runC: a lightweight, portable container runtime.
runC is basically a little command-line tool to leverage libcontainer directly, without going through the Docker Engine.
The goal of runC is to make standard containers available everywhere.
This project was donated the Open Container Initiative (OCI).
The libcontainer repository has been archived now.
In reality, libcontainer was not abandoned but it was moved to runC repository.
Let’s move to the practical part and create a container using runC.
Start by installing runC runtime:
Let’s create a directory (/mycontainer) where we are going to export the content of the image Busybox.
Coming in somewhere between 1 and 5 Mb in on-disk size (depending on the variant), BusyBox is a very good ingredient to craft space-efficient distributions.
BusyBox combines tiny versions of many common UNIX utilities into a single small executable. It provides replacements for most of the utilities you usually find in GNU fileutils, shellutils, etc. The utilities in BusyBox generally have fewer options than their full-featured GNU cousins; however, the options that are included provide the expected functionality and behave very much like their GNU counterparts. BusyBox provides a fairly complete environment for any small or embedded system.
source: Docker Hub.
Using runC command we can run the busybox container that use the extracted image and a spec file (config.json).
`runc spec` command initially creates this JSON file:
An alternative for generating a customized spec config is to use “oci-runtime-tool”, the sub-command “oci-runtime-tool generate” has lots of options that can be used to do many customizations.
For more information see runtime-tools.
Using the generated specification JSON file, you can customize the runtime of the container. We can, for example, change the argument for the application to execute.
Let’s view the change between the original config.json file and the new one:
Let’s now run again the container and notice how it sleeps for 10 seconds before exiting.
Since containers become mainstream, the different actors in the containers ecosystem have been working on standardization.
Standardization is a key to automation and generalization of best practices.
While giving the runC project to the OCI, Docker started using containerd in 2016, as a container runtime that interface with the underlying low-level runtime runC.
Containerd has full support for starting OCI bundles and managing their lifecycle. Containerd (as well as other runtimes like cri-o) uses runC to run containers but implements also other high-level features like image management and high-level APIs.
runC is built on libcontainer which is the same container library previously powering Docker engine.
Prior to version 1.11, Docker engine was used to manage volumes, networks, containers, images etc..
Now, Docker architecture is broken into four components:
- Docker engine,
- and runC.
The binaries are respectively called docker, docker-containerd, docker-containerd-shim, and docker-runc.
Let’s enumerate the step to run a container using the new architecture of docker:
- Docker engine creates the container (from an image) and passes it to containerd.
- Containerd calls containerd-shim
- Containerd-shim uses runC to run the container
- Containerd-shim allows the runtime (runC in this case) to exit after it starts the container
Using this new architecture we can run “daemon-less containers” and we have two advantages:
- runC can exit after starting the container and we don’t have to have the whole runtime processes running.
- containerd-shim keeps the file descriptors like stdin, stdout and stderr open even when Docker and/or containerd die.
This is probably one of the most redundant questions. After understanding why Docker broke its architecture to runC and Containerd, you realize that both are runtimes.
If you were following the story from the beginning you had probably noticed the use of high-level and low-level runtimes. This is the practical difference between both.
Both can be called runtimes but every runtime has a different purpose and features. In order to keep the containers ecosystem standardized, low-level containers runtime only allows running containers.
The low-level runtime (like runC) should be light, fast and not conflictual with other higher levels of containers management.
When you create a Docker container, it is in reality managed by both runtimes containerd and runC.
You can find many containers runtimes, some of them are OCI standarized and others are not, some are low-level runtimes and other are more than just runtimes and implement a tooling layer to manage the lifecycle of containers and more:
- image transfer and storage,
- container execution and supervision,
- low-level storage,
- network attachments,
We can add new runtime using Docker by executing:
sudo dockerd --add-runtime=<runtime-name>=<runtime-path>
sudo apt-get install nvidia-container-runtime
sudo dockerd --add-runtime=nvidia=/usr/bin/nvidia-container-runtime
Kubernetes is one of the most popular orchestration systems. With the evolving number of containers runtime, kubernetes aims to be more extensible and interface with more containers runtimes other than Docker.
Originally, Kubernetes used Docker runtime to run containers and it is still the default runtime.
However, CoreOS wanted to use kubernetes with RKT runtime and offered patches to kubernetes to use this runtime as an alternative to Docker.
Instead of changing kubernetes code base when adding a new container runtime, Kubernetes upstream decided to create CRI or Container Runtime Interface, which is a set of APIs and libraries that allows running different containers runtime in Kubernetes.
Any interaction between kubernetes core and a supported runtime is performed through the CRI API.
These are some of the CRI plugins:
Is the first container runtime created for the kubernetes CRI interface. cri-o is not intended to replace Docker but it can be used instead of Docker runtime in the specefic context of Kubernetes.
Containerd CRI :
With cri-containerd, users can run Kubernetes clusters using containerd as the underlying runtime without Docker installed.
gVisor is a project developed by Google which implements around 200 of the Linux system calls in userspace, for additional security compared to Docker containers that run directly on top of the Linux kernel and are isolated with namespaces.
Google Cloud App Engine uses gVisor CRI to perform an isolation between customers.
gVisro runtime integrates with Docker and Kubernetes, making it simple to run sandboxed containers.
CRI-O Kata Containers
Kata Containers is an open source project building lightweight virtual machines that plug into the containers ecosystem. CRI-O Kata Containers allows running Kata Containers on Kubernetes instead of Docker default runtime.
The project of building a single monolithic Docker platform is somehow abandoned and gave birth to Moby project where Docker is composed of many components like RunC.
Moby is a project to organize and modularize the development of Docker.
It is an ecosystem of development and production. Regular users of Docker will notice no change.
Moby helps in developing and running Docker CE and EE (Moby is Docker upstream) as well as creating a development and production environment for other runtimes and platforms.
As we have seen, Docker donated RunC to the Open Container Initiative (OCI), but what is this initiative ?
The OCI is a lightweight, open governance structure, launched on 2015 by Docker, CoreOS and other leaders in the container industry.
the Open Container Initiative (OCI) aims to establish common standards for software containers in order to avoid a potential fragmentation and divisions inside the container ecosystem.
It contains two specifications:
- runtime-spec: The runtime specification
- image-spec: The image specification
A container using a different runtime can be used with Docker API. A container created using Docker, should run with any other engine.
If you like this article, let me know by buying me a coffee here and I’ll publish similar articles explaining advanced concepts in an easy way. You can also order my online training Painless Docker.