The Firecracker virtual machine monitor

This article brought to you by LWN subscribers

Subscribers to made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

Cloud computing services that run customer code in short-lived processes are often called "serverless". But under the hood, virtual machines (VMs) are usually launched to run that isolated code on demand. The boot times for these VMs can be slow. This is the cause of noticeable start-up latency in a serverless platform like Amazon Web Services (AWS) Lambda. To address the start-up latency, AWS developed Firecracker, a lightweight virtual machine monitor (VMM), which it recently released as open-source software. Firecracker emulates a minimal device model to launch Linux guest VMs more quickly. It's an interesting exploration of improving security and hardware utilization by using a minimal VMM built with almost no legacy emulation.

A pared-down VMM

Firecracker began as a fork of Google's crosvm from ChromeOS. It runs Linux guest VMs using the Kernel-based Virtual Machine (KVM) API and emulates a minimal set of devices. Currently, it supports only Intel processors, but AMD and Arm are planned to follow. In contrast to the QEMU code base of well over one million lines of C, which supports much more than just qemu-kvm, Firecracker is around 50 thousand lines of Rust.

The small size lets Firecracker meet its specifications for minimal overhead and fast startup times. Serverless workloads can be significantly delayed by slow cold boots, so integration tests are used to enforce the specifications. The VMM process starts up in around 12ms on AWS EC2 I3.metal instances. Though this time varies, it stays under 60ms. Once the guest VM is configured, it takes a further 125ms to launch the init process in the guest. Firecracker spawns a thread for each VM vCPU to use via the KVM API along with a separate management thread. The memory overhead of each thread (excluding guest memory) is less than 5MB.

Performance aside, paring down the VMM emulation reduces the attack surface exposed to untrusted guest virtual machines. Though there were notable VM emulation bugs [PDF] before it, qemu-kvm has demonstrated this risk well. Nelson Elhage published a qemu-kvm guest-to-host breakout [PDF] in 2011. It exploited a quirk in the PCI device hotplugging emulation, which would always act on unplug requests for guest hardware devices — even if the device didn't support being unplugged. Back then, Elhage correctly expected more vulnerabilities to come in the KVM user-space emulation. There have been other exploits since then, but perhaps the clearest example of the risk from obsolete device emulation is the vulnerability Jason Geffner discovered in the QEMU virtual floppy disk controller in 2015.

Running Firecracker

Freed from the need to support lots of legacy devices, Firecracker ships as a single static binary linked against the musl C library. Each run of Firecracker is a one-shot launch of a single VM. Firecracker VMs aren't rebooted. The VM either shuts down or ends when its Firecracker process is killed. Re-launching a VM is as simple as killing the Firecracker process and running Firecracker again. Multiple VMs are launched by running separate instances of Firecracker, each running one VM.

Firecracker can be run without arguments. The VM is configured after Firecracker starts via a RESTful API over a Unix socket. The guest kernel, its boot arguments, and the root filesystem are configured over this API. The root filesystem is a raw disk image. Multiple disks can be attached to the VM, but only before the VM is started. These curl commands from the Getting Started guide configure a VM with the provided demo kernel and an Alpine Linux root filesystem image:

 curl --unix-socket /tmp/firecracker.socket -i \ -X PUT 'http://localhost/boot-source' \ -H 'Accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "kernel_image_path": "./hello-vmlinux.bin", "boot_args": "console=ttyS0 reboot=k panic=1 pci=off" }'

 curl --unix-socket /tmp/firecracker.socket -i \ -X PUT 'http://localhost/drives/rootfs' \ -H 'Accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "drive_id": "rootfs", "path_on_host": "./hello-rootfs.ext4", "is_root_device": true, "is_read_only": false }'

The configured VM can then be started with a final call:

 curl --unix-socket /tmp/firecracker.socket -i \ -X PUT 'http://localhost/actions' \ -H 'Accept: application/json' \ -H 'Content-Type: application/json' \ -d '{"action_type": "InstanceStart"}'

Each such Firecracker process runs a single KVM instance (a "microVM" in the documentation). The serial console of the guest VM is mapped to Firecracker's standard input/output. Networking can be configured for the guest via a TAP interface on the host. As an example for host-only networking, create a TAP interface on the host with:

 # ip tuntap add dev tap0 mode tap # ip addr add dev tap0 # ip link set tap0 up

Then configure the VM to use the TAP interface before starting the VM:

 curl --unix-socket /tmp/firecracker.socket -i \ -X PUT 'http://localhost/network-interfaces/eth0' \ -H 'Content-Type: application/json' \ -d '{ "iface_id": "eth0", "host_dev_name": "tap0", "state": "Attached" }'

Finally, start the VM and configure networking inside the guest as needed:

 # ip addr add dev eth0 # ip link set eth0 up # ping -c 3

Emulation for networking and block devices uses Virtio. The only other emulated device is an i8042 PS/2 keyboard controller supporting a single key for the guest to request a reset. No BIOS is emulated as the VMM boots a Linux kernel directly, loading the kernel into guest VM memory and starting the vCPU at the kernel's entry point.

The Firecracker demo runs 4000 such microVMs on an AWS EC2 I3.metal instance with 72 vCPUs (on 36 physical cores) and 512 GB of memory. As shown by the demo, Firecracker will gladly oversubscribe host CPU and memory to maximize hardware utilization.

Once a microVM has started, the API supports almost no actions. Unlike a more general purpose VMM, there's intentionally no support for live migration or VM snapshots since serverless workloads are short lived. The main supported action is triggering a block device rescan. This is useful since Firecracker doesn't support hotplugging disks; they need to be attached before the VM starts. If the disk contents are not known at boot, a secondary empty disk can be attached. Later the disk can be resized and filled on the host. A block device rescan will then let the Linux guest pick up the changes to the disk.

The Firecracker VMM can rate-limit its guest VM I/O to contain misbehaving workloads. This limits bytes per second and I/O operations per second on the disk and over the network. Firecracker doesn't enforce this as a static maximum I/O rate per second. Instead, token buckets are used to bound usage. This lets guest VMs do I/O as fast as needed until the token bucket for bytes or operations empties. The buckets are continuously replenished at a fixed rate. The bucket size and replenishment rate are configurable depending on how large of bursts should be allowed in guest VM usage. This particular token bucket implementation also allows for a large initial burst on startup.

Cloud computing services typically provide a metadata HTTP service reachable from inside the VM. Often it's available at the well-known non-routable IP address, like it is for AWS, Google Cloud, and Azure. The metadata HTTP service offers details specific to the cloud provider and the service on which the code is run. Typical examples are the host networking configuration and temporary credentials the virtualized code can use. The Firecracker VMM supports emulating a metadata HTTP service for its guest VM. The VMM handles traffic to the metadata IP itself, rather than via the TAP interface. This is supported by a small user-mode TCP stack and a tiny HTTP server built into Firecracker. The metadata available is entirely configurable.

Deploying Firecracker in production

In 2007 Tavis Ormandy studied [PDF] the security exposure of hosts running hostile virtualized code. He recommended treating VMMs as services that could be compromised. Firecracker's guide for safe production deployment shows what that looks like a decade later.

Being written in Rust mitigates some risk to the Firecracker VMM process from malicious guests. But Firecracker also ships with a separate jailer used to reduce the blast radius of a compromised VMM process. The jailer isolates the VMM in a chroot, in its own namespaces, and imposes a tight seccomp filter. The filter whitelists system calls by number and optionally limits system-call arguments, such as limiting ioctl() commands to the necessary KVM calls. Control groups version 1 are used to prevent PID exhaustion and to prevent workloads sharing CPU cores and NUMA nodes to reduce the likelihood of exploitable side channels.

The recommendations include a list of host security configurations. These are meant to mitigate side channels enabled by CPU features, host kernel features, and recent hardware vulnerabilities.

Possible future as a container runtime

Originally, Firecracker was intended to be a faster way to run serverless workloads while keeping the isolation of VMs, but there are other possible uses for it. An actively developed prototype in Go uses Firecracker as a container runtime. The goal is a drop-in containerd replacement with the needed interfaces to meet the Open Container Initiative (OCI) and Container Network Interface (CNI) standards. Though there are already containerd shims like Kata Containers that can run containers in VMs, Firecracker's unusually pared-down design is hoped to be more lightweight and trustworthy. Currently, each container runs in a single VM, but the project plans to batch multiple containers into single VMs as well.

Commands to manage containers get sent from front-end tools (like ctr). In the prototype's architecture, the runtime passes the commands to an agent inside the Firecracker guest VM using Firecracker's experimental vsock support. Inside the guest VM, the agent in turn proxies the commands to runc to spawn containers. The prototype also implements a snapshotter for creating and restoring container images as disk images.

The initial goal of Firecracker was a faster serverless platform running code isolated in VMs. But Firecracker's use as a container runtime might prove its design more versatile than that. As an open-source project, it's a useful public exploration of what minimal application-specific KVM implementations can achieve when built without the need for legacy emulation.

(Log in to post comments)