# Block device IO engine

For all Firecracker versions prior to v1.0.0, the emulated block device uses a
synchronous IO engine for executing the device requests, based on blocking
system calls.

Firecracker 1.0.0 adds support for an asynchronous block device IO engine.
Support is currently in **developer preview**. See
[this section](#developer-preview-status) for more info.

The `Async` engine leverages [`io_uring`](https://kernel.dk/io_uring.pdf) for
executing requests in an async manner, therefore getting overall higher
throughput by taking better advantage of the block device hardware, which
typically supports queue depths greater than 1.

The block IO engine is configured via the PUT /drives API call (pre-boot only),
with the `io_engine` field taking two possible values:

- `Sync` (default)
- `Async` (in [developer preview](../RELEASE_POLICY.md))

The `Sync` variant is the default, in order to provide backwards compatibility
with older Firecracker versions.

## Example configuration

```bash
curl --unix-socket ${socket} -i \
     -X PUT "http://localhost/drives/rootfs" \
     -H "accept: application/json" \
     -H "Content-Type: application/json" \
     -d "{
             \"drive_id\": \"rootfs\",
             \"path_on_host\": \"${drive_path}\",
             \"is_root_device\": true,
             \"is_read_only\": false,
             \"io_engine\": \"Sync\"
         }"
```

## Host requirements

Firecracker requires a minimum host kernel version of 5.10.51 for the `Async`
IO engine.

This requirement is based on the availability of the `io_uring` subsystem, as
well as a couple of features and bugfixes that were added in newer kernel
versions.

If a block device is configured with the `Async` io_engine on a host kernel
older than 5.10.51, the API call will return a 400 Bad Request, with a
suggestive error message.

## Performance considerations

The performance is strictly tied to the host kernel version. The gathered data
may not be relevant for modified/newer kernels than 5.10.

### Device creation

When using the `Async` variant, there is added latency on device creation (up
to ~110 ms), caused by the extra io_uring system calls performed by
Firecracker.
This translates to higher latencies on either of these operations:

- API call duration for block device config
- Boot time for VMs started via JSON config files
- Snapshot restore time

For use-cases where the lowest latency on the aforementioned operations is
desired, it is recommended to use the `Sync` IO engine.

### Block IOPS and efficiency

The `Async` engine performance potential is showcased when the block device
backing files are placed on a physical disk that supports efficient parallel
execution of requests, like an NVME drive.
It's also recommended to evenly distribute the backing files across the
available drives of a host, to limit contention in high-density scenarios.

The performance measurements we've done were made on NVME drives, and we've
discovered that:

For __read__ workloads which operate on data that is not present in the
host page cache, the performance improvement for `Async` is about 1.5x-3x in
overall efficiency (IOPS per CPU load) and up to 30x in total IOPS.

For __write__ workloads, the `Async` engine brings an improvement of about
20-45% in total IOPS but performs worse than the `Sync` engine in total
efficiency (IOPS per CPU load).
This means that while Firecracker will achieve better performance, it will be
at the cost of consuming more CPU for the kernel workers. In this case, the VMM
cpu load is also reduced, which should translate into performance increase in
hybrid workloads (block+net+vsock).

Whether or not using the `Async` engine is a good idea performance-wise depends
on the workloads and the amount of spare CPU available on a host.
According to our NVME experiments, io_uring will always bring performance
improvements (granted that there are enough available CPU resources).

It is recommended that users perform some tests with examples of expected
workloads and measure the efficiency as (IOPS/CPU load).

## Developer preview status

View the [release policy](../RELEASE_POLICY.md) for information about developer
preview terminology.

The `Async` io_engine is not yet suitable for production use. It will be made
available for production once Firecracker has support for a host kernel that
implements mitigation mechanisms for the following threats:

### Threat 1: PID exhaustion

The number of io_uring kernel workers assigned to one Firecracker block device
is upper-bounded by:

```
(1 + NUMA_COUNT * min(size_of_ring, 4 * NUMBER_OF_CPUS)
```

This formula is derived from the 5.10 linux kernel code, while `size_of_ring`
is hardcoded to `128` in Firecracker.

Depending on the number of microVMs that can concurrently live on a host and
the number of block devices configured for each microVM, the kernel PID limit
may be reached, resulting in failure to create any new process.

Kernels starting with 5.15 expose a configuration option for customising this
upper bound. Once possible, we plan on exposing this in the Firecracker drive
configuration interface.

### Threat 2: worker thread resource consumption

The io_uring kernel workers are spawned in the root cgroup of the system.
They don’t inherit the Firecracker cgroup, cannot be moved out of the root
cgroup and their names don't contain any information about the microVM's PID.
This makes it impossible to attribute a worker to a specific Firecracker VM
and limit the CPU and memory consumption of said workers via cgroups.

Starting with kernel 5.12 (currently unsupported), the Firecracker cgroup is
inherited by the io_uring workers.

### Path to GA

We plan on marking the Async engine as production ready once an LTS linux
kernel including mitigations for the aforementioned mitigations is released and
support for it is added in Firecracker.

Read more about Firecracker's [kernel support policy](../kernel-policy.md).