# Host file isolation

firecracker-containerd has a number of host-based components, based on files,
that are accessible to a microVM and need to be isolated.  Isolation means that
one microVM does not have the ability to influence the execution of another
microVM or to gain access to data held by another microVM.  We believe this is
important both for security (i.e., one VM tenant cannot influence the execution
of another tenant's VM) and for repeatability (my container runs the same way
every time).

## File components

### Kernel

The kernel is the bootable component of the microVM.  We expect that users of
firecracker-containerd will either use the same kernel for all microVMs, or have
a small number of kernels shared among multiple microVMs.  Importantly, we do
not expect that each microVM will use a unique kernel and we do not want to
require a user of firecracker-containerd to provide separate kernels for each
microVM.

Because the kernel is shared among multiple microVMs, it must not be mutable by
any microVM.

### Root filesystem

Similar to the kernel, the root filesystem contains the user-mode components
that are invoked upon boot of a microVM.  Like the kernel, we expect users of
firecracker-containerd to either use the same root filesystem for all microVMs
or share a small number of root filesystems among multiple microVMs.  Also like
the kernel, we do not want to require a user of firecracker-containerd to
provide separate root filesystems for each microVM.

Because the backing root filesystem is shared among multiple microVMs, it must
not be mutable by any microVM.  MicroVMs which need to have a mutable root
filesystem would either need that facility to be provided by
firecracker-containerd or use a technique like "live CD" Linux distributions to
allow some ephemeral mutability.

### Firecracker and Jailer binaries

The Firecracker and Jailer binaries are used in the launch process and runtime
management of a VM.  We expect that every VM will use the same version of
Firecracker and Jailer.  While we do not expect there to be a mechanism by which
a running VM can write to either of these files, we do believe it's important to
enforce that expectation external to Firecracker and Jailer.

## Snapshotter and content store

The snapshotter and content store are technically backed by files, but
incorporate enough difference to warrant a separate document.

## Techniques

Any of the file components that need to be shared can be done so through several
basic ways:

* Reusing the same backing storage
* Copying (duplicating the content)
* Copy-on-write

There are a variety of techniques to accomplish each of these.

### Reusing storage

* Opening the same file
* Creating a hard-link to the same file and opening it
* Creating a symbolic-link to the same file and opening it
* Bind-mounting the file elsewhere on the filesystem and opening it
* Opening an existing file descriptor from the `/proc` filesystem

All of these allow for mutation of the original storage, unless protected by
another mechanism layered on top.  These are all very efficient mechanisms; they
do not incur additional latency or storage space.  However, we still need a
mechanism for preventing mutation.

### Copying

Copying ahead of launching a microVM provides assurance that the original file
will not be modified, as the original backing storage is not opened and not
accessible.  However, it has the downsides of being expensive in terms of time
(a latency impact on pulling an image or launching a new microVM) and expensive
in terms of space (scaling linearly with the number of microVMs we launch).

Copies can be performed to local, durable storage, to network-attached storage,
or to memory with a tmpfs or memfd.

### Copy-on-write

Copy-on-write techniques provide mechanisms for attempting to achieve the same
efficiency as reusing storage but allowing the safety of preventing modification
of the original storage.  There are a variety of copy-on-write solutions:

* Overlay filesystem (file-based copy-on-write storage, copy-up is performed
  when a file is opened for write)
* Devicemapper thin devices (block-based copy-on-write, copy-up is performed
  when a block is written)
* Filesystem-integrated copy-on-write (ZFS, BTRFS, XFS, etc)

Overlay has broad support as it is integrated with the Linux kernel and
Firecracker already requires a new-enough kernel.  Overlay is also simple to set
up.  We are already using devicemapper as our snapshotter, even though it is
more challenging to set up.  Filesystem-integrated copy-on-write approaches
require a filesystem that supports the feature; we do not currently require a
specific filesystem to run firecracker-containerd.

### Preventing mutation

Linux provides different capabilities for preventing mutation to files:

* POSIX permissions - These are the simplest to understand and rely on the
  kernel for enforcement.  If a file is read-only and the opening UID/GID does
  not have permission to write or to change the permissions, Linux will
  effectively prevent writes.  However, if a process is run as root or has the
  ability to escalate its privileges, it may be able to change the permission
  bitmask and make the file writable.
* Mounting a filesystem as read-only - Similar to POSIX permissions, this relies
  on the Linux kernel and the underlying filesystem implementation for
  enforcement.  If a filesystem is mounted as read-only, Linux will effectively
  prevent writes.  However, if a process is run as root or has the ability to
  escalate its privileges, it may be able to re-mount the filesystem as
  writable.
* Ensuring files are opened with `O_RDONLY` - The `open(2)` syscall provides a
  set of modes with which to open files.  The `O_RDONLY` mode asks the Linux
  kernel to prevent any calls to change the file through `write(2)` or other
  syscalls.
* File sealing - Supported with memory-backed file descriptors only (memfd),
  sealing enforces stronger permissions preventing modification, truncation,
  growth or changes in the permission set.

Firecracker also provides some capabilities:

* Attaching the device as read-only - The emulated device visible to the microVM
  is presented as read-only, and the Firecracker VMM should prevent
  modification.  When attaching read-only, Firecracker opens the backing file as
  `O_RDONLY`, allowing the Linux kernel on the host to also enforce this
  restriction.

## Recommendation

For initial implementation, I recommend that we use POSIX permissions and
auditing `open(2)` calls to ensure `O_RDONLY`.

If for some reason we decide later that this is insufficient, we can look at a
copy-on-write technique.  I lean toward using overlay filesystems due to their
wide availability and support.