Benchmarking Linux process sandboxing mechanisms

22 October, 2023

Build systems are in the business of computing what are nominally pure functions, but in the messy world of Unix files and processes. If you care about your builds being correct, reliable, reproducible, or secure, you probably would would like them to avoid arbitrarily accessing the network or filesystem of the host. Sources of nondeterminism such as time would ideally be controlled as well. Maybe someday we'll perform all builds in a deterministic WebAssembly sandbox, but for now the state of the art is mostly Unix process isolation mechanisms.

Despite having worked on build systems for a good chunk of my career, I realized that I actually had a pretty weak intuition for the performance costs of any kind of sandboxing.

This is a very important question if you would like to design a build system or optimize the use of one. It could impact both what forms of sandboxing are viable, and at what granularity to apply them.

If sandboxing adds a lot of overhead on top of regular process invocations, then maybe you want sandboxing to apply on the level of "projects" in a multi-project build graph (as in Nix¹). If, on the other hand, sandboxing adds almost no overhead, then you can use it anywhere that already has a process boundary. Build systems like Bazel² use a minimal form of sandboxing at the level of individual translation units (e.g. object files in C++), but as far as I know there is little precedent for "full" sandboxing at this granularity³.

I couldn't find any good reference for the costs of the different sandboxing technologies provided by Linux, so I decided to measure the performance of a few popular tools.

Sandboxing on Linux

In this post I'm only going to look at Linux. There are three major components of a build-grade process sandbox on Linux.

Namespaces give the sandboxed process a "clean room" environment separated from the user accounts, network, process IDs, etc. of the host system.
pivot_root isolates the filesystem root to a sandbox directory.
Bind mounts make it possible to virtualize the filesystem tree in the sandbox, so it doesn't need to correspond directly to some on-disk tree externally.

Process sandboxing has become prominent mainly thanks to Linux containers, an ecosystem and movement primarily oriented toward the needs of running backend services. Sadly, adoption in build systems still lags far behind the world of services. Since service startup latency isn't usually a huge concern, it also wouldn't surprise me if the relevant kernel code paths are not as optimized as they could be.

The benchmarks

In each benchmark, the underlying process is true, which exits immediately. The measured time always includes launching the process and waiting for its completion. The following setups are tested:

No sandboxing, just a regular Unix process
No sandboxing, but executed under sh -c
Minimal Bubblewrap sandbox (just remounting /)
Full Bubblewrap sandbox with all namespaces unshared
The unshare command with all namespaces unshared
The Rust unshare crate with all namespaces unshared
docker run (with alpine image)
podman run (with alpine image)

Just a regular true process gives us the baseline for process invocation latency. Running under sh doesn't tell us anything about sandboxing, but it's a useful sense of scale because wrapper shell scripts are extremely common in most build setups.

Bubblewrap / bwrap is a very nice low level sandboxing tool which supports a huge number of isolation features. The second configuration relative to the first should isolate the relative cost of the namespace setup specifically. Since Bubblewrap requires running under an explicit filesystem sandbox, I add unshare (which just does namespaces) to try to isolate the cost of the bind mount.

In a real build system which natively understands sandboxing, it might make sense to perform the sandbox setup system calls directly in the build system process. This would only incur one level of process overhead, rather than adding an extra process by using something like bwrap. To simulate this, I test a Rust library which sets up Linux namespaces in-process, and then invokes the target process (still true). Since the benchmark script is written in Rust and the measurements are taken internal to the process, this avoids the extra process overhead.

For completeness, I test Docker and Podman since they provide most of the isolation features relevant to a build system (plus a lot more features relevant to services).

A note on `$PATH` resolution

I ran the benchmarks in a Nix shell, which sets a very long $PATH list, so I (correctly) guessed that searching for true by name might affect the results. Except for the container tests (where this shouldn't be as much of an issue with the alpine image), each mechanism invokes true through a resolved path.

While I'm not super interested in the overhead incurred by long $PATH searches, I also added a benchmark which just invokes true as is. Like sh -c, this can help give a useful sense of scale.

Results

My machine is a Ryzen 7 2700X desktop running NixOS 23.05 on ZFS. None of the benchmarks had any significant variance (after warm-up time), so the numbers here are all just averages.

Mechanism	Total time (ms)
normal process, resolved	0.653
normal process, unresolved	1.067
`sh -c`	3.670
`bwrap`	2.171
`bwrap` with namespaces	2.903
`unshare` command	3.110
`unshare` library	2.293
`docker run`	248
`podman run`	402

Omitting the container tests for reasons of scale, I've plotted both the total time of each mechanism, and the marginal overhead on top of the baseline process invocation time:

Sandboxing mechanism performance

The benchmark code can be found on GitHub.

Conclusions

Sandboxing via Linux namespaces adds an overhead a few times the overhead of an unsandboxed process. However it's still quite fast, so sandboxing should be used whenever possible in build systems. In these benchmarks we used true which should have negligible run time, but a real compiler or code generator likely dwarfs the process invocation time, sandboxed or not.

It's important to remember that any process invocation is still about five orders of magnitude slower than a function call. It would be really cool to design a compiler which can use a build system to cache the compilation of individual functions, for example, but the overhead would be prohibitively expensive.

Bubblewrap

Bubblewrap looks pretty well optimized. It's wholly superior to the unshare command, and its namespacing overhead (relative to the filesystem-only configuration) is significantly less than the unshare library, which avoids a whole process layer. This is impressive considering Bubblewrap's focus is security.

Containers

I was quite surprised that Docker and Podman had such high latency. They obviously do a lot more bookkeeping, but I thought it would be maybe an order of magnitude slower at worst. Widespread familiarity with container technology may make it seem like an attractive sandboxing mechanism for builds (and technically it checks most of the right boxes), but the overhead makes it problematic. You certainly could not use it for translation-unit level granularity, but the fact that it's in the realm of human-observable latency makes me want to rule it out even for very course granularity.

Shell

I was not surprised to find that adding a shell wrapper has an overhead significantly greater than the theoretical minimum (i.e. one extra process overhead). However the fact that it was roughly on par with the sandboxing mechanisms is a useful heuristic: if you wouldn't think twice about wrapping a command in a shell script (very common in build systems), you probably shouldn't care about sandboxing overhead.

`$PATH` resolution

Resolving true on each invocation added a very significant overhead. I really expected filesystem caching in the kernel to take care of this after the initial run, but apparently not.

Technically this isn't enforced by Nix itself, but it's the canonical way to use it as defined by nixpkgs.

It's always bothered me that Bazel includes "correct" in its motto, when it doesn't even try to stop you from using tools from the host.

Bazel has --experimental_use_hermetic_linux_sandbox but it's not widely used and very off the beaten path.