Benchmarking Linux process sandboxing mechanisms
22 October, 2023
Build systems are in the business of computing what are nominally pure functions, but in the messy world of Unix files and processes. If you care about your builds being correct, reliable, reproducible, or secure, you probably would would like them to avoid arbitrarily accessing the network or filesystem of the host. Sources of nondeterminism such as time would ideally be controlled as well. Maybe someday we'll perform all builds in a deterministic WebAssembly sandbox, but for now the state of the art is mostly Unix process isolation mechanisms.
Despite having worked on build systems for a good chunk of my career, I realized that I actually had a pretty weak intuition for the performance costs of any kind of sandboxing.
This is a very important question if you would like to design a build system or optimize the use of one. It could impact both what forms of sandboxing are viable, and at what granularity to apply them.
If sandboxing adds a lot of overhead on top of regular process invocations, then maybe you want sandboxing to apply on the level of "projects" in a multi-project build graph (as in Nix1). If, on the other hand, sandboxing adds almost no overhead, then you can use it anywhere that already has a process boundary. Build systems like Bazel2 use a minimal form of sandboxing at the level of individual translation units (e.g. object files in C++), but as far as I know there is little precedent for "full" sandboxing at this granularity3.
I couldn't find any good reference for the costs of the different sandboxing technologies provided by Linux, so I decided to measure the performance of a few popular tools.
Sandboxing on Linux
In this post I'm only going to look at Linux. There are three major components of a build-grade process sandbox on Linux.
- Namespaces give the sandboxed process a "clean room" environment separated from the user accounts, network, process IDs, etc. of the host system.
pivot_root
isolates the filesystem root to a sandbox directory.- Bind mounts make it possible to virtualize the filesystem tree in the sandbox, so it doesn't need to correspond directly to some on-disk tree externally.
Process sandboxing has become prominent mainly thanks to Linux containers, an ecosystem and movement primarily oriented toward the needs of running backend services. Sadly, adoption in build systems still lags far behind the world of services. Since service startup latency isn't usually a huge concern, it also wouldn't surprise me if the relevant kernel code paths are not as optimized as they could be.
The benchmarks
In each benchmark, the underlying process is true
, which exits immediately.
The measured time always includes launching the process and waiting for its completion.
The following setups are tested:
- No sandboxing, just a regular Unix process
- No sandboxing, but executed under
sh -c
- Minimal Bubblewrap sandbox (just remounting
/
) - Full Bubblewrap sandbox with all namespaces unshared
- The
unshare
command with all namespaces unshared - The Rust
unshare
crate with all namespaces unshared docker run
(withalpine
image)podman run
(withalpine
image)
Just a regular true
process gives us the baseline for process invocation latency.
Running under sh
doesn't tell us anything about sandboxing, but it's a useful sense of scale because wrapper shell scripts are extremely common in most build setups.
Bubblewrap / bwrap
is a very nice low level sandboxing tool which supports a huge number of isolation features.
The second configuration relative to the first should isolate the relative cost of the namespace setup specifically.
Since Bubblewrap requires running under an explicit filesystem sandbox, I add unshare
(which just does namespaces) to try to isolate the cost of the bind mount.
In a real build system which natively understands sandboxing, it might make sense to perform the sandbox setup system calls directly in the build system process.
This would only incur one level of process overhead, rather than adding an extra process by using something like bwrap
.
To simulate this, I test a Rust library which sets up Linux namespaces in-process, and then invokes the target process (still true
).
Since the benchmark script is written in Rust and the measurements are taken internal to the process, this avoids the extra process overhead.
For completeness, I test Docker and Podman since they provide most of the isolation features relevant to a build system (plus a lot more features relevant to services).
A note on $PATH
resolution
I ran the benchmarks in a Nix shell, which sets a very long $PATH
list, so I (correctly) guessed that searching for true
by name might affect the results.
Except for the container tests (where this shouldn't be as much of an issue with the alpine
image), each mechanism invokes true
through a resolved path.
While I'm not super interested in the overhead incurred by long $PATH
searches, I also added a benchmark which just invokes true
as is.
Like sh -c
, this can help give a useful sense of scale.
Results
My machine is a Ryzen 7 2700X desktop running NixOS 23.05 on ZFS. None of the benchmarks had any significant variance (after warm-up time), so the numbers here are all just averages.
Mechanism | Total time (ms) |
---|---|
normal process, resolved | 0.653 |
normal process, unresolved | 1.067 |
sh -c | 3.670 |
bwrap | 2.171 |
bwrap with namespaces | 2.903 |
unshare command | 3.110 |
unshare library | 2.293 |
docker run | 248 |
podman run | 402 |
Omitting the container tests for reasons of scale, I've plotted both the total time of each mechanism, and the marginal overhead on top of the baseline process invocation time:
The benchmark code can be found on GitHub.
Conclusions
Sandboxing via Linux namespaces adds an overhead a few times the overhead of an unsandboxed process.
However it's still quite fast, so sandboxing should be used whenever possible in build systems.
In these benchmarks we used true
which should have negligible run time, but a real compiler or code generator likely dwarfs the process invocation time, sandboxed or not.
It's important to remember that any process invocation is still about five orders of magnitude slower than a function call. It would be really cool to design a compiler which can use a build system to cache the compilation of individual functions, for example, but the overhead would be prohibitively expensive.
Bubblewrap
Bubblewrap looks pretty well optimized.
It's wholly superior to the unshare
command, and its namespacing overhead (relative to the filesystem-only configuration) is significantly less than the unshare
library, which avoids a whole process layer.
This is impressive considering Bubblewrap's focus is security.
Containers
I was quite surprised that Docker and Podman had such high latency. They obviously do a lot more bookkeeping, but I thought it would be maybe an order of magnitude slower at worst. Widespread familiarity with container technology may make it seem like an attractive sandboxing mechanism for builds (and technically it checks most of the right boxes), but the overhead makes it problematic. You certainly could not use it for translation-unit level granularity, but the fact that it's in the realm of human-observable latency makes me want to rule it out even for very course granularity.
Shell
I was not surprised to find that adding a shell wrapper has an overhead significantly greater than the theoretical minimum (i.e. one extra process overhead). However the fact that it was roughly on par with the sandboxing mechanisms is a useful heuristic: if you wouldn't think twice about wrapping a command in a shell script (very common in build systems), you probably shouldn't care about sandboxing overhead.
$PATH
resolution
Resolving true
on each invocation added a very significant overhead.
I really expected filesystem caching in the kernel to take care of this after the initial run, but apparently not.
Technically this isn't enforced by Nix itself, but it's the canonical way to use it as defined by nixpkgs.
It's always bothered me that Bazel includes "correct" in its motto, when it doesn't even try to stop you from using tools from the host.
Bazel has --experimental_use_hermetic_linux_sandbox
but it's not widely used and very off the beaten path.