Safe by Default: Building a Minimal, Rootless Sandbox on Linux

Trevor Woollacott on 2025-09-08

People often reach for Docker, Podman or even a virtual machine when they need isolation. But what if you don’t need to ship a whole operating system? What if you just want to run a single command, like ffmpeg on a user-uploaded video, pdftotext on a unknown PDF, or a third-party analysis tool in CI, with strong guardrails?

You don’t always need a container, a sandbox may do.

A container is a full package: a root filesystem, libraries, and your application, all bundled up. It’s like a self-contained apartment. A sandbox is a set of kernel fences around a single process running on your existing system.

We’ll build a rootless sandbox around a command using the isolation features Linux already gives you. Then I’ll show even simpler, one-liner ways to get 90% of the value with zero code.

Why Bother with a Sandbox? The Limits of Prevention

The core idea here is defense in depth. No single defense is perfect.

Sanitize your inputs, validate images, escape strings. That’s the outer wall of your defense.

But what happens when that wall is breached?

The Fallacy of the Perfect Sanitizer

Relying only on input sanitization is like believing your front-door lock is unpickable. Reality check:

  1. Unknown Unknowns (Zero-Days): Before disclosure, nobody was “sanitizing for” Log4Shell. A malicious log line could trigger code execution. You can’t filter what you don’t yet know exists.
  2. Complexity Bypasses: Formats like ZIP, PDF, video containers, and image codecs are labyrinths of edge-cases. A file can pass “valid PNG” checks but still hit an image-parser edge case. The ZIP format is no exception — its spec includes local file headers and a central directory (which can disagree), ZIP64 for >4 GB files, encryption, and nesting, all of which make reliable sanitization hard. Early ZIP parsing ambiguity (e.g., CVE-2003–1154) showed how malformed ZIPs could bypass scanners; since then, ZIP parser discrepancies and path-traversal issues continue to surface. And decompression bombs like the infamous 42.zip (kilobytes in, petabytes out) can overwhelm systems without strict resource limits.
  3. Bugs in the Sanitizer: Even fancy regexes can cause DoS, and Unicode oddities can bypass your filters.

A sandbox assumes the process is hostile. It’s not about preventing the exploit, it’s about making the exploit useless. If it fires, it hits no network, no sensitive files, no privileged syscalls. The blast radius is tiny.

The Sandbox as a Blast Shield: Three Scenarios

1) Malicious PDF on your web service

2) Compromised dependency in CI/CD

3) Decompression “zip bomb”

Don’t just try to block the punch. Take it inside a padded room.

The Isolation Layers You’ll Need

Namespaces (the walls):

Security latches (the rules):

A From-Scratch Demo That Works

Requirements

Step 1: Landlock loader (ABI-aware, no extra headers)

I’ve sprinkled comments through the Landlock loader (and the sandbox script) so you can skim the flow without pausing to read headers or man pages. If there’s interest, I’m happy to write a follow-up post that walks line-by-line through the C code and the Bash wrapper, why each flag exists, how the ABI probing works, and where to tweak the policy for your own tools.

landlock-loader.c:

// landlock-loader.c
// Build:   gcc landlock-loader.c -O2 -Wall -o landlock-loader
// Usage:   ./landlock-loader <cmd> [args...]
//
// Notes:
// - Probes Landlock ABI and enables all *known* FS rights up to that ABI.
// - RO trees (e.g., /usr, /bin, /lib*, /etc/ssl/certs) get READ/EXEC (if available).
// - RW trees (e.g., /tmp) get create/remove/write + read (+ truncate if ABI>=3).
// - Sets no_new_privs before restricting self.
// - If Landlock isn’t available, prints a clear error and exits.

#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <linux/landlock.h>   // Requires a recent kernel UAPI; if missing, install linux-headers
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/prctl.h>
#include <sys/syscall.h>
#include <unistd.h>

// ---- syscalls (avoid glibc version issues) ----
static inline int ll_create(const struct landlock_ruleset_attr *a, size_t s, __u32 f) {
    return syscall(__NR_landlock_create_ruleset, a, s, f);
}
static inline int ll_add(int rs, enum landlock_rule_type t, const void *attr, __u32 f) {
    return syscall(__NR_landlock_add_rule, rs, t, attr, f);
}
static inline int ll_restrict(int rs, __u32 f) {
    return syscall(__NR_landlock_restrict_self, rs, f);
}

// ---- helpers ----
static int add_path_rule(int ruleset_fd, const char *path, __u64 access) {
    int pfd = open(path, O_PATH | O_CLOEXEC);
    if (pfd < 0) {
        fprintf(stderr, "landlock: open('%s') failed: %s\n", path, strerror(errno));
        return -1;
    }
    struct landlock_path_beneath_attr pb = {
        .parent_fd = pfd,
        .allowed_access = access
    };
    int rc = ll_add(ruleset_fd, LANDLOCK_RULE_PATH_BENEATH, &pb, 0);
    int saved = errno;
    close(pfd);
    if (rc < 0) {
        fprintf(stderr, "landlock: add_rule('%s') failed: %s\n", path, strerror(saved));
        return -1;
    }
    return 0;
}

// Compose a best-effort handled_access_fs bitmask for the detected ABI.
// We guard additions by ABI version and by feature #ifdef (if headers are older/newer).
static __u64 handled_fs_mask_for_abi(int abi) {
    if (abi < 1) abi = 1;

    __u64 m = 0;

    // ---- ABI v1 rights (Linux 5.13+) ----
    // These should exist in all modern headers.
    m |= LANDLOCK_ACCESS_FS_EXECUTE;
    m |= LANDLOCK_ACCESS_FS_WRITE_FILE;
    m |= LANDLOCK_ACCESS_FS_READ_FILE;
    m |= LANDLOCK_ACCESS_FS_READ_DIR;
    m |= LANDLOCK_ACCESS_FS_REMOVE_DIR;
    m |= LANDLOCK_ACCESS_FS_REMOVE_FILE;
    m |= LANDLOCK_ACCESS_FS_MAKE_CHAR;
    m |= LANDLOCK_ACCESS_FS_MAKE_DIR;
    m |= LANDLOCK_ACCESS_FS_MAKE_REG;
    m |= LANDLOCK_ACCESS_FS_MAKE_SOCK;
    m |= LANDLOCK_ACCESS_FS_MAKE_FIFO;
    m |= LANDLOCK_ACCESS_FS_MAKE_BLOCK;
    m |= LANDLOCK_ACCESS_FS_MAKE_SYM;

    if (abi >= 2) {
#ifdef LANDLOCK_ACCESS_FS_REFER
        m |= LANDLOCK_ACCESS_FS_REFER;      // cross-dir rename/link
#endif
    }
    if (abi >= 3) {
#ifdef LANDLOCK_ACCESS_FS_TRUNCATE
        m |= LANDLOCK_ACCESS_FS_TRUNCATE;   // truncate file content/size
#endif
    }
    // ABI v4 adds TCP network rights (separate from FS), keep FS mask unchanged.
    if (abi >= 5) {
#ifdef LANDLOCK_ACCESS_FS_IOCTL_DEV
        m |= LANDLOCK_ACCESS_FS_IOCTL_DEV;  // device-specific ioctls
#endif
    }
    // Future ABIs: harmless to ignore here; kernel masks unknown bits.

    return m;
}

// Build RO access set: read files/dirs + (optionally) execute, for binary trees.
static __u64 ro_access_for_abi(int abi) {
    __u64 a = LANDLOCK_ACCESS_FS_READ_FILE | LANDLOCK_ACCESS_FS_READ_DIR;
    // EXECUTE has been available since ABI v1
    (void)abi; // silence unused warning; kept for symmetry
    a |= LANDLOCK_ACCESS_FS_EXECUTE;
    return a;
}

// Build RW workdir access: read/write plus create/remove (and truncate if available).
static __u64 rw_access_for_abi(int abi) {
    __u64 a = 0;
    a |= LANDLOCK_ACCESS_FS_READ_FILE | LANDLOCK_ACCESS_FS_READ_DIR;
    a |= LANDLOCK_ACCESS_FS_WRITE_FILE;
    a |= LANDLOCK_ACCESS_FS_MAKE_REG | LANDLOCK_ACCESS_FS_MAKE_DIR;
    a |= LANDLOCK_ACCESS_FS_REMOVE_FILE | LANDLOCK_ACCESS_FS_REMOVE_DIR;
#ifdef LANDLOCK_ACCESS_FS_TRUNCATE
    if (abi >= 3) a |= LANDLOCK_ACCESS_FS_TRUNCATE;
#endif
    return a;
}

static void usage(const char *argv0) {
    fprintf(stderr, "usage: %s <cmd> [args...]\n", argv0);
}

int main(int argc, char **argv) {
    if (argc < 2) { usage(argv[0]); return 1; }

    // Probe ABI version. If kernel doesn’t support Landlock, this returns -1/ENOSYS.
    int abi = ll_create(NULL, 0, LANDLOCK_CREATE_RULESET_VERSION);
    if (abi < 0) {
        fprintf(stderr, "landlock: unavailable on this kernel (create_ruleset: %s)\n", strerror(errno));
        return 1;
    }
    if (abi == 0) abi = 1;  // older kernels may return 0; treat as v1
    // Cap to a known max if desired (optional). We just use what's reported.

    // Compose handled mask for this ABI.
    __u64 handled = handled_fs_mask_for_abi(abi);

    struct landlock_ruleset_attr a = { .handled_access_fs = handled };
    int rs = ll_create(&a, sizeof(a), 0);
    if (rs < 0) {
        fprintf(stderr, "landlock: create_ruleset failed: %s\n", strerror(errno));
        return 1;
    }

    // ---- Define your policy here ----
    // Typical dynamic binaries need RO access to system trees and certs:
    const __u64 RO = ro_access_for_abi(abi);
    (void) add_path_rule(rs, "/usr", RO);
    (void) add_path_rule(rs, "/bin", RO);
    (void) add_path_rule(rs, "/lib", RO);
    (void) add_path_rule(rs, "/lib64", RO);
    (void) add_path_rule(rs, "/etc/ssl/certs", RO);
    (void) add_path_rule(rs, "/proc", RO);

    // Writable work area (adjust to your sandbox’s bind-mount):
    const __u64 RW = rw_access_for_abi(abi);
    (void) add_path_rule(rs, "/tmp", RW);

    // Safety latch: prevent privilege gains across execve.
    if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) != 0) {
        fprintf(stderr, "landlock: PR_SET_NO_NEW_PRIVS failed: %s\n", strerror(errno));
        close(rs);
        return 1;
    }

    // Restrict current thread + descendants.
    if (ll_restrict(rs, 0) < 0) {
        fprintf(stderr, "landlock: restrict_self failed: %s\n", strerror(errno));
        close(rs);
        return 1;
    }
    close(rs);

    // Now exec the target under this Landlock domain.
    execvp(argv[1], &argv[1]);
    fprintf(stderr, "landlock: execvp('%s') failed: %s\n", argv[1], strerror(errno));
    return 1;
}

Compile:

gcc landlock-loader.c -O2 -Wall -o landlock-loader

Step 2: Sandbox script

sandbox.sh:

#!/usr/bin/env bash
# sandbox.sh — rootless sandbox wrapper for landlock-loader
# Usage: ./sandbox.sh <cmd> [args...]
set -Eeuo pipefail

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
LOADER_PATH="${LOADER:-$SCRIPT_DIR/landlock-loader}"

if [[ ! -x "$LOADER_PATH" ]]; then
  echo "ERROR: build landlock-loader first (expected at $LOADER_PATH)" >&2
  exit 1
fi

sandbox_init() {
  set -Eeuo pipefail

  # argv[1] is the absolute path to landlock-loader
  local LOADER_PATH="$1"; shift

  # Defaults resolved inside this shell (avoids 'unbound variable')
  local TMPFS_SIZE="${TMPFS_SIZE:-256M}"
  local USE_CGROUPS="${CGROUP_LIMITS:-0}"
  local CG_MEM_MAX="${CG_MEM_MAX:-512M}"
  local CG_PIDS_MAX="${CG_PIDS_MAX:-64}"

  # Private mount namespace
  mount --make-rprivate /

  # Hardened /proc
  mkdir -p /proc
  umount -l /proc 2>/dev/null || true
  mount -t proc proc /proc -o nosuid,nodev,noexec,hidepid=0

  # Private /dev: tmpfs, then bind a few safe char devices read-only from host
  mount -t tmpfs -o mode=0755 tmpfs /dev
  for n in null zero random urandom; do
    if [[ -e "/proc/1/root/dev/$n" ]]; then
      : >"/dev/$n" 2>/dev/null || true
      mount --bind "/proc/1/root/dev/$n" "/dev/$n"
      mount -o remount,ro,bind "/dev/$n"
    fi
  done
  ln -snf /proc/self/fd   /dev/fd
  ln -snf /proc/self/fd/0 /dev/stdin
  ln -snf /proc/self/fd/1 /dev/stdout
  ln -snf /proc/self/fd/2 /dev/stderr

  # Private /tmp (tmpfs)
  [[ -d /tmp ]] || mkdir -p /tmp
  mount -t tmpfs -o "mode=1777,size=${TMPFS_SIZE}" tmpfs /tmp

  # Optional cgroup2 ceilings (best-effort rootless)
  if [[ "$USE_CGROUPS" == "1" && -f /sys/fs/cgroup/cgroup.controllers ]]; then
    local cg="/sys/fs/cgroup/sbx.$$"
    mkdir -p "$cg" 2>/dev/null || true
    { echo $$ >"$cg/cgroup.procs"; } 2>/dev/null || true
    { printf '%s' "$CG_MEM_MAX" >"$cg/memory.max"; } 2>/dev/null || true
    { printf '%s' "$CG_PIDS_MAX" >"$cg/pids.max"; } 2>/dev/null || true
  fi

  # Keep netns isolated; ensure loopback is down (best-effort)
  command -v ip >/dev/null 2>&1 && { ip link set lo down 2>/dev/null || true; }

  # Apply Landlock policy and exec target
  exec "$LOADER_PATH" "$@"
}

# Spawn private namespaces and pass LOADER_PATH explicitly
exec unshare \
  --map-root-user \
  --mount --pid --net --uts --fork \
  /bin/bash -c "$(declare -f sandbox_init); sandbox_init \"$LOADER_PATH\" \"\$@\"" -- "$@"

Make it executable:

chmod +x sandbox.sh

Step 3: Try it

1) Quick “hello”

./sandbox.sh /bin/sh -lc 'echo hello'
hello

2) Filesystem: can write to /tmp

./sandbox.sh /bin/sh -lc 'echo ok >/tmp/dummy_file && ls -l /tmp'
total 4
-rw-r--r-- 1 0 0 3 Sep 10 11:52 dummy_file

3) Filesystem: /etc/shadow is denied (Landlock)

./sandbox.sh /bin/sh -lc 'cat /etc/shadow'
cat: /etc/shadow: Permission denied

4) Networking: no outbound (isolated netns)

./sandbox.sh /bin/sh -lc 'curl -sS https://example.com || echo "network: blocked"'
network: blocked

5) PID namespace: your shell is PID 1

./sandbox.sh /bin/sh -lc 'echo "PID is $$"; ps -o pid,comm'
PID is 1
  PID COMMAND
    1 sh
    2 ps

6) Hardened /proc

./sandbox.sh /bin/sh -lc 'mount | grep " on /proc "'
proc on /proc type proc (nosuid,nodev,noexec,relatime)

7) Minimal /dev (read-only binds)

./sandbox.sh /bin/sh -lc ': > /dev/zero || echo "/dev/zero is read-only (good)"'
/bin/sh: 1: cannot create /dev/zero: Permission denied

Simpler, No-Code (90% of the value, one line)

If you just want results, use tools you already have.

Option A: systemd-run (rootless under --user)

Here’s a quick and easy command you can run, providing us with namespaces, syscall filters, RO root, private /tmp, and private network.

systemd-run --user --wait --collect --pty \
  -p NoNewPrivileges=yes \
  -p PrivateUsers=yes \
  -p PrivateTmp=yes \
  -p PrivateNetwork=yes \
  -p ProtectSystem=strict \
  -p ProtectHome=yes \
  -p ProtectKernelTunables=yes \
  -p ProtectControlGroups=yes \
  -p LockPersonality=yes \
  -p MemoryDenyWriteExecute=yes \
  -p 'SystemCallFilter=@system-service @basic-io @file-system' \
  -p 'SystemCallFilter=~@mount @keyring @module @raw-io @reboot' \
  -p ReadWritePaths=/tmp \
  -p ReadOnlyPaths=/usr:/bin:/lib:/lib64:/etc/ssl/certs \
  /usr/bin/busybox sh -c 'cat /etc/shadow'

Running as unit: run-u175.service; invocation ID: ed8783dfc6af4a7fbecdc0189f3f6780
cat: can't open '/etc/shadow': Permission denied
Finished with result: exit-code
Main processes terminated with: code=exited/status=1
Service runtime: 75ms
CPU time consumed: 57ms
Memory peak: 260.0K
Memory swap peak: 0B\

Option B: bubblewrap (bwrap)

Great rootless filesystem isolation; pair with seccomp later if needed.

bwrap \
  --unshare-user --unshare-pid --unshare-net --unshare-uts \
  --ro-bind /usr /usr \
  --ro-bind /bin /bin \
  --ro-bind /lib /lib \
  --ro-bind /lib64 /lib64 \
  --ro-bind /etc/ssl/certs /etc/ssl/certs \
  --dir /tmp --bind /tmp /tmp \
  --dev-bind /dev/null /dev/null \
  --dev-bind /dev/zero /dev/zero \
  --dev-bind /dev/random /dev/random \
  --dev-bind /dev/urandom /dev/urandom \
  --proc /proc \
  /usr/bin/busybox sh -c 'echo hi >/tmp/x && cat /etc/shadow || true'

Optional: Tiny seccomp allowlist

Once your sandbox works with namespaces and Landlock, you can add a syscall allowlist. Here is an example where where limit the syscalls of pdftotext:

  1. Record what the tool actually does:
strace -f -o /tmp/syscalls.log pdftotext sample.pdf /tmp/out.txt
cut -d\( -f1 /tmp/syscalls.log | awk '{print $NF}' | sort -u
  1. Start small and allow just the usual suspects for pdftotext (tweak from your own strace):
// build: gcc -O2 -Wall secf.c -lseccomp -o secf
#include <seccomp.h>
#include <stdio.h>
int main(int argc, char** argv){
  if (argc < 2) { fprintf(stderr,"usage: %s <cmd> [args]\n", argv[0]); return 1; }
  scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_KILL_PROCESS);
  int ok[] = { SCMP_SYS(read),SCMP_SYS(write),SCMP_SYS(close),SCMP_SYS(exit),SCMP_SYS(exit_group),
               SCMP_SYS(fstat),SCMP_SYS(stat),SCMP_SYS(openat),SCMP_SYS(access),SCMP_SYS(lseek),
               SCMP_SYS(pread64),SCMP_SYS(readlink),SCMP_SYS(getrandom),SCMP_SYS(brk),
               SCMP_SYS(mmap),SCMP_SYS(mprotect),SCMP_SYS(munmap),SCMP_SYS(arch_prctl),
               SCMP_SYS(set_tid_address),SCMP_SYS(set_robust_list),SCMP_SYS(rt_sigaction),
               SCMP_SYS(rt_sigprocmask),SCMP_SYS(prlimit64),SCMP_SYS(clock_gettime),
               SCMP_SYS(getpid),SCMP_SYS(getuid),SCMP_SYS(getgid),SCMP_SYS(geteuid),SCMP_SYS(getegid) };
  for (size_t i=0;i<sizeof(ok)/sizeof(ok[0]);++i) seccomp_rule_add(ctx, SCMP_ACT_ALLOW, ok[i], 0);
  if (seccomp_load(ctx)) { perror("seccomp_load"); return 1; }
  execvp(argv[1], &argv[1]); perror("execvp"); return 1;
}

Use it like:

./sandbox.sh ./secf pdftotext sample.pdf /tmp/out.txt

Tips

When implementing security profiles for your applications, it’s best to start with a foundational set of allowed system calls. If an application crashes, you can investigate the cause by checking the system’s error logs, such as dmesg or the application’s standard error output (stderr), for any blocked system calls. Once identified, you can then deliberately add the necessary call to its allowlist.

To make this tuning process smoother and avoid constant crashes, you can temporarily switch the default rule from terminating the process (SCMP_ACT_KILL_PROCESS) to a non-fatal error (SCMP_ACT_ERRNO(EPERM)). This allows the application to continue running even when it attempts a blocked call, simply receiving a “permission denied” error. This approach enables you to identify all required system calls more efficiently as you test the application’s functionality. It is also important to maintain a separate allowlist for each tool, as their needs will differ. For instance, a simple tool like pdftotext is unlikely to require network access, whereas a multimedia tool like ffmpeg might.

The Big Picture

If you’re shipping an app with its own libraries and OS image, containers are the right tool. But for the very common task of running a single untrusted command on your host, Linux already gives you the parts to build a fast, lightweight sandbox.

Use the from-scratch script when you want full control and understanding. Use systemd-run / bwrap when you just want to get it done, predictably, in one line.

So, is this setup foolproof? No, but it’s a crucial layer of defense-in-depth. A sandbox is a powerful containment strategy. Your defenses are only as strong as the kernel they run on, and a vulnerability there can bypass these fences. Similarly, a small misconfiguration or a change in the sandboxed tool can reopen attack paths. The goal isn’t to achieve perfect, unbreakable security, it’s to shrink an exploit’s blast radius from “catastrophic” down to “a failed job.” And that’s a huge win.