pivot_root: How a Container Gets Its Own Filesystem¶

Written by:

Igor Gorovyy
DevOps Engineer Lead & Senior Solutions Architect

A container sees its own filesystem, not the host's. How does this work? Through the pivot_root(2) system call, which swaps the root directory of a process. Not chroot -- that one is easy to escape. pivot_root operates at the mount namespace level and provides real isolation.

This is part 3 of the Sheep & Shepherd series. Previous parts: namespaces gave the container its own view of the system, and re-exec worked around Go's threading model. Next up: cgroups v2 caps its resources and OverlayFS makes the filesystem cheap to spin up.

The sequence of steps¶

Before doing pivot_root, I need to prepare the new root filesystem. Here's what happens inside the container init process (set up via the re-exec pattern in part 2):

graph TD
    A["ContainerInit()"] --> B["Set hostname"]
    B --> C["Mount /proc, /sys, /tmp, /dev<br/>inside new rootfs"]
    C --> D["Create /dev/null, /dev/zero,<br/>/dev/random, /dev/tty"]
    D --> E["pivotRoot(rootfs)"]
    E --> F["syscall.Exec(target_command)"]

Mounting filesystems¶

Before pivot_root I mount the required filesystems inside the new root:

mounts := []struct {
    source string
    target string
    fstype string
    flags  uintptr
    data   string
}{
    {"proc", "proc", "proc", 0, ""},
    {"sysfs", "sys", "sysfs", 0, ""},
    {"tmpfs", "tmp", "tmpfs", 0, ""},
    {"tmpfs", "dev", "tmpfs",
        syscall.MS_NOSUID | syscall.MS_STRICTATIME, "mode=755"},
}

for _, m := range mounts {
    target := filepath.Join(rootfs, m.target)
    os.MkdirAll(target, 0755)
    syscall.Mount(m.source, target, m.fstype, m.flags, m.data)
}

/proc -- needed for ps, top, and other utilities. /sys -- device information. /tmp -- temporary files. /dev -- devices.

Creating device nodes¶

The container needs basic devices. Without /dev/null, for example, a lot of programs will just crash:

func createDevices(rootfs string) {
    devPath := filepath.Join(rootfs, "dev")

    devices := []struct {
        name  string
        major uint32
        minor uint32
        mode  uint32
    }{
        {"null", 1, 3, 0666},
        {"zero", 1, 5, 0666},
        {"random", 1, 8, 0666},
        {"urandom", 1, 9, 0666},
        {"tty", 5, 0, 0666},
    }

    for _, d := range devices {
        path := filepath.Join(devPath, d.name)
        dev := unix.Mkdev(d.major, d.minor)
        unix.Mknod(path, syscall.S_IFCHR|d.mode, int(dev))
    }

    // Symlinks for stdin/stdout/stderr
    os.Symlink("/proc/self/fd", filepath.Join(devPath, "fd"))
    os.Symlink("/proc/self/fd/0", filepath.Join(devPath, "stdin"))
    os.Symlink("/proc/self/fd/1", filepath.Join(devPath, "stdout"))
    os.Symlink("/proc/self/fd/2", filepath.Join(devPath, "stderr"))
}

Mknod creates special device files with specific major/minor numbers. The kernel knows that /dev/null (1, 3) is a "black hole" you can write to and never read anything back from.

The pivot_root itself¶

Here's the key function:

func pivotRoot(newRoot string) error {
    putOld := filepath.Join(newRoot, ".pivot_old")
    os.MkdirAll(putOld, 0700)

    // Bind mount newRoot onto itself (pivot_root requirement)
    syscall.Mount(newRoot, newRoot, "",
        syscall.MS_BIND|syscall.MS_REC, "")

    // Swap the root
    unix.PivotRoot(newRoot, putOld)

    // Move into the new root
    os.Chdir("/")

    // Unmount the old root
    syscall.Unmount("/.pivot_old", syscall.MNT_DETACH)

    // Remove the mount point
    os.RemoveAll("/.pivot_old")

    return nil
}

Step by step:

graph LR
    subgraph "Before pivot_root"
        OLD_ROOT["/ (host FS)"]
        NEW_ROOT["/var/lib/sheep/overlay/abc/merged"]
    end

    subgraph "Bind mount"
        BM["newRoot mounted onto itself"]
    end

    subgraph "After pivot_root"
        REAL_ROOT["/ (container FS)"]
        PIVOT_OLD["/.pivot_old (old host FS)"]
    end

    subgraph "After unmount"
        CLEAN_ROOT["/ (container only)"]
    end

    OLD_ROOT --> BM
    BM --> REAL_ROOT
    REAL_ROOT --> CLEAN_ROOT

Bind mount -- pivot_root requires the new root to be a mount point. Bind-mounting a directory onto itself satisfies this requirement.

PivotRoot -- atomically swaps the process root. The old root filesystem ends up in .pivot_old.

Unmount -- after pivot_root the old FS is accessible through /.pivot_old. I unmount it with the MNT_DETACH flag (lazy unmount, doesn't wait for all files to close).

RemoveAll -- remove the .pivot_old directory.

Why not chroot?¶

chroot simply changes what the process considers the root directory. But:

you can escape chroot with ../../../ tricks
chroot doesn't affect already-open file descriptors
chroot doesn't work with mount namespaces

pivot_root operates at the mount namespace level and reliably isolates the filesystem. After unmounting the old root, the process physically can't reach the host FS. And on top of this root, OverlayFS (part 5) stacks copy-on-write layers so 10 containers can share one base image.

Limitations¶

pivot_root requires the new root to be a mount point. If you forget the bind mount, you'll get EINVAL. This isn't an obvious error, and debugging it takes time. Docker/runc have had these edge cases handled through years of production use.

Try it yourself¶

# Start a container and check mount points:
sudo ./sheep run --name pivot-test minimal /bin/sh
# Inside the container:
mount | head -5
ls /     # you only see the container FS

Namespaces isolate but don't limit. Next up -- cgroups v2 for resource limits.

Source code for the series: github.com/igorgorovoy/sheep-shepherd-meadow

Previous: Re-Exec Pattern | Next: Cgroups v2