Skip to content

pivot_root: How a Container Gets Its Own Filesystem

pivot_root: How a Container Gets Its Own Filesystem

Written by:

Igor Gorovyy
DevOps Engineer Lead & Senior Solutions Architect

LinkedIn

Stanislav Gorovyy
HBO-ICT Student at Hogeschool Saxion | Software Engineer

LinkedIn


A container sees its own filesystem, not the host's. How does this work? Through the pivot_root(2) system call, which swaps the root directory of a process. Not chroot -- that one is easy to escape. pivot_root operates at the mount namespace level and provides real isolation.

The sequence of steps

Before doing pivot_root, we need to prepare the new root filesystem. Here's what happens inside the container init process:

graph TD
    A["ContainerInit()"] --> B["Set hostname"]
    B --> C["Mount /proc, /sys, /tmp, /dev<br/>inside new rootfs"]
    C --> D["Create /dev/null, /dev/zero,<br/>/dev/random, /dev/tty"]
    D --> E["pivotRoot(rootfs)"]
    E --> F["syscall.Exec(target_command)"]

Mounting filesystems

Before pivot_root we mount the required filesystems inside the new root:

mounts := []struct {
    source string
    target string
    fstype string
    flags  uintptr
    data   string
}{
    {"proc", "proc", "proc", 0, ""},
    {"sysfs", "sys", "sysfs", 0, ""},
    {"tmpfs", "tmp", "tmpfs", 0, ""},
    {"tmpfs", "dev", "tmpfs",
        syscall.MS_NOSUID | syscall.MS_STRICTATIME, "mode=755"},
}

for _, m := range mounts {
    target := filepath.Join(rootfs, m.target)
    os.MkdirAll(target, 0755)
    syscall.Mount(m.source, target, m.fstype, m.flags, m.data)
}

/proc -- needed for ps, top, and other utilities. /sys -- device information. /tmp -- temporary files. /dev -- devices.

Creating device nodes

The container needs basic devices. Without /dev/null, for example, a lot of programs will just crash:

func createDevices(rootfs string) {
    devPath := filepath.Join(rootfs, "dev")

    devices := []struct {
        name  string
        major uint32
        minor uint32
        mode  uint32
    }{
        {"null", 1, 3, 0666},
        {"zero", 1, 5, 0666},
        {"random", 1, 8, 0666},
        {"urandom", 1, 9, 0666},
        {"tty", 5, 0, 0666},
    }

    for _, d := range devices {
        path := filepath.Join(devPath, d.name)
        dev := unix.Mkdev(d.major, d.minor)
        unix.Mknod(path, syscall.S_IFCHR|d.mode, int(dev))
    }

    // Symlinks for stdin/stdout/stderr
    os.Symlink("/proc/self/fd", filepath.Join(devPath, "fd"))
    os.Symlink("/proc/self/fd/0", filepath.Join(devPath, "stdin"))
    os.Symlink("/proc/self/fd/1", filepath.Join(devPath, "stdout"))
    os.Symlink("/proc/self/fd/2", filepath.Join(devPath, "stderr"))
}

Mknod creates special device files with specific major/minor numbers. The kernel knows that /dev/null (1, 3) is a "black hole" you can write to and never read anything back from.

The pivot_root itself

Here's the key function:

func pivotRoot(newRoot string) error {
    putOld := filepath.Join(newRoot, ".pivot_old")
    os.MkdirAll(putOld, 0700)

    // Bind mount newRoot onto itself (pivot_root requirement)
    syscall.Mount(newRoot, newRoot, "",
        syscall.MS_BIND|syscall.MS_REC, "")

    // Swap the root
    unix.PivotRoot(newRoot, putOld)

    // Move into the new root
    os.Chdir("/")

    // Unmount the old root
    syscall.Unmount("/.pivot_old", syscall.MNT_DETACH)

    // Remove the mount point
    os.RemoveAll("/.pivot_old")

    return nil
}

Step by step:

graph LR
    subgraph "Before pivot_root"
        OLD_ROOT["/ (host FS)"]
        NEW_ROOT["/var/lib/sheep/overlay/abc/merged"]
    end

    subgraph "Bind mount"
        BM["newRoot mounted onto itself"]
    end

    subgraph "After pivot_root"
        REAL_ROOT["/ (container FS)"]
        PIVOT_OLD["/.pivot_old (old host FS)"]
    end

    subgraph "After unmount"
        CLEAN_ROOT["/ (container only)"]
    end

    OLD_ROOT --> BM
    BM --> REAL_ROOT
    REAL_ROOT --> CLEAN_ROOT

Bind mount -- pivot_root requires the new root to be a mount point. Bind-mounting a directory onto itself satisfies this requirement.

PivotRoot -- atomically swaps the process root. The old root filesystem ends up in .pivot_old.

Unmount -- after pivot_root the old FS is accessible through /.pivot_old. We unmount it with the MNT_DETACH flag (lazy unmount, doesn't wait for all files to close).

RemoveAll -- remove the .pivot_old directory.

Why not chroot?

chroot simply changes what the process considers the root directory. But:

  • you can escape chroot with ../../../ tricks
  • chroot doesn't affect already-open file descriptors
  • chroot doesn't work with mount namespaces

pivot_root operates at the mount namespace level and reliably isolates the filesystem. After unmounting the old root, the process physically can't reach the host FS.

Limitations

pivot_root requires the new root to be a mount point. If you forget the bind mount, you'll get EINVAL. This isn't an obvious error, and debugging it takes time. Docker/runc have had these edge cases handled through years of production use.

Try it yourself

# Start a container and check mount points:
sudo ./sheep run --name pivot-test minimal /bin/sh
# Inside the container:
mount | head -5
ls /     # you only see the container FS

Namespaces isolate but don't limit. Next up -- cgroups v2 for resource limits.

Previous: Re-Exec Pattern