pivot_root: How a Container Gets Its Own Filesystem¶
Written by:
Igor Gorovyy
DevOps Engineer Lead & Senior Solutions Architect
A container sees its own filesystem, not the host's. How does this work? Through the pivot_root(2) system call, which swaps the root directory of a process. Not chroot -- that one is easy to escape. pivot_root operates at the mount namespace level and provides real isolation.
This is part 3 of the Sheep & Shepherd series. Previous parts: namespaces gave the container its own view of the system, and re-exec worked around Go's threading model. Next up: cgroups v2 caps its resources and OverlayFS makes the filesystem cheap to spin up.
The sequence of steps¶
Before doing pivot_root, I need to prepare the new root filesystem. Here's what happens inside the container init process (set up via the re-exec pattern in part 2):
graph TD
A["ContainerInit()"] --> B["Set hostname"]
B --> C["Mount /proc, /sys, /tmp, /dev<br/>inside new rootfs"]
C --> D["Create /dev/null, /dev/zero,<br/>/dev/random, /dev/tty"]
D --> E["pivotRoot(rootfs)"]
E --> F["syscall.Exec(target_command)"]
Mounting filesystems¶
Before pivot_root I mount the required filesystems inside the new root:
mounts := []struct {
source string
target string
fstype string
flags uintptr
data string
}{
{"proc", "proc", "proc", 0, ""},
{"sysfs", "sys", "sysfs", 0, ""},
{"tmpfs", "tmp", "tmpfs", 0, ""},
{"tmpfs", "dev", "tmpfs",
syscall.MS_NOSUID | syscall.MS_STRICTATIME, "mode=755"},
}
for _, m := range mounts {
target := filepath.Join(rootfs, m.target)
os.MkdirAll(target, 0755)
syscall.Mount(m.source, target, m.fstype, m.flags, m.data)
}
/proc -- needed for ps, top, and other utilities. /sys -- device information. /tmp -- temporary files. /dev -- devices.
Creating device nodes¶
The container needs basic devices. Without /dev/null, for example, a lot of programs will just crash:
func createDevices(rootfs string) {
devPath := filepath.Join(rootfs, "dev")
devices := []struct {
name string
major uint32
minor uint32
mode uint32
}{
{"null", 1, 3, 0666},
{"zero", 1, 5, 0666},
{"random", 1, 8, 0666},
{"urandom", 1, 9, 0666},
{"tty", 5, 0, 0666},
}
for _, d := range devices {
path := filepath.Join(devPath, d.name)
dev := unix.Mkdev(d.major, d.minor)
unix.Mknod(path, syscall.S_IFCHR|d.mode, int(dev))
}
// Symlinks for stdin/stdout/stderr
os.Symlink("/proc/self/fd", filepath.Join(devPath, "fd"))
os.Symlink("/proc/self/fd/0", filepath.Join(devPath, "stdin"))
os.Symlink("/proc/self/fd/1", filepath.Join(devPath, "stdout"))
os.Symlink("/proc/self/fd/2", filepath.Join(devPath, "stderr"))
}
Mknod creates special device files with specific major/minor numbers. The kernel knows that /dev/null (1, 3) is a "black hole" you can write to and never read anything back from.
The pivot_root itself¶
Here's the key function:
func pivotRoot(newRoot string) error {
putOld := filepath.Join(newRoot, ".pivot_old")
os.MkdirAll(putOld, 0700)
// Bind mount newRoot onto itself (pivot_root requirement)
syscall.Mount(newRoot, newRoot, "",
syscall.MS_BIND|syscall.MS_REC, "")
// Swap the root
unix.PivotRoot(newRoot, putOld)
// Move into the new root
os.Chdir("/")
// Unmount the old root
syscall.Unmount("/.pivot_old", syscall.MNT_DETACH)
// Remove the mount point
os.RemoveAll("/.pivot_old")
return nil
}
Step by step:
graph LR
subgraph "Before pivot_root"
OLD_ROOT["/ (host FS)"]
NEW_ROOT["/var/lib/sheep/overlay/abc/merged"]
end
subgraph "Bind mount"
BM["newRoot mounted onto itself"]
end
subgraph "After pivot_root"
REAL_ROOT["/ (container FS)"]
PIVOT_OLD["/.pivot_old (old host FS)"]
end
subgraph "After unmount"
CLEAN_ROOT["/ (container only)"]
end
OLD_ROOT --> BM
BM --> REAL_ROOT
REAL_ROOT --> CLEAN_ROOT
Bind mount -- pivot_root requires the new root to be a mount point. Bind-mounting a directory onto itself satisfies this requirement.
PivotRoot -- atomically swaps the process root. The old root filesystem ends up in .pivot_old.
Unmount -- after pivot_root the old FS is accessible through /.pivot_old. I unmount it with the MNT_DETACH flag (lazy unmount, doesn't wait for all files to close).
RemoveAll -- remove the .pivot_old directory.
Why not chroot?¶
chroot simply changes what the process considers the root directory. But:
- you can escape chroot with
../../../tricks - chroot doesn't affect already-open file descriptors
- chroot doesn't work with mount namespaces
pivot_root operates at the mount namespace level and reliably isolates the filesystem. After unmounting the old root, the process physically can't reach the host FS. And on top of this root, OverlayFS (part 5) stacks copy-on-write layers so 10 containers can share one base image.
Limitations¶
pivot_root requires the new root to be a mount point. If you forget the bind mount, you'll get EINVAL. This isn't an obvious error, and debugging it takes time. Docker/runc have had these edge cases handled through years of production use.
Try it yourself¶
# Start a container and check mount points:
sudo ./sheep run --name pivot-test minimal /bin/sh
# Inside the container:
mount | head -5
ls / # you only see the container FS
Namespaces isolate but don't limit. Next up -- cgroups v2 for resource limits.
Source code for the series: github.com/igorgorovoy/sheep-shepherd-meadow
Previous: Re-Exec Pattern | Next: Cgroups v2
