pivot_root: How a Container Gets Its Own Filesystem¶
Written by:
Igor Gorovyy
DevOps Engineer Lead & Senior Solutions Architect
Stanislav Gorovyy
HBO-ICT Student at Hogeschool Saxion | Software Engineer
A container sees its own filesystem, not the host's. How does this work? Through the pivot_root(2) system call, which swaps the root directory of a process. Not chroot -- that one is easy to escape. pivot_root operates at the mount namespace level and provides real isolation.
The sequence of steps¶
Before doing pivot_root, we need to prepare the new root filesystem. Here's what happens inside the container init process:
graph TD
A["ContainerInit()"] --> B["Set hostname"]
B --> C["Mount /proc, /sys, /tmp, /dev<br/>inside new rootfs"]
C --> D["Create /dev/null, /dev/zero,<br/>/dev/random, /dev/tty"]
D --> E["pivotRoot(rootfs)"]
E --> F["syscall.Exec(target_command)"]
Mounting filesystems¶
Before pivot_root we mount the required filesystems inside the new root:
mounts := []struct {
source string
target string
fstype string
flags uintptr
data string
}{
{"proc", "proc", "proc", 0, ""},
{"sysfs", "sys", "sysfs", 0, ""},
{"tmpfs", "tmp", "tmpfs", 0, ""},
{"tmpfs", "dev", "tmpfs",
syscall.MS_NOSUID | syscall.MS_STRICTATIME, "mode=755"},
}
for _, m := range mounts {
target := filepath.Join(rootfs, m.target)
os.MkdirAll(target, 0755)
syscall.Mount(m.source, target, m.fstype, m.flags, m.data)
}
/proc -- needed for ps, top, and other utilities. /sys -- device information. /tmp -- temporary files. /dev -- devices.
Creating device nodes¶
The container needs basic devices. Without /dev/null, for example, a lot of programs will just crash:
func createDevices(rootfs string) {
devPath := filepath.Join(rootfs, "dev")
devices := []struct {
name string
major uint32
minor uint32
mode uint32
}{
{"null", 1, 3, 0666},
{"zero", 1, 5, 0666},
{"random", 1, 8, 0666},
{"urandom", 1, 9, 0666},
{"tty", 5, 0, 0666},
}
for _, d := range devices {
path := filepath.Join(devPath, d.name)
dev := unix.Mkdev(d.major, d.minor)
unix.Mknod(path, syscall.S_IFCHR|d.mode, int(dev))
}
// Symlinks for stdin/stdout/stderr
os.Symlink("/proc/self/fd", filepath.Join(devPath, "fd"))
os.Symlink("/proc/self/fd/0", filepath.Join(devPath, "stdin"))
os.Symlink("/proc/self/fd/1", filepath.Join(devPath, "stdout"))
os.Symlink("/proc/self/fd/2", filepath.Join(devPath, "stderr"))
}
Mknod creates special device files with specific major/minor numbers. The kernel knows that /dev/null (1, 3) is a "black hole" you can write to and never read anything back from.
The pivot_root itself¶
Here's the key function:
func pivotRoot(newRoot string) error {
putOld := filepath.Join(newRoot, ".pivot_old")
os.MkdirAll(putOld, 0700)
// Bind mount newRoot onto itself (pivot_root requirement)
syscall.Mount(newRoot, newRoot, "",
syscall.MS_BIND|syscall.MS_REC, "")
// Swap the root
unix.PivotRoot(newRoot, putOld)
// Move into the new root
os.Chdir("/")
// Unmount the old root
syscall.Unmount("/.pivot_old", syscall.MNT_DETACH)
// Remove the mount point
os.RemoveAll("/.pivot_old")
return nil
}
Step by step:
graph LR
subgraph "Before pivot_root"
OLD_ROOT["/ (host FS)"]
NEW_ROOT["/var/lib/sheep/overlay/abc/merged"]
end
subgraph "Bind mount"
BM["newRoot mounted onto itself"]
end
subgraph "After pivot_root"
REAL_ROOT["/ (container FS)"]
PIVOT_OLD["/.pivot_old (old host FS)"]
end
subgraph "After unmount"
CLEAN_ROOT["/ (container only)"]
end
OLD_ROOT --> BM
BM --> REAL_ROOT
REAL_ROOT --> CLEAN_ROOT
Bind mount -- pivot_root requires the new root to be a mount point. Bind-mounting a directory onto itself satisfies this requirement.
PivotRoot -- atomically swaps the process root. The old root filesystem ends up in .pivot_old.
Unmount -- after pivot_root the old FS is accessible through /.pivot_old. We unmount it with the MNT_DETACH flag (lazy unmount, doesn't wait for all files to close).
RemoveAll -- remove the .pivot_old directory.
Why not chroot?¶
chroot simply changes what the process considers the root directory. But:
- you can escape chroot with
../../../tricks - chroot doesn't affect already-open file descriptors
- chroot doesn't work with mount namespaces
pivot_root operates at the mount namespace level and reliably isolates the filesystem. After unmounting the old root, the process physically can't reach the host FS.
Limitations¶
pivot_root requires the new root to be a mount point. If you forget the bind mount, you'll get EINVAL. This isn't an obvious error, and debugging it takes time. Docker/runc have had these edge cases handled through years of production use.
Try it yourself¶
# Start a container and check mount points:
sudo ./sheep run --name pivot-test minimal /bin/sh
# Inside the container:
mount | head -5
ls / # you only see the container FS
Namespaces isolate but don't limit. Next up -- cgroups v2 for resource limits.
Previous: Re-Exec Pattern
