Перейти до змісту

Linux Namespaces: Isolating a Process in 50 Lines of Go

Linux Namespaces: Isolating a Process in 50 Lines of Go

Written by:

Igor Gorovyy
DevOps Engineer Lead & Senior Solutions Architect

LinkedIn


A container is a process with a restricted view of the system. That's it. No magic, no virtual machines. The Linux kernel gives a process its own "namespace," and the process thinks it's alone on the machine.

I built the Sheep container runtime from scratch in Go, and in this series I'll show how it works from the inside. Let's start with the most important piece -- namespaces.

Five namespaces that make a container a container

Linux provides several types of namespaces. Each one isolates a different aspect:

Namespace Flag What it isolates
UTS CLONE_NEWUTS hostname -- the container gets its own name
PID CLONE_NEWPID processes -- the container only sees its own PIDs
Mount CLONE_NEWNS filesystem -- its own root, its own mounts
IPC CLONE_NEWIPC inter-process communication -- shared memory, semaphores
Network CLONE_NEWNET network -- its own network stack, IP, ports

What it looks like in code

Here's the key function from Sheep that starts an isolated process:

func startContainer(c *Container) (int, error) {
    cmd := reexecCommand(c)

    cmd.SysProcAttr = &syscall.SysProcAttr{
        Cloneflags: syscall.CLONE_NEWUTS |
            syscall.CLONE_NEWPID |
            syscall.CLONE_NEWNS |
            syscall.CLONE_NEWIPC |
            syscall.CLONE_NEWNET,
        Unshareflags: syscall.CLONE_NEWNS,
    }

    cmd.Stdin = os.Stdin
    cmd.Stdout = os.Stdout
    cmd.Stderr = os.Stderr

    if err := cmd.Start(); err != nil {
        return 0, fmt.Errorf("start namespaced process: %w", err)
    }

    pid := cmd.Process.Pid

    if err := setupCgroups(c, pid); err != nil {
        cmd.Process.Kill()
        return 0, fmt.Errorf("setup cgroups: %w", err)
    }

    if err := setupNetworkForContainer(c, pid); err != nil {
        fmt.Fprintf(os.Stderr, "warning: network setup failed: %v\n", err)
    }

    go cmd.Wait()
    return pid, nil
}

Everything hinges on one line -- SysProcAttr.Cloneflags. We pass the kernel a set of flags, and it creates a child process in separate namespaces.

What each flag does

CLONE_NEWUTS -- hostname isolation

A process in a new UTS namespace gets its own hostname. Without this, sethostname() would change the hostname of the entire machine.

if hostname != "" {
    if err := syscall.Sethostname([]byte(hostname)); err != nil {
        return fmt.Errorf("sethostname: %w", err)
    }
}

CLONE_NEWPID -- process isolation

Inside a PID namespace, the process thinks its PID is 1. It can't see any host processes. It's like a separate operating system, just sharing one kernel.

CLONE_NEWNS -- filesystem isolation

A mount namespace gives the process its own mount table. The container can mount /proc, /sys, /tmp without affecting the host.

CLONE_NEWIPC -- IPC isolation

Shared memory segments, semaphores, message queues -- all separate. The container can't eavesdrop on the host's IPC.

CLONE_NEWNET -- network isolation

Its own network stack: its own interfaces, IP addresses, routing table, iptables. Until we set up a veth pair and bridge, the container has no network at all.

How it all fits together

graph TB
    subgraph "Host"
        HOST_PROC["Host processes<br/>PID 1, 2, 3..."]
        HOST_NET["Host network<br/>eth0, 192.168.1.x"]
        HOST_FS["Host filesystem<br/>/, /home, /var"]
    end

    subgraph "Container (new namespaces)"
        C_PROC["PID 1 (init)"]
        C_NET["Container network<br/>eth0, 10.20.0.x"]
        C_FS["Own filesystem<br/>/, /proc, /tmp"]
        C_HOST["hostname: sheep-abc123"]
    end

    HOST_PROC -.->|"CLONE_NEWPID"| C_PROC
    HOST_NET -.->|"CLONE_NEWNET"| C_NET
    HOST_FS -.->|"CLONE_NEWNS"| C_FS

The host sees the container process as a regular process with a high PID. But from the inside, the container only sees itself.

What can go wrong

Namespaces provide isolation but not resource limits. A process in a new PID namespace can still eat all the machine's memory. For limits, you need cgroups -- that's coming in part 4.

And one more thing. Go spawns several threads before main() even runs. This means that clone() with namespace flags doesn't work quite right if you call it directly. You need the re-exec pattern, which we'll cover in the next part.

Try it yourself

git clone https://github.com/igorovh/sheep && cd sheep
go build ./cmd/sheep
sudo ./sheep bootstrap minimal
sudo ./sheep run --name test minimal /bin/sh
# Inside the container:
hostname        # you'll see the shortened container ID
ps aux          # only one process -- /bin/sh
ip addr         # its own network interface

Next up, we'll look at why Go and clone() clash and how the re-exec pattern fixes it.

Next: Re-Exec Pattern