Re-Exec Pattern: Why Go and clone() Don't Get Along¶

Written by:

Igor Gorovyy
DevOps Engineer Lead & Senior Solutions Architect

In part 1 I showed how namespaces isolate a process. But there's a problem: Go and the clone() system call aren't very compatible. Here's why and how to fix it.

This is part 2 of the Sheep & Shepherd series. After this, pivot_root gives the container its own filesystem root, cgroups v2 caps its resources, and OverlayFS makes the filesystem cheap to spin up.

The problem: goroutines and clone()¶

Go starts several OS threads before your main() even begins. The Go runtime needs them for the garbage collector, the network poller, and other things.

When you call clone() with the CLONE_NEWPID flag, a child process is created with PID 1 in the new namespace. But here's the catch -- the Go threads that were already running in the parent process don't get cloned along with it. The child process gets only one thread, while the Go runtime expects several. The result is a crash or unpredictable behavior.

The fix: self re-exec¶

Instead of calling clone() directly, I do this:

Launch the binary itself as a child process with new namespaces
Pass a special init command so the child knows it is the container init
The child process sets up the filesystem, hostname, and so on
After setup, the child replaces itself with the target program

sequenceDiagram
    participant P as sheep (parent)
    participant C as sheep init (child)
    participant T as target process

    P->>P: cmd.SysProcAttr.Cloneflags = CLONE_NEW*
    P->>C: exec("sheep", "init", "--rootfs", "/path", "--", "cmd")
    Note over C: New process in new namespaces
    C->>C: sethostname()
    C->>C: mount proc, sys, dev, tmp
    C->>C: pivotRoot(rootfs)
    C->>T: syscall.Exec(target_command)
    Note over T: PID 1 in the container

What it looks like in code¶

Here's how the parent process creates the re-exec command:

func reexecCommand(c *Container) *exec.Cmd {
    self, _ := os.Executable()
    cmd := exec.Command(self,
        append([]string{
            "init",
            "--rootfs", c.RootFS,
            "--hostname", c.Config.Hostname,
            "--",
        }, c.Command...)...)
    cmd.Env = append(c.Config.Env,
        fmt.Sprintf("SHEEP_CONTAINER_ID=%s", c.ID))
    return cmd
}

os.Executable() returns the path to the current binary. I launch the binary itself with the init argument.

Handling "init" in main()¶

In main(), the first thing I check is whether this is the child process:

func main() {
    if len(os.Args) < 2 {
        printUsage()
        os.Exit(1)
    }

    switch os.Args[1] {
    case "init":
        handleInit()
        return
    case "version":
        fmt.Println("sheep v0.1.0")
        return
    // ... other commands
    }
}

The init command is handled before the container manager is initialized. This matters because the child process is already running in a new namespace and doesn't have access to host directories.

What handleInit() does¶

func handleInit() {
    var rootfs, hostname string
    var command []string

    args := os.Args[2:]
    for i := 0; i < len(args); i++ {
        switch args[i] {
        case "--rootfs":
            i++
            if i < len(args) { rootfs = args[i] }
        case "--hostname":
            i++
            if i < len(args) { hostname = args[i] }
        case "--":
            command = args[i+1:]
            i = len(args)
        }
    }

    if rootfs == "" {
        fatal("init: --rootfs is required")
    }

    if err := container.ContainerInit(rootfs, hostname, command); err != nil {
        fatal("init: %v", err)
    }
}

I parse the arguments and call ContainerInit. This function sets everything up and does exec():

func ContainerInit(rootfs, hostname string, command []string) error {
    if hostname != "" {
        syscall.Sethostname([]byte(hostname))
    }

    // Mount /proc, /sys, /tmp, /dev
    mounts := []struct {
        source, target, fstype string
        flags                  uintptr
    }{
        {"proc", "proc", "proc", 0},
        {"sysfs", "sys", "sysfs", 0},
        {"tmpfs", "tmp", "tmpfs", 0},
        {"tmpfs", "dev", "tmpfs", syscall.MS_NOSUID | syscall.MS_STRICTATIME},
    }

    for _, m := range mounts {
        target := filepath.Join(rootfs, m.target)
        os.MkdirAll(target, 0755)
        syscall.Mount(m.source, target, m.fstype, m.flags, "")
    }

    createDevices(rootfs)
    pivotRoot(rootfs) // see part 3: ../2026-05-09-pivot-root/

    if len(command) == 0 {
        command = []string{"/bin/sh"}
    }

    binary, err := exec.LookPath(command[0])
    if err != nil {
        binary = command[0]
    }

    return syscall.Exec(binary, command, os.Environ())
}

The last line -- syscall.Exec() -- replaces the current process with the target command. This isn't fork+exec; it's a full replacement. After this, the Go runtime is gone, only the target program is running. The filesystem setup that happens just before Exec is detailed in part 3 on pivot_root.

Why not just fork()?¶

In C you can do fork() and immediately execve(). But Go doesn't give you direct access to fork(), because after a fork the child process has only one thread while the Go runtime needs several.

exec.Command with SysProcAttr.Cloneflags is the "proper" way to do fork+clone in Go. The new process starts with a clean runtime.

Data flow¶

graph LR
    A["sheep run nginx /bin/sh"] --> B["Create container"]
    B --> C["reexecCommand()"]
    C --> D["sheep init<br/>--rootfs /var/lib/sheep/overlay/abc/merged<br/>--hostname abc123<br/>-- /bin/sh"]
    D --> E["ContainerInit()"]
    E --> F["syscall.Exec(/bin/sh)"]

That's why when you run ps on the host, you see /bin/sh instead of sheep init -- because Exec() completely replaced the process.

What about Docker?¶

Docker solves this same problem differently. containerd launches runc, which is written in Go but uses a trick with nsenter -- a small C program that runs before the Go runtime via the cgo and init() mechanism. It's more complex but lets everything happen in a single process.

In Sheep I chose the simpler approach with re-exec, which works without cgo.

Try it yourself¶

# Watch sheep re-exec itself:
ps aux | grep 'sheep init'
# See the init process arguments:
cat /proc/$(pgrep -f 'sheep init')/cmdline | tr '\0' ' '

Next I'll look at pivot_root -- how a container gets its own filesystem.

Source code for the series: github.com/igorgorovoy/sheep-shepherd-meadow

Previous: Linux Namespaces | Next: pivot_root