Re-Exec Pattern: Why Go and clone() Don't Get Along¶
Written by:
Igor Gorovyy
DevOps Engineer Lead & Senior Solutions Architect
Stanislav Gorovyy
HBO-ICT Student at Hogeschool Saxion | Software Engineer
In part one we saw how namespaces isolate a process. But there's a problem: Go and the clone() system call aren't very compatible. Here's why and how to fix it.
The problem: goroutines and clone()¶
Go starts several OS threads before your main() even begins. The Go runtime needs them for the garbage collector, the network poller, and other things.
When you call clone() with the CLONE_NEWPID flag, a child process is created with PID 1 in the new namespace. But here's the catch -- the Go threads that were already running in the parent process don't get cloned along with it. The child process gets only one thread, while the Go runtime expects several. The result is a crash or unpredictable behavior.
The fix: self re-exec¶
Instead of calling clone() directly, we do this:
- Launch our own binary as a child process with new namespaces
- Pass a special
initcommand so the child knows it's the container init - The child process sets up the filesystem, hostname, and so on
- After setup, the child does
exec()of the target program
sequenceDiagram
participant P as sheep (parent)
participant C as sheep init (child)
participant T as target process
P->>P: cmd.SysProcAttr.Cloneflags = CLONE_NEW*
P->>C: exec("sheep", "init", "--rootfs", "/path", "--", "cmd")
Note over C: New process in new namespaces
C->>C: sethostname()
C->>C: mount proc, sys, dev, tmp
C->>C: pivotRoot(rootfs)
C->>T: syscall.Exec(target_command)
Note over T: PID 1 in the container
What it looks like in code¶
Here's how the parent process creates the re-exec command:
func reexecCommand(c *Container) *exec.Cmd {
self, _ := os.Executable()
cmd := exec.Command(self,
append([]string{
"init",
"--rootfs", c.RootFS,
"--hostname", c.Config.Hostname,
"--",
}, c.Command...)...)
cmd.Env = append(c.Config.Env,
fmt.Sprintf("SHEEP_CONTAINER_ID=%s", c.ID))
return cmd
}
os.Executable() returns the path to the current binary. We launch ourselves with the init argument.
Handling "init" in main()¶
In main(), the first thing we check is whether we're the child process:
func main() {
if len(os.Args) < 2 {
printUsage()
os.Exit(1)
}
switch os.Args[1] {
case "init":
handleInit()
return
case "version":
fmt.Println("sheep v0.1.0")
return
// ... other commands
}
}
The init command is handled before the container manager is initialized. This matters because the child process is already running in a new namespace and doesn't have access to host directories.
What handleInit() does¶
func handleInit() {
var rootfs, hostname string
var command []string
args := os.Args[2:]
for i := 0; i < len(args); i++ {
switch args[i] {
case "--rootfs":
i++
if i < len(args) { rootfs = args[i] }
case "--hostname":
i++
if i < len(args) { hostname = args[i] }
case "--":
command = args[i+1:]
i = len(args)
}
}
if rootfs == "" {
fatal("init: --rootfs is required")
}
if err := container.ContainerInit(rootfs, hostname, command); err != nil {
fatal("init: %v", err)
}
}
We parse the arguments and call ContainerInit. This function sets everything up and does exec():
func ContainerInit(rootfs, hostname string, command []string) error {
if hostname != "" {
syscall.Sethostname([]byte(hostname))
}
// Mount /proc, /sys, /tmp, /dev
mounts := []struct {
source, target, fstype string
flags uintptr
}{
{"proc", "proc", "proc", 0},
{"sysfs", "sys", "sysfs", 0},
{"tmpfs", "tmp", "tmpfs", 0},
{"tmpfs", "dev", "tmpfs", syscall.MS_NOSUID | syscall.MS_STRICTATIME},
}
for _, m := range mounts {
target := filepath.Join(rootfs, m.target)
os.MkdirAll(target, 0755)
syscall.Mount(m.source, target, m.fstype, m.flags, "")
}
createDevices(rootfs)
pivotRoot(rootfs)
if len(command) == 0 {
command = []string{"/bin/sh"}
}
binary, err := exec.LookPath(command[0])
if err != nil {
binary = command[0]
}
return syscall.Exec(binary, command, os.Environ())
}
The last line -- syscall.Exec() -- replaces the current process with the target command. This isn't fork+exec; it's a full replacement. After this, the Go runtime is gone, only the target program is running.
Why not just fork()?¶
In C you can do fork() and immediately execve(). But Go doesn't give you direct access to fork(), because after a fork the child process has only one thread while the Go runtime needs several.
exec.Command with SysProcAttr.Cloneflags is the "proper" way to do fork+clone in Go. The new process starts with a clean runtime.
Data flow¶
graph LR
A["sheep run nginx /bin/sh"] --> B["Create container"]
B --> C["reexecCommand()"]
C --> D["sheep init<br/>--rootfs /var/lib/sheep/overlay/abc/merged<br/>--hostname abc123<br/>-- /bin/sh"]
D --> E["ContainerInit()"]
E --> F["syscall.Exec(/bin/sh)"]
That's why when you run ps on the host, you see /bin/sh instead of sheep init -- because Exec() completely replaced the process.
What about Docker?¶
Docker solves this same problem differently. containerd launches runc, which is written in Go but uses a trick with nsenter -- a small C program that runs before the Go runtime via the cgo and init() mechanism. It's more complex but lets everything happen in a single process.
In Sheep we chose the simpler approach with re-exec, which works without cgo.
Try it yourself¶
# Watch sheep re-exec itself:
ps aux | grep 'sheep init'
# See the init process arguments:
cat /proc/$(pgrep -f 'sheep init')/cmdline | tr '\0' ' '
Now let's look at pivot_root -- how a container gets its own filesystem.
Previous: Linux Namespaces
