Cgroups v2: Limiting Memory, CPU, and PIDs¶

Written by:

Igor Gorovyy
DevOps Engineer Lead & Senior Solutions Architect

Namespaces isolate but don't limit. A process in a separate PID namespace can still eat all the machine's memory. For limits you need cgroups (control groups) -- a Linux kernel mechanism for controlling resources.

This is part 4 of the Sheep & Shepherd series. Previous parts: namespaces gave the container its own view of the system, re-exec worked around Go's threading model, and pivot_root gave it its own filesystem root. Next: OverlayFS makes that filesystem cheap to spin up.

Cgroups v1 vs v2¶

Cgroups v1 had a separate hierarchy for each controller (memory, cpu, pids). It was confusing and had race conditions. Cgroups v2 uses a single hierarchy where all controllers live under one tree. I use v2.

graph TD
    subgraph "cgroups v2"
        ROOT["/sys/fs/cgroup"]
        SHEEP["sheep/"]
        C1["container_abc123/"]
        C2["container_def456/"]

        ROOT --> SHEEP
        SHEEP --> C1
        SHEEP --> C2

        C1 --> M1["memory.max = 256M"]
        C1 --> P1["pids.max = 100"]
        C1 --> CPU1["cpu.max = 50000 100000"]

        C2 --> M2["memory.max = 512M"]
        C2 --> P2["pids.max = 200"]
        C2 --> CPU2["cpu.max = max 100000"]
    end

How Sheep sets up cgroups¶

After the container process starts (via the re-exec pattern from part 2), the parent process adds its PID to a cgroup:

const cgroupBase = "/sys/fs/cgroup"

func setupCgroups(c *Container, pid int) error {
    cgroupPath := filepath.Join(cgroupBase, "sheep", c.ID)
    os.MkdirAll(cgroupPath, 0755)

    // Add process to cgroup
    writeFile(filepath.Join(cgroupPath, "cgroup.procs"),
        strconv.Itoa(pid))

    // Enable controllers
    controllers := "+memory +pids +cpu"
    parentCtrl := filepath.Join(cgroupBase, "sheep",
        "cgroup.subtree_control")
    writeFile(parentCtrl, controllers)

    // Memory limit
    if c.Config.Memory > 0 {
        writeFile(filepath.Join(cgroupPath, "memory.max"),
            strconv.FormatInt(c.Config.Memory, 10))
    }

    // PIDs limit
    if c.Config.PidsLimit > 0 {
        writeFile(filepath.Join(cgroupPath, "pids.max"),
            strconv.FormatInt(c.Config.PidsLimit, 10))
    }

    // CPU quota
    if c.Config.CPUQuota > 0 {
        quota := fmt.Sprintf("%d 100000", c.Config.CPUQuota)
        writeFile(filepath.Join(cgroupPath, "cpu.max"), quota)
    }

    // CPU weight (shares)
    if c.Config.CPUShares > 0 {
        weight := (c.Config.CPUShares * 10000) / 262144
        if weight < 1 { weight = 1 }
        writeFile(filepath.Join(cgroupPath, "cpu.weight"),
            strconv.FormatInt(weight, 10))
    }

    return nil
}

Everything through files. Cgroups v2 is a virtual filesystem. Write a number to a file -- set a limit.

memory.max -- memory limit¶

I write the byte count to memory.max. If the container tries to use more, the kernel kills the process (OOM killer).

# Start a container with a 256MB limit
sheep run --name test -m 256m minimal /bin/sh

The CLI parses suffixes:

func parseMemory(s string) int64 {
    multiplier := int64(1)
    if strings.HasSuffix(s, "g") || strings.HasSuffix(s, "G") {
        multiplier = 1024 * 1024 * 1024
        s = s[:len(s)-1]
    } else if strings.HasSuffix(s, "m") || strings.HasSuffix(s, "M") {
        multiplier = 1024 * 1024
        s = s[:len(s)-1]
    } else if strings.HasSuffix(s, "k") || strings.HasSuffix(s, "K") {
        multiplier = 1024
        s = s[:len(s)-1]
    }
    v, _ := strconv.ParseInt(s, 10, 64)
    return v * multiplier
}

pids.max -- process limit¶

A fork bomb is a classic attack where a process endlessly spawns child processes. pids.max caps the number of processes in a cgroup. When the limit is reached, fork() returns an error.

sheep run --pids-limit 100 minimal /bin/sh

cpu.max -- CPU quota¶

Format: $QUOTA $PERIOD in microseconds. If cpu.max = 50000 100000, the container gets 50ms out of every 100ms -- that's 50% of one core.

sheep run --cpu-quota 50000 minimal /bin/sh

cpu.weight -- relative weight¶

Docker uses --cpu-shares (from 2 to 262144). Cgroups v2 uses cpu.weight (from 1 to 10000). A conversion is needed:

weight := (c.Config.CPUShares * 10000) / 262144

Weight only matters when there's contention for CPU. If the machine is idle, a container with weight 1 gets just as much CPU as a container with weight 10000.

Cleanup after stop¶

When a container stops, I remove its cgroup:

func cleanupCgroups(c *Container) {
    cgroupPath := filepath.Join(cgroupBase, "sheep", c.ID)
    os.RemoveAll(cgroupPath)
}

subtree_control -- here's a gotcha¶

Cgroups v2 requires you to explicitly enable controllers for child cgroups. Without writing +memory +pids +cpu to the parent cgroup's cgroup.subtree_control, the memory.max files won't appear in children.

This is a common pitfall when working with cgroups v2 -- everything runs, but limits don't apply because you forgot to enable the controllers.

What's missing¶

My implementation is simplified. Docker also has: - memory.swap.max for limiting swap - cpu.max with burst for short-lived spikes - io.max for limiting disk I/O - OOM score adjustment

For a learning project, three controllers are enough.

Try it yourself¶

# Start a container with a memory limit:
sudo ./sheep run --name cg-test -m 128m --pids-limit 50 minimal /bin/sh
# Check the cgroup on the host:
cat /sys/fs/cgroup/sheep/*/memory.max
cat /sys/fs/cgroup/sheep/*/pids.max

Resources are limited. Next up -- OverlayFS, which gives each container its own filesystem without copying gigabytes (on top of the pivot_root we built in part 3).

Previous: pivot_root | Next: OverlayFS