Cgroups v2: Limiting Memory, CPU, and PIDs¶
Written by:
Igor Gorovyy
DevOps Engineer Lead & Senior Solutions Architect
Namespaces isolate but don't limit. A process in a separate PID namespace can still eat all the machine's memory. For limits you need cgroups (control groups) -- a Linux kernel mechanism for controlling resources.
This is part 4 of the Sheep & Shepherd series. Previous parts: namespaces gave the container its own view of the system, re-exec worked around Go's threading model, and pivot_root gave it its own filesystem root. Next: OverlayFS makes that filesystem cheap to spin up.
Cgroups v1 vs v2¶
Cgroups v1 had a separate hierarchy for each controller (memory, cpu, pids). It was confusing and had race conditions. Cgroups v2 uses a single hierarchy where all controllers live under one tree. I use v2.
graph TD
subgraph "cgroups v2"
ROOT["/sys/fs/cgroup"]
SHEEP["sheep/"]
C1["container_abc123/"]
C2["container_def456/"]
ROOT --> SHEEP
SHEEP --> C1
SHEEP --> C2
C1 --> M1["memory.max = 256M"]
C1 --> P1["pids.max = 100"]
C1 --> CPU1["cpu.max = 50000 100000"]
C2 --> M2["memory.max = 512M"]
C2 --> P2["pids.max = 200"]
C2 --> CPU2["cpu.max = max 100000"]
end
How Sheep sets up cgroups¶
After the container process starts (via the re-exec pattern from part 2), the parent process adds its PID to a cgroup:
const cgroupBase = "/sys/fs/cgroup"
func setupCgroups(c *Container, pid int) error {
cgroupPath := filepath.Join(cgroupBase, "sheep", c.ID)
os.MkdirAll(cgroupPath, 0755)
// Add process to cgroup
writeFile(filepath.Join(cgroupPath, "cgroup.procs"),
strconv.Itoa(pid))
// Enable controllers
controllers := "+memory +pids +cpu"
parentCtrl := filepath.Join(cgroupBase, "sheep",
"cgroup.subtree_control")
writeFile(parentCtrl, controllers)
// Memory limit
if c.Config.Memory > 0 {
writeFile(filepath.Join(cgroupPath, "memory.max"),
strconv.FormatInt(c.Config.Memory, 10))
}
// PIDs limit
if c.Config.PidsLimit > 0 {
writeFile(filepath.Join(cgroupPath, "pids.max"),
strconv.FormatInt(c.Config.PidsLimit, 10))
}
// CPU quota
if c.Config.CPUQuota > 0 {
quota := fmt.Sprintf("%d 100000", c.Config.CPUQuota)
writeFile(filepath.Join(cgroupPath, "cpu.max"), quota)
}
// CPU weight (shares)
if c.Config.CPUShares > 0 {
weight := (c.Config.CPUShares * 10000) / 262144
if weight < 1 { weight = 1 }
writeFile(filepath.Join(cgroupPath, "cpu.weight"),
strconv.FormatInt(weight, 10))
}
return nil
}
Everything through files. Cgroups v2 is a virtual filesystem. Write a number to a file -- set a limit.
memory.max -- memory limit¶
I write the byte count to memory.max. If the container tries to use more, the kernel kills the process (OOM killer).
# Start a container with a 256MB limit
sheep run --name test -m 256m minimal /bin/sh
The CLI parses suffixes:
func parseMemory(s string) int64 {
multiplier := int64(1)
if strings.HasSuffix(s, "g") || strings.HasSuffix(s, "G") {
multiplier = 1024 * 1024 * 1024
s = s[:len(s)-1]
} else if strings.HasSuffix(s, "m") || strings.HasSuffix(s, "M") {
multiplier = 1024 * 1024
s = s[:len(s)-1]
} else if strings.HasSuffix(s, "k") || strings.HasSuffix(s, "K") {
multiplier = 1024
s = s[:len(s)-1]
}
v, _ := strconv.ParseInt(s, 10, 64)
return v * multiplier
}
pids.max -- process limit¶
A fork bomb is a classic attack where a process endlessly spawns child processes. pids.max caps the number of processes in a cgroup. When the limit is reached, fork() returns an error.
sheep run --pids-limit 100 minimal /bin/sh
cpu.max -- CPU quota¶
Format: $QUOTA $PERIOD in microseconds. If cpu.max = 50000 100000, the container gets 50ms out of every 100ms -- that's 50% of one core.
sheep run --cpu-quota 50000 minimal /bin/sh
cpu.weight -- relative weight¶
Docker uses --cpu-shares (from 2 to 262144). Cgroups v2 uses cpu.weight (from 1 to 10000). A conversion is needed:
weight := (c.Config.CPUShares * 10000) / 262144
Weight only matters when there's contention for CPU. If the machine is idle, a container with weight 1 gets just as much CPU as a container with weight 10000.
Cleanup after stop¶
When a container stops, I remove its cgroup:
func cleanupCgroups(c *Container) {
cgroupPath := filepath.Join(cgroupBase, "sheep", c.ID)
os.RemoveAll(cgroupPath)
}
subtree_control -- here's a gotcha¶
Cgroups v2 requires you to explicitly enable controllers for child cgroups. Without writing +memory +pids +cpu to the parent cgroup's cgroup.subtree_control, the memory.max files won't appear in children.
This is a common pitfall when working with cgroups v2 -- everything runs, but limits don't apply because you forgot to enable the controllers.
What's missing¶
My implementation is simplified. Docker also has:
- memory.swap.max for limiting swap
- cpu.max with burst for short-lived spikes
- io.max for limiting disk I/O
- OOM score adjustment
For a learning project, three controllers are enough.
Try it yourself¶
# Start a container with a memory limit:
sudo ./sheep run --name cg-test -m 128m --pids-limit 50 minimal /bin/sh
# Check the cgroup on the host:
cat /sys/fs/cgroup/sheep/*/memory.max
cat /sys/fs/cgroup/sheep/*/pids.max
Resources are limited. Next up -- OverlayFS, which gives each container its own filesystem without copying gigabytes (on top of the pivot_root we built in part 3).
Previous: pivot_root | Next: OverlayFS
