Scheduler: How to Pick a Node for a Pod¶
Written by:
Igor Gorovyy
DevOps Engineer Lead & Senior Solutions Architect
When a pod is created, it's in a Pending state with no assigned node. The Scheduler has to find the best node for it. In Shepherd, this works in two stages: filtering (eliminate unfit nodes) and scoring (pick the best one).
graph LR
A["All nodes"] --> B["Filter<br/>Ready? Heartbeat?<br/>Resources? Labels?"]
B --> C["Feasible nodes"]
C --> D["Score<br/>Least-loaded?<br/>Resource balance?"]
D --> E["Selected node"]
Reconciliation Loop¶
The Scheduler runs as a ticker-based loop every 2 seconds:
func (s *Scheduler) Run(stopCh <-chan struct{}) {
ticker := time.NewTicker(2 * time.Second)
defer ticker.Stop()
for {
select {
case <-stopCh:
return
case <-ticker.C:
s.reconcile()
}
}
}
func (s *Scheduler) reconcile() {
pods, _ := s.store.ListPods("")
for _, pod := range pods {
if pod.Status.Phase == PodPending &&
pod.Spec.NodeName == "" {
s.SchedulePod(pod)
}
}
}
We look for all pods in Pending state without an assigned node. For each one, we call SchedulePod.
Filtering Nodes¶
func (s *Scheduler) filterNodes(nodes []*Node,
pod *Pod) []*Node {
var feasible []*Node
for _, node := range nodes {
// Node must be Ready
if node.Status.Condition != NodeReady {
continue
}
// Heartbeat must be fresh (< 30 seconds)
if time.Since(node.Status.LastHeartbeat) >
30*time.Second {
continue
}
// NodeSelector must match
if !matchLabels(node.Metadata.Labels,
pod.Spec.NodeSelector) {
continue
}
// Resource check
totalReq := podResourceRequests(pod)
alloc := node.Status.Allocatable
if totalReq.CPU > 0 && totalReq.CPU > alloc.CPU {
continue
}
if totalReq.Memory > 0 &&
totalReq.Memory > alloc.Memory {
continue
}
if node.Status.PodCount >= alloc.Pods {
continue
}
feasible = append(feasible, node)
}
return feasible
}
Four checks:
1. Node is Ready (not down)
2. Heartbeat is fresh (agent is responding)
3. Labels match the pod's nodeSelector
4. Enough resources (CPU, memory, pod count)
Scoring¶
func (s *Scheduler) scoreNodes(nodes []*Node,
pod *Pod) []scoredNode {
scored := make([]scoredNode, len(nodes))
for i, node := range nodes {
score := 0
// Prefer nodes with fewer pods
score += (node.Status.Allocatable.Pods -
node.Status.PodCount) * 10
// Prefer nodes with more available CPU
if node.Status.Allocatable.CPU > 0 {
usedRatio := float64(
node.Status.Allocatable.CPU -
podResourceRequests(pod).CPU) /
float64(node.Status.Allocatable.CPU)
score += int(usedRatio * 50)
}
// Prefer nodes with more available memory
if node.Status.Allocatable.Memory > 0 {
usedRatio := float64(
node.Status.Allocatable.Memory -
podResourceRequests(pod).Memory) /
float64(node.Status.Allocatable.Memory)
score += int(usedRatio * 50)
}
scored[i] = scoredNode{node: node, score: score}
}
return scored
}
Two criteria: - Least-loaded -- fewer pods = higher score - Resource balance -- more free resources = higher score
Assigning a Pod¶
func (s *Scheduler) SchedulePod(pod *Pod) {
nodes, _ := s.store.ListNodes()
feasible := s.filterNodes(nodes, pod)
if len(feasible) == 0 {
pod.Status.Message = "no feasible nodes"
s.store.UpdatePod(pod)
return
}
scored := s.scoreNodes(feasible, pod)
sort.Slice(scored, func(i, j int) bool {
return scored[i].score > scored[j].score
})
selected := scored[0].node
pod.Spec.NodeName = selected.Metadata.Name
pod.Status.Phase = PodPending // still Pending!
pod.Status.Message = fmt.Sprintf(
"scheduled to node %s", selected.Metadata.Name)
pod.Status.HostIP = selected.Spec.Address
s.store.UpdatePod(pod)
s.store.RecordEvent(Event{
Type: "Normal",
Reason: "Scheduled",
Message: fmt.Sprintf("Pod %s scheduled to node %s",
pod.Metadata.Name, selected.Metadata.Name),
})
}
Note: the pod stays Pending even after scheduling. It only becomes Running when the agent on the node actually starts the containers. This is eventual consistency.
matchLabels¶
func matchLabels(nodeLabels, selector map[string]string) bool {
if len(selector) == 0 { return true }
for k, v := range selector {
if nodeLabels[k] != v { return false }
}
return true
}
All selector keys must match. An empty selector matches anything.
Comparison with Real Kubernetes¶
The K8s scheduler is far more complex:
- Dozens of predicates (our filters)
- Scoring with configurable weights
- Preemption (evicting lower-priority pods)
- Pod affinity/anti-affinity
- Taints and tolerations
But the base model is the same: filter, score, bind.
Something to Keep in Mind¶
Scoring doesn't account for already-scheduled but not-yet-running pods. If 10 pods are created at the same time, they can all land on the same node (it looks like the least loaded at the time of scheduling). In Kubernetes, reserved resources handle this.
Two more traps. A 30-second heartbeat means a dead node still counts as feasible for half a minute -- a pod can be assigned to a node that's already gone. And sort.Slice isn't stable: with equal scores, node order is non-deterministic, so the "first best" can change from run to run.
💡 Fun facts¶
- The two-phase filter → score model came into Kubernetes straight from Borg and Omega -- Google's internal systems. The Omega paper specifically describes how moving from a monolithic scheduler to "shared state" let multiple schedulers run in parallel without locking.
- In the real kube-scheduler, the scoring phase normalizes all scores to a 0–100 range before combining them with weights -- otherwise a plugin emitting large absolute numbers would "drown out" all the others.
- Kubernetes' default balancing plugin is called
NodeResourcesBalancedAllocation-- it prefers nodes where CPU and memory are used evenly, not just nodes with lots of free room. Our scoring doesn't do that. - Preemption (evicting lower-priority pods to free up space) wasn't there from the start -- early Kubernetes just left the pod Pending until space freed up on its own.
What I figured out while digging into this¶
My biggest insight was that "scheduled" and "running" are different states, and the scheduler only owns the first one. At first I wrote the code as if assigning a node equaled starting the pod, and I couldn't figure out why the status never matched. Then it clicked: the scheduler only sets NodeName, and from there the agent on the node decides when to actually start the containers. The separation of concerns here isn't theoretical -- it directly determines where you go looking for a bug.
What could be improved¶
- Add reserved resources: count already-assigned but not-yet-running pods so a burst of pods doesn't all pile onto one node.
- Make sorting deterministic --
sort.SliceStableplus a tie-break on node name. - Move the scoring weights (
*10,*50) into config instead of hardcoding them -- the first step toward a pluggable scheduler like Kubernetes has. - Add the simplest anti-affinity: don't place two replicas of the same deployment on the same node.
Try It Yourself¶
# Create a pod and check events:
sheepctl apply -f examples/pod.json
sheepctl events | head -5
# You'll see: Created, Scheduled, pod → node
sheepctl get pods
The Scheduler assigns nodes. Next up -- the reconciliation loop, the heart of the entire system.
Resources¶
- kube-scheduler concepts — official scheduler docs
- Scheduling, Preemption and Eviction — full chapter
- Borg paper (Google Research) — the system Kubernetes inherits from
- Omega paper (Google Research) — flexible scheduler architecture
Source code for the series: github.com/igorgorovoy/sheep-shepherd-meadow
Previous: BoltDB State Store | Next: Reconciliation Loop
