Cgroups

System resources, such as CPU and memory, are distributed by means of a Linux Kernel feature called "cgroups".

A Pod requests resources through it's spec.containers[*].resources fields.

MetaKube nodes use systemd as the cgroup manager. They are organized in a hierarchy through "slices".

Depending on the Pod's determined Quality of Service, they're organized directly under the kubepods.slice or one level lower under the kubepods-burstable.slice or kubepods-besteffort.slice:

$ systemctl status
...
├─init.scope 
├─user.slice 
│ └─user-<uid>.slice
├─system.slice 
│ ├─containerd.service
│ ├─kubelet.service 
│ └─<other system services>.service
└─kubepods.slice 
  ├─kubepods-pod<id>.slice 
  │ ├─cri-containerd-<container id>.scope
  │ │ └─<pid> <command>
  │ └─cri-containerd-<container id>.scope
  │   └─<pid> /pause
  ├─kubepods-burstable.slice 
  │ └─kubepods-burstable-pod<pod id>.slice 
  └─kubepods-besteffort.slice 
    └─kubepods-besteffort-pod<pod id>.slice

CPU

CPU time allocation is implemented by the Linux Kernel's completely fair scheduler (CFS).

A container's resources.requests.cpu is regarded during scheduling to ensure there are enough CPUs available on the node. It also corresponds to the container's cgroup's cpu.weight value.

The weight determines the place of a processes' thread in the CFS's weighted queue and thus how often it's scheduled to be executed on a CPU core.

By dividing up the available resources proportionately, Kubelet ensures that a container process will at least get the configured requested CPU time (from resources.requests.cpu). Any spare CPU (either not requested or when processes yield) is free to use by waiting processes. It is again proportionately distributed based on the processes' cgroup's CPU weights and up to their cpu.max value (from resources.limits.cpu).

Warning

Kubernetes doesn't prevent you from over-committing a node (limits > available resources). Containers with low CPU requests may be throttled and may not be able to answer requests including liveness or readiness probes.

Memory

A container's resources.requests.memory is only regarded during scheduling to ensure the node has enough space available.

A container's resources.limits.memory correspond to the cgroup's memory.max value. Once reached, the process will be "OOM" killed.

Warning

Kubernetes doesn't prevent you from over-committing a node (limits > available resources) and exhausting the total available RAM. If any higher level slice's or the entire system's memory is exhausted, the system will choose and kill a process in that slice or any respectively.