Enable the GPU time slicing feature

The NVIDIA device plugin supports GPU oversubscription using either Time-Slicing or MPS, though these methods are mutually exclusive. Time-Slicing utilizes CUDA time-slicing to enable concurrent workloads on a single GPU, sharing memory and running within the same fault domain – meaning a crash in one workload affects all others. Conversely, MPS employs a control daemon to manage access and enforces space partitioning, allowing explicit allocation of memory and compute resources to each workload, ensuring isolation and preventing resource contention. Importantly, both Time-Slicing and MPS apply the same sharing configuration across all GPUs on a node; per-GPU customization is not supported.

Prerequisites:

GPU enabled for use in the cluster

Time-Slicing

1. Create the config

cat << EOF > /tmp/gpu-time-slicing-config.yaml
version: v1
flags:
  migStrategy: "none"
  failOnInitError: true
  nvidiaDriverRoot: "/"
  plugin:
    passDeviceSpecs: false
    deviceListStrategy: "envvar"
    deviceIDStrategy: "uuid"
sharing:
  timeSlicing:
    failRequestsGreaterThanOne: true
    resources:
    - name: nvidia.com/gpu
      replicas: 10
EOF

2. Apply the Helm upgrade

helm upgrade -i nvdp nvdp/nvidia-device-plugin \
  --version=0.17.1 \
  --namespace nvidia-device-plugin \
  --create-namespace \
  --set gfd.enabled=true \
  --set-file config.map.config=/tmp/gpu-time-slicing-config.yaml

3. Verify the change was applied

kubectl describe node ...
  ...
  Capacity:
    nvidia.com/gpu.shared: 10
  ...

Multi-Process Service (MPS)

1. Create config

cat << EOF > /tmp/gpu-mps-config.yaml
version: v1
flags:
  migStrategy: "none"
  failOnInitError: true
  nvidiaDriverRoot: "/"
  plugin:
    passDeviceSpecs: false
    deviceListStrategy: "envvar"
    deviceIDStrategy: "uuid"
sharing:
  mps:
    resources:
    - name: nvidia.com/gpu
      replicas: 12
EOF

2. Apply Helm chart

helm upgrade -i nvdp nvdp/nvidia-device-plugin \
  --version=0.17.1 \
  --namespace nvidia-device-plugin \
  --create-namespace \
  --set gfd.enabled=true \
  --set-file config.map.config=/tmp/gpu-mps-config.yaml

3. Verify the change was applied

kubectl describe node ...
  ...
  Capacity:
    nvidia.com/gpu.shared: 12
  ...

Uninstall Helm deployment

helm uninstall nvdp -n nvidia-device-plugin