Add GPU node to a MetaKube cluster
Warning
As of right now you cannot select a GPU flavor for your machine-deployment from the selection. The Guide below shows you a workaround for this. We are currently working on an implementation of the flavors for machine-deployments.
Prerequisites:
- Get access to a GPU flavor
- Completed the setup of the GPU node
- Access to the GPU node (ssh)
Create a machine-deployment
As mentioned before, currently the UI does not allow the GPU flavor selection for machine-deployments. For this we will apply a workaround. Create a machine-deployment with the needed amount of nodes in any flavor (preferably m2.small). After the node is visible in your cluster, run the following command in the CLI:
This shows the snippet for the node_deployment of a GPU flavor node. Replace GPU_FLAVOR_NAME with your GPU flavor.
data "openstack_images_image_v2" "image" {
most_recent = true
visibility = "public"
properties = {
os_distro = "ubuntu"
os_version = "20.04"
}
}
resource "metakube_node_deployment" "node_deployment" {
cluster_id = metakube_cluster.cluster.id
project_id = var.project_id
spec {
replicas = 1
template {
cloud {
openstack {
flavor = "GPU_FLAVOR_NAME"
image = data.openstack_images_image_v2.image.name
use_floating_ip = true
instance_ready_check_period = "5s"
instance_ready_check_timeout = "100s"
}
}
operating_system {
ubuntu {}
}
versions {
kubelet = data.metakube_k8s_version.cluster.version
}
}
}
}
Info
The GPU flavored nodes take longer than usual nodes to be ready.
Enable GPU to be used within the cluster
Now we have to install the nvidia-toolkit on the GPU node itself.
1. Configure production repository
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
2. Update packages
3. Install the NVIDIA Toolkit packages
4. Configure NVIDIA Toolkit for containerd
5. Restart containerd
With that you finished setting up the actual GPU node. The next steps continue in the cluster.
6. Add Helm repo for the nvidia-device-plugin in the cluster and update it
Info
Continue only if you don't want time-slicing or MPS enabled.
We are going to use the nvidia-device-plugin. This handles the modification so that the pods can utilize the GPU.
7. Verify the version
8. Deploy the Helm chart (adjust version according to step 7)
Info
The setting gfd.enabled=true enables the auto-node-labeling by gpu-feature-discovery.
This feature labels the nodes according to their specs and simplifies the node management.
helm upgrade -i nvdp nvdp/nvidia-device-plugin \
--version=0.17.1 \
--namespace nvidia-device-plugin \
--create-namespace \
--set gfd.enabled=true
9. Verify installation with example pod
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
restartPolicy: Never
containers:
- name: cuda-container
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
EOF
The output should look similar to this:
$ kubectl logs gpu-pod
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
You successfully enabled the usage of the GPU node within your cluster 🎉. With the following resource management snippet, you can schedule your pod(s) to use the GPU computation.
...
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
Remove GPU from cluster
1. Delete machine-deployment
Reference the delete machine-deployment section