NVIDIA GPU Enablement (H100)
This section describes how NVIDIA GPUs (specifically NVIDIA H100 NVL) are enabled on a single-node Talos Kubernetes cluster.
GPU enablement is intentionally split into two distinct layers:
- Talos OS layer – NVIDIA drivers and container runtime integration via Talos system extensions
- Kubernetes layer – RuntimeClass and NVIDIA device plugin to expose GPUs as schedulable resources
This documentation explains what is required and why.
The concrete configuration, patches, and automation live in the separate infrastructure repository.
Scope and assumptions
This guide assumes:
- A single-node, on-prem Talos cluster
- Kubernetes already bootstrapped and reachable
- Cilium installed and functional
- NVIDIA GPUs physically present in the node
This setup is intended for:
- AI workloads
- Development and PoC environments
- Internal platforms (OS2AI / AarhusAI)
It is not a high-availability production GPU platform.
1. Prerequisites and quick validation
Before enabling GPUs, verify basic cluster health.
Talos health
talosctl health --nodes <NODE_IP>
Kubernetes access
kubectl get nodes
kubectl get pods -A
The node must be Ready and the Kubernetes API reachable.
2. Talos GPU support overview
Talos Linux is immutable and does not support installing drivers via SSH or package managers.
GPU support is provided via Talos system extensions, managed through Image Factory and applied using a Talos upgrade.
For NVIDIA GPUs, the following extensions are required:
siderolabs/nvidia-open-gpu-kernel-modules-productionsiderolabs/nvidia-container-toolkit-production
Important
Always keep the NVIDIA kernel modules and container toolkit on the same version track
(do not mixltsandproduction).
3. Updating Talos with NVIDIA extensions
3.1 Generate a new Image Factory schematic
Using Image Factory, generate a new schematic ID that includes:
- NVIDIA open GPU kernel modules
- NVIDIA container toolkit
This schematic defines the Talos installer image used for upgrades.
3.2 Upgrade Talos using the Image Factory installer image
From the infrastructure repository (or anywhere with access to talos/generated/talosconfig):
export TALOSCONFIG="$(pwd)/talos/generated/talosconfig"
talosctl --talosconfig "$TALOSCONFIG" --nodes <NODE_IP> upgrade --image factory.talos.dev/installer/<SCHEMATIC_ID>:v1.11.5
During the upgrade:
- Talos reboots the node
- NVIDIA extensions are installed
- Existing configuration is preserved
After the node comes back, verify the extensions:
talosctl --talosconfig "$TALOSCONFIG" --nodes <NODE_IP> get extensions
You should see:
nvidia-open-gpu-kernel-modules-*nvidia-container-toolkit-*
4. Loading NVIDIA kernel modules in Talos
Installing the extensions alone is not sufficient.
The NVIDIA kernel modules must be explicitly loaded via a Talos machine configuration patch.
In the infrastructure repository, this patch is located at:
talos/patches/nvidia-oss-modules.yaml
Example patch
machine:
kernel:
modules:
- name: nvidia
- name: nvidia_uvm
- name: nvidia_drm
- name: nvidia_modeset
sysctls:
net.core.bpf_jit_harden: 1
Apply the patch:
export TALOSCONFIG="$(pwd)/talos/generated/talosconfig"
talosctl --talosconfig "$TALOSCONFIG" --nodes <NODE_IP> patch mc --patch @talos/patches/nvidia-oss-modules.yaml
Verify driver and modules
talosctl --talosconfig "$TALOSCONFIG" --nodes <NODE_IP> read /proc/driver/nvidia/version
talosctl --talosconfig "$TALOSCONFIG" --nodes <NODE_IP> read /proc/modules | grep -E 'nvidia|nvidia_uvm|nvidia_drm|nvidia_modeset'
If the driver version is shown and modules are loaded, Talos GPU support is complete.
4.1 NVIDIA container runtime (no-cgroups)
Talos uses cgroups v2.
In this setup, the NVIDIA container runtime is configured with no-cgroups = true.
Patch location:
talos/patches/nvidia-no-cgroups.yaml
Example:
machine:
files:
- op: create
path: /etc/nvidia-container-runtime/config.toml
permissions: 0o644
content: |
[nvidia-container-cli]
no-cgroups = true
Without this setting, GPU workloads may fail to start with container runtime errors.
5. Kubernetes RuntimeClass for NVIDIA
Kubernetes must be instructed to use the NVIDIA runtime when running GPU workloads.
This is done using a RuntimeClass.
In the infrastructure repository, the manifest is stored at:
k8s/manifests/runtimeclass-nvidia.yaml
RuntimeClass definition
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: nvidia
handler: nvidia
GitOps note
In this setup, the RuntimeClass is applied via ArgoCD.
Avoid creating it manually withkubectlexcept for temporary debugging.
Verify:
kubectl get runtimeclass nvidia
6. NVIDIA device plugin (Kubernetes)
The NVIDIA device plugin exposes GPUs to Kubernetes as schedulable resources (e.g. nvidia.com/gpu).
Important note (single-node clusters)
The upstream NVIDIA device plugin chart may include node affinity rules that depend on Node Feature Discovery (NFD).
In this setup:
- NFD is not installed
- Affinity must be disabled explicitly
6.1 Device plugin Helm values
In the infrastructure repository:
k8s/nvidia-device-plugin/values.yaml
Example values:
runtimeClassName: nvidia
# Disable NFD/GFD-based scheduling requirements
affinity: null
nodeSelector:
kubernetes.io/os: linux
tolerations:
- key: "node-role.kubernetes.io/control-plane"
operator: "Exists"
effect: "NoSchedule"
- key: "CriticalAddonsOnly"
operator: "Exists"
6.2 Install or upgrade the device plugin
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
helm upgrade --install nvidia-device-plugin nvdp/nvidia-device-plugin -n kube-system -f k8s/nvidia-device-plugin/values.yaml
Verify scheduling
kubectl get ds -n kube-system | grep -i nvidia
kubectl get pods -n kube-system | grep -i nvidia
The DaemonSet should show DESIRED = 1.
7. Verify GPU availability in Kubernetes
Check node resources:
kubectl describe node <NODE_NAME> | grep -A10 -i nvidia.com/gpu
Expected output:
nvidia.com/gpu: 2(for two H100 GPUs)
8. Test GPU access from a pod
Run a simple CUDA container and execute nvidia-smi:
kubectl run nvidia-test --restart=Never -ti --rm --image nvcr.io/nvidia/cuda:12.5.0-base-ubuntu22.04 --overrides '{"spec": {"runtimeClassName": "nvidia"}}' nvidia-smi
If the command prints GPU information, GPU enablement is successful.
A PodSecurity warning may appear for this test pod.
This does not affect GPU functionality.
Next steps
Once GPU enablement is complete, the cluster is ready for GPU workloads such as:
- 09 – Sealed Secrets (secure secret management)
- 10 – vLLM (GPU-based inference and embeddings, requires
runtimeClassName: nvidia)
Summary
GPU enablement is complete when:
- Talos reports an NVIDIA driver version
- NVIDIA kernel modules are loaded
- Kubernetes advertises
nvidia.com/gpu nvidia-smiruns successfully inside a pod
At this point, the cluster is ready for GPU-accelerated workloads.