Troubleshooting and FAQ
This section collects the most common failure scenarios when operating a single-node Talos Kubernetes cluster, along with proven solutions based on project scripts, configuration patterns, and known platform limitations.
The content is based on hands-on experience during installation and operation.
1. General troubleshooting checklist
Always start with the basics:
talosctl health
kubectl get nodes
kubectl get pods -A
If Talos does not respond:
- The server may still be running from the Talos ISO (maintenance mode)
- Static IP configuration in
network.yamlmay be incorrect - Firewall or routing between WSL2 and the server may block traffic
2. Talos-related issues
2.1 connection refused or timeout from talosctl
Common causes:
- Talos endpoint or node not configured:
talosctl config endpoint <NODE_IP> talosctl config node <NODE_IP> - Node is still running from ISO
- Static IP was not applied correctly
Validate:
ping <NODE_IP>
talosctl version
2.2 Talos health reports errors
talosctl health --nodes <NODE_IP>
Typical problems:
Kubernetes API unavailable
- Cilium not running or misconfigured
- Certificate issues during bootstrap
- Invalid controlplane configuration
etcd issues
- Bootstrap not executed:
talosctl bootstrap --nodes <NODE_IP> - System time out of sync (check NTP)
3. Kubernetes API errors (context deadline exceeded)
kubectl get pods -A
Possible causes:
- Cilium not installed or failing
- kube-apiserver in restart loop
- kubeconfig not generated yet
Inspect logs:
talosctl logs kube-apiserver
talosctl logs kubelet
4. Cilium issues
4.1 Cilium pods not ready
cilium status
kubectl -n kube-system get pods
Common causes:
- Wrong
k8sServiceHost/k8sServicePort - Talos default CNI not disabled
- Kernel or BPF limitations
4.2 Recovery
helm upgrade --install cilium ./k8s/charts/cilium --namespace kube-system --values ./k8s/charts/cilium/values.yaml
Connectivity test:
cilium connectivity test
5. Rook Ceph issues (single-node)
kubectl -n rook-ceph get cephcluster
kubectl -n rook-ceph get pods
5.1 OSDs not starting
- Disks already formatted or mounted
- Wrong device filters
- Permission issues
5.2 MON / MGR instability
- Insufficient resources
- Expected warnings in single-node setups
5.3 StorageClass missing
kubectl get storageclass
Set default if needed:
kubectl patch storageclass rook-ceph-block -p '{"metadata":{"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
6. kubeconfig problems
6.1 API server unreachable
talosctl kubeconfig --nodes <NODE_IP> --force
Verify:
kubectl config view
kubectl get nodes
6.2 WSL2 cannot reach node
Common causes:
- Firewall blocking WSL2 subnet (often 172.x.x.x)
- WSL2 IP changed after reboot
- Missing routing between VLANs
ip route
ping <NODE_IP>
7. Pods cannot schedule on single-node cluster
Cause:
allowSchedulingOnControlPlanes: false
Fix by enabling scheduling:
cluster:
allowSchedulingOnControlPlanes: true
8. NVIDIA GPU issues
This section covers common problems related to NVIDIA GPU enablement on Talos-based Kubernetes clusters.
GPU support spans both Talos OS and Kubernetes, so issues may appear at either layer.
8.1 NVIDIA device plugin DaemonSet shows DESIRED = 0
Check:
kubectl get ds -n kube-system | grep -i nvidia
If DESIRED is 0, the DaemonSet is not matching any nodes.
Common cause
The NVIDIA device plugin chart may include a nodeAffinity that depends on Node Feature Discovery (NFD) labels, for example:
feature.node.kubernetes.io/pci-10de.present=truenvidia.com/gpu.present=true
In this setup, NFD is not installed, so the affinity prevents scheduling.
Fix
Disable affinity in the Helm values:
affinity: null
Then upgrade the release:
helm upgrade --install nvidia-device-plugin nvdp/nvidia-device-plugin -n kube-system -f k8s/charts/nvidia-device-plugin/values.yaml
Verify:
kubectl get ds -n kube-system | grep -i nvidia
kubectl get pods -n kube-system | grep -i nvidia
8.2 GPUs not visible on the node (nvidia.com/gpu missing)
Check node resources:
kubectl describe node <NODE_NAME> | grep -A10 -i nvidia.com/gpu
If no GPU resources are shown:
- Verify the NVIDIA device plugin pod is running
- Verify Talos has loaded NVIDIA kernel modules
Check on Talos:
talosctl read /proc/driver/nvidia/version
talosctl read /proc/modules | grep nvidia
If modules are missing, reapply the NVIDIA kernel module patch and reboot.
8.3 nvidia-smi fails inside a pod
Example test:
kubectl run nvidia-test --restart=Never -ti --rm --image nvcr.io/nvidia/cuda:12.5.0-base-ubuntu22.04 --overrides '{"spec": {"runtimeClassName": "nvidia"}}' nvidia-smi
Common causes
runtimeClassName: nvidiamissing- NVIDIA container toolkit extension not installed in Talos
- Device plugin not running
Notes
- A PodSecurity warning may appear for this test pod
- The warning does not prevent GPU access
8.4 NVIDIA driver missing after Talos upgrade
After upgrading Talos, GPUs may stop working if the installer image does not include the required NVIDIA extensions.
Verify extensions:
talosctl get extensions
If NVIDIA extensions are missing, upgrade Talos using the correct Image Factory installer image:
talosctl upgrade --image factory.talos.dev/installer/<SCHEMATIC_ID>:<TALOS_VERSION>
8.5 Quick GPU health checklist
talosctl read /proc/driver/nvidia/version
kubectl get ds -n kube-system | grep -i nvidia
kubectl describe node <NODE_NAME> | grep -A10 -i nvidia.com/gpu
If all three checks succeed, GPU support is operational.
9. Helm-related issues
9.1 Chart version conflicts
helm repo update
9.2 Values not applied
helm get values <release>
helm upgrade --install <release> <chart> --values values.yaml
10. Talos logging and diagnostics
Talos does not support SSH.
talosctl logs <service>
talosctl dmesg
talosctl service list
Examples:
talosctl logs containerd
talosctl logs kubelet
11. Emergency reboot
talosctl reboot --nodes <NODE_IP>
12. Known limitations
- Single-node Ceph is not highly available
- Control-plane and workloads share resources
- Cilium requires kernel BPF support
- Correct networking is critical during first apply
13. Full recovery procedure
If the cluster becomes unrecoverable:
task talos:build
task talos:apply-insecure
task talos:bootstrap
task talos:kubeconfig
14. Further reading
- Talos CLI reference: https://docs.siderolabs.com/talos/reference/cli/
- Cilium troubleshooting: https://docs.cilium.io/en/stable/
- Rook Ceph troubleshooting: https://rook.io/docs/rook/latest/Troubleshooting/
15. CloudNativePG volume expansion and instance rebuild
This section covers a common recovery pattern for CloudNativePG when a Postgres instance fails because the PVC is too small or the cluster enters a degraded state after storage pressure.
This procedure was validated during recovery of application databases running on Rook Ceph block storage (ceph-block).
15.1 Symptoms
Typical symptoms:
kubectl get pods -A
kubectl get cluster -A
Examples:
- A CNPG instance is in
CrashLoopBackOff - Application pods are running but not ready
kubectl get clusterreports:Not enough disk spaceWaiting for the instances to become activeCluster Is Not Ready
CNPG logs may show:
Detected low-disk space condition, avoid starting the instance
15.2 Verify the problem
Check cluster state:
kubectl -n <namespace> get cluster
kubectl -n <namespace> get cluster <cluster-name> -o yaml
kubectl -n <namespace> get pods -o wide
Check current PVC sizes:
kubectl -n <namespace> get pvc
kubectl -n <namespace> get pvc <pvc-name> -o yaml
Check whether the StorageClass supports expansion:
kubectl get storageclass ceph-block -o yaml
Expected:
allowVolumeExpansion: true
If a healthy CNPG instance still exists, inspect filesystem usage:
kubectl -n <namespace> exec -it <healthy-db-pod> -c postgres -- df -h
kubectl -n <namespace> exec -it <healthy-db-pod> -c postgres -- du -sh /var/lib/postgresql/data/pgdata
kubectl -n <namespace> exec -it <healthy-db-pod> -c postgres -- du -sh /var/lib/postgresql/data/pgdata/pg_wal
15.3 Update desired storage in Git
Increase the CNPG storage size in Git before doing instance recovery.
Recommended pattern:
- keep application values in
overrides/<app>/values.yaml - keep CNPG-specific settings in a dedicated file:
overrides/<app>/cloudnative-pg-values.yaml
Example:
cloudnative-pg:
cluster:
storage:
size: 20Gi
Then ensure the Argo CD application includes the file as the last values file so it overrides vendor defaults.
Example pattern:
valueFiles:
- cloudnative-pg-values.yaml
- values.yaml
- ../../../overrides/<app>/values.yaml
- ../../../overrides/<app>/cloudnative-pg-values.yaml
15.4 Validate with Helm before merge
Render the chart locally before merging:
helm dependency build ./vendor/applications/<app>
helm template <release-name> ./vendor/applications/<app> \
-f ./vendor/applications/<app>/cloudnative-pg-values.yaml \
-f ./vendor/applications/<app>/values.yaml \
-f ./overrides/<app>/values.yaml \
-f ./overrides/<app>/cloudnative-pg-values.yaml \
> /tmp/<app>-rendered.yaml
Inspect the rendered CNPG cluster:
sed -n '/kind: Cluster/,+160p' /tmp/<app>-rendered.yaml
Verify:
storage:
size: 20Gi
15.5 If PVCs do not resize automatically
Even when Argo CD is synced and the CNPG cluster spec shows the new size, the PVCs may still remain at the old size.
Check:
kubectl -n <namespace> get cluster <cluster-name> -o yaml | grep -A3 'storage:'
kubectl -n <namespace> get pvc
If the cluster shows the new desired size but the PVCs still show the old size, patch the PVCs manually:
kubectl -n <namespace> patch pvc <pvc-1> --type merge -p '{"spec":{"resources":{"requests":{"storage":"20Gi"}}}}'
kubectl -n <namespace> patch pvc <pvc-2> --type merge -p '{"spec":{"resources":{"requests":{"storage":"20Gi"}}}}'
Verify:
kubectl -n <namespace> get pvc
kubectl -n <namespace> get pvc <pvc-1> -o yaml | grep -A6 'resources:'
kubectl -n <namespace> get pvc <pvc-2> -o yaml | grep -A6 'resources:'
15.6 Recover a broken instance by scaling down and back up
If one instance remains broken after storage expansion, recover the cluster by temporarily reducing replicas.
Scale down to one instance:
kubectl -n <namespace> patch cluster <cluster-name> --type merge -p '{"spec":{"instances":1}}'
Verify the spec changed:
kubectl -n <namespace> get cluster <cluster-name> -o jsonpath='{.spec.instances}{"\n"}'
If the broken instance does not disappear automatically, delete the Pod:
kubectl -n <namespace> delete pod <broken-db-pod>
Wait until the cluster is stable with one ready instance:
kubectl -n <namespace> get pods
kubectl -n <namespace> get cluster <cluster-name> -o yaml | grep -E 'readyInstances|currentPrimary|phase'
Expected result:
- only one DB pod remains
- phase becomes healthy
- applications depending on the database may recover
Then restore high availability:
kubectl -n <namespace> patch cluster <cluster-name> --type merge -p '{"spec":{"instances":2}}'
kubectl -n <namespace> get pods -w
CNPG should create a fresh replica automatically.
15.7 If the old instance still blocks recovery
If scaling down is not enough and the broken instance identity remains stuck, it may be necessary to delete the old PVC after the cluster is stable at one instance.
Only do this after confirming that one healthy instance remains.
Example:
kubectl -n <namespace> delete pvc <broken-instance-pvc>
This forces CNPG to rebuild the replica from the remaining healthy instance.
15.8 Verification after recovery
Check cluster health:
kubectl -n <namespace> get pods
kubectl -n <namespace> get cluster <cluster-name> -o yaml | grep -E 'readyInstances|currentPrimary|phase'
Expected:
- primary is running
- replica is running
readyInstances: 2phase: Cluster in healthy state
Check application pods:
kubectl -n <namespace> get pods
For Open WebUI specifically, the web pods should return to 1/1 Running once the database becomes healthy again.
15.9 Notes
- This recovery pattern was used successfully for both Authentik and Open WebUI
- The issue may present differently:
- invalid storage shrink in GitOps
- PVC too small / low-disk-space protection
- broken replica stuck in restart loop
- On
ceph-block, volume expansion is supported, but PVCs may still need manual patching - Always update the GitOps source of truth first before performing manual recovery actions