Skip to content
Operating on GPU Compute
Operating on GPU Compute

Operating on GPU Compute

The cluster has two GPU paths: an NVIDIA RTX 5070 Ti on gpu-1 managed by the GPU Operator, and Intel Arc iGPUs on the three mini nodes exposed through DRA (Dynamic Resource Allocation). Both are operational, both have Talos-specific quirks, and both need different tools to inspect and troubleshoot.

This post covers the day-to-day commands for checking GPU state, managing workloads, and debugging the issues you will eventually hit. For the build story, see GPU Compute — NVIDIA and Intel and GPU Containers on Talos — The Validation Fix.

Observing State

NVIDIA GPU (gpu-1)

The GPU Operator runs several pods on gpu-1. Check that they are all healthy:

kubectl get pods -n gpu-operator -o wide

You should see pods for the device plugin, feature discovery, DCGM exporter, and the validation markers DaemonSet — all Running and 1/1. If any pod is stuck at Init:0/1, the validation markers are likely missing (see Debugging below).

To run nvidia-smi, exec into the DCGM exporter pod (it has the nvidia tools available):

kubectl exec -n gpu-operator $(kubectl get pod -n gpu-operator \
  -l app=nvidia-dcgm-exporter -o jsonpath='{.items[0].metadata.name}') \
  -- nvidia-smi

For a quick check of what the node reports as allocatable:

kubectl describe node gpu-1 | grep -A 10 "Allocated resources"
$ POD=$(kubectl get pod -n gpu-operator -l app=nvidia-dcgm-exporter -o jsonpath='{.items[0].metadata.name}'); kubectl exec -n gpu-operator "$POD" -c nvidia-dcgm-exporter -- nvidia-smi
Mon Apr 20 16:55:31 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.211.01             Driver Version: 570.211.01     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5070 Ti     Off |   00000000:01:00.0 Off |                  N/A |
|  0%   33C    P8             19W /  300W |    7956MiB /  16303MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Look for nvidia.com/gpu in the capacity and allocatable fields:

kubectl get node gpu-1 -o jsonpath='{.status.capacity.nvidia\.com/gpu}'
# Should return: 1

Intel iGPU (mini nodes)

The Intel DRA driver runs as a DaemonSet — one pod per mini node:

kubectl get pods -n intel-gpu-resource-driver -o wide

Check that ResourceSlices are published for each node:

kubectl get resourceslice -o wide

You should see three slices, one per mini node, all with driver gpu.intel.com. The DeviceClass should also exist:

$ kubectl get resourceslice -o wide
NAME                         NODE     DRIVER          POOL     AGE
mini-1-gpu.intel.com-wnt98   mini-1   gpu.intel.com   mini-1   28d
mini-2-gpu.intel.com-ssz5r   mini-2   gpu.intel.com   mini-2   28d
mini-3-gpu.intel.com-ch2jg   mini-3   gpu.intel.com   mini-3   28d
kubectl get deviceclass gpu.intel.com

To see active ResourceClaims (pods currently using an Intel GPU):

kubectl get resourceclaim -A

Routine Operations

Check GPU Utilization

For NVIDIA, the quickest way to see what is running on the GPU:

# GPU utilization, memory usage, running processes
kubectl exec -n ollama $(kubectl get pod -n ollama \
  -o jsonpath='{.items[0].metadata.name}') -- nvidia-smi

# Or just check Ollama's model status
kubectl exec -n ollama $(kubectl get pod -n ollama \
  -o jsonpath='{.items[0].metadata.name}') -- ollama ps

The ollama ps output tells you the model name, size, processor allocation (look for 100% GPU), and context window size.

Check Which Pods Use the GPU

# NVIDIA — find pods requesting nvidia.com/gpu
kubectl get pods -A -o json | jq -r '
  .items[] | select(.spec.containers[].resources.limits."nvidia.com/gpu" != null)
  | "\(.metadata.namespace)/\(.metadata.name)"'

# Intel DRA — find pods with ResourceClaims
kubectl get pods -A -o json | jq -r '
  .items[] | select(.spec.resourceClaims != null)
  | "\(.metadata.namespace)/\(.metadata.name)"'

Pull Models via kubectl exec

On Talos with NVIDIA, postStart lifecycle hooks fail due to the nvidia-container-cli exec hook. Models must be pulled manually after the pod is running:

kubectl exec -n ollama $(kubectl get pod -n ollama \
  -o jsonpath='{.items[0].metadata.name}') -- ollama pull qwen3.5:9b

kubectl exec -n ollama $(kubectl get pod -n ollama \
  -o jsonpath='{.items[0].metadata.name}') -- ollama pull deepseek-coder:6.7b

Models persist on the Longhorn PVC, so this is a one-time operation unless the PVC is lost.

Manage GPU Memory

If Ollama holds a model in VRAM that you want to unload:

# List loaded models
kubectl exec -n ollama $(kubectl get pod -n ollama \
  -o jsonpath='{.items[0].metadata.name}') -- ollama ps

# Unload by running a different model, or restart the pod
kubectl delete pod -n ollama $(kubectl get pod -n ollama \
  -o jsonpath='{.items[0].metadata.name}')

The pod will be recreated by the Deployment. Models on the PVC remain available — they just need to be loaded back into VRAM on the next request.

Debugging

GPU Not Allocating

If a pod requesting nvidia.com/gpu stays Pending:

# 1. Check that the device plugin registered the GPU
kubectl get node gpu-1 -o jsonpath='{.status.allocatable.nvidia\.com/gpu}'
# Should return 1. If empty or 0, the device plugin is not running.

# 2. Check GPU Operator pods
kubectl get pods -n gpu-operator -o wide
# All should be Running. Look for Init:0/1 or CrashLoopBackOff.

# 3. Check validation markers
kubectl exec -n gpu-operator $(kubectl get pod -n gpu-operator \
  -l app=nvidia-validation-markers -o jsonpath='{.items[0].metadata.name}') \
  -- ls -la /run/nvidia/validations/
# Should show driver-ready and toolkit-ready files

If the markers are missing, check that the nvidia-validation-markers DaemonSet is running. If it is running but the files are gone, the node may have rebooted (files are on tmpfs). The DaemonSet loop recreates them within 30 seconds.

Containerd Issues on Talos

If GPU pods are stuck at ContainerCreating with PodReadyToStartContainers: False:

# Check containerd runtime config on gpu-1
talosctl -n 192.168.55.31 read /etc/cri/conf.d/20-customization.part

The file should contain the nvidia runtime as default and the base_runtime_spec:

[plugins."io.containerd.cri.v1.runtime"]
  cdi_spec_dirs = ["/var/cdi/static", "/var/cdi/dynamic"]
  [plugins."io.containerd.cri.v1.runtime".containerd]
    default_runtime_name = "nvidia"
  [plugins."io.containerd.cri.v1.runtime".containerd.runtimes.nvidia]
    base_runtime_spec = "/etc/cri/conf.d/base-spec.json"

If base_runtime_spec is missing, kubelet cannot track the GPU container lifecycle. See the Talos validation fix for the full story.

Talos Reboot Loops from Conflicting Patches

If a node enters a ~35-minute reboot loop after applying a config patch, the likely cause is two patches creating the same file at /etc/cri/conf.d/20-customization.part. Talos cannot merge them and throws:

resource EtcFileSpecs.files.talos.dev(files/cri/conf.d/20-customization.part@undefined) already exists

The fix: each node must have its own machine-specific patch. Delete the cluster-wide patch before applying machine-specific ones. To recover a looping node:

# Remove the conflicting cluster-wide patch from Omni
omnictl delete configpatch <cluster-wide-patch-id>

# Watch the node recover (it will complete its current reboot cycle)
kubectl get node <node-name> -w

Force-Delete GPU Pods Carefully

Force-deleting a GPU pod (kubectl delete pod --force --grace-period=0) leaves stale containers holding the GPU allocation inside containerd. The device plugin still sees the GPU as allocated. New GPU pods will stay Pending with Insufficient nvidia.com/gpu.

If you must force-delete:

# Force delete (last resort)
kubectl delete pod -n <namespace> <pod> --force --grace-period=0

# Check if the GPU is still shown as allocated
kubectl describe node gpu-1 | grep -A 5 "Allocated resources"

# If GPU is still stuck as allocated, a clean node reboot clears it
talosctl -n 192.168.55.31 reboot

A clean reboot is the only reliable way to clear stale GPU allocations from containerd. Budget about 90 seconds for Talos to come back Ready.

Quick Reference

TaskCommand
GPU Operator healthkubectl get pods -n gpu-operator -o wide
nvidia-smikubectl exec -n gpu-operator $(kubectl get pod -n gpu-operator -l app=nvidia-dcgm-exporter -o jsonpath='{.items[0].metadata.name}') -- nvidia-smi
Ollama model statuskubectl exec -n ollama $(kubectl get pod -n ollama -o jsonpath='{.items[0].metadata.name}') -- ollama ps
Pull a modelkubectl exec -n ollama $(kubectl get pod -n ollama -o jsonpath='{.items[0].metadata.name}') -- ollama pull <model>
Node GPU capacitykubectl get node gpu-1 -o jsonpath='{.status.allocatable.nvidia\.com/gpu}'
Intel DRA podskubectl get pods -n intel-gpu-resource-driver -o wide
Intel ResourceSliceskubectl get resourceslice -o wide
Active ResourceClaimskubectl get resourceclaim -A
Validation markerskubectl exec -n gpu-operator $(kubectl get pod -n gpu-operator -l app=nvidia-validation-markers -o jsonpath='{.items[0].metadata.name}') -- ls /run/nvidia/validations/
Containerd configtalosctl -n 192.168.55.31 read /etc/cri/conf.d/20-customization.part
Reboot gpu-1talosctl -n 192.168.55.31 reboot

References