Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
A pod can schedule onto a GPU node and still not actually see the GPU, because of a runtime misconfiguration, a missing capability, or a driver mismatch. The single fastest way to prove a container has real GPU access is to run nvidia-smi inside it — if it lists the device, you have a working stack end to end. Making this verification a reflex saves hours: instead of debugging why your training job is mysteriously slow, you run one command and instantly learn whether the container even sees the hardware. This is the 'is it plugged in' check for every GPU workload.
Exec into a GPU pod and run nvidia-smi. A correct setup prints the GPU model, driver and CUDA versions, memory, and any running processes. If it errors with 'command not found' or 'no devices', the runtime isn't injecting the GPU.
Use these three in order. Each builds on the one before.
In one paragraph, what does running nvidia-smi inside a pod tell me, and why is it the first thing to check?
Walk me through how the NVIDIA container runtime injects the GPU device and driver into a container so that nvidia-smi works inside it.
A pod scheduled onto a GPU node but nvidia-smi reports 'No devices were found'. Given the runtime injection mechanism, what are the most likely causes in priority order?
# Run a throwaway pod that just calls nvidia-smi
kubectl run gpu-check --rm -it --restart=Never \
--image=nvcr.io/nvidia/cuda:12.5.0-base-ubuntu22.04 \
--overrides='{"spec":{"containers":[{"name":"gpu-check","image":"nvcr.io/nvidia/cuda:12.5.0-base-ubuntu22.04","command":["nvidia-smi"],"resources":{"limits":{"nvidia.com/gpu":1}}}]}}'
# Or exec into an existing GPU pod
kubectl exec -it cuda-vectoradd -- nvidia-smi
# Healthy output shows a table with the GPU name, driver/CUDA version,
# memory usage, and a process list. "No devices were found" means the
# container did not get GPU access despite scheduling onto a GPU node.