Verifying GPU access with nvidia-smi

hard

Learn with your AI

Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.

Open in Claude Open in ChatGPT

Why this matters

A pod can schedule onto a GPU node and still not actually see the GPU, because of a runtime misconfiguration, a missing capability, or a driver mismatch. The single fastest way to prove a container has real GPU access is to run nvidia-smi inside it — if it lists the device, you have a working stack end to end. Making this verification a reflex saves hours: instead of debugging why your training job is mysteriously slow, you run one command and instantly learn whether the container even sees the hardware. This is the 'is it plugged in' check for every GPU workload.

Demo

Exec into a GPU pod and run nvidia-smi. A correct setup prints the GPU model, driver and CUDA versions, memory, and any running processes. If it errors with 'command not found' or 'no devices', the runtime isn't injecting the GPU.

Try it yourself

Run the gpu-check pod and confirm nvidia-smi prints a populated GPU table.
Read off the driver and CUDA versions and note them — version skew is a common failure cause.
Remove the nvidia.com/gpu limit from the override and confirm nvidia-smi now reports 'No devices were found'.
Run nvidia-smi inside an actual workload pod (not a throwaway) and confirm your process appears in its process list.

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

In one paragraph, what does running nvidia-smi inside a pod tell me, and why is it the first thing to check?

2. Why it works (the mechanism)

Walk me through how the NVIDIA container runtime injects the GPU device and driver into a container so that nvidia-smi works inside it.

3. Advanced — application & what's next

A pod scheduled onto a GPU node but nvidia-smi reports 'No devices were found'. Given the runtime injection mechanism, what are the most likely causes in priority order?