Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
The device plugin is the bridge between vendor hardware and the kubelet, and it follows a precise contract: a small DaemonSet pod runs on each node, registers with the kubelet over a gRPC socket, lists the GPUs it found, and then allocates specific devices to pods at start time. Understanding this contract demystifies a whole class of failures — a missing socket, a crashed plugin pod, a driver mismatch — that otherwise present as 'my GPU pod is stuck Pending forever'. Once you know the plugin is the thing advertising and allocating devices, you know exactly where to look when GPUs vanish from a node.
The device plugin runs as a DaemonSet in kube-system (or the GPU operator's namespace). Its pod registers via a Unix socket under /var/lib/kubelet/device-plugins/ and reports the node's GPUs. If that pod is unhealthy, the node's GPU count drops to zero.
Use these three in order. Each builds on the one before.
In one paragraph, explain what the NVIDIA device plugin is and why Kubernetes needs it to use GPUs.
Walk me through the device plugin lifecycle: how it registers with the kubelet, lists devices, and allocates a specific GPU to a pod at startup.
A GPU node suddenly shows nvidia.com/gpu: 0 while the GPUs are physically fine. Using the device plugin model, give me an ordered debugging plan.
# Find the device plugin DaemonSet and its per-node pods
kubectl get daemonset -n kube-system | grep -i nvidia
kubectl get pods -n kube-system -l name=nvidia-device-plugin-ds -o wide
# Read a plugin pod's logs to see registration + device discovery
kubectl logs -n kube-system -l name=nvidia-device-plugin-ds --tail=40
# Healthy logs include lines like:
# "Starting FS watcher"
# "Registered device plugin for 'nvidia.com/gpu' with Kubelet"
# "Devices: GPU-... healthy"
# Confirm the registration socket exists on the node (debug pod / nsenter)
# ls /var/lib/kubelet/device-plugins/ should show nvidia-gpu.sock