Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
For CPU and memory, Kubernetes lets you set requests (the guaranteed minimum) separately from limits (the ceiling), and the gap between them drives bin-packing and overcommit. GPUs break that model: because they are an extended resource, request and limit must be equal, and you effectively only set limits. Builders who carry their CPU intuition over to GPUs write specs that get rejected or, subtly, end up requesting more than they meant. Knowing that GPUs are an all-or-nothing, request-equals-limit resource is what keeps your QoS class and your scheduling predictable.
Try to set a GPU request smaller than its limit and the API server rejects it. Extended resources require request == limit, which also pins GPU pods to the Guaranteed QoS class.
Use these three in order. Each builds on the one before.
In one paragraph, why must GPU requests equal GPU limits in Kubernetes when CPU and memory don't have that rule?
Walk me through how Kubernetes handles requests vs limits for extended resources like nvidia.com/gpu, and how that determines the pod's QoS class.
Given that GPUs can't be overcommitted via request/limit gaps, how do MIG and time-slicing reintroduce sharing, and what do they trade away compared to true overcommit?
# This spec is REJECTED: for extended resources, request must equal limit.
apiVersion: v1
kind: Pod
metadata:
name: bad-gpu-request
spec:
containers:
- name: app
image: nvcr.io/nvidia/cuda:12.5.0-base-ubuntu22.04
command: ["sleep", "3600"]
resources:
requests:
nvidia.com/gpu: 1 # mismatch ->
limits:
nvidia.com/gpu: 2 # ... API server: requests must equal limits