Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
Theory becomes real the moment you run a workload that uses the GPU and you can watch it happen. Tying together everything from this module — a node advertising GPUs, a pod requesting one, the runtime injecting it, and nvidia-smi confirming utilization — gives you the baseline deployment you will harden and scale in every later module. Doing it once, observing the GPU light up under load, removes the mystery: GPU scheduling is just resources, requests, runtimes, and verification composed together. This is the working foundation the rest of the course builds on.
Deploy a GPU workload that does real compute (a tight CUDA loop or a tiny PyTorch job), then run nvidia-smi while it works and watch GPU-Util climb toward 100%. That moving number is proof the whole stack functions.
Use these three in order. Each builds on the one before.
In one paragraph, summarize the full chain that lets a Kubernetes pod actually use a GPU, from node to container.
Walk me through each layer a GPU request passes through — scheduler, device plugin allocation, runtime injection — when this job starts.
Given this working end-to-end GPU job, identify every layer where a misconfiguration could cause it to fail or silently run on CPU, and how you'd verify each.
Working within free-tier limits. Free / low-tier provider keys rate-limit aggressively, and eval or agent loops that fan out calls will hit
429 Too Many Requestsfast. Survive it: readRetry-Afterand thex-ratelimit-*headers and back off (exponential backoff with jitter + a max-retry cap) instead of hammering; cap in-flight requests with a small concurrency limiter so you stay under the RPM/TPM ceiling; cache identical requests so retries don't re-spend quota; downshift to a smaller/cheaper model for practice runs; use the provider Batch API for non-interactive jobs; or sidestep hosted limits entirely by running a small model locally (Ollama / llama.cpp) or on a free Colab/Kaggle GPU while you learn.
# gpu-burn-job.yml — a Job that exercises the GPU so you can watch utilization
apiVersion: batch/v1
kind: Job
metadata:
name: gpu-burn
spec:
backoffLimit: 0
template:
spec:
restartPolicy: Never
containers:
- name: torch
image: pytorch/pytorch:2.4.0-cuda12.4-cudnn9-runtime
command:
- python
- -c
- |
import torch, time
assert torch.cuda.is_available(), "CUDA not visible to the container"
x = torch.randn(8192, 8192, device="cuda")
t = time.time()
for _ in range(200):
x = x @ x
torch.cuda.synchronize()
print("done in", round(time.time() - t, 2), "s on", torch.cuda.get_device_name(0))
resources:
limits:
nvidia.com/gpu: 1