Your first end-to-end GPU workload

hard

Learn with your AI

Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.

Open in Claude Open in ChatGPT

Why this matters

Theory becomes real the moment you run a workload that uses the GPU and you can watch it happen. Tying together everything from this module — a node advertising GPUs, a pod requesting one, the runtime injecting it, and nvidia-smi confirming utilization — gives you the baseline deployment you will harden and scale in every later module. Doing it once, observing the GPU light up under load, removes the mystery: GPU scheduling is just resources, requests, runtimes, and verification composed together. This is the working foundation the rest of the course builds on.

Demo

Deploy a GPU workload that does real compute (a tight CUDA loop or a tiny PyTorch job), then run nvidia-smi while it works and watch GPU-Util climb toward 100%. That moving number is proof the whole stack functions.

Try it yourself

Run the job and confirm its logs report the GPU name and a completion time.
While the job runs, watch nvidia-smi and confirm GPU-Util climbs well above idle.
Force the assertion to fail by removing the GPU limit and confirm the job errors with 'CUDA not visible'.
Scale to two parallel job pods on a multi-GPU node and confirm each lands on a different GPU.

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

In one paragraph, summarize the full chain that lets a Kubernetes pod actually use a GPU, from node to container.

2. Why it works (the mechanism)

Walk me through each layer a GPU request passes through — scheduler, device plugin allocation, runtime injection — when this job starts.

3. Advanced — application & what's next

Given this working end-to-end GPU job, identify every layer where a misconfiguration could cause it to fail or silently run on CPU, and how you'd verify each.

References

Working within free-tier limits. Free / low-tier provider keys rate-limit aggressively, and eval or agent loops that fan out calls will hit 429 Too Many Requests fast. Survive it: read Retry-After and the x-ratelimit-* headers and back off (exponential backoff with jitter + a max-retry cap) instead of hammering; cap in-flight requests with a small concurrency limiter so you stay under the RPM/TPM ceiling; cache identical requests so retries don't re-spend quota; downshift to a smaller/cheaper model for practice runs; use the provider Batch API for non-interactive jobs; or sidestep hosted limits entirely by running a small model locally (Ollama / llama.cpp) or on a free Colab/Kaggle GPU while you learn.