Schedule, partition, autoscale, and operate NVIDIA GPUs on Kubernetes for AI workloads. From the device plugin to MIG, gang scheduling, and on-call runbooks.
Ten modules, YAML + kubectl + helm throughout, taking you from 'how does Kubernetes even see a GPU' to running a multi-tenant, autoscaling, observable GPU cluster. Covers the NVIDIA GPU Operator, MIG and time-slicing/MPS partitioning, topology-aware placement, gang scheduling with Volcano and Kueue, autoscaling with Cluster Autoscaler and Karpenter, multi-tenancy and cost control, and DCGM-based monitoring with an incident runbook. Reflects 2026 practice with current operator, device-plugin, and scheduler tooling.
Built by Lakshya Kumar
Paste this into any AI chat. Fill in the bracketed parts with your context — you'll get back a straight answer on whether this belongs on your plate.
We grant free access case-by-case — students, career-switchers, builders on a tight budget. Sign in to send us a note.
Sign in to applyComplete all modules, then submit the required number of capstone projects. Each must earn a passing rating from an admin reviewer.
Configure a node with MIG slices for one workload and time-slicing for another, run both concurrently, add an autoscaling policy that scales the GPU node pool on a DCGM utilization or queue-depth metric, and stand up a DCGM + Grafana dashboard. Submit manifests, the autoscaling config, dashboard screenshots, and a writeup explaining the partitioning and scaling decisions.
I'm taking a "GPU Orchestration & Kubernetes for AI" course covering the device plugin model, the NVIDIA GPU Operator, scheduling (taints/affinity/priority), MIG and time-slicing/MPS partitioning, topology-aware placement, gang scheduling with Volcano/Kueue, autoscaling with Cluster Autoscaler/Karpenter, multi-tenancy + cost, and DCGM-based monitoring. My context: 1. My cluster: [cloud GKE/EKS/AKS, on-prem, or local kind/minikube] 2. GPU types I have or want: [e.g. A100 80GB, L4, H100, consumer RTX] 3. My workloads: [training / batch inference / online serving / notebooks] 4. My biggest GPU pain right now: [low utilization / Pending pods / cost / contention / failures] Given that, answer: - Which module should I prioritize for my situation? - Should I be using MIG, time-slicing, MPS, or whole GPUs for my workloads, and why? - Name 3 concrete wins this course would unlock for me. - If I only had 2 hours this week, which single technique gives me the biggest lift?
Install the NVIDIA GPU Operator via Helm on a fresh GPU node pool, validate every operand (driver, toolkit, device plugin, DCGM, node feature discovery), run the operator's validation workload, and document the node labels it produced. Submit the values file, validation output, and a rollback plan.
Run an all-or-nothing multi-GPU distributed job using Volcano or Kueue so that it only starts when every worker can be placed, and demonstrate that partial scheduling never occurs under contention. Submit the job spec, queue/quota config, and evidence of gang behavior under load.
Partition a GPU cluster across at least two tenants using namespaces, ResourceQuotas, and fractional GPU (MIG or time-slicing), back it with spot/preemptible GPU nodes plus on-demand fallback, and produce a chargeback report. Submit the tenancy manifests and a cost breakdown.
Deploy the DCGM exporter with Prometheus + Grafana, build alerts for XID errors, ECC faults, thermal throttling, and idle GPUs, and write an on-call runbook covering detection, node cordon/drain, and recovery for each failure class. Submit dashboards, alert rules, and the runbook.
Source, install instructions, time-slicing and MIG config. Referenced across Modules 1, 4, 5.