DevOps & Cloud Infrastructure / Course

GPU Orchestration & Kubernetes for AI

Schedule, partition, autoscale, and operate NVIDIA GPUs on Kubernetes for AI workloads. From the device plugin to MIG, gang scheduling, and on-call runbooks.

Free preview

Certificate: 1 of 5 capstones

Ten modules, YAML + kubectl + helm throughout, taking you from 'how does Kubernetes even see a GPU' to running a multi-tenant, autoscaling, observable GPU cluster. Covers the NVIDIA GPU Operator, MIG and time-slicing/MPS partitioning, topology-aware placement, gang scheduling with Volcano and Kueue, autoscaling with Cluster Autoscaler and Karpenter, multi-tenancy and cost control, and DCGM-based monitoring with an incident runbook. Reflects 2026 practice with current operator, device-plugin, and scheduler tooling.

Built by Lakshya Kumar

kubernetes

gpu

nvidia

mig

scheduling

autoscaling

devops

Before you start4 items

Comfortable with Kubernetes basics (pods, deployments, services, kubectl).
Access to a cluster with at least one NVIDIA GPU (cloud GKE/EKS/AKS or local with a GPU); kind/minikube works for some labs.
Basic Helm familiarity.
Comfortable on the Linux command line.

Is this course for you?Ask an AI

Paste this into any AI chat. Fill in the bracketed parts with your context — you'll get back a straight answer on whether this belongs on your plate.

Get access to GPU Orchestration & Kubernetes for AI

$3.99

30-day access

Prefer the whole catalog? See all-access membership.

Ask for access

We grant free access case-by-case — students, career-switchers, builders on a tight budget. Sign in to send us a note.

Capstone projects

Submit any 1 of 5 to earn the certificate

Complete all modules, then submit the required number of capstone projects. Each must earn a passing rating from an admin reviewer.

capstoneShared GPU cluster: MIG + time-slicing + autoscaling + DCGM

Configure a node with MIG slices for one workload and time-slicing for another, run both concurrently, add an autoscaling policy that scales the GPU node pool on a DCGM utilization or queue-depth metric, and stand up a DCGM + Grafana dashboard. Submit manifests, the autoscaling config, dashboard screenshots, and a writeup explaining the partitioning and scaling decisions.

Submit cluster configMinimum rating for approval: 3/5

gpu-operator-installGPU Operator install + validation

Further reading & study material6 sources

Prompt

I'm taking a "GPU Orchestration & Kubernetes for AI" course covering the device plugin model, the NVIDIA GPU Operator, scheduling (taints/affinity/priority), MIG and time-slicing/MPS partitioning, topology-aware placement, gang scheduling with Volcano/Kueue, autoscaling with Cluster Autoscaler/Karpenter, multi-tenancy + cost, and DCGM-based monitoring.

My context:
1. My cluster: [cloud GKE/EKS/AKS, on-prem, or local kind/minikube]
2. GPU types I have or want: [e.g. A100 80GB, L4, H100, consumer RTX]
3. My workloads: [training / batch inference / online serving / notebooks]
4. My biggest GPU pain right now: [low utilization / Pending pods / cost / contention / failures]

Given that, answer:
- Which module should I prioritize for my situation?
- Should I be using MIG, time-slicing, MPS, or whole GPUs for my workloads, and why?
- Name 3 concrete wins this course would unlock for me.
- If I only had 2 hours this week, which single technique gives me the biggest lift?

Install the NVIDIA GPU Operator via Helm on a fresh GPU node pool, validate every operand (driver, toolkit, device plugin, DCGM, node feature discovery), run the operator's validation workload, and document the node labels it produced. Submit the values file, validation output, and a rollback plan.

Submit install reportMinimum rating for approval: 3/5

gang-scheduled-jobGang-scheduled distributed training job

Run an all-or-nothing multi-GPU distributed job using Volcano or Kueue so that it only starts when every worker can be placed, and demonstrate that partial scheduling never occurs under contention. Submit the job spec, queue/quota config, and evidence of gang behavior under load.

Submit jobMinimum rating for approval: 3/5

multi-tenant-clusterMulti-tenant GPU cluster with quotas + spot

Partition a GPU cluster across at least two tenants using namespaces, ResourceQuotas, and fractional GPU (MIG or time-slicing), back it with spot/preemptible GPU nodes plus on-demand fallback, and produce a chargeback report. Submit the tenancy manifests and a cost breakdown.

Submit tenancy designMinimum rating for approval: 3/5

gpu-monitoring-runbookGPU monitoring + incident runbook

Deploy the DCGM exporter with Prometheus + Grafana, build alerts for XID errors, ECC faults, thermal throttling, and idle GPUs, and write an on-call runbook covering detection, node cordon/drain, and recovery for each failure class. Submit dashboards, alert rules, and the runbook.

Submit runbookMinimum rating for approval: 3/5

Source, install instructions, time-slicing and MIG config. Referenced across Modules 1, 4, 5.

GPU Orchestration & Kubernetes for AI

GPUs in Kubernetes

The NVIDIA GPU Operator

GPU Scheduling Fundamentals

Hardware Partitioning: MIG

Time-Slicing & MPS

Topology-Aware Scheduling

Advanced Cluster Scheduling

Autoscaling GPU Workloads

Multi-Tenancy & Cost

Operating GPU Clusters