Serve a real fleet of models the way companies actually do it: NVIDIA Triton, TensorRT-LLM, and NIM, plus graph compilation, ensembles/BLS, multi-model GPU sharing, observability, and production deployment.
Ten modules, ~100 challenges that go beyond a single vLLM process to the enterprise serving stack of 2026: NVIDIA Triton (model repository, backends, dynamic batching, ensembles/BLS), TensorRT-LLM (engine builds, in-flight batching, FP8/INT4 quantization, multi-GPU), graph compilation (TensorRT, ONNX Runtime, torch.compile), NVIDIA NIM microservices, multi-model orchestration on shared GPUs, full observability (Prometheus, DCGM, tracing, SLO alerts), and production deployment (packaging, canary, blue/green, Kubernetes, autoscaling). Hands-on throughout with Python clients, config.pbtxt, bash for tritonserver/docker, and Kubernetes YAML.
Built by Lakshya Kumar
Paste this into any AI chat. Fill in the bracketed parts with your context — you'll get back a straight answer on whether this belongs on your plate.
We grant free access case-by-case — students, career-switchers, builders on a tight budget. Sign in to send us a note.
Sign in to applyComplete all modules, then submit the required number of capstone projects. Each must earn a passing rating from an admin reviewer.
Build and ship a two-model pipeline — embed-then-rerank or guard-then-LLM — as a Triton ensemble/BLS deployment (or a NIM plus custom logic) with in-server tokenization, dynamic batching tuned to an SLA, and full Prometheus + DCGM metrics. Submit the running endpoint, the config/repository, and a dashboard screenshot proving the SLA holds under load.
I'm taking an "Enterprise Serving Stacks" course covering NVIDIA Triton, TensorRT-LLM, NIM, graph compilation (TensorRT/ONNX Runtime/torch.compile), ensembles + Business Logic Scripting, multi-model GPU sharing, observability (Prometheus/DCGM/tracing), and production deployment (canary/blue-green/Kubernetes/autoscaling). It's hands-on with Python clients, config.pbtxt, bash for tritonserver/docker, and Kubernetes YAML. Here's my context: 1. My models to serve: [list models, frameworks, sizes] 2. My GPUs: [type, count, memory, interconnect] 3. My SLAs: [TTFT/TPOT/p99 latency targets per model] 4. My current serving setup: [nothing yet / a single vLLM process / something else] 5. My deployment target: [single host / Kubernetes / managed] Given that, answer: - Which module should I prioritize first, and why, for my situation? - For my model fleet and GPUs, should I lean toward self-built Triton, TensorRT-LLM engines, or NIM — and why? - Name 3 concrete wins this course would unlock for my deployment. - Name 1 thing the course won't help me with so I set expectations correctly. - If I only had one weekend, which single technique (batching tuning, quantization, GPU sharing, or observability) would give me the biggest lift, and how would I measure that it worked?
Build FP16 and quantized (FP8 or INT4 AWQ) TensorRT-LLM engines for an open model, serve them through Triton, and benchmark throughput, TTFT, and TPOT against each other and a hosted baseline, with a small quality eval on the quantized variant. Submit the build commands, the benchmark table, and the eval results.
Deploy at least five models sharing GPUs with instance groups, priorities/rate limits (or MIG isolation), and a router (by tenant/difficulty with failover). Load-test to prove the latency-critical model holds its SLA while a bulk model saturates the same hardware. Submit the fleet plan and the load-test results.
Build an in-server Business Logic Scripting pipeline with real control flow — e.g. a cheap guard/router that conditionally invokes an expensive model, or a refinement loop gated by a quality-check model. Prove the expensive model is skipped when it should be and the whole thing is a single client call. Submit the BLS code, the model repository, and a trace.
Instrument a serving deployment end to end (Triton + DCGM metrics, tracing, Grafana dashboard, SLO alerts) and ship a model-version change through a canary (or blue/green) rollout with a tested rollback. Induce a regression and show your observability caught it and the rollout aborted. Submit the dashboard, alert rules, and a deploy/rollback log.
Module 4's home base: engine builds, in-flight batching, quantization, and multi-GPU.