Agentic and Applied AI / Course

Enterprise Serving Stacks

Serve a real fleet of models the way companies actually do it: NVIDIA Triton, TensorRT-LLM, and NIM, plus graph compilation, ensembles/BLS, multi-model GPU sharing, observability, and production deployment.

Free preview

Certificate: 1 of 5 capstones

Ten modules, ~100 challenges that go beyond a single vLLM process to the enterprise serving stack of 2026: NVIDIA Triton (model repository, backends, dynamic batching, ensembles/BLS), TensorRT-LLM (engine builds, in-flight batching, FP8/INT4 quantization, multi-GPU), graph compilation (TensorRT, ONNX Runtime, torch.compile), NVIDIA NIM microservices, multi-model orchestration on shared GPUs, full observability (Prometheus, DCGM, tracing, SLO alerts), and production deployment (packaging, canary, blue/green, Kubernetes, autoscaling). Hands-on throughout with Python clients, config.pbtxt, bash for tritonserver/docker, and Kubernetes YAML.

Built by Lakshya Kumar

triton

tensorrt-llm

nim

serving

gpu

nvidia

mlops

Before you start4 items

Completed or comfortable with LLM inference basics (KV cache, batching, TTFT/TPOT).
Comfortable with Docker and the command line.
Access to an NVIDIA GPU (cloud or local) for the hands-on labs.
Basic Kubernetes familiarity helps for the deployment modules.

Is this course for you?Ask an AI

Paste this into any AI chat. Fill in the bracketed parts with your context — you'll get back a straight answer on whether this belongs on your plate.

Get access to Enterprise Serving Stacks

$3.99

30-day access

Prefer the whole catalog? See all-access membership.

Ask for access

We grant free access case-by-case — students, career-switchers, builders on a tight budget. Sign in to send us a note.

Capstone projects

Submit any 1 of 5 to earn the certificate

Complete all modules, then submit the required number of capstone projects. Each must earn a passing rating from an admin reviewer.

capstonePackage a two-model pipeline as a serving stack

Build and ship a two-model pipeline — embed-then-rerank or guard-then-LLM — as a Triton ensemble/BLS deployment (or a NIM plus custom logic) with in-server tokenization, dynamic batching tuned to an SLA, and full Prometheus + DCGM metrics. Submit the running endpoint, the config/repository, and a dashboard screenshot proving the SLA holds under load.

Submit serving pipelineMinimum rating for approval: 3/5

trtllm-engine-benchmarkTensorRT-LLM engine build + benchmark

Further reading & study material6 sources

Prompt

I'm taking an "Enterprise Serving Stacks" course covering NVIDIA Triton, TensorRT-LLM, NIM, graph compilation (TensorRT/ONNX Runtime/torch.compile), ensembles + Business Logic Scripting, multi-model GPU sharing, observability (Prometheus/DCGM/tracing), and production deployment (canary/blue-green/Kubernetes/autoscaling). It's hands-on with Python clients, config.pbtxt, bash for tritonserver/docker, and Kubernetes YAML.

Here's my context:
1. My models to serve: [list models, frameworks, sizes]
2. My GPUs: [type, count, memory, interconnect]
3. My SLAs: [TTFT/TPOT/p99 latency targets per model]
4. My current serving setup: [nothing yet / a single vLLM process / something else]
5. My deployment target: [single host / Kubernetes / managed]

Given that, answer:
- Which module should I prioritize first, and why, for my situation?
- For my model fleet and GPUs, should I lean toward self-built Triton, TensorRT-LLM engines, or NIM — and why?
- Name 3 concrete wins this course would unlock for my deployment.
- Name 1 thing the course won't help me with so I set expectations correctly.
- If I only had one weekend, which single technique (batching tuning, quantization, GPU sharing, or observability) would give me the biggest lift, and how would I measure that it worked?

Build FP16 and quantized (FP8 or INT4 AWQ) TensorRT-LLM engines for an open model, serve them through Triton, and benchmark throughput, TTFT, and TPOT against each other and a hosted baseline, with a small quality eval on the quantized variant. Submit the build commands, the benchmark table, and the eval results.

Submit engine benchmarkMinimum rating for approval: 3/5

bls-pipelineA conditional BLS pipeline

Build an in-server Business Logic Scripting pipeline with real control flow — e.g. a cheap guard/router that conditionally invokes an expensive model, or a refinement loop gated by a quality-check model. Prove the expensive model is skipped when it should be and the whole thing is a single client call. Submit the BLS code, the model repository, and a trace.

Submit BLS pipelineMinimum rating for approval: 3/5

observability-and-canaryObservability + canary deployment

Instrument a serving deployment end to end (Triton + DCGM metrics, tracing, Grafana dashboard, SLO alerts) and ship a model-version change through a canary (or blue/green) rollout with a tested rollback. Induce a regression and show your observability caught it and the rollout aborted. Submit the dashboard, alert rules, and a deploy/rollback log.

Submit observability + canaryMinimum rating for approval: 3/5

Module 4's home base: engine builds, in-flight batching, quantization, and multi-GPU.

Enterprise Serving Stacks

The Enterprise Serving Problem

NVIDIA Triton Inference Server

Model Config & Dynamic Batching in Triton

TensorRT-LLM

Graph Compilation & Optimization

Business Logic Scripting & Ensembles

Multi-Model Orchestration

NVIDIA NIM

Observability for Serving Stacks

Deploying an Enterprise Stack