Run ML and LLM systems in production the way it's actually done: experiment tracking, model/prompt registries with lineage, CI/CD with eval gates, safe deployment patterns, automated LLM evals, observability, guardrails, and a continuous improvement loop. The discipline that turns a working demo into a system you can trust at scale.
Ten modules, ~100 challenges on the operational discipline behind reliable ML and LLM products. Python-first and 2026-current: it bridges classic MLOps (MLflow experiment tracking, DVC data versioning, Airflow/Dagster pipelines, model registries, GitHub Actions CI/CD, canary and blue/green deploys) to LLMOps, where for many apps you never train a model and the real artifacts are prompts, RAG configs, and agents. You'll build automated LLM evals (golden sets, LLM-as-judge done right, regression gates, evals that survive rate limits), production observability with Langfuse-style tracing and drift detection, a guardrail layer with red-team tests, and finally close the lifecycle loop end to end. Every module ships runnable code and a project; the through-line is treating prompts and models as versioned, tested, monitored artifacts instead of strings you edit and pray over.
Built by Lakshya Kumar
Paste this into any AI chat. Fill in the bracketed parts with your context — you'll get back a straight answer on whether this belongs on your plate.
We grant free access case-by-case — students, career-switchers, builders on a tight budget. Sign in to send us a note.
Sign in to applyComplete all modules, then submit the required number of capstone projects. Each must earn a passing rating from an admin reviewer.
Take one real ML model or LLM feature all the way around the loop: register it (with a model card, provenance, and lineage), gate it with an automated eval suite in CI, deploy it via canary behind feature flags with automated rollback, wrap it in an observability + guardrail layer, and write the runbooks. Submit the repo, a CI run showing the eval gate, a deployment + rollback demo, an observability dashboard, and the runbooks.
I'm taking an "MLOps & LLMOps: The Production Lifecycle" course — the operational discipline behind reliable ML and LLM systems. It covers experiment tracking (MLflow/W&B), data & feature pipelines (DVC/Airflow), model & prompt registries with lineage, CI/CD with eval gates (GitHub Actions), deployment patterns (shadow/canary/blue-green/A-B), automated LLM evals (golden sets, LLM-as-judge, regression gates), production observability (Langfuse tracing, drift detection), guardrails & safety (injection defense, PII, moderation), and closing the continuous loop. Python-first, bridging classic MLOps to 2026 LLMOps. Here's my context: 1. What I'm building/operating: [describe the ML model or LLM feature/product] 2. Do I train/fine-tune anything, or is every "model" a hosted API call? [train / fine-tune / pure API / mix] 3. My current maturity: [level 0 notebook + manual deploy / level 1 some automation / level 2 full CI-CD] 4. Where it hurts most: [can't reproduce results / no evals / regressions ship silently / drift / cost / safety / slow iteration] Given that, answer: - Which module should I prioritize first and why, given my maturity level? - Which is my single highest-leverage gap (tracking / data versioning / registry / CI eval gate / observability / guardrails / the loop)? - Name 3 concrete changes I could make this week, and how I'd measure that each one helped. - Name 1 thing this course won't fix so I have the right expectations.
Build a reusable eval pack another team could drop onto their LLM app: a versioned golden dataset, reference-based + reference-free checks, a calibrated LLM-as-judge with bias controls, a resilient runner (concurrency caps, retries, checkpointing, dead-letter) that survives rate limits, and a CI regression gate. Submit it as a small package or repo with docs and a sample run.
Instrument a real LLM app end to end: nested-span tracing, responsible logging (sampling + PII redaction), error-type classification, data/concept/output drift detection, online quality signals, and a dashboard with symptom-level alerts. Submit the live dashboard, a debugged trace from a failing request, and proof that a drift or cost spike fires the right alert.
Wrap an LLM feature in a complete guardrail layer (input sanitization, PII redaction, prompt-injection defense, content moderation, schema validation/repair, fallbacks + human-in-the-loop) and red-team it with attack tests for injections, PII leaks, malformed output, and policy violations. Submit the implementation and the red-team suite proving the guardrails hold.
Close the lifecycle loop for a real system: production feedback feeds the golden/training set, an automated trigger drives iteration, the eval gate guards quality, a gated deploy ships behind guardrails, and observability feeds the next turn — plus a unified deployment manifest, cost governance with a spend cap, and a tested runbook. Submit a demo of one full automated loop turn and the manifest/runbook artifacts.
The tracking + registry tool used across Modules 2, 4, and 5. Keep it open while you build.