Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
RLHF is the technique that transformed base language models into chat assistants like ChatGPT and Claude. The pipeline has three stages: SFT (which you've done), reward model training (train a model to predict human preference scores), and RL optimization (use PPO to make the LLM maximize the reward model's score while not diverging too far from the SFT checkpoint). The RL stage is where almost all the complexity lives: PPO on a 7B model requires loading the model four times (policy, reference, reward, value head), managing KV cache across rollouts, and tuning a fragile optimization loop. Knowing how RLHF works at this level means you can diagnose training instability, understand why DPO is increasingly preferred, and read the RLHF literature.
RLHF has two distinct learning problems stapled together. First, a reward model is trained on human preference pairs using a Bradley-Terry loss — chosen responses must score higher than rejected by a margin. Second, the policy is updated with PPO, which clips the probability ratio between old and new policy to prevent destructively large updates. Understanding both loss functions explains why RLHF is fragile: the reward model must generalize to unseen outputs, and PPO's clipping must be tuned carefully or training collapses.
import torch
import torch.nn.functional as F
# ── Part 1: Reward Model Training ──────────────────────────────────────────
# The reward model is a language model with a scalar head (regression).
# It's trained on (chosen, rejected) pairs: score(chosen) > score(rejected).
def reward_model_loss(score_chosen, score_rejected, margin=0.0):
"""Bradley-Terry preference model loss."""
# Push score_chosen - score_rejected to be positive
return -F.logsigmoid(score_chosen - score_rejected - margin).mean()
scores_c = torch.tensor([1.5, 2.0, 1.8]) # reward for chosen responses
scores_r = torch.tensor([0.5, 0.8, 0.3]) # reward for rejected responses
loss = reward_model_loss(scores_c, scores_r)
print(f"Reward model loss: {loss:.4f}") # should be low (chosen > rejected)
loss_bad = reward_model_loss(scores_r, scores_c) # flip chosen/rejected
print(f"Reward model loss (wrong order): {loss_bad:.4f}") # high
# ── Part 2: PPO Clipping Objective ─────────────────────────────────────────
def ppo_clip_loss(log_ratio, advantages, eps=0.2):
"""
log_ratio: log(pi_theta(a|s) / pi_old(a|s)) — how much the policy changed
advantages: A_t = R_t - baseline (reward - expected reward)
eps: clipping threshold — limit how far the policy can move per step
"""
ratio = log_ratio.exp()
clipped = ratio.clamp(1 - eps, 1 + eps)
loss = -torch.min(ratio * advantages, clipped * advantages).mean()
return loss
log_ratios = torch.tensor([0.1, 0.3, -0.1, 0.5]) # policy changed moderately
advantages = torch.tensor([1.0, 0.5, 0.8, 2.0]) # all positive (good actions)
loss_ppo = ppo_clip_loss(log_ratios, advantages)
print(f"\nPPO clip loss: {loss_ppo:.4f}")
print(f"Unclipped ratios: {log_ratios.exp().round(decimals=3)}")
print(f"Clipped ratios (eps=0.2): {log_ratios.exp().clamp(0.8, 1.2).round(decimals=3)}")python3 main.pylog_ratios = torch.tensor([2.0, 2.0, 2.0, 2.0]) (policy changed a lot). See how clipping limits the update even though advantages are positive. This is why PPO is called a 'trust-region' method — it prevents too-large policy updates.advantages = torch.tensor([-1.0, -0.5, -0.8, -2.0]) (all negative — model took bad actions). Does the PPO loss go up or down? The loss should increase, pushing the policy away from these actions.PPOTrainer and DPOTrainer config complexity: count the number of required hyperparameters for each. Which has more? This gives you a sense of why DPO is easier to use in practice.Use these three in order. Each builds on the one before.
In one paragraph, explain the three stages of RLHF training: SFT, reward model training, and RL optimization. What is the goal of each stage, and what data does each require?
Walk me through why PPO is used for RLHF instead of simpler policy gradient algorithms. What is the clipping objective protecting against, and what happens to training stability if you remove the clip (set eps=1000)?
I'm comparing RLHF (with PPO) to DPO for a preference alignment task. I have 100K preference pairs and a 13B model. Walk me through: (1) the infrastructure difference (how many GPU hours, how many model copies), (2) expected quality comparison on MT-Bench at the same compute budget, (3) failure modes unique to each method, and (4) which I should use given a 2-week timeline and 4×A100 80GB cluster.