RLHF overview — reward model, PPO loop, why it's hard

hard

Learn with your AI

Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.

Open in Claude Open in ChatGPT

Why this matters

RLHF is the technique that transformed base language models into chat assistants like ChatGPT and Claude. The pipeline has three stages: SFT (which you've done), reward model training (train a model to predict human preference scores), and RL optimization (use PPO to make the LLM maximize the reward model's score while not diverging too far from the SFT checkpoint). The RL stage is where almost all the complexity lives: PPO on a 7B model requires loading the model four times (policy, reference, reward, value head), managing KV cache across rollouts, and tuning a fragile optimization loop. Knowing how RLHF works at this level means you can diagnose training instability, understand why DPO is increasingly preferred, and read the RLHF literature.

Demo

RLHF has two distinct learning problems stapled together. First, a reward model is trained on human preference pairs using a Bradley-Terry loss — chosen responses must score higher than rejected by a margin. Second, the policy is updated with PPO, which clips the probability ratio between old and new policy to prevent destructively large updates. Understanding both loss functions explains why RLHF is fragile: the reward model must generalize to unseen outputs, and PPO's clipping must be tuned carefully or training collapses.

import torch
import torch.nn.functional as F

# ── Part 1: Reward Model Training ──────────────────────────────────────────
# The reward model is a language model with a scalar head (regression).
# It's trained on (chosen, rejected) pairs: score(chosen) > score(rejected).

def reward_model_loss(score_chosen, score_rejected, margin=0.0):
    """Bradley-Terry preference model loss."""
    # Push score_chosen - score_rejected to be positive
    return -F.logsigmoid(score_chosen - score_rejected - margin).mean()

scores_c = torch.tensor([1.5, 2.0, 1.8])   # reward for chosen responses
scores_r = torch.tensor([0.5, 0.8, 0.3])   # reward for rejected responses
loss = reward_model_loss(scores_c, scores_r)
print(f"Reward model loss: {loss:.4f}")     # should be low (chosen > rejected)

loss_bad = reward_model_loss(scores_r, scores_c)  # flip chosen/rejected
print(f"Reward model loss (wrong order): {loss_bad:.4f}")  # high

# ── Part 2: PPO Clipping Objective ─────────────────────────────────────────
def ppo_clip_loss(log_ratio, advantages, eps=0.2):
    """
    log_ratio: log(pi_theta(a|s) / pi_old(a|s)) — how much the policy changed
    advantages: A_t = R_t - baseline (reward - expected reward)
    eps: clipping threshold — limit how far the policy can move per step
    """
    ratio    = log_ratio.exp()
    clipped  = ratio.clamp(1 - eps, 1 + eps)
    loss     = -torch.min(ratio * advantages, clipped * advantages).mean()
    return loss

log_ratios = torch.tensor([0.1, 0.3, -0.1, 0.5])   # policy changed moderately
advantages = torch.tensor([1.0, 0.5,  0.8, 2.0])    # all positive (good actions)

loss_ppo = ppo_clip_loss(log_ratios, advantages)
print(f"\nPPO clip loss: {loss_ppo:.4f}")
print(f"Unclipped ratios: {log_ratios.exp().round(decimals=3)}")
print(f"Clipped ratios (eps=0.2): {log_ratios.exp().clamp(0.8, 1.2).round(decimals=3)}")

Run: python3 main.py

Try it yourself

Set log_ratios = torch.tensor([2.0, 2.0, 2.0, 2.0]) (policy changed a lot). See how clipping limits the update even though advantages are positive. This is why PPO is called a 'trust-region' method — it prevents too-large policy updates.

Set advantages = torch.tensor([-1.0, -0.5, -0.8, -2.0]) (all negative — model took bad actions). Does the PPO loss go up or down? The loss should increase, pushing the policy away from these actions.

Research the four models you need to load simultaneously during RLHF PPO training: policy model, reference model (frozen SFT checkpoint), reward model, and value head. For a 7B model in bf16, estimate the total GPU memory required. This is why RLHF is expensive and why DPO (which only needs policy + reference) is so appealing.

Compare the TRL PPOTrainer and DPOTrainer config complexity: count the number of required hyperparameters for each. Which has more? This gives you a sense of why DPO is easier to use in practice.

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

In one paragraph, explain the three stages of RLHF training: SFT, reward model training, and RL optimization. What is the goal of each stage, and what data does each require?

2. Why it works (the mechanism)

Walk me through why PPO is used for RLHF instead of simpler policy gradient algorithms. What is the clipping objective protecting against, and what happens to training stability if you remove the clip (set eps=1000)?

3. Advanced — application & what's next

I'm comparing RLHF (with PPO) to DPO for a preference alignment task. I have 100K preference pairs and a 13B model. Walk me through: (1) the infrastructure difference (how many GPU hours, how many model copies), (2) expected quality comparison on MT-Bench at the same compute budget, (3) failure modes unique to each method, and (4) which I should use given a 2-week timeline and 4×A100 80GB cluster.