BERT vs GPT — encoder-only vs decoder-only tradeoffs

easy

Learn with your AI

Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.

Open in Claude Open in ChatGPT

Why this matters

Not all transformers are the same. BERT uses bidirectional attention (each token sees the full context in both directions), making it excellent for understanding tasks — classification, NER, question answering — but it can't generate text. GPT uses causal (left-to-right) attention, making it a natural text generator, but it can't look ahead. T5 uses both: an encoder reads the input bidirectionally, a decoder generates the output causally. Understanding which architecture fits which task is the first design decision in any NLP project — getting it wrong means fine-tuning a generative model for a classification task (usually slower and worse than a BERT-family model).

Demo

BERT returns a (batch, seq_len, 768) tensor of contextual embeddings — one vector per input token, useful for classification and NER. GPT-2 returns (batch, seq_len, 50257) logit tensors over the vocabulary — one distribution per position, useful for generation. The output shape difference is not cosmetic; it encodes the entire architectural purpose of each model.

# pip install transformers torch
from transformers import (
    BertTokenizer, BertModel,
    GPT2Tokenizer, GPT2LMHeadModel,
)
import torch

text = "The quick brown fox"

# BERT — encoder (bidirectional, masked LM pretraining)
bert_tok   = BertTokenizer.from_pretrained("bert-base-uncased")
bert_model = BertModel.from_pretrained("bert-base-uncased")
bert_inputs = bert_tok(text, return_tensors="pt")
with torch.no_grad():
    bert_out = bert_model(**bert_inputs)

print("BERT last_hidden_state:", bert_out.last_hidden_state.shape)
# (1, 6, 768) — 1 batch, 6 tokens (incl. [CLS]/[SEP]), 768-dim vectors

# GPT-2 — decoder (causal, autoregressive pretraining)
gpt_tok   = GPT2Tokenizer.from_pretrained("gpt2")
gpt_model = GPT2LMHeadModel.from_pretrained("gpt2")
gpt_inputs = gpt_tok(text, return_tensors="pt")
with torch.no_grad():
    gpt_out = gpt_model(**gpt_inputs)

print("GPT-2 logits:", gpt_out.logits.shape)
# (1, 4, 50257) — 1 batch, 4 tokens, 50257 vocab logits per token

next_token_id = gpt_out.logits[0, -1].argmax().item()
print("GPT-2 predicted next token:", gpt_tok.decode([next_token_id]))

Run: python3 main.py

Try it yourself

Extract the [CLS] embedding from BERT: bert_out.last_hidden_state[0, 0]. This 768-dim vector is the sentence representation. Print its norm. Fine-tuning BERT for classification adds a linear layer on top of this vector.

Generate a full continuation with GPT-2: gpt_model.generate(gpt_inputs['input_ids'], max_new_tokens=20). Decode and print. Try different temperature values (0.1, 1.0, 2.0) with do_sample=True — observe how temperature controls randomness.

Run BERT on a masked sentence: bert_tok('[CLS] The [MASK] brown fox [SEP]', return_tensors='pt'). Use BertForMaskedLM and print the top-5 predicted tokens for the mask. This is BERT's pretraining task.

Compare parameter counts: sum(p.numel() for p in bert_model.parameters()) vs the same for gpt_model. Both are ~110M. What does this tell you about the size difference between BERT's bidirectional attention and GPT-2's causal attention?

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

In one paragraph, explain the key architectural difference between BERT and GPT-2. Which tasks is each suited for, and why can't you easily use GPT-2 for text classification the same way you'd use BERT?

2. Why it works (the mechanism)

Walk me through how BERT's masked language model (MLM) pretraining works: what is masked, what does the model predict, and how does this force it to learn bidirectional context? Compare to GPT's causal language model: predict the next token given all previous tokens.

3. Advanced — application & what's next

I'm building a document classification system (50 classes, 100K labeled documents). Walk me through the decision of whether to use BERT-base, RoBERTa-large, a BERT variant fine-tuned on my domain (e.g., BioBERT, LegalBERT), or GPT-4 via API. Consider: fine-tuning cost, inference latency, data size, and expected accuracy tradeoffs.