Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
Not all transformers are the same. BERT uses bidirectional attention (each token sees the full context in both directions), making it excellent for understanding tasks — classification, NER, question answering — but it can't generate text. GPT uses causal (left-to-right) attention, making it a natural text generator, but it can't look ahead. T5 uses both: an encoder reads the input bidirectionally, a decoder generates the output causally. Understanding which architecture fits which task is the first design decision in any NLP project — getting it wrong means fine-tuning a generative model for a classification task (usually slower and worse than a BERT-family model).
BERT returns a (batch, seq_len, 768) tensor of contextual embeddings — one vector per input token, useful for classification and NER. GPT-2 returns (batch, seq_len, 50257) logit tensors over the vocabulary — one distribution per position, useful for generation. The output shape difference is not cosmetic; it encodes the entire architectural purpose of each model.
# pip install transformers torch
from transformers import (
BertTokenizer, BertModel,
GPT2Tokenizer, GPT2LMHeadModel,
)
import torch
text = "The quick brown fox"
# BERT — encoder (bidirectional, masked LM pretraining)
bert_tok = BertTokenizer.from_pretrained("bert-base-uncased")
bert_model = BertModel.from_pretrained("bert-base-uncased")
bert_inputs = bert_tok(text, return_tensors="pt")
with torch.no_grad():
bert_out = bert_model(**bert_inputs)
print("BERT last_hidden_state:", bert_out.last_hidden_state.shape)
# (1, 6, 768) — 1 batch, 6 tokens (incl. [CLS]/[SEP]), 768-dim vectors
# GPT-2 — decoder (causal, autoregressive pretraining)
gpt_tok = GPT2Tokenizer.from_pretrained("gpt2")
gpt_model = GPT2LMHeadModel.from_pretrained("gpt2")
gpt_inputs = gpt_tok(text, return_tensors="pt")
with torch.no_grad():
gpt_out = gpt_model(**gpt_inputs)
print("GPT-2 logits:", gpt_out.logits.shape)
# (1, 4, 50257) — 1 batch, 4 tokens, 50257 vocab logits per token
next_token_id = gpt_out.logits[0, -1].argmax().item()
print("GPT-2 predicted next token:", gpt_tok.decode([next_token_id]))python3 main.py[CLS] embedding from BERT: bert_out.last_hidden_state[0, 0]. This 768-dim vector is the sentence representation. Print its norm. Fine-tuning BERT for classification adds a linear layer on top of this vector.gpt_model.generate(gpt_inputs['input_ids'], max_new_tokens=20). Decode and print. Try different temperature values (0.1, 1.0, 2.0) with do_sample=True — observe how temperature controls randomness.bert_tok('[CLS] The [MASK] brown fox [SEP]', return_tensors='pt'). Use BertForMaskedLM and print the top-5 predicted tokens for the mask. This is BERT's pretraining task.sum(p.numel() for p in bert_model.parameters()) vs the same for gpt_model. Both are ~110M. What does this tell you about the size difference between BERT's bidirectional attention and GPT-2's causal attention?Use these three in order. Each builds on the one before.
In one paragraph, explain the key architectural difference between BERT and GPT-2. Which tasks is each suited for, and why can't you easily use GPT-2 for text classification the same way you'd use BERT?
Walk me through how BERT's masked language model (MLM) pretraining works: what is masked, what does the model predict, and how does this force it to learn bidirectional context? Compare to GPT's causal language model: predict the next token given all previous tokens.
I'm building a document classification system (50 classes, 100K labeled documents). Walk me through the decision of whether to use BERT-base, RoBERTa-large, a BERT variant fine-tuned on my domain (e.g., BioBERT, LegalBERT), or GPT-4 via API. Consider: fine-tuning cost, inference latency, data size, and expected accuracy tradeoffs.