Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
You have three main options when adapting a pretrained model: full fine-tune (update all weights), LoRA (update low-rank adapters, freeze everything else), and QLoRA (LoRA on a 4-bit quantized base model). Full fine-tune gives the highest quality ceiling but requires the most GPU memory and time — often impossible on a single GPU for 7B+ models. LoRA can fine-tune a 7B model on 24 GB and produces quality within a few points of full fine-tune for most tasks. QLoRA can fit a 65B model on a single 48 GB GPU at the cost of some accuracy and slower training. Knowing when to use each is the first decision in any fine-tuning project.
The difference between full fine-tuning, LoRA, and QLoRA is measured in gigabytes, not percentages. Full fine-tuning an 8B model with Adam occupies roughly 56 GB — weights, gradients, and both optimizer moments — before a single activation is stored. LoRA cuts the trainable parameter count by 99 %, collapsing the optimizer state to under 2 GB while the frozen weights stay in bf16. QLoRA halves the base-weight footprint again by storing them in 4-bit NormalFloat, making a 65 B-parameter fine-tune possible on a single 48 GB GPU.
# Memory estimation for Llama-3 8B fine-tuning approaches
# Model: 8B params, bfloat16 weights
PARAMS = 8e9 # 8 billion
def memory_estimate(approach):
if approach == "full_ft":
weights = PARAMS * 2 # bf16 weights: 2 bytes/param
grads = PARAMS * 2 # gradient: same size as weights
optimizer = PARAMS * 8 # Adam: 2 momentum states × 4 bytes each
activations = 4e9 # rough estimate for batch_size=1
total = weights + grads + optimizer + activations
elif approach == "lora_bf16":
weights = PARAMS * 2 # frozen bf16 weights
lora_p = PARAMS * 0.01 # ~1% trainable
grads = lora_p * 2
optimizer = lora_p * 8
activations = 4e9
total = weights + grads + optimizer + activations
elif approach == "qlora_4bit":
weights = PARAMS * 0.5 # 4-bit: 0.5 bytes/param
lora_p = PARAMS * 0.01
grads = lora_p * 4 # stored in fp32 for stability
optimizer = lora_p * 8
activations = 2e9 # smaller with gradient checkpointing
total = weights + grads + optimizer + activations
return total / 1e9 # GB
for approach in ["full_ft", "lora_bf16", "qlora_4bit"]:
gb = memory_estimate(approach)
print(f"{approach:<15} ~{gb:5.1f} GB "
f"{'fits on 8×A100 80GB' if gb < 640 else 'multi-node required'}"
f"{' fits on 1×A100 80GB' if gb < 80 else ''}"
f"{' fits on 1×4090 24GB' if gb < 24 else ''}")python3 main.pyPARAMS = 8e9 to PARAMS = 70e9 (Llama-3 70B). Which approaches still fit on a single A100 80GB? Which require multiple GPUs? This is why QLoRA was revolutionary when it appeared — it made 65B fine-tuning accessible to individuals.Use these three in order. Each builds on the one before.
In one paragraph, explain why QLoRA can fine-tune a 65B model on a single GPU when full fine-tuning cannot. What is 4-bit quantization doing, and why are the LoRA adapter weights kept in higher precision?
Walk me through the memory breakdown of full fine-tuning in Adam: model weights, gradients, first moment (m), second moment (v), and activations. At 8B parameters in bf16, compute each component's size in GB. This is why you need ~60 GB just for the optimizer state.
I want to fine-tune Llama-3 8B for a customer support task. I have access to 1×A100 80GB, 10K instruction pairs, and a 1-week timeline. Walk me through: (1) whether I need LoRA or can do full fine-tune, (2) which target_modules to apply LoRA to and why, (3) which PEFT library settings to start with (rank, alpha, dropout), and (4) how to decide if the fine-tuned model is better than just prompting the base model.