Full fine-tune vs LoRA vs QLoRA — memory, time, quality tradeoffs

easy

Learn with your AI

Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.

Open in Claude Open in ChatGPT

Why this matters

You have three main options when adapting a pretrained model: full fine-tune (update all weights), LoRA (update low-rank adapters, freeze everything else), and QLoRA (LoRA on a 4-bit quantized base model). Full fine-tune gives the highest quality ceiling but requires the most GPU memory and time — often impossible on a single GPU for 7B+ models. LoRA can fine-tune a 7B model on 24 GB and produces quality within a few points of full fine-tune for most tasks. QLoRA can fit a 65B model on a single 48 GB GPU at the cost of some accuracy and slower training. Knowing when to use each is the first decision in any fine-tuning project.

Demo

The difference between full fine-tuning, LoRA, and QLoRA is measured in gigabytes, not percentages. Full fine-tuning an 8B model with Adam occupies roughly 56 GB — weights, gradients, and both optimizer moments — before a single activation is stored. LoRA cuts the trainable parameter count by 99 %, collapsing the optimizer state to under 2 GB while the frozen weights stay in bf16. QLoRA halves the base-weight footprint again by storing them in 4-bit NormalFloat, making a 65 B-parameter fine-tune possible on a single 48 GB GPU.

# Memory estimation for Llama-3 8B fine-tuning approaches
# Model: 8B params, bfloat16 weights

PARAMS = 8e9  # 8 billion

def memory_estimate(approach):
    if approach == "full_ft":
        weights   = PARAMS * 2          # bf16 weights: 2 bytes/param
        grads     = PARAMS * 2          # gradient: same size as weights
        optimizer = PARAMS * 8          # Adam: 2 momentum states × 4 bytes each
        activations = 4e9               # rough estimate for batch_size=1
        total     = weights + grads + optimizer + activations
    elif approach == "lora_bf16":
        weights   = PARAMS * 2          # frozen bf16 weights
        lora_p    = PARAMS * 0.01       # ~1% trainable
        grads     = lora_p * 2
        optimizer = lora_p * 8
        activations = 4e9
        total     = weights + grads + optimizer + activations
    elif approach == "qlora_4bit":
        weights   = PARAMS * 0.5        # 4-bit: 0.5 bytes/param
        lora_p    = PARAMS * 0.01
        grads     = lora_p * 4          # stored in fp32 for stability
        optimizer = lora_p * 8
        activations = 2e9               # smaller with gradient checkpointing
        total     = weights + grads + optimizer + activations
    return total / 1e9  # GB

for approach in ["full_ft", "lora_bf16", "qlora_4bit"]:
    gb = memory_estimate(approach)
    print(f"{approach:<15}  ~{gb:5.1f} GB  "
          f"{'fits on 8×A100 80GB' if gb < 640 else 'multi-node required'}"
          f"{'  fits on 1×A100 80GB' if gb < 80 else ''}"
          f"{'  fits on 1×4090 24GB' if gb < 24 else ''}")

Run: python3 main.py

Try it yourself

Change PARAMS = 8e9 to PARAMS = 70e9 (Llama-3 70B). Which approaches still fit on a single A100 80GB? Which require multiple GPUs? This is why QLoRA was revolutionary when it appeared — it made 65B fine-tuning accessible to individuals.

For QLoRA, what changes if you double the LoRA rank from 1% to 2% of parameters? Recompute the total memory. Is the increase significant relative to the base model size?

Research DeepSpeed ZeRO-3: it shards both weights, gradients, and optimizer states across GPUs. How does this change the 'full fine-tune' memory per GPU on 8×A100s? (You can estimate: weights/8 + grads/8 + optimizer/8 + activations per GPU.)

Look up the actual peak memory reported by the QLoRA paper for Llama-2 65B on a single A100. Does it match the estimate above? What overhead did the paper's measurement include that the estimate misses?

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

In one paragraph, explain why QLoRA can fine-tune a 65B model on a single GPU when full fine-tuning cannot. What is 4-bit quantization doing, and why are the LoRA adapter weights kept in higher precision?

2. Why it works (the mechanism)

Walk me through the memory breakdown of full fine-tuning in Adam: model weights, gradients, first moment (m), second moment (v), and activations. At 8B parameters in bf16, compute each component's size in GB. This is why you need ~60 GB just for the optimizer state.

3. Advanced — application & what's next

I want to fine-tune Llama-3 8B for a customer support task. I have access to 1×A100 80GB, 10K instruction pairs, and a 1-week timeline. Walk me through: (1) whether I need LoRA or can do full fine-tune, (2) which target_modules to apply LoRA to and why, (3) which PEFT library settings to start with (rank, alpha, dropout), and (4) how to decide if the fine-tuned model is better than just prompting the base model.