QLoRA combines quantization with LoRA for extreme efficiency.
How it works:
Quantize base model to -bit (NF format)
Add LoRA adapters in FP
Train only adapters, base stays frozen and quantized
Result: Fine-tune B model on single GB GPU. Previously impossible without multi-GPU setups.
Quality: Surprisingly close to full fine-tuning. Small degradation from quantization.
Interview question: "How does QLoRA save memory?"
Base model in -bit (× smaller than FP). Only small adapters in FP. Optimizer states only for adapters.