SGD: Update using mini-batch gradient. Simple but slow. Sensitive to learning rate.
Momentum: Add velocity term. Accelerates in consistent gradient direction.
Adam: Combines momentum with adaptive learning rates. Tracks first and second moment of gradients. Default choice.
Learning rate scheduling: Step decay, cosine annealing, or warmup-then-decay.
Interview question: "Why use Adam over SGD?"
Adam adapts learning rates per parameter and converges faster. But SGD with momentum sometimes generalizes better.