Weight decay penalizes large weights, acting as regularization. It adds a term to the loss proportional to weight magnitude.
Typical values for fine-tuning: to .
Weight decay prevents weights from growing too large, which helps generalization. AdamW applies weight decay separately from gradient updates, which works better than older approaches.