SFT continues the pre-training objective on your data. The model sees an input sequence and predicts the next token. Cross-entropy loss measures how wrong its predictions are.
Gradients flow backward through the network, adjusting weights to reduce loss. Over many iterations, the model's predictions align with your training examples. Simple but effective.