Decoder-only models use causal masking. Position can only attend to positions -, not onward. This prevents "cheating" during training.
Implemented by setting future attention scores to negative infinity before softmax. This zeros out future attention weights.
Causal masking lets us train on full sequences in parallel while ensuring the model only uses past context when predicting each token.