Here's how self-attention works:
Create Q, K, V matrices by multiplying input by learned weight matrices.
Compute attention scores: multiply Q by K transpose. This gives similarity between each query and all keys.
Scale scores by dividing by square root of key dimension. Prevents scores from getting too large.
Apply softmax to get attention weights (sum to ).
Multiply weights by V to get output.