After attention, each position passes through a feed-forward network (FFN). This is linear layers with an activation function between them:
FFN(x) = W_2 * ReLU(W_1 * x + b_1) + b_2
The hidden dimension is typically x the model dimension. For a -dimensional model, the FFN hidden layer has dimensions.
FFNs store factual knowledge. Attention routes information. FFNs transform it.