Standard attention is slow because it reads and writes the full attention matrix to GPU memory. Flash Attention fixes this by computing attention in tiles.
Instead of materializing the full attention matrix, Flash Attention processes small blocks that fit in fast SRAM. This reduces memory reads and writes dramatically.
You get -x speedup with identical outputs. Flash Attention is now standard in most inference frameworks. When fine-tuning, enabling Flash Attention is usually a free performance win.