Flash Attention rewrites the attention computation to be memory-efficient. Instead of materializing the full attention matrix, it computes attention in tiles.
Results:
- -x faster attention
- Linear memory in sequence length instead of quadratic
Most modern training setups use Flash Attention automatically. Verify it's enabled in your config.