Full attention lets every token attend to every other token. Sliding window attention restricts each token to attend only to nearby tokens within a fixed window.
With a window size of , token can only see tokens -. This reduces memory from to where is window size.
Mistral uses sliding window attention. Information still propagates across the full sequence through stacked layers. You trade some long-range attention for much better efficiency on long sequences.