Before transformers, models processed text sequentially. Each word depended only on previous words. Long-range dependencies were hard to capture.
Attention changed everything. It lets the model look at all positions simultaneously and decide which ones matter for the current prediction. The word "it" can attend directly to "the cat" tokens earlier. This parallel processing also made training faster on GPUs.