Before transformers, RNNs processed sequences by maintaining hidden state.
RNN: Hidden state updated at each step: . Suffers from vanishing gradients on long sequences.
LSTM: Long Short-Term Memory. Gates control information flow: forget, input, output. Handles longer dependencies.
BiLSTM: Process sequence forward and backward. Captures context from both directions.
Why transformers won: Attention is parallelizable. RNNs process sequentially. Transformers scale better with compute.
Interview tip: Know LSTM gates conceptually. Transformers dominate but RNN questions still appear.