Context window is the maximum input length the model can process.
Typical sizes: GPT-: K tokens. Claude: K tokens. LLaMA: K-K depending on version.
Limitations:
- Attention is . Long contexts are slow and expensive.
- "Lost in the middle": Models attend poorly to middle content.
- Cost scales with context length.
Solutions:
- Sparse attention patterns
- Retrieval augmentation (RAG)
- Sliding window attention
Interview question: "How would you handle documents longer than the context window?"
Chunk and summarize. Use RAG. Or use hierarchical approaches.