Decoder-only models won for several reasons:
- Simpler architecture (no encoder-decoder attention)
- Scales better with data and compute
- Single model handles both understanding and generation
- KV caching makes inference efficient
All major LLMs (GPT, Llama, Mistral, Claude) use decoder-only architecture. This is what you'll fine-tune.