You may already have usable data:
- Customer support logs (sanitized)
- Internal documentation and wikis
- Code repositories and comments
- Email threads (with consent)
- Public Q&A sites like Stack Overflow
Mined data needs heavy filtering. Real-world data is messy. But it captures authentic patterns that synthetic data might miss.