Common cleaning steps:
- Remove HTML tags and escape sequences
- Normalize Unicode characters
- Fix encoding issues (mojibake)
- Remove excessive whitespace
- Strip boilerplate headers and footers
Be careful not to over-clean. Removing all punctuation destroys meaning. Lowercasing loses information. Clean just enough to remove noise.