Raw text needs cleaning before modeling. The preprocessing pipeline depends on your task.
Common steps:
- Lowercasing (usually yes for classification, careful with NER)
- Punctuation removal (task-dependent)
- Stopword removal (for bag-of-words, not for sequence models)
- Stemming: Reduce to root form (running → run). Fast but crude.
- Lemmatization: Dictionary-based normalization (better → good). More accurate.
Interview tip: Know when NOT to preprocess. For sentiment, "not good" loses meaning if you remove stopwords.