Crawlers must not overwhelm websites:
Robots.txt: Check before crawling. Respect Disallow rules. Cache rules per domain.
Rate limiting:
- Per-domain limits (e.g., request per second)
- Global limits to control total bandwidth
- Exponential backoff on errors
Crawl-delay: Some robots.txt specify delay. Honor it.
Distributed politeness: Centralized rate limiter or consistent hashing to ensure one crawler per domain.
Prioritization: Important pages first. Don't waste quota on low-value pages.
User-agent: Identify your crawler. Provide contact info. Respond to abuse complaints.