Components: URL frontier (queue of URLs to crawl), fetcher (downloads pages), parser (extracts links and content), URL filter (removes duplicates and bad URLs), storage (pages and index).
Politeness: Respect robots.txt. Rate limit per domain. Don't overwhelm servers.
Scale: Billions of pages. Distributed fetchers. Consistent hashing to assign URLs to crawlers. Bloom filter for URL deduplication.
Freshness: Recrawl frequently changing pages more often. News sites daily. Static sites monthly.