PySpark distributes processing across clusters. Same logic, massive scale.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum
spark = SparkSession.builder.appName('example').getOrCreate()
df = spark.read.parquet('s3://bucket/data/')
result = df.filter(col('status') == 'active') \
.groupBy('region') \
.agg(sum('revenue').alias('total_revenue'))
result.write.parquet('s3://bucket/output/')
Key concepts:
- Lazy evaluation: Transformations don't execute until action called
- Partitioning: Data split across workers
- Caching: Keep frequently used DataFrames in memory