Pandas works great until it doesn't. It typically needs -x the file size in RAM. A GB CSV might require - GB of memory depending on data types.
Solutions:
- Chunking: Process in pieces with
pd.read_csv(chunksize=10000) - Dtypes: Use efficient types.
categoryfor strings,int32instead ofint64 - Dask: Parallel pandas. Same API, distributed execution
- Polars: Rust-based, faster than pandas for many operations
import dask.dataframe as dd
df = dd.read_parquet('large_file.parquet')
result = df.groupby('category').sum().compute()
For large data, move to Spark.