Parquet layout primer

Choose partition sizes and formats that fit parallel reads.

Parquet Layout Primer Cloud analytics gets easier when data is stored in self-describing, splittable formats. Parquet is a natural fit for many Dask DataFrame workloads. Layout checks Avoid millions of tiny files. Partition by columns used for common filters. Keep partition sizes large enough for useful work. Store data close to compute. Quick sizing thought If each worker has 16 GB of memory, a 256 MB to 1 GB partition is often a more useful starting point than 5 MB fragments. Example read ddf = dd.read_parquet( "s3://company-lake/events/", columns=["account_id", "event_time", "amount"], )

Storage layout