Parquet, Partitioning, and Cloud Data Layout
Back to modules
Course progress25%
article
Parquet layout primer
Choose partition sizes and formats that fit parallel reads.
Parquet Layout Primer
Cloud analytics gets easier when data is stored in self-describing, splittable formats. Parquet is a natural fit for many Dask DataFrame workloads.
Layout checks
- Avoid millions of tiny files.
- Partition by columns used for common filters.
- Keep partition sizes large enough for useful work.
- Store data close to compute.
Quick sizing thought
If each worker has 16 GB of memory, a 256 MB to 1 GB partition is often a more useful starting point than 5 MB fragments.
Example read
ddf = dd.read_parquet(
"s3://company-lake/events/",
columns=["account_id", "event_time", "amount"],
)1
Parquet layout primer
Storage layout