Parquet, Partitioning, and Cloud Data Layout

Back to modules
Course progress25%
article

Parquet layout primer

Choose partition sizes and formats that fit parallel reads.

Parquet Layout Primer

Cloud analytics gets easier when data is stored in self-describing, splittable formats. Parquet is a natural fit for many Dask DataFrame workloads.

Layout checks

  • Avoid millions of tiny files.
  • Partition by columns used for common filters.
  • Keep partition sizes large enough for useful work.
  • Store data close to compute.

Quick sizing thought

If each worker has 16 GB of memory, a 256 MB to 1 GB partition is often a more useful starting point than 5 MB fragments.

Example read

ddf = dd.read_parquet(
    "s3://company-lake/events/",
    columns=["account_id", "event_time", "amount"],
)

Parquet layout primer

Storage layout