Parquet, Partitioning, and Cloud Data Layout

1Xarray cloud chunking2Chunking scenario
Back to modules
Course progress25%
article

Xarray cloud chunking

Align chunks to access patterns for large geospatial and scientific datasets.

Xarray Cloud Chunking

Array workloads use chunks instead of DataFrame partitions, but the question is similar: how much data should one task process, and does the chunk shape match the access pattern?

Chunking for geospatial data

  • Chunk along dimensions that users slice frequently.
  • Avoid chunks that are too tiny to amortize overhead.
  • Avoid chunks so large that workers spill.
  • Use formats like Zarr when random access matters.

Small example

import xarray as xr

ds = xr.open_zarr("s3://weather/surface.zarr", chunks={"time": 24, "lat": 512, "lon": 512})
daily = ds.temperature.mean(dim=["lat", "lon"])

Review prompt

Which dimension does the user filter first, and which dimension does the algorithm reduce?

Xarray cloud chunking

Xarray and geospatial