Parquet, Partitioning, and Cloud Data Layout
Back to modules
Course progress25%
article
Xarray cloud chunking
Align chunks to access patterns for large geospatial and scientific datasets.
Xarray Cloud Chunking
Array workloads use chunks instead of DataFrame partitions, but the question is similar: how much data should one task process, and does the chunk shape match the access pattern?
Chunking for geospatial data
- Chunk along dimensions that users slice frequently.
- Avoid chunks that are too tiny to amortize overhead.
- Avoid chunks so large that workers spill.
- Use formats like Zarr when random access matters.
Small example
import xarray as xr
ds = xr.open_zarr("s3://weather/surface.zarr", chunks={"time": 24, "lat": 512, "lon": 512})
daily = ds.temperature.mean(dim=["lat", "lon"])
Review prompt
Which dimension does the user filter first, and which dimension does the algorithm reduce?