STAC + stackstac + xarray

python

cloud

Author

Henry Rodman

Published

February 18, 2023

Modified

October 6, 2024

Keywords

STAC, python, cloud-native geospatial, xarray, stackstac

stackstac is a python package for making it dead simple to process data from a STAC using xarray.

When you use stackstac, you do not need to write any code to handle merging operations for many raster files. The data come out the other side in a neatly packaged xarray object with x, y, time, and band dimensions! This is very convenient when you are interested in an arbitrary area that may require combining data from many STAC items. The stackstac documentation is very good, but I have put this demo together to highlight a few things that I really like about the STAC + stackstac + xarray workflow.

Define search parameters

I am interested in getting a mosaic of daily observations from Sentinel 2 for September 2022.

import pyproj
import pystac_client
import stackstac
import xarray as xr
from shapely.geometry import box
from shapely.ops import transform

# STAC connection information for Sentinel 2 COGs
STAC_URL = "https://earth-search.aws.element84.com/v1"
STAC_COLLECTION = "sentinel-2-l2a"

# spatial projection information
CRS_STRING = "epsg:5070"
EPSG = pyproj.CRS.from_string(CRS_STRING).to_epsg()

# area of interest along the North Shore of Lake Superior
AOI = box(373926, 2744693, 406338, 2765304)

# a few more parameters
RESOLUTION = 100  # meters
BANDS = ["red", "green", "blue"]
START_DATE = "2022-09-01"
END_DATE = "2022-09-30"

Query the STAC for matching items

To query the STAC, we need to provide a bounding box in epsg:4326 coordinates:

# STAC items store bounding box info in epsg:4326
transformer_4326 = pyproj.Transformer.from_crs(
    crs_from=CRS_STRING,
    crs_to="epsg:4326",
    always_xy=True,
)

bbox_4326 = transform(transformer_4326.transform, AOI).bounds

This will return all of the STAC items that intersect the provided bounding box and time window:

catalog = pystac_client.Client.open(STAC_URL)

stac_items = catalog.search(
    collections=[STAC_COLLECTION],
    bbox=bbox_4326,
    datetime=[START_DATE, END_DATE],
).item_collection()

The query yields many STAC items, each of which describes multiple COGs (one per band). Using other tools, we would need to write a bunch of code to make sure we combine the data correctly but with stackstac we can forget about that and just get on with our analysis!

len(stac_items)

Stack it

Lazily load the raster data into an xarray.DataArray using stackstack.stack. This function uses the STAC item metadata to construct a multidimensional array with human-readable coordinates that can be manipulated with the magnificently powerful suite of xarray functions and methods!

sentinel_stack = stackstac.stack(
    items=stac_items,
    assets=BANDS,
    epsg=EPSG,
    resolution=RESOLUTION,
    bounds=AOI.bounds,
    xy_coords="center",
)
sentinel_stack

<xarray.DataArray 'stackstac-5efc1f91909c10ae7d8453451c8ac811' (time: 20,
                                                                band: 3,
                                                                y: 208, x: 325)> Size: 32MB
dask.array<fetch_raster_window, shape=(20, 3, 208, 325), dtype=float64, chunksize=(1, 1, 208, 325), chunktype=numpy.ndarray>
Coordinates: (12/54)
  * time                                     (time) datetime64[ns] 160B 2022-...
    id                                       (time) <U24 2kB 'S2B_15UXP_20220...
  * band                                     (band) <U5 60B 'red' 'green' 'blue'
  * x                                        (x) float64 3kB 3.74e+05 ... 4.0...
  * y                                        (y) float64 2kB 2.765e+06 ... 2....
    s2:nodata_pixel_percentage               (time) object 160B 51.473683 ......
    ...                                       ...
    gsd                                      int64 8B 10
    raster:bands                             object 8B {'nodata': 0, 'data_ty...
    common_name                              (band) <U5 60B 'red' 'green' 'blue'
    center_wavelength                        (band) float64 24B 0.665 0.56 0.49
    full_width_half_max                      (band) float64 24B 0.038 ... 0.098
    epsg                                     int64 8B 5070
Attributes:
    spec:        RasterSpec(epsg=5070, bounds=(373900, 2744600, 406400, 27654...
    crs:         epsg:5070
    transform:   | 100.00, 0.00, 373900.00|\n| 0.00,-100.00, 2765400.00|\n| 0...
    resolution:  100

The resolution argument makes it possible to resample the input data on-the-fly. In this case, I am downsampling from the original 20 meter resolution to 100 meters.

Wrangle the time dimension

One thing to watch out for with stackstac.stack is that you will wind up with a distinct time coordinate for each STAC item that you pass in. To achieve the intuitive representation of the data, you need to flatten the DataArray with respect to day.

Note: if you are only reading a single STAC item, stackstac.mosaic will inadvertently reduce your data along the band dimension (which is definitely not what you want!), hence the conditional statement checking for more than one time coordinate value.

def flatten(x, dim="time"):
    assert isinstance(x, xr.DataArray)
    if len(x[dim].values) > len(set(x[dim].values)):
        x = x.groupby(dim).map(stackstac.mosaic)

    return x


# round time coordinates so all observations from the same day so they have
# equivalent timestamps
sentinel_stack = sentinel_stack.assign_coords(
    time=sentinel_stack.time.astype("datetime64[D]")
)

# mosaic along time dimension
flat_stack = flatten(sentinel_stack, dim="time")
flat_stack

/tmp/ipykernel_4612/1844268677.py:12: UserWarning: Converting non-nanosecond precision datetime values to nanosecond precision. This behavior can eventually be relaxed in xarray, as it is an artifact from pandas which is now beginning to support non-nanosecond precision values. This warning is caused by passing non-nanosecond np.datetime64 or np.timedelta64 values to the DataArray or Variable constructor; it can be silenced by converting the values to nanosecond precision ahead of time.
  time=sentinel_stack.time.astype("datetime64[D]")

<xarray.DataArray 'stackstac-5efc1f91909c10ae7d8453451c8ac811' (time: 10,
                                                                band: 3,
                                                                y: 208, x: 325)> Size: 16MB
dask.array<concatenate, shape=(10, 3, 208, 325), dtype=float64, chunksize=(1, 1, 208, 325), chunktype=numpy.ndarray>
Coordinates: (12/21)
  * band                                     (band) <U5 60B 'red' 'green' 'blue'
  * x                                        (x) float64 3kB 3.74e+05 ... 4.0...
  * y                                        (y) float64 2kB 2.765e+06 ... 2....
    mgrs:utm_zone                            int64 8B 15
    s2:product_type                          <U7 28B 'S2MSI2A'
    s2:processing_baseline                   <U5 20B '04.00'
    ...                                       ...
    raster:bands                             object 8B {'nodata': 0, 'data_ty...
    common_name                              (band) <U5 60B 'red' 'green' 'blue'
    center_wavelength                        (band) float64 24B 0.665 0.56 0.49
    full_width_half_max                      (band) float64 24B 0.038 ... 0.098
    epsg                                     int64 8B 5070
  * time                                     (time) datetime64[ns] 80B 2022-0...
Attributes:
    spec:        RasterSpec(epsg=5070, bounds=(373900, 2744600, 406400, 27654...
    crs:         epsg:5070
    transform:   | 100.00, 0.00, 373900.00|\n| 0.00,-100.00, 2765400.00|\n| 0...
    resolution:  100

Up until now, we have not processed any actual raster data! All of the operations have been carried out using the STAC item information and associated raster metadata from the source files. By working in this way, you can iterate very rapidly making sure that the dimensions of the output matches your expectation before you process any actual data.

Load the data into memory and take a look

You can keep going with an analysis 100% lazily until the very end, but this time I am just making a plot so we have reached the end of the lazy road. Calling the compute method will execute the lazily-evaluated operations that we queued up for flat_stack.

flat_stack = flat_stack.compute()

We can view the images easily with the plot method, giving us a glimpse into autumn on the North Shore of Lake Superior!

flat_stack.sel(band=BANDS).plot.imshow(
    col="time",
    col_wrap=4,
    rgb="band",
    robust=True,
    size=4,
    vmin=-0.1,
    vmax=-0.01,
    add_labels=False,
)

I remember it being cloudy last fall but damn we had a couple of nice days! Apparently September 28 was a really good one.

flat_stack.sel(band=BANDS, time="2022-09-28").plot.imshow(
    rgb="band",
    robust=True,
    size=5,
    vmin=-0.1,
    vmax=-0.01,
    add_labels=False,
)

Whats next?

This is really just scratching the surface of what you can do with xarray, I will try to cover some more advanced topics in later posts.