import datetime
import itertools
from collections import defaultdict
import pandas as pd
import pyproj
import pystac
import pystac_client
import stackstac
import xarray as xr
from rasterio.enums import Resampling
from shapely.geometry import box
from shapely.ops import transform
HLS: Your new favorite remote sensing dataset
NASA, remote sensing, Harmonized Landsat Sentinel, STAC, python, cloud-native geospatial
The Harmonized Landsat Sentinel 2 (HLS) dataset is an awesome public source for 30 meter spectral satellite imagery.
Produced by NASA, the HLS dataset combines the independently powerful Landsat 8/9 and Sentinel 2 satellite observation systems into one analysis-ready dataset.
Some of the best features include:
More rich time series than either Sentinel 2 or Landsat individually
- one observation every two to three days from HLS compared to one every five days for Sentinel 2 and one every eight days for Landsat 8/9
It is BRDF corrected, which makes mosaics generated across orbits from different days nearly seamless
- neither of the original data sources for Landsat or Sentinel data have this feature
The data are accessible via NASA’s LPCLOUD STAC!
Improved cloud masking for Sentinel 2!
- The original Sentinel 2 cloud mask leaves something to be desired, but in the HLS collection you get a more robust mask that is consistent with the Landsat products
In this post, I am going to demonstrate some of the awesome capabilities of the HLS data and show why you should consider using it in place of either Landsat or Sentinel 2! Check out the HLS 2.0 User Guide for more details on the process for generating the HLS data.
Get organized
authentication with Earthdata and the .netrc file
To access the HLS data from NASA, you will need to have a (free) Earthdata Login profile. You can create an account here. Once your account is active, you will need to add your login credentials to a .netrc
file in your home directory. You can do this with a single shell command in a Unix system:
echo "machine urs.earthdata.nasa.gov login <USERNAME> password <PASSWORD>" > ~/.netrc
When GDAL reads the raster data using curl
it will find your credentials in this file.
environment variables
There are a handful of environment variables that must be set in order for GDAL to locate and read the raster files correctly. You can choose to include these in a startup script (e.g. ~/.bashrc
) or set them within your Python script:
CPL_VSIL_CURL_USE_HEAD=FALSE
GDAL_DISABLE_READDIR_ON_OPEN=YES
GDAL_HTTP_COOKIEJAR=/tmp/cookies.txt
GDAL_HTTP_COOKIEFILE=/tmp/cookies.txt
imports
Here are the python packages that we need:
connect to the NASA STAC
The HLS data queried from the LPCLOUD STAC. We can use pystac_client
to open a connection to the STAC:
= "https://cmr.earthdata.nasa.gov/stac/LPCLOUD"
CMR_STAC_URL
= pystac_client.Client.open(CMR_STAC_URL) catalog
The HLS data are stored in two separate collections:
Landsat:
HLSL30_2.0
Sentinel:
HLSS30_2.0
= ["HLSL30_2.0", "HLSS30_2.0"] HLS_COLLECTION_IDS
There may be good reasons to keep them separate but this difference and one other design choice that we will discuss later make it more tedious to get up and running with the HLS data than it should!
area of interest
For this analysis we are going to focus on a section of the Kawishiwi River system on the edge of the Boundary Waters Canoe Area Wilderness in Northern Minnesota that I know as “the Little Triangle”.
# my CRS of choice for CONUS these days
= "epsg:5070"
CRS_STRING = pyproj.CRS.from_string(CRS_STRING).to_epsg()
EPSG
# bounding box that surrounds the Little Triangle
= box(326000, 2771000, 337000, 2778000)
AOI
# STAC items store bounding box info in epsg:4326
= pyproj.Transformer.from_crs(
transformer_4326 =CRS_STRING,
crs_from="epsg:4326",
crs_to=True,
always_xy
)
= transform(transformer_4326.transform, AOI).bounds bbox_4326
Explore the data
check out this time series!
We can query the STAC catalog for the entire time series by searching the catalog without a datetime
specification.
= []
all_items for collection_id in HLS_COLLECTION_IDS:
= catalog.search(
hls_history_search =collection_id,
collections=bbox_4326,
bbox
)for page in hls_history_search.pages():
all_items.extend(page.items)
= {
collection_history list) for collection in HLS_COLLECTION_IDS
collection: defaultdict(
}
for item in all_items:
= pd.Timestamp(item.datetime.date()) + pd.offsets.MonthEnd()
year_month = collection_history[item.collection_id][year_month]
entry if (date := item.datetime.date()) not in entry:
entry.append(date)
I am looping through the pages()
because the NASA STAC connection can be flaky sometimes and will give you 502 errors when you loop through a long list of items()
. I don’t know why the pages()
method is more stable, but it seems to work better for large queries like this.
Code
# get count of images by year/month/sensor
= pd.DataFrame(
collection_counts
{
collection: {len(dates) for year_month, dates in year_months.items()
year_month:
}for collection, year_months in collection_history.items()
}0)
).fillna(
= collection_counts.plot.line(
_ ="monthly count of days with observations by sensor"
title )
You can see that the Landsat time series stretches back to Landsat 8’s launch in 2013 and that the Sentinel 2 observations kick into gear in late 2020. The density of observations really ramped up in 2022 when Landsat 9 came online!
NASA is currently working on adding the full Sentinel 2 archive (back to 2015) to the HLS dataset. That project is scheduled to be complete in the Fall of 2023 according to this EarthData forum post
band labels
One thing that makes it harder to get up and running with the HLS data is that the Sentinel and Landsat STAC items have different band IDs for the many of the spectral bands! This matters to us here because stackstac
uses the STAC items’ asset labels to name the band
dimension. Without any modifications, some of the bands would get scrambled when combining data from the two collections.
Band name | OLI band number | MSI band number | HLS band code name L8 | HLS band code name S2 | Wavelength (micrometers) | |
---|---|---|---|---|---|---|
0 | Coastal Aerosol | 1 | 1 | B01 | B01 | 0.43 – 0.45* |
1 | Blue | 2 | 2 | B02 | B02 | 0.45 – 0.51* |
2 | Green | 3 | 3 | B03 | B03 | 0.53 – 0.59* |
3 | Red | 4 | 4 | B04 | B04 | 0.64 – 0.67* |
4 | Red-Edge 1 | – | 5 | – | B05 | 0.69 – 0.71** |
5 | Red-Edge 2 | – | 6 | – | B06 | 0.73 – 0.75** |
6 | Red-Edge 3 | – | 7 | – | B07 | 0.77 – 0.79** |
7 | NIR Broad | – | 8 | – | B08 | 0.78 –0.88** |
8 | NIR Narrow | 5 | 8A | B05 | B8A | 0.85 – 0.88* |
9 | SWIR 1 | 6 | 11 | B06 | B11 | 1.57 – 1.65* |
10 | SWIR 2 | 7 | 12 | B07 | B12 | 2.11 – 2.29* |
11 | Water vapor | – | 9 | – | B09 | 0.93 – 0.95** |
12 | Cirrus | 9 | 10 | B09 | B10 | 1.36 – 1.38* |
13 | Thermal Infrared 1 | 10 | – | B10 | – | 10.60 – 11.19* |
14 | Thermal Infrared 2 | 11 | – | B11 | – | 11.50 – 12.51* |
To avoid this headache, we can make define a crosswalk dictionary that will make it possible to modify the STAC item metadata so that we can safely load all of the data without scrambling the bands.
= {
BAND_CROSSWALK "HLSL30_2.0": {
"B01": "coastal aerosol",
"B02": "blue",
"B03": "green",
"B04": "red",
"B05": "nir narrow",
"B06": "swir 1",
"B07": "swir 2",
"B09": "cirrus",
"B10": "thermal infrared 1",
"B11": "thermal",
},"HLSS30_2.0": {
"B01": "coastal aerosol",
"B02": "blue",
"B03": "green",
"B04": "red",
"B05": "red-edge 1",
"B06": "red-edge 2",
"B07": "red-edge 3",
"B08": "nir broad",
"B8A": "nir narrow",
"B09": "water vapor",
"B10": "cirrus",
"B11": "swir 1",
"B12": "swir 2",
},
}
# these are the ones that we are going to use
= ["red", "green", "blue", "Fmask"] BANDS
Make a cloud-free mosaic
search the STAC
Now we are going to search the STAC again but this time with a datetime
specification to limit our search to July 2022.
= "2022-07-01"
START_DATE = "2022-07-31"
END_DATE
= catalog.search(
stac_items =HLS_COLLECTION_IDS,
collections=bbox_4326,
bbox=[START_DATE, END_DATE],
datetime ).item_collection()
modify the asset labels
We can use the crosswalk to rename the STAC item asset labels to pave the way for stackstac
to load the data in the most convenient and straightforward way. There are ways to do this after you load the data using xarray.DataArray.assign_coords
, but I find the STAC item metadata alteration approach more intuitive.
for item in stac_items:
for original_band, new_band in BAND_CROSSWALK.get(item.collection_id).items():
= item.assets.pop(original_band) item.assets[new_band]
load into a DataArray
Use stackstac.stack
to load the STAC items into a neatly packaged xarray.DataArray
:
= stackstac.stack(
hls_stack_raw
stac_items,=BANDS,
assets=AOI.bounds,
bounds=EPSG,
epsg=30,
resolution="center",
xy_coords )
cloud/shadow masking
The Fmask
band contains all of the pixel quality information that we need to mask out invalid pixels from the images, but it takes a little bit of work. The quality information is stored in integer form but we can work with it using some binary tricks!
Bit number | Mask name | Bit value | Mask description | |
---|---|---|---|---|
0 | 6-7 | aerosol level | 11 | High aerosol |
1 | 10 | Moderate aerosol | ||
2 | 01 | Low aerosol | ||
3 | 00 | Climatology aerosol | ||
4 | 5 | Water | 1 | Yes |
5 | 0 | No | ||
6 | 4 | Snow/ice | 1 | Yes |
7 | 0 | No | ||
8 | 3 | Cloud shadow | 1 | Yes |
9 | 0 | No | ||
10 | 2 | Adjacent to cloud/shadow | 1 | Yes |
11 | 0 | No | ||
12 | 1 | Cloud | 1 | Yes |
13 | 0 | No | ||
14 | 0 | Cirrus | Reserved, but not used |
We can use the ‘bitwise OR’ operator |
and the ‘zero fill left shift’ operator <<
to construct an integer representation of invalid pixels. For the cloud-free mosaic, we want to exclude any pixel that is marked (bit value = 1) as either a cloud (bit 1), adjacent to a cloud or shadow (bit 2), or a cloud shadow (bit 3).
= [1, 2, 3] # cloud shadow, adjacent to cloud shadow, cloud
hls_mask_bitfields = 0
hls_bitmask for field in hls_mask_bitfields:
|= 1 << field
hls_bitmask
print(hls_bitmask)
14
If we cast the result to a binary string you can see how that works:
format(hls_bitmask, "08b")
'00001110'
When you translate the integer 14
into binary you get the ultimate invalid pixel!
Next, we use ‘bitwise AND’ to identify pixels that have an invalid value in any of the bits that we care about. Any integer value that has the invalid value (1) in any of the specified bits (1, 2, and 3) will return a non-zero value when compared to 14
with the &
operator:
For example, a pixel where all bit values are zero will return zero because none of the bitwise 1 values are shared with our invalid mask value (14 aka 00001110):
int("00000000", 2) & 14
0
And a pixel that is marked as adjacent to a cloud or shadow (bit 2 = 1) will return 4 (2^2) because bit 2 has 1:
int("00000100", 2) & 14
4
Fortunately for us, we can apply the bitwise operators on integer arrays! We can classify all pixels as either good or bad like this:
= hls_stack_raw.sel(band="Fmask").astype("uint16")
fmask = fmask & hls_bitmask hls_bad
Then we can set all invalid pixels (hls_bad != 0
) to NaN
like this:
# mask pixels where any one of those bits are set
= hls_stack_raw.where(hls_bad == 0) hls_masked
get the cloud-free mosaic
After we have eliminated the invalid pixels it is very easy to generate a cloud-free mosaic for a specific time range (e.g. monthly). For example, we can take the median value from all non-cloud/non-cloud shadow pixels within a given month with just a few lines of code.
= hls_masked.resample(time="1M").median(skipna=True).compute() hls_cloud_free
We are only looking at a single month here, but you could pass many months worth of STAC items to stackstac.stack
and get a rich time series of cloud-free imagery using the exact same method.
Plot the image and bask in the cloud-free beauty:
Code
= (
_ ="time")
hls_cloud_free.squeeze(dim=["red", "green", "blue"])
.sel(band
.plot.imshow(="band",
rgb=True,
robust=6,
size=False,
add_labels=[],
xticks=[],
yticks
) )
Compare to Sentinel 2 L2A and Landsat Collection 2 Level-2
The Landsat Collection Level-2 and Sentinel 2 Level 2A datasets are available in Microsoft’s Planetary Computer STAC. They are available in other STACs, too (e.g. Element84’s Earth Search Catalog), but the Planetary Computer catalog is really nice and the data are well- documented and have examples attached to many of the available collections.
Code
import planetary_computer
= pystac_client.Client.open(
planetary_computer_catalog "https://planetarycomputer.microsoft.com/api/stac/v1",
=planetary_computer.sign_inplace,
modifier
)
= planetary_computer_catalog.search(
sentinel_stac_items =["sentinel-2-l2a"],
collections=bbox_4326,
bbox=[START_DATE, END_DATE],
datetime
).get_all_items()
= {
sentinel_band_crosswalk "B04": "red",
"B03": "green",
"B02": "blue",
}
for item in sentinel_stac_items:
for original_band, new_band in sentinel_band_crosswalk.items():
= item.assets.pop(original_band)
item.assets[new_band]
= stackstac.stack(
sentinel_stack =sentinel_stac_items,
items=["red", "green", "blue", "SCL"],
assets=EPSG,
epsg=30,
resolution=AOI.bounds,
bounds="center",
xy_coords=Resampling.bilinear,
resampling
)
# valid SCL values: 4: vegetation, 5: bare soils, 6: water, 11: snow or ice
# https://docs.digitalearthafrica.org/en/latest/data_specs/Sentinel-2_Level-2A_specs.html
= [4, 5, 6, 11]
valid_scl
= (
sentinel_cloud_free ="SCL").isin(valid_scl))
sentinel_stack.where(sentinel_stack.sel(band=["red", "green", "blue"])
.sel(band="1M")
.resample(time=True)
.median(skipna
.compute()
)
= planetary_computer_catalog.search(
landsat_stac_items =["landsat-c2-l2"],
collections=bbox_4326,
bbox=[START_DATE, END_DATE],
datetime={"platform": {"neq": "landsat-7"}}, # skip Landsat 7 because stripes
query
).get_all_items()
= stackstac.stack(
landsat_stack =landsat_stac_items,
items=["red", "green", "blue", "qa_pixel"],
assets=EPSG,
epsg=30,
resolution=AOI.bounds,
bounds="center",
xy_coords=False,
band_coords
)
# following https://stackstac.readthedocs.io/en/latest/examples/gif.html#Mask-cloudy-pixels-using-the-QA-band:
# Make a bitmask---when we bitwise-and it with the data, it leaves just the 4 bits we care about
= [1, 2, 3, 4] # dilated cloud, cirrus, cloud, cloud shadow
mask_bitfields = 0
bitmask for field in mask_bitfields:
|= 1 << field
bitmask
= landsat_stack.sel(band="qa_pixel").astype("uint16")
landsat_qa = landsat_qa & bitmask # just look at those 4 bits
landsat_bad
= landsat_stack.where(
landsat_good == 0
landsat_bad # mask pixels where any one of those bits are set
)
= (
landsat_cloud_free =["red", "green", "blue"])
landsat_good.sel(band="1M")
.resample(time=True)
.median(skipna
.compute() )
Both of the original datasets are atmospherially corrected and represent surface surface reflectance, but the processed images are definitely not free of cloud- or orbit artifacts! One of the greatest advantages of using the HLS dataset instead of the original sources is improved cloud masking. The HLS data undergo an entirely different process than the Landsat Collection Level-2 and Sentinel 2 Level 2A datasets, see below for an outline of the processing workflow.
These images show 30 meter resolution, cloud-free mosaics for July 2022 from each source:
In the Sentinel 2 image you can clearly see some cloud haze scattered around most of the image. The Landsat image shows patchy artifacts that are probably attributable to hazy clouds that were not masked from the original images. Meanwhile, the HLS mosaic comes out crystal clear!
Analyze the HLS Landsat and Sentinel collections separately
To illustrate the power of the full HLS dataset and the advantage of the high density of collections in a short time, lets load the imagery for each sensor separately and create a cloud-free mosaic for July 2022 for each one.
get an array for each sensor
= {
items_by_collection
collection: pystac.ItemCollection(for item in stac_items if item.collection_id == collection]
[item
)for collection in HLS_COLLECTION_IDS
}
= {}
stacks_by_collection
# a function to flatten the time dimension
def flatten(x, dim="time"):
assert isinstance(x, xr.DataArray)
if len(x[dim].values) > len(set(x[dim].values)):
= x.groupby(dim).map(stackstac.mosaic)
x
return x
for collection, collection_items in items_by_collection.items():
= stackstac.stack(
raw_stack
collection_items,=BANDS,
assets=AOI.bounds,
bounds=EPSG,
epsg=30,
resolution="center",
xy_coords
)
= flatten(raw_stack, dim="time") stacks_by_collection[collection]
We can combine them along a new dimension called sensor
to make it easy to look at the images with respect to sensor and time.
= xr.concat(
sensor_stack list(stacks_by_collection.values()),
=pd.Index(["Landsat", "Sentinel"], name="sensor"),
dim
) sensor_stack
<xarray.DataArray 'stackstac-197185f7fbca466629e6ec490ea8b1b1' (sensor: 2, time: 16, band: 4, y: 234, x: 368)> Size: 88MB dask.array<concatenate, shape=(2, 16, 4, 234, 368), dtype=float64, chunksize=(1, 1, 1, 234, 368), chunktype=numpy.ndarray> Coordinates: * band (band) <U5 80B 'red' 'green' 'blue' 'Fmask' * x (x) float64 3kB 3.26e+05 3.26e+05 ... 3.37e+05 3.37e+05 * y (y) float64 2kB 2.778e+06 2.778e+06 ... 2.771e+06 2.771e+06 * time (time) datetime64[ns] 128B 2022-07-03T17:20:28.306000 ...... epsg int64 8B 5070 id (time) object 128B 'HLS.S30.T15UXP.2022184T170901.v2.0' .... end_datetime (time) object 128B '2022-07-03T17:20:28.306Z' ... '2022-0... eo:cloud_cover (time) float64 128B 83.0 94.0 98.0 nan ... 1.0 nan 65.0 99.0 start_datetime (time) object 128B '2022-07-03T17:20:28.306Z' ... '2022-0... * sensor (sensor) object 16B 'Landsat' 'Sentinel' Attributes: spec: RasterSpec(epsg=5070, bounds=(325980, 2770980, 337020, 27780... crs: epsg:5070 transform: | 30.00, 0.00, 325980.00|\n| 0.00,-30.00, 2778000.00|\n| 0.0... resolution: 30
Since we are going to make several plots using this array, load the data into memory with a call to the .compute()
method.
= sensor_stack.compute() sensor_stack
The HLS data are organized by 100 km MGRS tiles, and there can be redundant coverage by separate STAC items. We can clean this up by flattening the data out to have one observation per sensor per day.
# a function to prepare the arrays for better plotting
def flatten_by_day(x):
return (
=x.time.astype("datetime64[D]"))
x.assign_coords(time"time")
.groupby(map(stackstac.mosaic)
.
)
= xr.concat(
flattened_by_day
[=sensor))
flatten_by_day(sensor_stack.sel(sensorfor sensor in ["Landsat", "Sentinel"]
],="sensor",
dim )
/tmp/ipykernel_4669/795691738.py:4: UserWarning: Converting non-nanosecond precision datetime values to nanosecond precision. This behavior can eventually be relaxed in xarray, as it is an artifact from pandas which is now beginning to support non-nanosecond precision values. This warning is caused by passing non-nanosecond np.datetime64 or np.timedelta64 values to the DataArray or Variable constructor; it can be silenced by converting the values to nanosecond precision ahead of time.
x.assign_coords(time=x.time.astype("datetime64[D]"))
/tmp/ipykernel_4669/795691738.py:4: UserWarning: Converting non-nanosecond precision datetime values to nanosecond precision. This behavior can eventually be relaxed in xarray, as it is an artifact from pandas which is now beginning to support non-nanosecond precision values. This warning is caused by passing non-nanosecond np.datetime64 or np.timedelta64 values to the DataArray or Variable constructor; it can be silenced by converting the values to nanosecond precision ahead of time.
x.assign_coords(time=x.time.astype("datetime64[D]"))
When we plot the images we only have one observation per day per sensor.
= flattened_by_day.sel(band=["red", "green", "blue"]).plot.imshow(
_ ="sensor",
col="time",
row="band",
rgb=True,
robust=5,
size=0,
vmin=800,
vmax=False,
add_labels=[],
xticks=[],
yticks )
Check out the valid/invalid binary values for each sensor/date:
= flattened_by_day.sel(band="red").isnull() == False
has_data = flattened_by_day.sel(band="Fmask").astype("uint16")
fmask = (fmask & hls_bitmask).where(has_data > 0)
hls_bad
= (
_ > 0)
(hls_bad > 0)
.where(has_data
.plot.imshow(="sensor",
col="time",
row=4,
size=False,
add_labels=[],
xticks=[],
yticks
) )
For illustration purposes, make a “combined” sensor array to add to sensor_stack
= (
landsat ="Landsat")
sensor_stack.sel(sensor="combined")
.assign_coords(sensor"sensor", 1)
.expand_dims(
)= (
sentinel ="Sentinel")
sensor_stack.sel(sensor="combined")
.assign_coords(sensor"sensor", 1)
.expand_dims(
)= landsat.combine_first(sentinel)
combined_stack
= xr.concat([sensor_stack, combined_stack], dim="sensor") full_stack
Would we have at least one valid observation for every pixel?
# is there a non-na spectral observation?
= full_stack.sel(band="red") > 0
has_data
# is the pixel invalid by our Fmask criteria?
= full_stack.sel(band="Fmask").astype("uint16")
fmask = fmask & hls_bitmask
is_bad
# if not invalid and has non-na value it is valid
= (is_bad == 0).astype("float64").where(has_data > 0)
valid
# are there any valid pixels in the time series?
= (valid > 0).any(dim="time")
has_valid = has_valid.plot.imshow(
_ ="sensor",
col=4,
size=False,
add_labels=0,
vmin=1,
vmax=[],
xticks=[],
yticks )
Not if we only use Landsat images, not if we use only Sentinel images, but we nearly have full coverage if we use the combined set!
< 1).sum(dim=["x", "y"]) (has_valid
<xarray.DataArray 'stackstac-197185f7fbca466629e6ec490ea8b1b1' (sensor: 3)> Size: 24B array([1129, 1619, 94]) Coordinates: epsg int64 8B 5070 * sensor (sensor) object 24B 'Landsat' 'Sentinel' 'combined'
By combining the sensors we reduce the number of pixels with no valid observations from hundreds down to only three!
Get cloud-free mosaic for each sensor and the combined set side-by-side.
= full_stack.where(is_bad == 0).median(dim="time", skipna=True)
sensor_cloud_free
= sensor_cloud_free.sel(band=["red", "green", "blue"]).plot.imshow(
_ ="sensor",
col="band",
rgb=True,
robust=4,
size=0,
vmin=800,
vmax=False,
add_labels=[],
xticks=[],
yticks )
This demonstrates that combining the time series of observations from each of the sensor arrays gives you superior ingredients for cloud-free mosaic products than using either of the sensors independently.