NASA S3 compute location test
spoiler: do everything in us-west-2!
NASA, AWS, S3, STAC, python, cloud-native geospatial
One of the main principles of the cloud-native geospatial movement is bringing compute operations as “close” to the raw data as possible. Since the storage location of most data is out of your control, it is worth your time to pick a compute environment that will minimize the time spent reading data from cloud storage!
I tested the run time performance of a basic raster read operation from a handful of compute environments so you can see what difference it makes:
Key Takeaways
Spend some time to find out which region the data are stored in and do your compute on machines located in the same region! Read operations on NASA’s Earthdata catalog are 2x faster if you run them in the same region as the storage bucket (us-west-2) than if you run them in a different region (e.g. us-east-1)!
You may get a slight performance boost by accessing the data directly from S3 URIs (e.g. s3://lp-prod-protected/path/to/B04.tif) instead of the https links (e.g. https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/path/to/B04.tif), but it doesn’t make a huge difference.
Test results
nasa_s3_test.py queries the Harmonized Landsat Sentinel (HLS) STAC collections and reads a year’s worth of data for a small area.
local
Running this on my laptop takes about one minute, which doesn’t seem to bad for reading a whole year’s worth of data!
$ python nasa_s3_test.py --method="default"
> average run time: 56.97
us-east-1
The read operation takes only 16 seconds if we run it on an EC2 instance in us-east-1 - that’s fast, right?
$ python nasa_s3_test.py --method="default"
> average run time: 16.06
us-west-2
Running it in the native region for the raster data brings it down to 8 seconds!
$ python nasa_s3_test.py --method="default"
> average run time: 8.08
us-west-2 with S3 URIs
NASA suggests that reading data using the S3 URIs might give you a significant performance boost, but in this test the speed advantage is relatively minor (0.25 seconds). It takes a little more work to get the S3 credentials then modify the hrefs for the assets, but it’s not very hard. Check out this tutorial from NASA for more context.
$ python nasa_s3_test.py --method="direct_from_s3"
> direct_from_s3 average run time: 7.74
Test script:
import os
import timeit
import click
import pystac_client
import rasterio
import requests
import stackstac
= 5070
EPSG = "https://data.lpdaac.earthdatacloud.nasa.gov/"
HLS_URL_PREFIX = (-121.8238, 38.4921, -121.6018, 38.6671)
BBOX = "HLSL30.v2.0"
COLLECTION = "2022-04-01"
START_DATE = "2023-03-31"
END_DATE = ["B04", "B03", "B02"]
ASSETS = 30
RESOLUTION = 10
N_ITERATIONS
def default(stac_items):
"""Read the rasters using the links in the STAC item metadata"""
stackstac.stack(
stac_items,=ASSETS,
assets=BBOX,
bounds_latlon=EPSG,
epsg=RESOLUTION,
resolution="center",
xy_coords
).compute()
def get_nasa_s3_creds():
# get username/password from netrc file
= {}
netrc_creds with open(os.path.expanduser("~/.netrc")) as f:
for line in f:
= line.strip().split(" ")
key, value = value
netrc_creds[key]
# request AWS credentials for direct read access
= requests.get(
url "https://data.lpdaac.earthdatacloud.nasa.gov/s3credentials",
=False,
allow_redirects"Location"]
).headers[
= requests.get(
raw_creds =(netrc_creds["login"], netrc_creds["password"])
url, auth
).json()
return dict(
=raw_creds["accessKeyId"],
aws_access_key_id=raw_creds["secretAccessKey"],
aws_secret_access_key=raw_creds["sessionToken"],
aws_session_token="us-west-2",
region_name
)
def direct_from_s3(stac_items, nasa_creds):
"""Read the rasters using S3 URIs rather than the links in the STAC item
metadata
"""
# replace https:// prefixes with s3:// so rasterio will read directly from S3
for item in stac_items:
for asset in item.assets.values():
if asset.href.startswith(HLS_URL_PREFIX):
= asset.href.replace(HLS_URL_PREFIX, "s3://")
asset.href
with rasterio.Env(session=rasterio.session.AWSSession(**nasa_creds)) as env:
stackstac.stack(
stac_items,=["B04", "B03", "B02"],
assets=BBOX,
bounds_latlon=EPSG,
epsg=30,
resolution="center",
xy_coords=stackstac.DEFAULT_GDAL_ENV.updated(
gdal_env=dict(session=env.session)
always
),
).compute()
return
@click.command()
@click.option("--method")
def run(method):
# find STAC items
= pystac_client.Client.open("https://cmr.earthdata.nasa.gov/stac/LPCLOUD")
catalog
= catalog.search(
stac_items =[COLLECTION],
collections=BBOX,
bbox=[START_DATE, END_DATE],
datetime
).item_collection()
= None
func if method == "default":
= default
func = {}
kwargs elif method == "direct_from_s3":
= direct_from_s3
func = {"nasa_creds": get_nasa_s3_creds()}
kwargs
assert func
= timeit.timeit(lambda: func(stac_items, **kwargs), number=N_ITERATIONS)
run_time print(f"average run time: {run_time / N_ITERATIONS}")
if __name__ == "__main__":
run()