Overview
Teaching: 45 min Exercises: 0 minQuestions
How can I extract pixel values from rasters and perform computations?
How might I write pixel values out to a new raster file?
Objectives
Understand the basic components of a raster dataset and how to access them from a python program.
Perform numerical operations on pixel values.
Read from and write to raster datasets.
GDAL is a powerful and mature library for reading, writing and warping raster datasets, written in C++ with bindings to other languages. There are a variety of geospatial libraries available on the python package index, and almost all of them depend on GDAL. One such python library developed and supported by Mapbox, rasterio, builds on top of GDAL’s many features, but provides a more pythonic interface and supports many of the features and formats that GDAL supports. Both GDAL and rasterio are constantly being updated and improved: As of writing this tutorial (July 2018), GDAL is at version 2.3.1 and rasterio is at version 1.0.2.
When should you use GDAL directly?
When should you use rasterio
instead of GDAL?
When might these not be the best tools?
The Landsat program is the longest-running civilian satellite imagery program, with the first satellite launched in 1972 by the US Geological Survey. Landsat 8 is the latest satellite in this program, and was launched in 2013. Landsat observations are processed into “scenes”, each of which is approximately 183 km x 170 km, with a spatial resolution of 30 meters and a temporal resolution of 16 days. The duration of the landsat program makes it an attractive source of medium-scale imagery for land surface change analyses. Landsat is multiband imagery, you can read more about it here.
One reason we’ll use Landsat 8 for this demo is that the entire Landsat 8 archive is hosted by various commercial Cloud providers with free public access (AWS and Google Cloud)!
We start by importing all the python libraries we need in this tutorial:
A single Landsat8 scene is about 1 Gb in size since it contains a large array of data for each imagery band. Images are often grouped into a single .tar.gz file, which requires scientists to download the entire 1 Gb, even if you are just interested in a single band or subset of the image. To circumvent downloading, a new format has been proposed for Cloud storage called Cloud-optimized Geotiffs (COGs), to allow easy access to individual bands or even subsets of single images.
In a strict sense, the Landsat images on AWS and Google Cloud are not COGs (you can validate the format with online tools such as this - http://cog-validate.radiant.earth/html). Nevertheless, the files are stored as tiled Geotiffs, so many COG features still work.
You can use NASA’s Earthdata Search website to discover data. Landsat images are organized by ‘path’ and ‘row’. We’ve chosen a scenes from path 42, row 34, that doesn’t have many clouds present (LC08_L1TP_042034_20170616_20170629_01_T1). Note that ‘T1’ stands for ‘Tier 1’ (for analytic use), and ‘RT’ stands for ‘Real-time’, for which quality control is not as rigorous. Read more about the various Landsat formats and collections here. At first glance it seems that you can find this data on both AWS or Google Cloud, for example look at the band 4 image for the date we selected:
Or, the same file is available on Google Cloud storage:
Landsat on Google:
{'driver': 'GTiff', 'dtype': 'uint16', 'nodata': None, 'width': 7821, 'height': 7951, 'count': 1, 'crs': CRS({'init': 'epsg:32611'}), 'transform': Affine(30.0, 0.0, 204285.0, 0.0, -30.0, 4268115.0), 'blockxsize': 256, 'blockysize': 256, 'tiled': True, 'compress': 'lzw', 'interleave': 'band'}
Landsat on AWS:
{'driver': 'GTiff', 'dtype': 'uint16', 'nodata': None, 'width': 7821, 'height': 7951, 'count': 1, 'crs': CRS({'init': 'epsg:32611'}), 'transform': Affine(30.0, 0.0, 204285.0, 0.0, -30.0, 4268115.0), 'blockxsize': 512, 'blockysize': 512, 'tiled': True, 'compress': 'deflate', 'interleave': 'band'}
If you’re familiar with programming in python, you’ve probably seen context managers before. This context manager, rasterio.open() functions like the python standard library function open for opening files. The block of code within the with … as statement is executed once the file is opened, and the file is closed when the context manager exits. This means that we don’t have to manually close the raster file, as the context manager handles that for us.
Instead of a local file path, rasterio knows how to read URLs too, so we just passed the link to the file on AWS
src.profile is a collection of metadata for the file. We see that it is a Geotiff (Gtiff), the image values are unsigned integer format, nodata values are not assigned, the image has a dimensions of 7711x7531, is a single band, is in UTM coordinates, has a simple affine transformation, is chunked into smaller 512x512 arrays, tiled and compressed on the AWS hard drive where it is stored.
Note that we have not actually downloaded the image! We just read the metadata.
Did you catch the subtle differences between the data on AWS versus the data on Google? The image block sizes, or “tiling” is different, as is the compression scheme. It also turns out that AWS stores pre-made image overviews, but Google does not. It is important to realize that not all archives are the same! In fact the Google Cloud storage archive is a complete mirror of the full USGS archive, whereas as of writing this, AWS only has collection 1 scenes after 2017. In general, performance will be best if you are accessing data locally (so if you are running this notebook on AWS, use the landsat public data on AWS)! We are going to use the AWS archive for the rest of this tutorial because of the convenience of overviews:
Decimation factor= 81
array type: <class 'numpy.ndarray'>
[[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
...
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]]
Earlier we saw in the metadata that a no-data value wasn’t set, but pixels outside the imaged area are clearly set to “0”. Images commonly look like this because of satellite orbits and the fact that the Earth is rotating as imagery is acquired! The colormap is often improved if we change the out of bounds area to NaN. To do this we have to convert the datatype from uint16 to float32 (so be aware the array with NaNs will take 2x the storage space). The serrated edge is due to the coarse sampling of the full resolution image that we are doing here.
Another nice feature of COGs is that you can request a subset of the image and only that subset will be downloaded to your computer:
The Normalized Difference Vegetation Index is a simple indicator that can be used to assess whether the target includes healthy vegetation. This calculation uses two bands of a multispectral image dataset, the Red and Near-Infrared (NIR) bands:
\[NDVI = \frac{(NIR - Red)}{(NIR + Red)}\]For this tutorial, we’ll use the NIR and Red bands from a Landsat-8 scene above part of the central valley and the Sierra Nevada in California. We’ll be using Level 1TP datasets, orthorectified, map-projected images containing radiometrically calibrated data.
Because of the longevity of the landsat mission and because different sensors on the satellite record data at different resolutions, these bands are individually stored as single-band raster files. Some other rasters may store multiple bands in the same file.
So far, we have read in a cloud-optimized geotiff from the Cloud into our computer memory (RAM), and done a simple computation. What if we want to save this result locally for future use?
Since we have used a subsampled overview, we have to modify the orginal metadata before saving! In particular, the Affine matrix describing the coordinates is different (new resolution and extents, and we’ve changed the datatype). We’ll stick with Geotiff, but note that it is no longer cloud-optimized.
Be sure to check that the saved file looks the same:
Note that rasterio also has a ‘convenience method’ for plotting with georeferenced coordinates
Raster images really have two sets of coordinates. First, ‘image coordinates’ correspond to the row and column for a specific pixel. Second, the ‘spatial coordinates’ correspond to the location of each pixel on the surface of the Earth. Rasterio makes it convenient to use both coordinate systems.
Lets say you want the value of NDVI at a specific point in this scene. For example Fresno, CA (-119.770163586, 36.741997032). But the image is in UTM coordinates, so you have to first convert these points to UTM (Many websites will do this for you https://www.geoplaner.com), or you can use the pyproj library. Be warned that this only works if you read the dataset at it’s full resolution (if you do decimated or windowed reads of the file, you need to adjust the affine transform).
Fresno NDVI
-------
lon,lat= (-119.77,36.74)
easting,northing= (252664,4.06983e+06)
row,col= (734,179)
ndvi= 0.24
Interesting Feature
-------
row,col= (200,450)
easting,northing= (325920,4.21398e+06)
lon,lat= (-118.98,38.06)
ndvi= -0.10
Let’s take a look at the difference in NDVI between a scene in June 2013 and June 2017. If you went to the AWS Landsat Archive page, you probably noticed that it isn’t obvious how to search and discover images (most of the time you probably won’t know the row, path, or full URL of images over your area of interest!) There are many options for searching, graphical web applications like NASA’s Earthdata Search, or convenient Python tools like DevSeed’s sat-search, among many others! We won’t go into these tools here, but we encourage you to experiment with your own image scenes. Here is a file from June 2018 that was easy to find with Landsat for AWS using the Path and Row from the earlier URL:
https://landsatonaws.com/L8/042/034/LC08_L1TP_042034_20180619_20180703_01_T1
With the band 4 imagery URL:
And plot the results with matplotlib:
We just loaded 4 decimated Landsat 8 band images into memory and computed the difference in NDVI between two dates. That was relatively easy because these two rasters happen to use the same coordinate system and grid. This is know as ‘Analysis Ready Data’. In the geosciences, we commonly have data in WGS84 Lat, Lon. Or what if you want to compare to NDVI estimated from Sentinel-2 satellite acquisitions, which are on a different grid?
Rasterio can also be used for masking, reprojecting, and regridding distinct datasets. Here is one simple example to reproject our local NDVI onto a WGS84 Lat/Lon Grid. The example creates a ‘VRT’ file, which is merely a ASCII text file describing the transformation and other image metadata. The array values do no need to be duplicated, the VRT file just contains a reference to the local file (LC08_L1TP_042034_20170616_20170629_01_T1_NDVI_OVIEW.tif). This is extremely useful to avoid duplicating lots of data and filling up your computer!
The issue with VRTs is now you have two files and let’s say you want to share this image with a colleague. Also if you move things around on your computer, the paths might get mixed up, so in some cases it’s nice to save a complete reprojected file. The second code block does this, saving our NDVI as a Geotiff in WGS84:
This just scratches the surface of what you can do with rasterio. Check out the excellent official documentation as you continue exploring:
https://rasterio.readthedocs.io/en/latest
Key Points
Rasterio is built around the GDAL library (recall section 3), to facilitate raster operations in Python.
Pixel values of rasters can be extracted to a numpy array.
Computation is done in local memory on numpy arrays, then saved to the raster format of choice.