Common raster operations#

When working with raster data, there are various operations and techniques that might be useful when preprocessing the data for further analysis. Some typical raster operations include for example selecting, clipping or masking raster to include only pixels that are located in a given area or at given indices, merging multiple raster tiles into a single raster mosaic, converting raster data into vector format (and vice versa), or resampling the raster pixels into higher or lower spatial resolution (upscaling, downscaling). In this chapter, we will learn how to conduct these kind of raster operations using xarray, rioxarray, geocube and rasterio Python libraries.

Selecting data#

Being able to select data based on indices or certain criteria (e.g. all values above specific threshold) is one of the most common operations that you need to do when working with any kind of data (raster, vector, as well as non-geographical data). When working with raster data, we can take advantage of xarray’s flexible indexing routines that combine the best features of numpy and pandas for data selection.

There are various ways how you can select data from DataArray and Dataset, but some of the most commonly used methods include .isel(), .sel() and .where() which we will introduce in the following. Let’s start by reading the elevation dataset that we used earlier in Chapter 7.2:

import xarray as xr
import matplotlib.pyplot as plt

fp = "data/temp/kilimanjaro_dataset.nc"

data = xr.open_dataset(fp, decode_coords="all")
data
<xarray.Dataset> Size: 156MB
Dimensions:          (y: 3601, x: 3601)
Coordinates:
  * x                (x) float64 29kB 36.0 36.0 36.0 36.0 ... 37.0 37.0 37.0
  * y                (y) float64 29kB -2.0 -2.0 -2.001 ... -2.999 -3.0 -3.0
    spatial_ref      int32 4B ...
Data variables:
    elevation        (y, x) float64 104MB ...
    relative_height  (y, x) float32 52MB ...

Let’s also plot the data to see how our values and coordinates (x and y) distribute over a map:

data["elevation"].plot()
plt.title("Elevation in meters");
../../../_images/1815e154059a451d9a4a83ff09cc1ed29b9b3a951633c68e2be6cd6fa917c2c6.png

Figure 7.7. Elevation of the landscape in North-Eastern Tanzania.

Selecting data by index values#

The logic for selecting data from a DataArray values works quite similarly as we have learned earlier in this book using pandas (for Series and DataFrame objects), except that the returned object in xarray is always another DataArray. To select data based on specific index values, we can take advantage of the .isel() method that can be used similarly with Dataset and DataArray objects. For instance, we might be interested to select a cell at index [0, 0] in the array which corresponds to the cell that is located at the very top-left corner of the raster. To do this, we specify that the index value for both x and y dimension is 0 when calling the .isel() method:

selected_cell = data.isel(x=0, y=0)
selected_cell
<xarray.Dataset> Size: 32B
Dimensions:          ()
Coordinates:
    x                float64 8B 36.0
    y                float64 8B -2.0
    spatial_ref      int32 4B ...
Data variables:
    elevation        float64 8B ...
    relative_height  float32 4B ...

As we can see, the output is always returned as an xarray object, which in our case is a Dataset. To access the elevation or relative_height value at this given position, we can use the .item() which will return the value as a regular number (as learned in Chapter 7.2):

selected_cell["elevation"].item()
826.0
selected_cell["relative_height"].item()
258.0

Quite often when working with raster data, you are interested to select a range of values from a given raster (called slicing). To slice the data based on index ranges, we can use the same .isel() method and provide the information about the index range using the Python’s built-in slice() function that can be used to specify a range of indices to be included in the selection by specifying start and end index positions. In the following, we will select the first 1000 cells in the x and y dimensions:

selection = data.isel(x=slice(0, 1000), y=slice(0, 1000))
selection
<xarray.Dataset> Size: 12MB
Dimensions:          (y: 1000, x: 1000)
Coordinates:
  * x                (x) float64 8kB 36.0 36.0 36.0 36.0 ... 36.28 36.28 36.28
  * y                (y) float64 8kB -2.0 -2.0 -2.001 ... -2.277 -2.277 -2.277
    spatial_ref      int32 4B ...
Data variables:
    elevation        (y, x) float64 8MB ...
    relative_height  (y, x) float32 4MB ...
selection["elevation"].plot()
plt.title("Elevation in meters for the selected cells");
../../../_images/1b6f638e2e687dd6597180a06167c7199670983012c9e630d0e101ca32840763.png

Figure 7.8. Elevation values in the cells that were selected based on indices.

As we can see, now the shape of our output Dataset is 1000x1000 cells and the selection covers to top-left corner of our input Dataset.

Selecting data based on coordinates#

While selecting data based on index positions might be useful in certain situations, quite often you want to select data based on coordinates. To do this, we can take advantage of the coordinate labels using .sel() method, which allows for more intuitive selection by providing the actual coordinates for you area of interest. Next, we will do the selection for an area that covers the following area:

  • x-coordinates: 36.4 - 36.6

  • y-coordinates: -2.4 - -2.6

area_of_interest = data.sel(x=slice(36.4, 36.6), y=slice(-2.4, -2.6))
area_of_interest["elevation"].plot()
plt.title("Elevation in meters for the selected area of interest");
../../../_images/85df5ddb4dee494e5723eb56a5af8b7b3a7db8ef0598175dc27d43c0976fbcf3.png

Figure 7.9. Elevation values in the cells that were selected based on coordinates.

Nice! Now we have selected the data for a given area of interest that we specified using the coordinates. In a similar manner, you can select data for any specific area of interest. Couple of important things to remember when selecting data based on coordinates: 1) The coordinate values provided for the selection need to be within the extent of a given input raster dataset. 2) The coordinates provided for the selection need to be in the same coordinate reference system (CRS) as the input raster. For example, here we provided the coordinates as Decimal Degrees because our input data contains coordinates as latitudes and longitudes. However, if your coordinates are represented e.g. in a metric coordinate reference system (e.g. UTM), then your selection should be based on coordinates in a metric system.

Selecting data based on logical conditions#

A third way to select data from a given raster is based on a specific criteria. To select data based on specific conditions, we can use the .where() method which is a handy tool to make conditional selections. For instance, we might be interested to select only such pixels from the raster where the elevation is above 2000 meters:

moderate_altitudes = data.where(data["elevation"] > 2000)
moderate_altitudes["elevation"].plot()
plt.title("Moderate altitude areas in the region");
../../../_images/90671af4bd69ec9fe8ee9cb0cc4231e470c34efa51e41c21cd6f8cd47c58789e.png

Figure 7.10. Elevation values in the cells that were above a specific criteria (2000 meters).

As a result, we now have a map that highlights the areas where the elevation is above 2 kilometers. In a similar manner, we can also combine multiple conditions to a single selection. In the following, we will select all pixels where the elevation is above 1000 meters and below or equal to 1500 meters:

condition = (data["elevation"] > 1000) & (data["elevation"] <= 1500)
low_altitudes = data.where(condition)
low_altitudes["elevation"].plot()
plt.title("Low altitude areas in the region");
../../../_images/d14d4f40705c13e5322ace065edb4faa54e81d7edfe65e9611a68578b74d9f86.png

Figure 7.11. Elevation values in the cells that are between 1000-1500 meters.

Excellent, now we have a map that highlights the elevations between specific elevation range. In a similar manner, you can conduct selections by combining selection criteria over a single data variable or even combine conditions over multiple different variables.

Clipping raster#

In the previous section we learned how to select data based on given criteria or by specifying an area of interest based on coordinates. However, quite often you actually have another geographic dataset that should be used to crop or select the data for your analysis. Clipping is one of the common raster operations in which you clip a given raster Dataset by another layer. This process allows you to crop the data in a way that only the cells that are e.g. within a given Polygon are selected for further analysis. In the following, we will learn how to do this by using the .rio.clip() method that comes with the rioxarray library.

To be able to clip this xarray.Dataset, we first need to create a geopandas.GeoDataFrame that contains the geometry that we want to use as our clipping features. In our case, we want to create a simple bounding box that defines the area which we want to keep from the raster. To create the GeoDataFrame, we can specify the corner coordinates of our bounding box and utilize the box function of shapely library which can conveniently create us the geometry (as introduced in Chapter 6.1):

import geopandas as gpd
from shapely.geometry import box

# Bounding box coordinates
minx = 36.1
miny = -3
maxx = 36.3
maxy = -2.77

# Create a GeoDataFrame that will be used to clip the raster
bbox_geometry = box(minx, miny, maxx, maxy)
clipping_gdf = gpd.GeoDataFrame(geometry=[bbox_geometry], crs="epsg:4326")

# Explore the extent on a map
clipping_gdf.explore()
Make this Notebook Trusted to load map: File -> Trust Notebook

Figure 7.11. Our area of interest around the Mt Kitumbene in Tanzania which will be used to clip the raster dataset.

Now after we have created the GeoDataFrame we can use it to clip the xarray.Dataset. To do this, we use the .rio.clip() method which wants as input the geometries that will be used for clipping the raster data. We can pass the gpd.GeoSeries as an input for this (i.e. the geometry column of our GeoDataFrame) and we also specify the crs to be the same as in out input raster data. It is important that the coordinate reference system of both layers are the same whenever doing GIS operations between multiple layers. Thus, we use a simple assert to check the match before doing the clipping:

# Check that the CRS matches between layers (only continues if True)
assert clipping_gdf.crs == data.elevation.rio.crs

# Clip the raster
kitumbene = data.rio.clip(geometries=clipping_gdf.geometry, crs=data.elevation.rio.crs)

Perfect! Now we have succesfully clipped the raster with our bounding box. Let’s make a map out of our results to see how our data looks like:

kitumbene["elevation"].plot()
plt.title("Elevations around Mt Kitumbene");
../../../_images/0b57dd16c8674d85fbc2d798222ff4cf44c16173c68fe542208338d7b66bd56c.png

Figure 7.12. Elevations in the area that was clipped based on a polygon.

After clipping, it is possible to continue working with the clipped Dataset and e.g. find the mean elevation for this area around the Mt Kitumbene. Notice that the operations clipped all the variables in our Dataset simultaneously. We can extract basic statistics from elevation and inspect the relative height in this area as follows:

# Mean elevation
kitumbene["elevation"].mean().item()
1467.7473111245304
# Update the relative height considering the minimum elevation on this area
kitumbene["relative_height"] = kitumbene["elevation"] - kitumbene["elevation"].min()

# Mean of the relative height
kitumbene["relative_height"].max().item()
2109.0

We can see that the mean elevation in this area is approximately 1470 meters while the maximum relative height is ~2100 meters. We needed to recalculate the relative height because the baseline minimum elevation in the landscape changed significantly after the clipping operation.

Masking a raster#

Another commonly used approach when working with raster data is to mask the data based on certain criteria or using another geographical layer as a mask. One common reasons for doing this is for example to exclude water bodies (e.g. lakes or oceans) from the terrain when doing specific analyses. In the following, we will continue working with the same elevation data and mask out the lakes from our raster dataset that exist in our study area.

Let’s start by downloading data from OpenStreetMap (OSM) about existing lakes in our study area. To do this, we first extract the bounds of our raster Dataset and then use osmnx library to fetch all OSM elements that have been tagged with key "water" and "lake" (read more about osmnx from Chapter 9.1):

import osmnx as ox

# Extract the bounding box based on the extent of the raster
data_bounds_geom = box(*data["elevation"].rio.bounds())

# Retrieve lakes from the given area
lakes = ox.features_from_polygon(data_bounds_geom, tags={"water": ["lake"]})

# Plot the raster and lakes on top of each other
fig, ax = plt.subplots(figsize=(12, 8))
data["elevation"].plot(ax=ax)
lakes.plot(ax=ax, facecolor="lightblue", edgecolor="red", alpha=0.4);
/Users/tenkanh2/micromamba/envs/python-gis-book/lib/python3.12/site-packages/osmnx/features.py:678: PerformanceWarning: indexing past lexsort depth may impact performance.
  gdf.loc[:, "geometry"] = gdf["geometry"].make_valid()
../../../_images/bcb9b71529f9ab0c44b11e4a8c3019a8df6ec05582f8ed27185c550a055306b3.png

Figure 7.13. Existing lakes that are present in our study area.

As we can see from the Figure 7.10, there is one large lake and multiple smaller ones in our study that we might not want to be taken into account when analyzing the terrain. Luckily, we can easily mask these areas out of our Dataset by using rioxarray. To do this, we can use the same .rio.clip() method which we used in the previous example. However, in this case, we do not want to totally remove those cells from our Dataset but only mask them out, so that the values on those areas are replaced with NaN values. By using parameters drop=False and invert=True, the cells that are intersecting with the lake geometries will be masked with NaNs:

masked_data = data.rio.clip(
    geometries=lakes.geometry,
    drop=False,
    invert=True,
    crs=data.elevation.rio.crs,
)
masked_data["elevation"].plot()
plt.title("Elevation data with a mask");
../../../_images/b457d13d6cd37c851221470f9891168b782c82ef481214f31e4845207fd75e70.png

Figure 7.14. A Dataset where the lakes have been masked out (shown with white color).

As a result, we now have a new Dataset where the elevation values overlapping with the lakes have been converted to NaNs. We can now compare whether e.g. the mean land surface elevation differs from the original one where the lakes were still included:

print("Mean elevation with lakes:", data["elevation"].mean().round().item())
print("Mean elevation without lakes:", masked_data["elevation"].mean().round().item())
Mean elevation with lakes: 1258.0
Mean elevation without lakes: 1287.0

Based on this comparison, we can see that masking out the lakes increases the mean elevation in the area by approximately 30 meters. In a similar manner, you can mask any rioxarray.Dataset with given mask features that you want to remove from the analysis.

Creating a raster mosaic by merging datasets#

One very common operation when working with raster data is to combine multiple individual raster layers (also called as tiles) into a single larger raster dataset, often called as raster mosaic. This can be done easily with the merge_datasets() -function in rioxarray. Here, we will create a mosaic based on DEM files (altogether 4 files) covering Kilimanjaro region in Tanzania. First we will read elevation data from an S3 bucket. Let’s start by creating a list of URL paths to given GeoTiff files that we have

import xarray as xr
from pathlib import Path
import rioxarray

# S3 bucket containing the data
bucket = "https://a3s.fi/swift/v1/AUTH_0914d8aff9684df589041a759b549fc2/PythonGIS"
path = Path(bucket)

# Generate urls for the elevation files
urls = [
    path / "elevation/kilimanjaro/ASTGTMV003_S03E036_dem.tif",
    path / "elevation/kilimanjaro/ASTGTMV003_S03E037_dem.tif",
    path / "elevation/kilimanjaro/ASTGTMV003_S04E036_dem.tif",
    path / "elevation/kilimanjaro/ASTGTMV003_S04E037_dem.tif",
]

# Show the first path
urls[0]
PosixPath('https:/a3s.fi/swift/v1/AUTH_0914d8aff9684df589041a759b549fc2/PythonGIS/elevation/kilimanjaro/ASTGTMV003_S03E036_dem.tif')

Now we have a list of URL paths to the files that we want to read into xarray. To do this, we create a nested loop where we iterate over the urls list one url at a time and read the GeoTiff file using the .open_dataset() function as we introduced in Chapter 7.2:

datasets = [
    xr.open_dataset(url, engine="rasterio", masked=True, band_as_variable=True)
    for url in urls
]

Now we have stored all the xarray.Dataset layers inside the list datasets. We can investigate the contents of the first Dataset in our list as follows:

datasets[0]
<xarray.Dataset> Size: 52MB
Dimensions:      (x: 3601, y: 3601)
Coordinates:
  * x            (x) float64 29kB 36.0 36.0 36.0 36.0 ... 37.0 37.0 37.0 37.0
  * y            (y) float64 29kB -2.0 -2.0 -2.001 -2.001 ... -2.999 -3.0 -3.0
    spatial_ref  int64 8B ...
Data variables:
    band_1       (y, x) float32 52MB ...
Attributes:
    Band_1:         Band 1
    AREA_OR_POINT:  Area
datasets[0].rio.shape
(3601, 3601)

As we can see an individual Dataset has a shape with 3601 cells on each dimension (x and y) and the name of the data variable band_1 which represents the elevation values. Let’s visualize these four raster tiles in separate maps to see how they look like. To do this, we use plt.subplots() to initialize a figure with 2x2 subplots and then visualize the rasters one by one. We use vmax parameter to specify a same value scale for each layer that makes the colors in the map comparable:

import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 2, figsize=(16, 16))

# Plot the tiles to see how they look separately
datasets[0]["band_1"].plot(ax=axes[0][0], vmax=5900, add_colorbar=False)
datasets[1]["band_1"].plot(ax=axes[0][1], vmax=5900, add_colorbar=False)
datasets[2]["band_1"].plot(ax=axes[1][0], vmax=5900, add_colorbar=False)
datasets[3]["band_1"].plot(ax=axes[1][1], vmax=5900, add_colorbar=False);
../../../_images/91ce3aa6be70bb89b7d70ba127b46e8b1d7d2267ea4afff5341998eca7f18ec1.png

Figure 7.15. Four elevation raster layers plotted next to each other.

From the figure we can see that these four raster tiles seem to belong together naturally as the elevation values as well as the coordinates along the x- and y-axis continue smoothly. Hence, we can stitch them together into a single larger raster Dataset. To merge multiple xarray.Datasets together, we can use the .merge_datasets() function from rioxarray:

from rioxarray.merge import merge_datasets

mosaic = merge_datasets(datasets)
mosaic
/Users/tenkanh2/micromamba/envs/python-gis-book/lib/python3.12/site-packages/IPython/core/interactiveshell.py:3577: SerializationWarning: saving variable None with floating point data as an integer dtype without any _FillValue to use for NaNs
  exec(code_obj, self.user_global_ns, self.user_ns)
<xarray.Dataset> Size: 208MB
Dimensions:      (x: 7201, y: 7201)
Coordinates:
  * x            (x) float64 58kB 36.0 36.0 36.0 36.0 ... 38.0 38.0 38.0 38.0
  * y            (y) float64 58kB -2.0 -2.0 -2.001 -2.001 ... -3.999 -4.0 -4.0
    spatial_ref  int64 8B 0
Data variables:
    band_1       (y, x) float32 207MB 826.0 826.0 823.0 ... 832.0 838.0 844.0
Attributes:
    Band_1:         Band 1
    AREA_OR_POINT:  Area

Excellent! Now we have successfully merged the individual datasets together which is evident from the shape of the new Dataset that has 7201 cells on x and y axis. Let’s now rename our data variable to a more intuitive one and plot the result to see how the end result looks like:

mosaic = mosaic.rename({"band_1": "elevation"})
mosaic["elevation"].plot(figsize=(12, 12))
plt.title("Elevation values covering larger area in the region close to Kilimanjaro");
../../../_images/a3a181c4934134ef0e3c839396c35c5423b4687b9bfac12c9af4d944e5213e29.png

Figure 7.16. A raster mosaic where four raster tiles were merged together.

The end result looks good and we can see clearly the Mount Kilimanjaro which is the highest mountain in Africa 5895 meters above sea level and the highest volcano in the Eastern Hemisphere.

Raster to vector conversion (vectorize)#

Another commonly used technique commonly needed when working with geographic data is to convert the data from raster to vector format and vice versa. These conversion techniques are commonly called as vectorize or rasterize operations. When we convert a raster Dataset to vector format the raster cells are converted into shapely.Polygon objects and the values of the cells are stored as an attribute (column) in the resulting GeoDataFrame. To convert xarray.DataArray into vector format, you can use the geocube library that helps doing these kind of data conversions. In the following, we continue work with the kitumbene elevation data that we created earlier by clipping the data and convert this layer into vector format. Let’s have a quick look on our input DataArray before continuing:

kitumbene["elevation"]
<xarray.DataArray 'elevation' (y: 828, x: 720)> Size: 5MB
array([[ 979.,  985.,  988., ..., 1056., 1054., 1054.],
       [ 978.,  982.,  983., ..., 1054., 1055., 1057.],
       [ 975.,  975.,  977., ..., 1058., 1058., 1059.],
       ...,
       [ 749.,  753.,  752., ..., 1003., 1003., 1003.],
       [ 753.,  753.,  752., ...,  999., 1001., 1002.],
       [ 755.,  754.,  751., ...,  999., 1000., 1000.]])
Coordinates:
  * x            (x) float64 6kB 36.1 36.1 36.1 36.1 ... 36.3 36.3 36.3 36.3
  * y            (y) float64 7kB -2.77 -2.771 -2.771 -2.771 ... -2.999 -3.0 -3.0
    spatial_ref  int64 8B 0
Attributes:
    Band_1:         Band 1
    long_name:      Band 1
    AREA_OR_POINT:  Area

To vectorize a given variable in your xarray.Dataset, you can use the vectorize function from the geocube library as follows:

from geocube.vector import vectorize

gdf = vectorize(kitumbene["elevation"].astype("float32"))
gdf.shape
(500279, 2)
gdf.head()
elevation geometry
0 979.0 POLYGON ((36.10014 -2.77014, 36.10014 -2.77042...
1 985.0 POLYGON ((36.10042 -2.77014, 36.10042 -2.77042...
2 988.0 POLYGON ((36.10069 -2.77014, 36.10069 -2.77042...
3 987.0 POLYGON ((36.10125 -2.77014, 36.10125 -2.77042...
4 984.0 POLYGON ((36.10153 -2.77014, 36.10153 -2.77042...

Great! Now we have converted the DataArray into vector format and as a result we got a GeoDataFrame that contains the geometries of individual cells as shapely.Polygon objects in the geometry column, and the cell values in the column elevation. The name of the column will be automatically added based on the name of the DataArray. As we can see from the gdf.shape all the raster cells were added as individual rows into the GeoDataFrame which means that our 2D array with 828 rows and 728 columns result in approximately 500 thousand rows in the GeoDataFrame.

When working with surface data (e.g. elevation), it is quite common that there are similar values close to each other in the raster cells and often there are specific regions where the elevation does not change. Therefore, after the vectorization operation it is a good idea to dissolve the geometries based on the data attribute which merges geometries with identical values into single geometries instead of representing all the values as separate polygons. To do this, we can use the .dissolve() function which we introduced in more detail in Chapter 6.3:

gdf = gdf.dissolve(by="elevation", as_index=False)
gdf.shape
(2110, 2)

As we can see, the number of rows in our GeoDataFrame was reduced dramatically from more than 500 thousand rows into a bit over 2000 rows. Let’s finally plot our GeoDataFrame to see how the data looks like:

gdf.plot(column="elevation");
../../../_images/fbd105180dafd9fa93cd99f6f23b55d042009e8056f1e67c4f7b0474d036b829.png

Figure 7.17. The elevation map made from the vectorized data.

As we can see from the figure, the map looks identical to our original DataArray (Figure 7.9) which means that the conversion works as it should.

It is good to keep in mind that the raster data structure (i.e. arrays) is much more efficient way to process continuous surfaces. When every cell in the 2D array is converted into polygon geometries, the processing and visualization of the data typically becomes more resource intensive for the computer (making things slower). There are approaches to deal with this issue e.g. by categorizing the data values of the surface into specific elevation classes (e.g. with 5 meter intervals) and then dissolving the geometries into larger Polygon shapes (as we did earlier without categorization). Another technique to consider is downscaling your data into lower resolution, meaning that the size of an individual cell will be larger. Naturally, both of these techniques has an impact on the quality of the data as the data is generalized and aggregated. It is a good idea to do the preprocessing steps for the raster data before vectorizing it, especially if you have large raster arrays because the array operations in xarray are very efficient.

Vector to raster conversion (rasterize)#

Now as we have seen how to convert the data from raster to vector, let’s continue and see how to do the conversion from vector to raster data format, i.e. how to rasterize a vector dataset. In this example, we aim to rasterize the lakes that we downloaded earlier from OpenStreetMap. Let’s have a look how the data looks like:

lakes = ox.features_from_polygon(data_bounds_geom, tags={"water": ["lake"]})

lakes.plot();
/Users/tenkanh2/micromamba/envs/python-gis-book/lib/python3.12/site-packages/osmnx/features.py:678: PerformanceWarning: indexing past lexsort depth may impact performance.
  gdf.loc[:, "geometry"] = gdf["geometry"].make_valid()
../../../_images/1e61ff4dec4b15639152b5c46fa9f3bfe9a8517ba0d87ba148dee5fbfff544f4.png

Figure 7.18. Lakes represented in vector format.

lakes.shape
(16, 15)
lakes.tail(2)
geometry natural water name name:de name:es name:fr name:it name:pt name:sk wikidata wikipedia intermittent salt type
element id
way 1355142947 POLYGON ((36.7008 -2.86243, 36.70067 -2.86243,... water lake NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
relation 2315592 POLYGON ((36.22203 -1.85475, 36.22273 -1.85303... water lake Lake Magadi NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN multipolygon

As we can see the lakes GeoDataFrame contains polygons and have various attributes associated with them, although majority of these attributes do not contain any relevant data. Thus, let’s just pick a few columns that are most interesting to us:

lakes = lakes[["geometry", "water", "name"]].copy().reset_index()
lakes.shape
(16, 5)

Our GeoDataFrame does not currently really include any useful numerical data (except id) that we would perhaps want to store as an attribute into our raster DataArray. Thus, let’s create one numerical attribute into our data and calculate the area of the lakes as something we will use as values in our raster. We can use some of the vector data processing tricks to do this which were introduced in Chapter 6:

# Reproject to metric system
lakes_utm = lakes.to_crs(epsg=21037)

# Calculate the area in km2
lakes_utm["area_km2"] = lakes_utm.area / 1000000

# Join the area information with the original gdf
lakes = lakes.join(lakes_utm[["area_km2"]])
lakes.head(5)
element id geometry water name area_km2
0 way 698412673 POLYGON ((36.22572 -2.66849, 36.22584 -2.6685,... lake NaN 0.001602
1 way 724099461 POLYGON ((36.1394 -2.76761, 36.13989 -2.76752,... lake NaN 1.435812
2 way 724099746 POLYGON ((36.20467 -2.7295, 36.20459 -2.72856,... lake NaN 3.120696
3 way 23786069 POLYGON ((35.91633 -2.5593, 35.9193 -2.55463, ... lake Lake Natron 819.954659
4 way 23786069 POLYGON ((35.91633 -2.5593, 35.9193 -2.55463, ... lake Lake Natron 819.954659

In the previous, we first reprojected the data into EPSG:21037 (Arc 1960 / UTM zone 37S) which is a UTM (Universal Transverse Mercator) projection that covers most of Tanzania and provides accurate distance and area calculations. Then we calculated the area of the lakes into square kilometers and finally joined the information about the area into our original GeoDataFramethat is in WGS84 coordinate reference system.

By looking at the resulting table we can see something interesting. It appears that the lakes data contain some duplicate rows as the Lake Natron is present twice in our table (something to do with fetching data from OSM). This can cause issues when rasterizing the data because there should not be geometries that overlap with each other. Thus, we want to remove all duplicate rows from the data before continuing:

lakes = lakes.drop_duplicates()
lakes.shape
(14, 6)

As a result, two duplicates were dropped from the GeoDataFrame. Now we are ready to rasterize our GeoDataFrame into xarray. To do this, we can use the make_geocube() function from the geocube library:

from geocube.api.core import make_geocube

lakes_ds = make_geocube(
    vector_data=lakes,
    measurements=["id", "area_km2"],
    resolution=(-0.01, 0.01),
    output_crs="epsg:4326",
)
lakes_ds
<xarray.Dataset> Size: 235kB
Dimensions:      (y: 107, x: 136)
Coordinates:
  * y            (y) float64 856B -1.805 -1.815 -1.825 ... -2.845 -2.855 -2.865
  * x            (x) float64 1kB 35.89 35.9 35.91 35.92 ... 37.22 37.23 37.24
    spatial_ref  int64 8B 0
Data variables:
    id           (y, x) float64 116kB nan nan nan nan nan ... nan nan nan nan
    area_km2     (y, x) float64 116kB nan nan nan nan nan ... nan nan nan nan

As a result, we now have an xarray.Dataset with two data variables. In the code above, we specified that the lakes GeoDataFrame is used as the input vector_data and the columns id and area_km2 should be included as the measurements, i.e. the data that are stored into separate xarray variables. The resolution parameter defines the spatial resolution of the output grid, i.e. size of a single pixel in the raster. In our case, we specified a tuple with values (-0.01, 0.01) that are presented as decimal degrees because the coordinate reference system of our input data is WGS84. Thus, resolution 0.01 indicates that the size of a single pixel in our raster is approximately 1x1 kilometers (1.11km). The negative sign for x-resolution is used because raster transformations often use negative values for the y-axis in many CRS systems, where the y-coordinates decrease as you move down the raster. Finally, the output_crs defines the CRS for the resulting Dataset. Let’s make a map out of our data to see how the result looks like:

lakes_ds["area_km2"].plot()
plt.title("Rasterized lakes");
../../../_images/f9d9285a16c6c3982e02451a9314b64dcf7f2db217d91f6cf366ca8b99bfb2ee.png

Figure 7.19. Lakes that have been rasterized into DataArray at approximately 1 km resolution.

Quite often when rasterizing vector data, you actually want to fit the output to have identical resolution to an already existing xarray.Dataset and align it with the other raster layer. For example in our case, we can use the data raster (with elevation values) as a target so that the resolution, dimensions and alignment would fit with the existing Dataset. We can achieve this by using the like parameter in make_geocube() function. This will ensure that the output aligns with the existing raster having same resolution and dimensions:

aligned_ds = make_geocube(vector_data=lakes, measurements=["id", "area_km2"], like=data)
aligned_ds
<xarray.Dataset> Size: 208MB
Dimensions:      (y: 3601, x: 3601)
Coordinates:
  * y            (y) float64 29kB -2.0 -2.0 -2.001 -2.001 ... -2.999 -3.0 -3.0
  * x            (x) float64 29kB 36.0 36.0 36.0 36.0 ... 37.0 37.0 37.0 37.0
    spatial_ref  int64 8B 0
Data variables:
    id           (y, x) float64 104MB nan nan nan nan nan ... nan nan nan nan
    area_km2     (y, x) float64 104MB nan nan nan nan nan ... nan nan nan nan
aligned_ds["area_km2"].plot();
../../../_images/27cf065ed13edd53e8eca7aa415027d962a2098fe74125b0ddf612899350478c.png

Figure 7.20. Lakes that have been rasterized and aligned with an existing Dataset.

As a result, we have now rasterized the lakes in a way that the aligned_ds aligns with the existing Dataset containing the elevation values. This technique can be very useful especially when you want to do calculations (map algebra) between multiple raster layers (more about map algebra in Chapter 7.5).

Resampling raster data#

In the following section, we will introduce a technique that allows you to resample your raster data. Resampling refers to changing the cell values due to changes in the raster grid for example due to changing the effective cell size of an existing dataset. There are two ways to resample your raster data. Upscaling (or upsampling) refers to cases in which convert the raster to higher resolution, i.e. smaller cells. Downscaling (or downsampling) is resampling to lower resolution, i.e. having larger cell sizes. In the following, we will see how we can resample an xarray.Dataset by downscaling and upscaling the data.

We can resample xarray data by using the rioxarray library that can be used to downscale and upscale raster data. Whenever downscaling data, you are ultimately aggregating the information because multiple individual cells are merged into one larger cell that is then stored in the output grid. Thus, it is important to decide the resampling method which determines how the data values are aggregated. Depending on the input data, you might for example calculate the average of the input cells which will then be stored in the output grid cell. In our case, taking the average makes sense, because our input data represents elevation. However, in some cases you might be interested to sum all the cell values for example if your input data would represent population counts in a given region. There are also various other ways to resample the data, such as extracting the minimum (min), maximum (max), median (med) or the mode from the input cells. The mode means that the value which appears most often in the input raster cells is selected to the output raster cell.

Downscaling#

In the following, we will downscale our elevation data significantly by using a downscale factor of 50. This means that the output raster will be 50 times smaller in terms of its dimensions compared to the input raster. To downscale the data, you need to define the new shape for our target raster Dataset. Here, we use a specific downscale_factor that is used to calculate the new width and height for the output Dataset. The width and height for the target Dataset need to be provided as integer values. Thus we ensure that the dimensions are integers by rounding (round()) and converting the number withint(). The .rio.reproject() method is then used to downscale the data using the new shape. The resampling parameter defines the resampling method which in our case will be Resampling.average. The Resampling class from rasterio library provides the methods for resampling:

from rasterio.enums import Resampling

# Define the new shape
downscale_factor = 50
new_width = int(round(data.rio.width / downscale_factor))
new_height = int(round(data.rio.height / downscale_factor))

# Downscale the data
data_downscaled = data.rio.reproject(
    dst_crs=data.rio.crs,
    shape=(new_height, new_width),
    resampling=Resampling.average,
)
data_downscaled
<xarray.Dataset> Size: 63kB
Dimensions:          (x: 72, y: 72)
Coordinates:
  * x                (x) float64 576B 36.01 36.02 36.03 ... 36.97 36.98 36.99
  * y                (y) float64 576B -2.007 -2.021 -2.035 ... -2.979 -2.993
    spatial_ref      int64 8B 0
Data variables:
    elevation        (y, x) float64 41kB 810.4 801.3 ... 1.251e+03 1.286e+03
    relative_height  (y, x) float32 21kB 242.4 233.3 162.9 ... 663.7 683.2 718.4

As we can see, now the dimensions of the new downscaled Dataset is 72x72 cells on x and y axis which is 50 times smaller compared to the original data that we used as input:

data.rio.shape
(3601, 3601)
print("Original resolution:", data.rio.resolution())
print("Downscaled resolution:", data_downscaled.rio.resolution())
Original resolution: (0.0002777777777777778, -0.0002777777777777781)
Downscaled resolution: (0.013892746913580308, -0.013892746913580263)

By comparing the spatial resolution between the datasets, we can see that the new resolution of the downscaled Dataset is slightly over 1x1 km (~0.014 decimal degrees). Let’s finally visualize the downscaled Dataset to investigate how the result looks on a map:

data_downscaled["elevation"].plot()
plt.title("Downscaled elevation data");
../../../_images/ddbb70885118375715143cbefafc2d49194484402a1b006ebd59fa24fa987b66.png

Figure 7.21. Downscaled data using a downscale factor of 50.

The downscaling operation seem to have worked well as the patterns are still clearly similar compared to the input data (Figure 7.7), although the spatial resolution is much lower. The data is downscaled so much that it is actually possible to identify individual pixels of the grid.

Upscaling#

The process of upscaling works very similarly to downscaling and we can use the same rioxarray method to increase the resolution of the input raster. In the following, we will specify that the new shape of the output Dataset will be two times larger than the input data. When upscaling, you are ultimately estimating values to new pixel cells based on the neighboring raster values of a given cell in the input raster. There are various ways to interpolate data which are provided in the Table 7.2.

: Table 7.2. Different resampling methods and their descriptions.

Resampling Method

Description

nearest

Selects the value of the closest pixel without interpolation. Fast but can produce blocky artifacts.

bilinear

Performs linear interpolation using the four nearest pixels, creating a smoother result than nearest-neighbor.

cubic

Uses bicubic interpolation, considering 16 surrounding pixels. Smoother output than bilinear. Higher computational cost.

cubic_spline

Applies cubic spline interpolation for even smoother results, but requires more processing power.

lanczos

Uses a sinc-based interpolation over a larger pixel neighborhood, producing resampling with significant smoothing.

average

Average resampling, computes the weighted average of all non-NODATA contributing pixels.

In the following, we will take advantage of a Resampling.bilinear resampling method which determines the value of a new pixel by taking a weighted average of the four nearest input pixels:

from rasterio.enums import Resampling

# Define the new shape
upscale_factor = 2
new_width = data.rio.width * upscale_factor
new_height = data.rio.height * upscale_factor

# Upscale the data
data_upscaled = data.rio.reproject(
    data.rio.crs,
    shape=(new_height, new_width),
    resampling=Resampling.bilinear,
)

Now we have successfully upscaled our data which we can confirm by comparing the shapes of the original and upscaled datasets:

print("Shape - Original:", data.rio.shape)
print("Shape - Upscaled:", data_upscaled.rio.shape)
Shape - Original: (3601, 3601)
Shape - Upscaled: (7202, 7202)

As expected, the new data_upscaled raster contains twice as many pixels compared to the input which is exactly what we were after. In a similar manner, you can upscale the data into different resolutions by providing a given target shape to your new raster layer. Notice that there is no “correct” way to do upscaling as all resampling methods involve interpolation that introduces uncertainty to the results. Let’s finally confirm that our upscaled data looks correct also on a map:

data_upscaled["elevation"].plot()
plt.title("Upscaled elevation data");
../../../_images/7beb8f817f974320d27e88cd4d15945b97054ea53c8a572f43169103c7fd7f90.png

Figure 7.22. Upscaled data using a upscale factor of 2.

Handling missing data#

To be added.