Introduction to data structures in xarray

Introduction to data structures in xarray#

Now that you have learned a bit of basics about raster data and how to create a simple 2-dimensional raster array using numpy, we continue to explore in a more comprehensive manner how to work with real-world raster data using xarray and rioxarray libraries (+ other relevant libraries linked to them). The xarray library is a highly useful tool for storing, representing and manipulating raster data, while rioxarray provides various raster processing (GIS) capabilities on top of the xarray data structures, such as reading and writing several different raster formats and conducting different geocomputational tasks. Under the hood, rioxarray uses another Python library called rasterio (that works with N-dimensional numpy arrays) but the benefit of xarray and rioxarrayis that they provide easier and more intuitive way to work with raster data layers, in a bit similar manner as working with vector data using geopandas.

When working with raster data, you typically have various layers that represent different geographical features of the world (e.g. elevation, temperature, precipitation etc.) and this data is possibly captured at different times of the year/day/hour, meaning that you may have longitudinal (repetitive) observations from the same area, constituting time series data. More often than not, you need to combine information from these layers to be able to conduct meaningful analysis based on the data, such as do a weather forecast. One of the greatest benefits of xarray is that you can easily store, combine and analyze all these different layers via a single object, i.e. a Dataset.

Key xarray datastructures: Dataset and DataArray#

The two fundamental data structures provided by the xarray library are DataArray and Dataset (Figure 7.2). Both of them build upon and extend the strengths of numpy and pandas libraries. The DataArray is a labeled N-dimensional array that is similar to pandas.Series but works with raster data (stored as numpy arrays). The Dataset then again is a multi-dimensional in-memory array database that contains multiple DataArray objects. In addition to the variables containing the observations of a given phenomena, you also have the x and y coordinates of the observations stored in separate layers, as well as metadata providing relevant information about your data, such as Coordinate Reference System and/or time. Thus, a Dataset containing raster data is very similar to geopandas.GeoDataFrame and actually various xarray operations can feel very familiar if you have learned the basics of pandas and geopandas covered in Chapters 3 and 6.

Figure 7.2. Key data structures. Image source: Xarray Community (2024), licensed under Apache 2.0.

Figure 7.2. Key xarray data structures. Image source: Xarray Community (2024), licensed under Apache 2.0.

Some of the benefits of xarray include:

A more intuitive and user-friendly interface to work with multidimensional arrays (compared e.g. to numpy)
The possibility to select and combine data along a dimension across all arrays in a Dataset simultaneously
Compatibility with a large ecosystem of Python libraries that work with arrays / raster data
Tight integration of functionalities from well-known Python data analysis libraries, such as pandas, numpy, matplotlib, and dask

In the following, we will continue by introducing some of the basic functionalities of the xarray data structures.

Plotting a data variable#

Thus far, we have investigated how the xarray data structures look like. However, we have not yet plotted anything on a map, which is also one of the typical things that you want to do as one of the first steps whenever working with new data. The xarray library provides very similar plotting functionality as geopandas, i.e. you can easily create a visualization out of your DataArray objects by using the .plot() method that uses matplotlib library in the background. In the following, we create a simple map out of the "elevation" data variable:

import matplotlib.pyplot as plt

# Plot the values
mesh = data["elevation"].plot(cmap="terrain")

# Add a title
plt.title("Elevation in meters");

../../../_images/60ae0f804231b7bb741e140347d9f87399e0cdeaa5073d2be550f706edf1080a.png

Figure 7.3. A map representing the elevation values in the Kilimanjaro area.

Great! Now we have a nice simple map that shows the relative height of the landscape where the highest peaks of the mountains are clearly visible on the bottom left corner. Notice that we used the "terrain" as a colormap for our visualization which provides a decent starting point for our visualization. However, as you can see it does not make sense that the part of the elevations are colored with blue because the land surface in this area of the world should not have any values going below the sea surface (0 meters). It is possible to deal with this issue by adjusting the colormap which you can learn from Chapter 8.

We can also easily plot our data in a couple of different ways and e.g. produce a contour map representing the elevation using contour lines. A contour map is a type of map that represents the three-dimensional features of a landscape in two dimensions by using contour lines highlighting the hills and valleys in our data. Each contour line connects points of equal elevation above a reference level, such as sea level. These lines provide a way to visualize the shape, slope, and elevation of the terrain. Contour maps are widely applied to different data in various domains, such as in navigation and orienteering to visualize elevation characteristics to help activities such as hiking; in meteorology to visualize weather related phenomena (atmospheric pressure, temperature); and in hydrology to help identifying drainage patterns, watersheds, and potential flood zones. We can create a contour map based on the input data by calling the .plot.contour() method in xarray:

contours = data["elevation"].plot.contour()
plt.title("Contour map based on the elevation");

../../../_images/4e462389f99e5168f37acdf35d3ed5c4c68e4b6f4700147507ba7a841d2c30bd.png

Figure 7.4. A contour map representing the elevation values in the Kilimanjaro area.

It is also possible to create a surface map that shows the elevation values in 3D. A 3D surface map is a three-dimensional representation of a terrain or surface, created by plotting elevation or depth data as a continuous surface. The map visually depicts the topography by assigning different colors, shading, and heights to represent variations in elevation, providing a realistic and intuitive view of the terrain. Creating a 3D surface map can be done by calling the .plot.surface() method in xarray:

surface = data["elevation"].plot.surface(cmap="Greens")
plt.title("A surface map representing the elevation in 3D");

../../../_images/3f177d81f120f727d4535c06caf7940e6dc312608f9675e448d737749759a151.png

Figure 7.5. A 3D surface map representing the elevation values in the Kilimanjaro area.

Now we have a nice three dimensional map that clearly shows the hills and valleys in our study region that gives an intuitive view to the landscape in the given region.

Extracting basic raster dataset properties#

Now as we have explored our data visually, it is good to continue by examining the basic properties of the data, as well as by calculating summary statistics out of the data, such as the minimum, maximum, or mean values of your data. To extract this information from your Dataset, xarray provides very similar functionalities as pandas to compute some basic statistics out of your data. For instance, we can easily extract the maximum elevation of our data by calling .max() method:

data["elevation"].max()

<xarray.DataArray 'elevation' ()> Size: 4B
array(2943., dtype=float32)
Coordinates:
    spatial_ref  int64 8B 0

As we see, this produces quite a lot of information because xarray returns an DataArray as a result by default. When exploring the output, we can see that the single value in the array is 2943 which is the highest point in our data. However, typically it would be more useful to get a numerical value as a result when doing operations like these. Luckily, it is easy to extract the actual number from the data by adding .item() after the command. The .item() method returns the xarray.Dataset element as a regular Python scalar value which is more similar to what e.g. pandas returns when you call the .max() or .min(), as demonstrated below:

data["elevation"].min().item()

568.0

data["elevation"].max().item()

2943.0

Dimensions of the data#

In addition to the summary statistics, we can explore some of the basic properties of our raster data. Majority of the geographic data related properties of a raster Dataset can be accessed via .rio accessor. An accessor is a method or attribute added to an existing data structure, such as a DataArray or Dataset in xarray, to provide specialized functionality. Accessors extend the capabilities of the base object without modifying its core structure. Via the .rio accessor we can explore various attributes of our data, such as the .shape, .width or .height:

print(data.rio.shape)
print(data.rio.width)
print(data.rio.height)

(3601, 3601)
3601
3601

Spatial resolution#

From the outputs we can see that the shape of our raster data seems to be rectangular as we have 3601 X 3601 cells to each direction (x and y). To better understand the data in geographic terms, we can retrieve the spatial resolution of the data by calling the .rio.resolution() method:

data.rio.resolution()

(0.0002777777777777778, -0.0002777777777777781)

As we see, the spatial resolution, i.e. the size of a single cell in our raster layer is ~0.0028. The resolution is always reported in the units of the input dataset’s coordinate reference system (CRS). Thus, in our case, this number is reported in decimal degrees meaning that the resolution is 30 meters.

Coordinate reference system#

We can easily access the coordinate reference system (CRS) information of our raster dataset via the .rio.crs attribute:

data.rio.crs

CRS.from_epsg(4326)

The .rio.crs returns the coordinate reference system information as an EPSG code and the code 4326 stands for the WGS84 coordinate reference system in which the units are represented as latitudes and longitudes (i.e. decimal degrees). Another CRS-related attribute that is useful to know is transform. The transform refers to the affine transformation matrix that maps pixel coordinates to geographic coordinates. It defines how the raster data is aligned in space, including information on scaling, rotation, and translation relative to a coordinate reference system. You can access the transform attribute as follows:

data.rio.transform()

Affine(0.0002777777777777778, 0.0, 35.9998611111111,
       0.0, -0.0002777777777777781, -1.99986111111111)

Spatial extent#

To extract information about the spatial extent of the dataset, we can use the .rio.bounds() method:

data.rio.bounds()

(35.9998611111111, -3.000138888888889, 37.000138888888884, -1.99986111111111)

This returns the minimum and maximum coordinates (here, in latitude and longitude) that bound our dataset, forming a minimum bounding rectangle around the data. The first two numbers represent the left-bottom (x,y) corner of the dataset, while the last two number represent the right-top corner (x,y) of the area, respectively.

NoData value#

Did you notice a small white rectangle in the bottom-right corner in Figure 7.3? This white area exists because our dataset includes some cells that does not have any data, i.e. they are represented with NaN values. To investigate whether your data contains NaN values, you can access the NoData value of the DataArray as follows:

data["elevation"].rio.nodata

nan

The .rio.nodata returns the value how the NaN value is represented in our xarray.DataArray. As we can see, the NaN value is represented with numpy library’s nan. However, in many cases, the NoData value in a given raster file is actually coded with a specific numerical value that is a clear outlier compared to the rest of the data, such as -9999 or 9999. To find out what value has been used in the raster file to code the NoData value, we can use rio.encoded_nodata which returns the NoData value as it is represented in the actual source data:

data["elevation"].rio.encoded_nodata

9999.0

It is important to deal with possible NaN values in the data, because otherwise the NoData values would be presented as actual values in the dataset (i.e. elevation). In our case, we dealt with the NaN values already when reading the data into xarray with xr.open_dataset() by specifying masked=True. This ensured that all values with number 9999 in the raster were transformed into numpy.nan values. This ensures that they do not influence any of the calculations when using the data but they are properly treated as data that does not exist. For example, the maximum value that we extracted earlier from the data was 2943, not 9999 which would be the case unless we mask out the NaN values when reading the data.

If a given dataset contains NoData values (it does not necessarily contain any NoData), the metadata of the dataset should always contain information about how the NoData values are encoded. With some raster files, you can also search for the NoData value by checking the .attrs attribute that might include information about NoData and is returned as a Python dictionary. The information about NoData can be can be stored with different keys, and you should look for values stored in keys, such as _FillValue, missing_value, fill_value or nodata. In our raster file, this metadata has not been stored as part of the file, and thus, we cannot access the .attrs attribute:

data["elevation"].rio.attrs

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[19], line 1
----> 1 data["elevation"].rio.attrs

AttributeError: 'RasterArray' object has no attribute 'attrs'

Radiometric resolution (bit depth)#

Lastly, we can extract information about the radiometric resolution (i.e. bit depth) of our Dataset by calling .dtypes:

data.dtypes

Frozen({'elevation': dtype('float32')})

This returns a Python dictionary like object that provides information about the bit depth of each DataArray stored in our Dataset. In our case, we only have one data attribute (elevation) and from the result we can see that the bit depth of this data is 32 bits. The radiometric resolution is determined by the number of bits used to represent the data for each pixel, which defines the range of possible intensity values. For example, an 8-bit sensor can record 256 levels of intensity (values 0-255), while a 16-bit sensor can record 65536 levels. Thus, the more bits you use to store the numbers on computer, the larger numbers you can store with a given bit depth. Notice that the more bits you use to store the numbers, the higher the memory consumption is as well. In terms of our example here (32-bit float), the data can be stored with high precision as it allows to store numbers with many decimals (approximately 7 decimal digits).

Another useful thing to understand about computer systems is that the numbers can be represented as either signed or unsigned, which determines whether negative values can be stored. Signed numbers include a “+” or “-” sign to indicate whether the value is positive or negative, while unsigned numbers are limited to non-negative values. The sign and bit depth (8-bit, 16-bit, etc.) directly affect the range of values that can be represented in the raster. For example, an 8-bit signed integer can represent values from -128 to 127, while an 8-bit unsigned integer can represent values from 0 to 255. Thus, if your data only contains positive numbers, it makes sense to store the data as unsigned because then you can store larger numbers with the same number of bits. For instance, Landsat 8 satellite sensor data are stored as unsigned 16-bit values meaning that the possible values are from 0 to 65536. The data type (signed, unsigned) can also significantly influence the memory footprint of your data which we will cover next.

Size of the data: Memory footprint in bytes#

We can extract information about the memory footprint of our DataArray or Dataset via the .nbytes attribute which returns the total bytes consumed by the stored data. For convenience, we convert the bytes into Megabytes (MB) with formula <bytes> / (1000*1000) where the value 1000 converts the bytes into kilobytes and the second one converts the kilobytes into megabytes, respectively:

# Memory consumption of a single DataArray
bytes_to_MB = 1000 * 1000
footprint_MB = data["elevation"].nbytes / bytes_to_MB
print(f"DaraArray memory consumption: {footprint_MB:.2f} MB.")

DaraArray memory consumption: 51.87 MB.

# Memory consumption of the whole Dataset
footprint_MB = data.nbytes / bytes_to_MB
print(f"Dataset memory consumption: {footprint_MB:.2f} MB.")

Dataset memory consumption: 51.93 MB.

As we can see from the above, the memory consumption of the elevation data variable is 51.87 megabytes while the memory consumption of the whole Dataset is slightly larger (51.93 MB). Majority of the memory footprint of a Dataset goes to storing the data variables (here, elevation values) and the rest of the memory is mostly consumed by other metadata related to the data (coordinates, CRS, indices etc.). Thus, the more data variables you store in your Dataset the larger the overall memory footprint of the data will be.

Converting data type (bit depth)#

Now we know that our data consumes quite a bit of memory from our computer. However, in certain cases there might be ways to optimize the memory usage by changing the data type (i.e. bit depth) into a type that is not as memory-hungry as the 32-bit float. For example our elevation data is presented with whole numbers which is evident e.g. from the minimum and maximum values of our data, which were 568.0 and 2943.0 respectively. Notice that neither of these values have any decimals (other than 0), which is due to the fact the precision of our elevation data is in full meters. Thus, the type of our elevation data could be changed to integers. In fact, we could use unsigned integer as a data type because the elevation values present in the data fall under the range of 0-65535. Notice that if our data values would exceed these limits, we could use 32-bit or 64-bit integer data types which allow to store much higher numbers in the array.

But does the bit-depth matter really? It does. For example in our case it would make a lot of sense to store the data with lower bit depth because that requires less disk space and memory from the computer as we demonstrate in the following. To change a data type of our elevation data variable, we can use the .astype() method that converts the input values into the target data type which is provided as an input argument. In the following, we will convert the elevation values (32-bit floats) into 16-bit unsigned integer numbers by calling .astype("uint16").

# Convert the data into integers
data16 = data.copy()
data16["elevation"] = data16["elevation"].astype("uint16")

/Users/tenkanh2/micromamba/envs/python-gis-book/lib/python3.12/site-packages/xarray/core/duck_array_ops.py:239: RuntimeWarning: invalid value encountered in cast
  return data.astype(dtype, **kwargs)

Oops, the xarray is producing a warning for us here. The reason for the warning is because our data contains NaN values which are incompatible with the integer data type. NaN values are only supported by float data types, thus to get rid off this warning, you should fill the NaN values with some meaningful number before doing the conversion to integers. By default, the data type conversion will replace the NaN values with value 0. There are different approaches to deal with this issue (e.g. interpolation), but for now, we will just ignore this warning for the sake of simplicity.

We can check the memory consumption of our updated DataArray as follows:

# Memory consumption of the updated DataArray
footprint_MB = data16["elevation"].nbytes / bytes_to_MB
print(f"DaraArray memory consumption: {footprint_MB:.2f} MB.")

DaraArray memory consumption: 25.93 MB.

data16["elevation"].max().item()

As we can see from the above the maximum elevation was now changed from the decimal number (2943.0) into integer (2943). Also the memory consumption improved significantly as the size of our data was cut into half when we converted the data into 16-bit unsigned integers. Thus, by choosing the data type in a smart way, you can significantly lower the memory consumption of the data on your computer which might make a big difference in terms of performance of your analysis. This is especially true in case you are analyzing very large raster datasets. In the following, we can see how changing the data type influences on the memory footprint of the data. Now we use float data types in the conversions so that we don’t receive the “invalid value” warning as before:

print(
    "Memory consumption 16-bit:",
    data["elevation"].astype("float16").nbytes / bytes_to_MB,
    "MB.",
)
print(
    "Memory consumption 32-bit:",
    data["elevation"].astype("float32").nbytes / bytes_to_MB,
    "MB.",
)
print(
    "Memory consumption 64-bit:",
    data["elevation"].astype("float64").nbytes / bytes_to_MB,
    "MB.",
)

Memory consumption 16-bit: 25.934402 MB.
Memory consumption 32-bit: 51.868804 MB.
Memory consumption 64-bit: 103.737608 MB.

As we can see, the memory consumption of the same exact data varies significantly depending on the bit-depth that we choose to use for our data. It is important to be careful when doing bit-dept conversions that you do not sabotage your data with the data conversion. For example, in our data the value range is between 568-2943. Thus, we need to use at least 16-bits to store these values. However, nothing stops you from changing the data type into 8-bit integers which will significantly alter our data:

low_bit_depth_data = data["elevation"].astype("uint8")
print("Min value: ", low_bit_depth_data.min().item())
print("Max value: ", low_bit_depth_data.max().item())

Min value:  0
Max value:  255

/Users/tenkanh2/micromamba/envs/python-gis-book/lib/python3.12/site-packages/xarray/core/duck_array_ops.py:239: RuntimeWarning: invalid value encountered in cast
  return data.astype(dtype, **kwargs)

low_bit_depth_data.plot(cmap="terrain")
plt.title("Data presented with 8-bit integers");

../../../_images/04c24dbdf92db5b77f0864efe0991c1f17a755d5963e4d7d497f42fe8de16358.png

Figure 7.6. Elevation values represented as 8-bit integers with distorted values.

As we can see from the map, the result distorts the data significantly and ultimately makes it unusable because all the cells having a value larger than 255 (i.e. the maximum supported value with 8-bits) are incorrectly presented in the data. Thus, when doing data type conversions, it is good to be extra careful that you do not accidentally break your data and produce incorrect results when further analyzing it.

Writing to a file#

At this stage, we have learned how to read raster data and explored some of the basic properties of a raster Dataset. As a last thing, we will learn how to write the data from xarray into specific raster file formats. Similarly as with vector data, raster data can be stored in various formats.

Raster file formats#

Table 7.1 introduces some of the popular file formats used for storing raster data. The GeoTIFF format is one of the most widely used formats to store individual raster layers (i.e. a single DataArray) to disk which is widely used format in various GIS software. One flavor of GeoTIFF that is useful to know is a Cloud Optimized GeoTIFF (COG) which is a regular GeoTIFF file with an internal organization that enables more efficient workflows on the cloud (useful with large datasets). With COG files you can stream just the portion of data that is needed that can significantly reduce processing times and allow working with a single large GeoTIFF file instead of multiple tiles.

: Table 7.1. Common raster file formats.

Name	Extension	Description
GeoTIFF	.tif, .tiff	Widely used to store individual raster layers to disk (i.e. a single `xarray.DataArray`).
netCDF	.nc	Widely used to store a whole `xarray.Dataset` that can contain multiple variables.
Zarr	.zarr	Newer format for storing `xarray.Dataset`. Supports storing large data in compressed chunks.

The netCDF format is also widely used in geosciences for storing multiple variables into a single file which is based on a general-purpose file format and a data model called HDF5 (.h5). The Zarr file format is a newer format designed for cloud-native, chunked, and compressed array storage. The Zarr file format and the associated zarr library also allows you to write multiple variables (i.e. a whole xarray.Dataset) into a single .zarr directory/file in a similar manner as with netCDF. In addition, zarr has the ability to store arrays in various ways, including in memory, in files, and in cloud-based object storage (such as Amazon S3 buckets or Google Cloud Storage) that are often used to store very large datasets. When using .zarr file format, you can take advantage of the nice capabilities of Zarr, including the ability to store and analyze datasets far too large fit onto disk, particularly in combination with dask library which provides capabilities for parallel and distributed computing.

Raster compression methods#

When working with large raster datasets, managing storage space and improving performance are essential. Raster compression methods can help you to reduce the file size of raster data which can often take up significant disk space. Compression can also speed up your analysis process because compressed files are faster to read, download and transfer (online). Thus, when working with rioxarray and writing data to different file formats, it is useful to compress the data. Thus, knowing which compression method to choose when saving raster data in formats like GeoTIFF is essential.

The data compression can be lossless or lossy. Lossless compression means that the file size is reduced without losing or changing any data, while lossy compression achieves higher compression rate (i.e. smaller file size) with some loss in data quality / precision. Table 7.2 introduces some of the most widely used compression methods that can be used when writing raster data. Choosing the “right” compression method depends on different factors. If you need lossless compression, LZW or ZSTD is a good balance between size and speed. If storage space is a top priority, you can consider using JPEG or JPEG2000 but it is good to be aware of the potential data loss. For cloud-based workflows, ZSTD or DEFLATE are good choices for compression.

: Table 7.2. Common raster compression methods.

Name	Type	Compress Ratio	Encoding / Decoding Speed	Common Raster Formats
`LZW`	Lossless	Moderate	Medium / Fast	GeoTIFF, PNG
`DEFLATE`	Lossless	High	Slow / Medium	GeoTIFF, PNG
`ZSTD`	Lossless	High	Fast / Fast	Cloud-optimized GeoTIFF
`JPEG`	Lossy	Very High	Fast / Fast	JPEG, GeoTIFF (YCbCr)
`JPEG2000`	Both avail.	Very High	Slow / Medium	JP2, GeoTIFF (JP2)

Introduction to data structures in xarray

Contents

Introduction to data structures in xarray#

Key xarray datastructures: Dataset and DataArray#

Reading a file into Dataset#

Renaming data variables#

Plotting a data variable#

Extracting basic raster dataset properties#

Dimensions of the data#

Spatial resolution#

Coordinate reference system#

Spatial extent#

NoData value#

Radiometric resolution (bit depth)#

Size of the data: Memory footprint in bytes#

Converting data type (bit depth)#

Creating a new data variable#

Writing to a file#

Raster file formats#

Raster compression methods#

GeoTIFF#

Cloud Optimized GeoTIFF (COG)#

netCDF#

Zarr#

Footnotes#