Introduction to data structures in xarray

Introduction to data structures in xarray#

Now that you have learned a bit of basics about raster data and how to create a simple 2-dimensional raster array using numpy, we continue to explore in a more comprehensive manner how to work with real-world raster data using xarray and rioxarray libraries (+ other relevant libraries linked to them). The xarray library is a highly useful tool for storing, representing and manipulating raster data, while rioxarray provides various raster processing (GIS) capabilities on top of the xarray data structures, such as reading and writing several different raster formats and conducting different geocomputational tasks. Under the hood, rioxarray uses another Python library called rasterio (that works with N-dimensional numpy arrays) but the benefit of xarray and rioxarrayis that they provide easier and more intuitive way to work with raster data layers, in a bit similar manner as working with vector data using geopandas.

When working with raster data, you typically have various layers that represent different geographical features of the world (e.g. elevation, temperature, precipitation etc.) and this data is possibly captured at different times of the year/day/hour, meaning that you have longitudinal observations from the same area, constituting time series data. More often than not, you need to combine information from these layers to be able to conduct meaningful analysis based on the data, such as do a weather forecast. One of the greatest benefits of xarray is that you can easily store, combine and analyze all these different layers via a single object, i.e. a Dataset, as demonstrated in Figure 7.2.

The two fundamental data structures provided by the xarray library are DataArray and Dataset (Figure 7.2). Both of them build upon and extend the strengths of numpy and pandas libraries. The DataArray is a labeled N-dimensional array that is similar to pandas.Series but works with raster data (stored as numpy arrays). The Dataset then again is a multi-dimensional in-memory array database that contains multiple DataArray objects. In addition to the variables containing the observations of a given phenomena, you also have the x and y coordinates of the observations stored in separate layers, as well as metadata providing relevant information about your data, such as Coordinate Reference System and/or time. Thus, a Dataset containing raster data is very similar to geopandas.GeoDataFrame and actually various xarray operations can feel very familiar if you have learned the basics of pandas and geopandas covered in Chapters 3 and 6.

Figure 7.2. Key  data structures. Image source: Xarray Community (2024), licensed under Apache 2.0.

Figure 7.2 Key xarray data structures. Image source: Xarray Community (2024), licensed under Apache 2.0.

Some of the benefits of xarray include:

  • A more intuitive and user-friendly interface to work with multidimensional arrays (compared e.g. to numpy)

  • The possibility to select and combine data along a dimension across all arrays in a Dataset simultaneously

  • Compatibility with a large ecosystem of Python libraries that work with arrays / raster data

  • Tight integration of functionalities from well-known Python data analysis libraries, such as pandas, numpy, matplotlib, and dask

Reading a file#

In the following, we start by investigating a simple elevation dataset using xarray that represents a Digital Elevation Model (DEM) of Kilimanjaro area in Tanzania. To read a raster data file (such as GeoTIFF) into xarray, we can use the .open_dataset() function. Here, we read a .tif file directly from a cloud storage space that we have created for this book:

import xarray as xr

# import rioxarray

url = "https://a3s.fi/swift/v1/AUTH_0914d8aff9684df589041a759b549fc2/PythonGIS/elevation/kilimanjaro/ASTGTMV003_S03E036_dem.tif"
data = xr.open_dataset(url, engine="rasterio")
data
<xarray.Dataset> Size: 52MB
Dimensions:      (band: 1, x: 3601, y: 3601)
Coordinates:
  * band         (band) int64 8B 1
  * x            (x) float64 29kB 36.0 36.0 36.0 36.0 ... 37.0 37.0 37.0 37.0
  * y            (y) float64 29kB -2.0 -2.0 -2.001 -2.001 ... -2.999 -3.0 -3.0
    spatial_ref  int64 8B ...
Data variables:
    band_data    (band, y, x) float32 52MB ...
type(data)
xarray.core.dataset.Dataset

Now we have read the GeoTIFF file into an xarray.Dataset data structure which we stored into a variable data. The Dataset contains the actual data values for the raster cells, as well as other relevant attribute information related to the data:

  • Dimensions show the number of cells of the given band, i.e. in our case there are 3061 cells both on the x and y axis

  • Coordinates is a container that contains the actual x and y coordinates of the cells, the Coordinate Reference System information stored in the spatial_ref attribute, and the band attribute that shows the number of bands in our data.

  • Data variables contains the actual data values of the cells (e.g. elevations as in our data)

data = data.squeeze("band", drop=True)
data
<xarray.Dataset> Size: 52MB
Dimensions:      (x: 3601, y: 3601)
Coordinates:
  * x            (x) float64 29kB 36.0 36.0 36.0 36.0 ... 37.0 37.0 37.0 37.0
  * y            (y) float64 29kB -2.0 -2.0 -2.001 -2.001 ... -2.999 -3.0 -3.0
    spatial_ref  int64 8B ...
Data variables:
    band_data    (y, x) float32 52MB ...
data = data.rename({"band_data": "elevation"})
data
<xarray.Dataset> Size: 52MB
Dimensions:      (x: 3601, y: 3601)
Coordinates:
  * x            (x) float64 29kB 36.0 36.0 36.0 36.0 ... 37.0 37.0 37.0 37.0
  * y            (y) float64 29kB -2.0 -2.0 -2.001 -2.001 ... -2.999 -3.0 -3.0
    spatial_ref  int64 8B ...
Data variables:
    elevation    (y, x) float32 52MB ...
data.data_vars
Data variables:
    elevation  (y, x) float32 52MB ...
data["elevation"].plot()
<matplotlib.collections.QuadMesh at 0x16ac8f680>
../../../_images/82ad5901595843ba742517eee627b203c59da309f04be02e0f24d1060c3d60a5.png

Dataset properties#

Let’s have a closer look at the properties of the file:

data.spatial_ref.attrs
{'crs_wkt': 'GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AXIS["Latitude",NORTH],AXIS["Longitude",EAST],AUTHORITY["EPSG","4326"]]',
 'semi_major_axis': 6378137.0,
 'semi_minor_axis': 6356752.314245179,
 'inverse_flattening': 298.257223563,
 'reference_ellipsoid_name': 'WGS 84',
 'longitude_of_prime_meridian': 0.0,
 'prime_meridian_name': 'Greenwich',
 'geographic_crs_name': 'WGS 84',
 'horizontal_datum_name': 'World Geodetic System 1984',
 'grid_mapping_name': 'latitude_longitude',
 'spatial_ref': 'GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AXIS["Latitude",NORTH],AXIS["Longitude",EAST],AUTHORITY["EPSG","4326"]]',
 'GeoTransform': '35.9998611111111 0.000277777777777778 0.0 -1.99986111111111 0.0 -0.000277777777777778'}
data.rio.crs
CRS.from_epsg(4326)
data.rio.crs.to_epsg()
4326
data.rio.crs.to_wkt()
'GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AXIS["Latitude",NORTH],AXIS["Longitude",EAST],AUTHORITY["EPSG","4326"]]'
# Resolution
data.rio.resolution()
(0.000277777777777778, -0.000277777777777778)
# Affine transform (how raster is scaled, rotated, skewed, and/or translated)
data.rio.transform()
Affine(0.000277777777777778, 0.0, 35.9998611111111,
       0.0, -0.000277777777777778, -1.99986111111111)
# Dimensions
print(data.rio.shape)
print(data.rio.width)
print(data.rio.height)
(3601, 3601)
3601
3601
# Number of bands
# data.rio.count
# Bounds of the file
data.rio.bounds()
(35.9998611111111, -3.000138888888889, 37.000138888888884, -1.99986111111111)
# No data values for all channels
data.rio.vars
['band_data']

Writing a file#

Add material about writing to most common raster file formats.