Introduction to data structures in xarray#
Now that you have learned a bit of basics about raster data and how to create a simple 2-dimensional raster array using numpy
, we continue to explore in a more comprehensive manner how to work with real-world raster data using xarray
and rioxarray
libraries (+ other relevant libraries linked to them). The xarray
library is a highly useful tool for storing, representing and manipulating raster data, while rioxarray
provides various raster processing (GIS) capabilities on top of the xarray
data structures, such as reading and writing several different raster formats and conducting different geocomputational tasks. Under the hood, rioxarray
uses another Python library called rasterio
(that works with N-dimensional numpy
arrays) but the benefit of xarray
and rioxarray
is that they provide easier and more intuitive way to work with raster data layers, in a bit similar manner as working with vector data using geopandas
.
When working with raster data, you typically have various layers that represent different geographical features of the world (e.g. elevation, temperature, precipitation etc.) and this data is possibly captured at different times of the year/day/hour, meaning that you have longitudinal observations from the same area, constituting time series data. More often than not, you need to combine information from these layers to be able to conduct meaningful analysis based on the data, such as do a weather forecast. One of the greatest benefits of xarray
is that you can easily store, combine and analyze all these different layers via a single object, i.e. a Dataset
, as demonstrated in Figure 7.2.
The two fundamental data structures provided by the xarray
library are DataArray
and Dataset
(Figure 7.2). Both of them build upon and extend the strengths of numpy
and pandas
libraries. The DataArray
is a labeled N-dimensional array that is similar to pandas.Series
but works with raster data (stored as numpy
arrays). The Dataset
then again is a multi-dimensional in-memory array database that contains multiple DataArray
objects. In addition to the variables containing the observations of a given phenomena, you also have the x
and y
coordinates of the observations stored in separate layers, as well as metadata providing relevant information about your data, such as Coordinate Reference System and/or time. Thus, a Dataset
containing raster data is very similar to geopandas.GeoDataFrame
and actually various xarray
operations can feel very familiar if you have learned the basics of pandas
and geopandas
covered in Chapters 3 and 6.
Figure 7.2 Key xarray
data structures. Image source: Xarray Community (2024), licensed under Apache 2.0.
Some of the benefits of xarray
include:
A more intuitive and user-friendly interface to work with multidimensional arrays (compared e.g. to
numpy
)The possibility to select and combine data along a dimension across all arrays in a
Dataset
simultaneouslyCompatibility with a large ecosystem of Python libraries that work with arrays / raster data
Tight integration of functionalities from well-known Python data analysis libraries, such as
pandas
,numpy
,matplotlib
, anddask
Reading a file#
In the following, we start by investigating a simple elevation dataset using xarray
that represents a Digital Elevation Model (DEM) of Kilimanjaro area in Tanzania. To read a raster data file (such as GeoTIFF) into xarray
, we can use the .open_dataset()
function. Here, we read a .tif
file directly from a cloud storage space that we have created for this book:
import xarray as xr
# import rioxarray
url = "https://a3s.fi/swift/v1/AUTH_0914d8aff9684df589041a759b549fc2/PythonGIS/elevation/kilimanjaro/ASTGTMV003_S03E036_dem.tif"
data = xr.open_dataset(url, engine="rasterio")
data
<xarray.Dataset> Size: 52MB Dimensions: (band: 1, x: 3601, y: 3601) Coordinates: * band (band) int64 8B 1 * x (x) float64 29kB 36.0 36.0 36.0 36.0 ... 37.0 37.0 37.0 37.0 * y (y) float64 29kB -2.0 -2.0 -2.001 -2.001 ... -2.999 -3.0 -3.0 spatial_ref int64 8B ... Data variables: band_data (band, y, x) float32 52MB ...
type(data)
xarray.core.dataset.Dataset
Now we have read the GeoTIFF file into an xarray.Dataset
data structure which we stored into a variable data
. The Dataset
contains the actual data values for the raster cells, as well as other relevant attribute information related to the data:
Dimensions
show the number of cells of the givenband
, i.e. in our case there are 3061 cells both on thex
andy
axisCoordinates
is a container that contains the actualx
andy
coordinates of the cells, the Coordinate Reference System information stored in thespatial_ref
attribute, and theband
attribute that shows the number of bands in our data.Data variables
contains the actual data values of the cells (e.g. elevations as in our data)
data = data.squeeze("band", drop=True)
data
<xarray.Dataset> Size: 52MB Dimensions: (x: 3601, y: 3601) Coordinates: * x (x) float64 29kB 36.0 36.0 36.0 36.0 ... 37.0 37.0 37.0 37.0 * y (y) float64 29kB -2.0 -2.0 -2.001 -2.001 ... -2.999 -3.0 -3.0 spatial_ref int64 8B ... Data variables: band_data (y, x) float32 52MB ...
data = data.rename({"band_data": "elevation"})
data
<xarray.Dataset> Size: 52MB Dimensions: (x: 3601, y: 3601) Coordinates: * x (x) float64 29kB 36.0 36.0 36.0 36.0 ... 37.0 37.0 37.0 37.0 * y (y) float64 29kB -2.0 -2.0 -2.001 -2.001 ... -2.999 -3.0 -3.0 spatial_ref int64 8B ... Data variables: elevation (y, x) float32 52MB ...
data.data_vars
Data variables:
elevation (y, x) float32 52MB ...
data["elevation"].plot()
<matplotlib.collections.QuadMesh at 0x16ac8f680>
Dataset properties#
Let’s have a closer look at the properties of the file:
data.spatial_ref.attrs
{'crs_wkt': 'GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AXIS["Latitude",NORTH],AXIS["Longitude",EAST],AUTHORITY["EPSG","4326"]]',
'semi_major_axis': 6378137.0,
'semi_minor_axis': 6356752.314245179,
'inverse_flattening': 298.257223563,
'reference_ellipsoid_name': 'WGS 84',
'longitude_of_prime_meridian': 0.0,
'prime_meridian_name': 'Greenwich',
'geographic_crs_name': 'WGS 84',
'horizontal_datum_name': 'World Geodetic System 1984',
'grid_mapping_name': 'latitude_longitude',
'spatial_ref': 'GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AXIS["Latitude",NORTH],AXIS["Longitude",EAST],AUTHORITY["EPSG","4326"]]',
'GeoTransform': '35.9998611111111 0.000277777777777778 0.0 -1.99986111111111 0.0 -0.000277777777777778'}
data.rio.crs
CRS.from_epsg(4326)
data.rio.crs.to_epsg()
4326
data.rio.crs.to_wkt()
'GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AXIS["Latitude",NORTH],AXIS["Longitude",EAST],AUTHORITY["EPSG","4326"]]'
# Resolution
data.rio.resolution()
(0.000277777777777778, -0.000277777777777778)
# Affine transform (how raster is scaled, rotated, skewed, and/or translated)
data.rio.transform()
Affine(0.000277777777777778, 0.0, 35.9998611111111,
0.0, -0.000277777777777778, -1.99986111111111)
# Dimensions
print(data.rio.shape)
print(data.rio.width)
print(data.rio.height)
(3601, 3601)
3601
3601
# Number of bands
# data.rio.count
# Bounds of the file
data.rio.bounds()
(35.9998611111111, -3.000138888888889, 37.000138888888884, -1.99986111111111)
# No data values for all channels
data.rio.vars
['band_data']
Writing a file#
Add material about writing to most common raster file formats.