Vector overlay operations#

Map overlay is probably the oldest analysis technique in GIS to understand relations between layers of geographic information. Already in the mid 19th century, John Snow famously identified the source of cholera outbreak in London to a specific public water pump by overlaying two point maps on top of each other (Unwin, 1996). Map overlay has been an important analytical tool for very long time, as being able to combine and analyse multiple layers of spatial data together is crucial for various spatial planning processes, including e.g. suitability analyses, environment impact assessment, or when doing zoning and landuse plans (e.g. Keeble, 1952; Masser and Ottens, 1999; Steinitz et al., 1976).

Overlay operations can be done both with vector and raster data. Vector overlay operations are commonly used technique to produce new geometries and associated attribute data based on two or more vector layers. When working with multiple spatial datasets (especially polygon or line layers), you might need to create new shapes based on places where the layers overlap (or do not overlap) with each other. Typical overlay operations include union, intersection, and difference (Figure 6.52). These are named after the result of the combination of two or more input layers which produce at least one (or more) output layer. Being able to combine spatial data layers like this is an important feature in most GIS tools. These manipulations are also often called as set operations.

The basic idea of vector overlay operations is demonstrated in Figure 6.52 where the green areas represent the areas which constitute the result after the overlay operation. It is good to keep in mind that overlays operate at the GeoDataFrame level, not on individual geometries, and the properties from both are retained (often, not always). In effect, for every shape in the left GeoDataFrame, this operation is executed against every other shape in the right GeoDataFrame

_Figure 6.52. Typical vector overlay operations between two geographic layers (circle and rectangles). _

Figure 6.52. Typical vector overlay operations between two geographic layers (circle and rectangles).

To demonstrate how these overlay operations work in practice, we will carry out vector overlay operations between two Polygon datasets that represent i) the postal code areas in the Helsinki city center and ii) a 3000 meter buffer around the Helsinki railway center. Let’s start by reading the datasets and prepare them for analysis:

import geopandas as gpd

postal_areas = gpd.read_file("data/Helsinki/Helsinki_centre_postal_areas.gpkg")
railway_station = gpd.read_file("data/Helsinki/Helsinki_railway_station.gpkg")

# Check the data
postal_areas.head()
posti_alue he_vakiy index_right density geometry
0 00100 18284.0 0 7769.586237 MULTIPOLYGON (((385653.893 6671591.048, 385573...
1 00120 7108.0 0 17168.685515 MULTIPOLYGON (((385316.092 6671076.984, 385279...
2 00130 1508.0 0 3515.476578 MULTIPOLYGON (((386212.111 6671061.262, 386176...
3 00140 7865.0 0 8440.281536 MULTIPOLYGON (((386577.05 6670280.544, 386552....
4 00150 9496.0 0 6944.933965 MULTIPOLYGON (((384846.102 6669565.816, 384823...
postal_areas.shape
(30, 5)
railway_station.head()
name id geometry
0 Helsinki Railway station 0 POINT (385738.777 6672297.759)

From here we can see that the postal_areas include MultiPolygon geometries representing altogether thirty postal code areas, whereas the railway_station represents a single Point for the Helsinki Railway station. As vector overlay operation happens between two geographic datasets, it is necessary to ensure that they both share the same Coordinate Reference System. Hence, let’s first check that the .crs are matching using the Python’s assert statement that tests whether the condition is True. If the test fails, the assert will throw an AssertionError with the message that we provide as text like in the following:

assert postal_areas.crs == railway_station.crs, "The CRS does not match!"

Great, the CRS matches between the layers. Hence, let’s continue and create a 3 kilometer buffer around the Helsinki railway station which we will use in our vector overlay operations:

station_buffer = railway_station.copy()
station_buffer["geometry"] = station_buffer.buffer(3000)

Here, we first created a copy of the original GeoDataFrame and then used the .buffer() method to create a Polygon circle with 3000 meter radius. Let’s visualize the data on a map so that we can get a better understanding of the two layers and how they overlap with each other:

m = postal_areas.explore(tiles="CartoDB Positron")
m = station_buffer.explore(m=m, color="red")
m
Make this Notebook Trusted to load map: File -> Trust Notebook

Figure 6.53. A sample of postal code areas in the Helsinki city centre and a 3km buffer around Helsinki railway station.

Intersection#

We are now ready to conduct an overlay analysis between these layers. We will create a new layer based postal code polygons that intersect with our Helsinki layer. We can use a method called .overlay() to conduct the overlay analysis between the given GeoDataFrame (postal_areas) and a second GeoDataFrame (station_buffer). With parameter how we can control how the overlay analysis is conducted. Possible values are 'intersection', 'union', 'symmetric_difference', 'difference', and 'identity'. Let’s start by doing an overlay using "intersection" as the overlay operation:

# Intersection
intersection = postal_areas.overlay(station_buffer, how="intersection")
intersection.head()
posti_alue he_vakiy index_right density name id geometry
0 00100 18284.0 0 7769.586237 Helsinki Railway station 0 POLYGON ((385573.62 6671699.801, 385612.659 66...
1 00120 7108.0 0 17168.685515 Helsinki Railway station 0 POLYGON ((385279.612 6671123.567, 385255.478 6...
2 00130 1508.0 0 3515.476578 Helsinki Railway station 0 POLYGON ((386176.826 6671111.888, 386039.154 6...
3 00140 7865.0 0 8440.281536 Helsinki Railway station 0 MULTIPOLYGON (((386552.849 6670283.51, 386528....
4 00150 9496.0 0 6944.933965 Helsinki Railway station 0 MULTIPOLYGON (((384823.503 6669566.95, 384799....
intersection.shape
(23, 7)

As a result we got a new GeoDataFrame that includes 23 postal code areas that intersected with the station_buffer. As we can see, due to the overlay operation, the dataset contains the attributes from both input layers, i.e. it works in a bit similar manner as sjoin() demonstrated in Chapter 6.7. To make it easier to understand how different vector overlay operations work, let’s create an easy helper function called plot_vector_overlay() that creates a comparison map based on the results before and after the overlay operation. To do this, we use matplotlib library and create a subplot with 2 separate plots:

import matplotlib.pyplot as plt


def plot_vector_overlay(gdf1, gdf2, result, title):
    """
    Creates two maps next to each other based on `gdf1`, `gdf2` and the
    `result` GeoDataFrames.
    """

    fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(8, 5))

    ax1 = gdf1.plot(ax=ax1)
    ax1 = gdf2.plot(ax=ax1, color="red", alpha=0.3)

    result.plot(ax=ax2)

    # Fetch bounds and apply to axis 2
    xmin, ymin, xmax, ymax = gdf1.total_bounds

    ax2.set_xlim(xmin, xmax)
    ax2.set_ylim(ymin, ymax)

    fig.suptitle(title, fontsize=16)
    # Add an arrow between the plots
    fig.text(0.49, 0.5, "⇨", fontsize=30, color="red")
    ax1.axis("off")
    ax2.axis("off")
    plt.tight_layout()
    return fig, ax1, ax2

Now we can call this function to create a visualizaton that demonstrates how the intersection overlay operations behaves:

fig, ax1, ax2 = plot_vector_overlay(
    gdf1=postal_areas, gdf2=station_buffer, result=intersection, title="Intersection"
)
../../../_images/ab9ce126588f859853125ad41359b48845ebc9b7a49367e72095f0add86cf90c.png

Figure 6.54. Result after conducting vector overlay operation by intersecting the two layers.

As we can see, the "intersection" overlay operation keeps the postal code areas that intersect with the circle and keeps all those geometries in the result. Important thing to notice is that with overlay() the intersecting GeoDataFrame (station_buffer) will also modify the input geometries by cutting them in the border areas where they cross. This is one of the key differences between .overlay() and sjoin() methods as sjoin() will not modify the input geometries. As mentioned earlier, attribute data from both GeoDataFrames are kept from the features that are part of the result. In the following, we will show one-by-one, how different overlay operations (i.e. union, difference, symmetric difference, identity) influence the results.

Union#

In a similar manner as with intersection, we can conduct overlay operation using "union" as below:

# Union
union = postal_areas.overlay(station_buffer, how="union")

fig, ax1, ax2 = plot_vector_overlay(
    gdf1=postal_areas, gdf2=station_buffer, result=union, title="Union"
)
../../../_images/7fafbb0c3a6b69c651097715cae187977eef1824d645591f9728d3cf4d31043e.png

Figure 6.55. Result after conducting vector overlay operation based on union.

union.shape
(42, 7)

When using "union" overlay operation, the geometries from both GeoDataFrames are kept in the result. As you can see, the number of rows has increased quite significantly from 30 to 42 rows. This happens because the postal code geometries are again modified by the station_buffer in the areas where the geometries cross each other: the postal code geometry is splitted in two in areas where the buffer geometry crosses the postal code geometry. Hence, this will increase the number of rows in the final output.

Question 6.13#

Did you fully understand why the number of rows increase after doing the union overlay? If not, use the .explore() and investigate the geometries in areas where the postal code areas and the ring (border) of the buffer geometry cross each other. When you hover over the border, what happens with the attribute values?

Hide code cell content
# Answer
print(
    """
When hovering over the buffer geometry border, the attribute values in the table change. 
Inside the ring, the attributes of the Helsinki railway station are kept in the results, whereas outside of the ring 
the table does not include any data for the columns associated with the railway station.
"""
)

# Solution
union.explore()
When hovering over the buffer geometry border, the attribute values in the table change. 
Inside the ring, the attributes of the Helsinki railway station are kept in the results, whereas outside of the ring 
the table does not include any data for the columns associated with the railway station.
Make this Notebook Trusted to load map: File -> Trust Notebook

Difference and symmetric difference#

Sometimes it might be useful to focus extract geometries that area outside a given layer. This can be achieved by using the .overlay() with "difference" operator:

# Difference
difference = postal_areas.overlay(station_buffer, how="difference")

fig, ax1, ax2 = plot_vector_overlay(
    gdf1=postal_areas, gdf2=station_buffer, result=difference, title="Difference"
)
../../../_images/05f86fdc147f176e5bedf380c18d0776aaa32a8a970eb1d0652860ec43417b4f.png

Figure 6.56. Result after conducting vector overlay operation based on difference.

difference.columns.values
array(['posti_alue', 'he_vakiy', 'index_right', 'density', 'geometry'],
      dtype=object)
difference.shape
(18, 5)

As can be seen from the results above, the "difference" will keep the postal code geometries that are outside of the buffer geometry (note we output difference.columns.values here to output only the column names). In terms of attributes, only the columns that are present in the source GeoDataFrame (i.e. postal_areas) are kept.

The symmetric difference overlay operation is an interesting one. It will keep the geometries and attributes outside of the station_buffer layer, as well as create a geometry within the station_buffer that includes areas that are within the station_buffer ring but outside the postal_areas GeoDataFrame. I.e. in our case, it mostly contains water areas that surround the Helsinki city centre as shown below:

# Symmetric Difference
symmetric_difference = postal_areas.overlay(station_buffer, how="symmetric_difference")

fig, ax1, ax2 = plot_vector_overlay(
    gdf1=postal_areas,
    gdf2=station_buffer,
    result=symmetric_difference,
    title="Symmetric Difference",
)
../../../_images/15c8bc4a5a645cd1965f57f8a42ce57654f8d5819ed30a26c317a0d24b4b48f0.png

Figure 6.57. Result after conducting vector overlay operation based on symmetric difference.

symmetric_difference.columns
Index(['posti_alue', 'he_vakiy', 'index_right', 'density', 'name', 'id',
       'geometry'],
      dtype='object')
symmetric_difference.shape
(19, 7)
symmetric_difference.tail()
posti_alue he_vakiy index_right density name id geometry
14 00520 7306.0 0.0 5609.742526 NaN NaN POLYGON ((385333.465 6675315.885, 385335.052 6...
15 00550 9464.0 0.0 9829.492877 NaN NaN POLYGON ((386285.241 6675258.417, 386164.283 6...
16 00570 3962.0 0.0 2205.349757 NaN NaN MULTIPOLYGON (((389839.918 6673030.501, 389808...
17 00580 2727.0 0.0 2109.697366 NaN NaN POLYGON ((387366.285 6674940.194, 387550.714 6...
18 NaN NaN NaN NaN Helsinki Railway station 0.0 MULTIPOLYGON (((383835.597 6669978.727, 383617...

As can be seen from above, the table includes now the attributes from the postal_areas GeoDataFrame as well as the attributes from station_buffer in the last row of the resulting GeoDataFrame.

Identity#

As a last overlay operation, we have the "identity" which computes a geometric intersection of the input features and identity features. The input features or portions thereof that overlap identity features will get the attributes of those identity features. The basic idea is very similar to "union" but in this case, the areas outside of postal_areas GeoDataFrame will not be filled with the station_buffer geometry as demonstrated below:

# Identity
identity = postal_areas.overlay(station_buffer, how="identity")

fig, ax1, ax2 = plot_vector_overlay(
    gdf1=postal_areas, gdf2=station_buffer, result=identity, title="Identity"
)
../../../_images/d27719a6dc1c1672bf87852c96dcadd4fc466292936fea9f9fcc3394f09165a8.png

Figure 6.58. Result after conducting vector overlay operation based on identity.

identity.columns
Index(['posti_alue', 'he_vakiy', 'index_right', 'density', 'name', 'id',
       'geometry'],
      dtype='object')
identity.shape
(41, 7)
identity.loc[20:25]
posti_alue he_vakiy index_right density name id geometry
20 00550 9464.0 0.0 9829.492877 Helsinki Railway station 0.0 POLYGON ((387057.865 6674284.968, 387045.245 6...
21 00570 3962.0 0.0 2205.349757 Helsinki Railway station 0.0 MULTIPOLYGON (((388536.474 6672959.818, 388512...
22 00580 2727.0 0.0 2109.697366 Helsinki Railway station 0.0 POLYGON ((387749.472 6673979.696, 387491.215 6...
23 00140 7865.0 0.0 8440.281536 NaN NaN POLYGON ((387352.067 6664714.928, 387339.366 6...
24 00150 9496.0 0.0 6944.933965 NaN NaN POLYGON ((384180.692 6668373.742, 384154.302 6...
25 00190 743.0 0.0 971.226183 NaN NaN MULTIPOLYGON (((388259.682 6668460.022, 388233...

As can be seen from the results above, the output now includes attribute information from both GeoDataFrames and the geometries in the postal_areas are split into multiple parts in places where the station_buffer cuts them. However, the geometry of the station_buffer itself is not included at all in the results.