Data classification

Data classification#

Data classification is a common task in geospatial data analysis that determines the assignment of values to distinct classes. Classifying original values into categories may help simplify the data for further analysis or communicating the results. Data classification is central when visualizing geographic information to correctly represent the distribution of the data.

Here, we will get familiar with classification schemes from the PySAL [1] mapclassify library [2] that is intended to be used when visualizing thematic maps. Further details of geographic data visualization will be covered in chapter 8. We will also learn how to classify data values based on pre-defined threshold values and conditional statements directly in geopandas.

Our sample data is an extract from the Helsinki Region Travel Time Matrix (Tenkanen and Toivonen, 2020) that represents travel times to the central railway station across 250 m x 250 m statistical grid squares covering the Helsinki region. Let’s read in the data and check the first rows of data:

from pathlib import Path
import numpy as np
import geopandas as gpd
import matplotlib.pyplot as plt

data_dir = Path("data")
grid_fp = data_dir / "Helsinki" / "TravelTimes_to_5975375_RailwayStation.shp"

# Read files
grid = gpd.read_file(grid_fp)
grid.head()

	car_m_d	car_m_t	car_r_d	car_r_t	from_id	pt_m_d	pt_m_t	pt_m_tt	pt_r_d	pt_r_t	pt_r_tt	to_id	walk_d	walk_t	geometry
0	32297	43	32260	48	5785640	32616	116	147	32616	108	139	5975375	32164	459	POLYGON ((382000 6697750, 381750 6697750, 3817...
1	32508	43	32471	49	5785641	32822	119	145	32822	111	133	5975375	29547	422	POLYGON ((382250 6697750, 382000 6697750, 3820...
2	30133	50	31872	56	5785642	32940	121	146	32940	113	133	5975375	29626	423	POLYGON ((382500 6697750, 382250 6697750, 3822...
3	32690	54	34429	60	5785643	33233	125	150	33233	117	144	5975375	29919	427	POLYGON ((382750 6697750, 382500 6697750, 3825...
4	31872	42	31834	48	5787544	32127	109	126	32127	101	121	5975375	31674	452	POLYGON ((381250 6697500, 381000 6697500, 3810...

Detailed column descriptions are available in Table 3, Tenkanen and Toivonen, 2020. We will use column 'pt_r_t' which contains information about travel time in minutes to the central railway station by public transportation in rush hour traffic. Missing data are presented with value -1. Let’s set the missing values as NaN to exclude no data from further analysis:

grid = grid.replace(-1, np.nan)

Classification schemes#

We will now learn how to use mapclassifyto assing the data vaules into distinct classes. Mapclassify allows applying various classification schemes on our data that partition the attribute values into mutually exclusive groups. Choosing an adequate classification scheme and number of classes depends on the message we want to convey with our map and the underlying distribution of the data. Available classification schemes include:

box_plot
equal_interval
fisher_jenks
fisher_jenks_sampled
headtail_breaks
jenks_caspall
jenks_caspall_forced
jenks_caspall_sampled
max_p_classifier
maximum_breaks
natural_breaks
quantiles
percentiles
std_mean
user_defined

See Rey et al., 2023 for a thorough introduction on the mathematics behind each classification scheme. These classification schemes can be used directly when plotting data in geopandas as long as mapclassify package is also installed.

Choosing a classification scheme#

Let’s have a look at the distribution of the public transport travel times through checking the histogram and descriptive statistics. A histogram is a graphic representation of the distribution of the data. Descriptive statistics summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values. While looking at the histogram, remember that each observation is one 250 meter x 250 meter grid square in the Helsinki region and the histogram shows the distribution of travel times to the central railway station across the whole region.

For exploring the different classification schemes, let’s create a pandas Series without NaN values.

# Creating a data Series withouth NaN values
travel_times = grid.loc[grid["pt_r_t"].notnull(), "pt_r_t"]

# Plot a histogram
travel_times.plot.hist(bins=50, color="lightgray")

<Axes: ylabel='Frequency'>

../../../_images/26c76ecf7649156e90a78607af8834678faf35521fc13a554c23558b59ef6f1a.png

Figure 6.60. Histogram of the travel time values. Data source: Tenkanen & Toivonen 2020.

travel_times.describe()

count    13020.000000
mean        53.124654
std         21.295944
min          0.000000
25%         38.000000
50%         49.000000
75%         65.000000
max        181.000000
Name: pt_r_t, dtype: float64

The maximum travel time to the central railway station by public transport (including time for walking) is 181 minutes, i.e. over three hours. Most of the travel times range between 38 and 65 minutes with an average travel time of 53 minutes. Looking at the histogram (Figure 8.6), we can tell than only a handful of grid squares have more than two hour travel times to the central railway station. These grid squares are most likely located in rather inaccessible places in terms of public transport accessibility.

Let’s have a closer look at how these mapclassify classifiers work and try out different classification schemes for visualizing the public transport traveltimes. In the interactive version of this book, you can try out different numbers of classes and different classification schemes.

Natural breaks#

First, let’s try out natural breaks classifier that tries to split the values into natural clusters. The number of observations per bin may vary according to the distribution of the data.

import mapclassify

mapclassify.NaturalBreaks(y=travel_times, k=10)

NaturalBreaks

    Interval       Count
------------------------
[  0.00,  24.00] |   604
( 24.00,  34.00] |  1689
( 34.00,  42.00] |  2360
( 42.00,  49.00] |  1885
( 49.00,  58.00] |  1978
( 58.00,  68.00] |  1718
( 68.00,  80.00] |  1412
( 80.00,  94.00] |   756
( 94.00, 113.00] |   456
(113.00, 181.00] |   162

It’s possible to extract the threshold values into an array:

mapclassify.NaturalBreaks(y=travel_times, k=10).bins

array([ 22.,  33.,  40.,  48.,  57.,  66.,  76.,  90., 109., 181.])

We can further explore the classification on top of the histogram:

import matplotlib.pyplot as plt

# Define classifier
classifier = mapclassify.NaturalBreaks(y=travel_times, k=10)

# Plot histogram for public transport rush hour travel time
grid["pt_r_t"].plot.hist(bins=50, color="lightgray", title="Natural Breaks")

# Add vertical lines for class breaks
for break_point in classifier.bins:
    plt.axvline(break_point, linestyle="dashed", linewidth=1)

../../../_images/90572554bb7dd213849ccd64352d16d529ace47301d73b0a2fdbb2c428fcbaab.png

Figure 6.61. Histogram of the travel time values with natural breaks classification into 10 groups. Data source: Tenkanen & Toivonen 2020.

We can apply the classifier on our data and store the result in a new column.

# Classify the data
grid["pt_r_t_nb"] = grid[["pt_r_t"]].apply(classifier)
grid["pt_r_t_nb"].head()

  8
  8
  8
  9
  8
Name: pt_r_t_nb, dtype: int64

Finally, we can visualize our data using the classification scheme when plotting the data in geopandas through adding the scheme option, while the parameter k defines the number of classess to use.

# Plot the data using natural breaks
ax = grid.plot(
    figsize=(6, 4),
    column="pt_r_t",
    linewidth=0,
    scheme="natural_breaks",
    k=10,
)

# Set the x and y axis off and adjust padding around the subplot
plt.axis("off")
plt.tight_layout()

../../../_images/217680b32e40b4783f42f405063910fe9adec2faa45e86fb98568d901841b55f.png

Figure 6.62. Travel times visualized using the natural breaks classification scheme. Data source: Tenkanen & Toivonen 2020.

Quantiles#

Next, let’s explore the quantiles classification that splits the data so that each class has an equal number of observations.

mapclassify.Quantiles(y=travel_times, k=10)

Quantiles

    Interval       Count
------------------------
[  0.00,  30.00] |  1406
( 30.00,  36.00] |  1436
( 36.00,  40.00] |  1242
( 40.00,  44.00] |  1135
( 44.00,  49.00] |  1319
( 49.00,  55.00] |  1392
( 55.00,  62.00] |  1342
( 62.00,  70.00] |  1265
( 70.00,  81.00] |  1190
( 81.00, 181.00] |  1293

Notice that the numerical range of the groups created using the quantiles classification scheme may vary greatly depending on the distribution of the data. In our example, some classes have more than 30 min interval, while others less than 10 minutes. The default number of classes is five (quintiles), but you can set the desired number of classes using the k parameter. In the interactive version of the book, you can try changing the number of classes and see what happens to the class intervals; more classes get added around the central peak of the histogram if increasing the number of classes.

import matplotlib.pyplot as plt

# Define classifier
classifier = mapclassify.Quantiles(y=travel_times, k=10)

# Plot histogram for public transport rush hour travel time
grid["pt_r_t"].plot.hist(bins=50, color="lightgray", title="Quantiles")

# Add vertical lines for class breaks
for break_point in classifier.bins:
    plt.axvline(break_point, linestyle="dashed", linewidth=1)

../../../_images/3c75c259c442ab7a0b716c8ffadd6035c0b43ae442f8f11688a3907c63c0059e.png

Figure 6.63. Histogram of the travel time values with Quantile classification into 10 groups. Data source: Tenkanen & Toivonen 2020.

If comparing the histograms of natural breaks and quantile classifications, we can observe that natural breaks might work better to display differences in the data values across the whole data range, while quantiles would help distinguishing differences around the central peak of the data distribution. However, neither of the classification schemes display differences in short, less than 25 minute travel times which might be important for making an informative map. Also, we might want to have round numbers for our class values to facilitate quick and intuitive interpretation.

# Plot the data using quantiles
ax = grid.plot(
    figsize=(6, 4),
    column="pt_r_t",
    linewidth=0,
    scheme="quantiles",
    k=10,
    legend=True,
    legend_kwds={"title": "Travel times (min)", "bbox_to_anchor": (1.4, 1)},
)

# Set the x and y axis off and adjust padding around the subplot
plt.axis("off")
plt.tight_layout()

../../../_images/1fc43314bc3017192300c379ea083ec92ba7ef42235514c3e899a3a684c2d178.png

Figure 6.64. Static map of travel times visualized using the quantiles classification scheme. Data source: Tenkanen & Toivonen 2020.

# Store the class index numbers
grid["pt_r_t_q10"] = grid[["pt_r_t"]].apply(classifier)
grid["pt_r_t_q10"].head()

  9
  9
  9
  9
  9
Name: pt_r_t_q10, dtype: int64

The quantile classification allows us to extract, for example the best 10 % of all grid squares in terms of travel times to the central railway station. Now that we divided the data into quintiles, we can get the top 10 % of the data through extracting the first category of our classified values.

grid[grid["pt_r_t_q10"] == 0].explore()

Make this Notebook Trusted to load map: File -> Trust Notebook

Figure 6.65. Top 10 % out of all statistical grid squares in the Helsinki Region in terms of public transport travel times to the Helsinki. Data source: Tenkanen & Toivonen 2020.

Pretty breaks#

The pretty breaks classification shceme rounds the class break values and divides the range equally to create intervals that look nice and that are easy to read. This classification scheme might be tempting to use as it creates intuitive and visually appealing intervals. However, depending on the distribution of the data, the group sizes might vary greatly which might lead to misleading visualizations.

mapclassify.PrettyBreaks(y=travel_times, k=10)

Pretty

    Interval       Count
------------------------
[  0.00,  20.00] |   335
( 20.00,  40.00] |  3749
( 40.00,  60.00] |  4822
( 60.00,  80.00] |  2740
( 80.00, 100.00] |   933
(100.00, 120.00] |   351
(120.00, 140.00] |    84
(140.00, 160.00] |     0
(160.00, 180.00] |     5
(180.00, 200.00] |     1

import matplotlib.pyplot as plt

# Define classifier
classifier = mapclassify.PrettyBreaks(y=travel_times, k=10)

# Plot histogram for public transport rush hour travel time
grid["pt_r_t"].plot.hist(bins=50, color="lightgray", title="Pretty breaks")

# Add vertical lines for class breaks
for break_point in classifier.bins:
    plt.axvline(break_point, linestyle="dashed", linewidth=1)

../../../_images/d4a6fc64710a0fb1ef3edbc7c288e9626c57e52365e59774bcd4decc7988a6e6.png

Figure 6.66. Histogram of the travel time values with 10 pretty breaks. Data source: Tenkanen & Toivonen 2020.

# Plot the data using pretty breaks
ax = grid.plot(
    figsize=(6, 4),
    column="pt_r_t",
    linewidth=0,
    scheme="prettybreaks",
    k=10,
    legend=True,
    legend_kwds={"title": "Travel times (min)", "bbox_to_anchor": (1.4, 1)},
)

# Set the x and y axis off and adjust padding around the subplot
plt.axis("off")
plt.tight_layout()

../../../_images/9b9ed87e56012a2e1c40b34bcc1c5c91dd13947d677fa7783c4214164b838395.png

Figure 6.67. Static map of travel times visualized using the pretty breaks classification scheme. Data source: Tenkanen & Toivonen 2020.

Regardless of the number of classes, pretty breaks is not ideal for our data as it fails to capture the variation in the data. Compared to this map, the previous two versions using natural breaks and quantiles provide a more informative view of the travel times.

Custom map classification#

In case none of the existing classification schemes produce a desired output, we can also create a custom classification scheme using mapclassify and select which class interval values to use. Fixed intervals with gradually increasing travel times provide an intuitive way to display travel time data. While the pretty breaks classification scheme follows this approach, it didn’t work perfectly for our data. With our own classification scheme, we can show differences among the typical travel times, but avoid having classes distinguishing between long travel times. We’ll create a custom classifier with fixed 10-minute intervals up to 90 minutes to achieve this.

break_values = [10, 20, 30, 40, 50, 60, 70, 80, 90]
classifier = mapclassify.UserDefined(y=travel_times, bins=break_values)
classifier

UserDefined

    Interval       Count
------------------------
[  0.00,  10.00] |    49
( 10.00,  20.00] |   286
( 20.00,  30.00] |  1071
( 30.00,  40.00] |  2678
( 40.00,  50.00] |  2697
( 50.00,  60.00] |  2125
( 60.00,  70.00] |  1631
( 70.00,  80.00] |  1109
( 80.00,  90.00] |   603
( 90.00, 181.00] |   771

import matplotlib.pyplot as plt

# Define classifier
classifier = mapclassify.UserDefined(y=travel_times, bins=break_values)

# Plot histogram for public transport rush hour travel time
grid["pt_r_t"].plot.hist(bins=50, title="User defined classes", color="lightgray")

# Add vertical lines for class breaks
for break_point in classifier.bins:
    plt.axvline(break_point, linestyle="dashed", linewidth=1)

../../../_images/392c2782fdb1f2223f138bbc433eea32f839beccadf08c621e58b6d44e583c57.png

Figure 6.68. Histogram of the travel time values with user defined class breaks. Data source: Tenkanen & Toivonen 2020.

When plotting the map, we can pass the break values using classification_kwds.

# Create one subplot. Control figure size in here.
fig, ax = plt.subplots(figsize=(6, 4))

grid.plot(
    ax=ax,
    column="pt_r_t",
    linewidth=0,
    scheme="UserDefined",
    classification_kwds={"bins": break_values},
    legend=True,
    legend_kwds={"title": "Travel times (min)", "bbox_to_anchor": (1.4, 1)},
)

# Set the x and y axis off and adjust padding around the subplot
plt.axis("off")
plt.tight_layout()

../../../_images/1931a6112e2e29361d25e0a3e89c7fa6bdbebfce949cc4aa5a40d4233a18c15c.png

Figure 6.69. Static map of travel times by car and public transport using a custom classification scheme. Data source: Tenkanen & Toivonen 2020.

Question 6.14#

Select another column from the data (for example, travel times by car: car_r_t) and visualize a thematic map using our custom classification scheme.

# Solution

# Create one subplot. Control figure size in here.
fig, ax = plt.subplots(figsize=(6, 4))

# Visualize the travel times using a classification scheme and add a legend
grid.plot(
    ax=ax,
    column="car_r_t",
    linewidth=0,
    scheme="UserDefined",
    classification_kwds={"bins": break_values},
    legend=True,
    legend_kwds={"title": "Travel times (min)", "bbox_to_anchor": (1.4, 1)},
)

# Set the x and y axis off and adjust padding around the subplot
plt.axis("off")
plt.tight_layout()

../../../_images/f3966a0ccc3d3682c16d2b8f0bf908efdf29e3f4271af4012caffeab1274675c.png

Rule-based classification#

Sometimes our analysis task benefits from combining multiple criteria for classifying data. For example, we might want to find out locations that are outside the city center within a reasonable public transport travel time. Such a selection could help us classify the statistical grid squares based on the potential for finding apartments with good public transport connections while avoiding the most expensive areas in the city center.

To implement this, we can use conditional statements to find grid squares where public transport travel time (column pt_r_tt) is less than a selected threshold value in minutes, and where walking distance (walk_d) is more than a selected threshold value in meters. Each rule will give a binary result (True/False) and we can further combine these rules to find those locations that meet both requirements.

# Threhsold values
pt_maximum = 30
walk_minimum = 2500

grid["pt_r_tt"] < pt_maximum

      False
      False
      False
      False
      False
         ...  
  False
  False
  False
  False
  False
Name: pt_r_tt, Length: 13231, dtype: bool

grid["walk_d"] > walk_minimum

       True
       True
       True
       True
       True
         ...  
   True
   True
  False
   True
   True
Name: walk_d, Length: 13231, dtype: bool

We can then use our pandas skills to combine these rules. Notice that you need parentheses around each set of condition.

grid["rule1"] = (grid["pt_r_tt"] < pt_maximum) & (grid["walk_d"] > walk_minimum)

Finally, now that we have our rule-based classification stored in one of our GeoDataFramecolumns, we can use this information to visualize the areas that meet our criteria:

grid.loc[grid["rule1"] == True].explore()

Make this Notebook Trusted to load map: File -> Trust Notebook

Figure 6.70. Grid squares that meet the selection criteria.

Question 6.15#

Change the threshold values above and see how the map changes!