Common tabular operations in pandas#
We have new learned the basics of pandas
data structures (i.e., Series
and DataFrame
) and you should be familiar with some methods for loading and exploring pandas
data. Next, we will continue exploring the pandas
data analysis functionalities, and see how it can be used for data manipulation, conducting simple calculations, and making selections based on specific criteria.
Basic calculations#
One of the most common things to do in pandas
is to create new columns based on calculations between different variables (columns). Next, we will learn how to do this using the same input data (data/kumpula-summer-2024.txt
) as in the previous section. We will first load it using the pd.read_csv()
method. Remember, that the first 8 lines contains the metadata which we will skip. This time, let’s store the filepath as a separate variable in order to make the code more readable and easier to change afterwards (a good practice).
import pandas as pd
# Define file path:
fp = "data/kumpula-summer-2024.txt"
# Read in the data from the file (starting at row 9):
data = pd.read_csv(fp, skiprows=8)
As a first step, it is always good to remember to check the data after reading it. This way we can be sure that everything looks as it should.
data.head()
YEARMODA | MAX | MIN | TEMP1 | TEMP2 | |
---|---|---|---|---|---|
0 | 20240601 | 26.2 | 18.4 | 24.2 | 23.2 |
1 | 20240602 | 24.4 | 17.1 | 20.7 | 24.2 |
2 | 20240603 | 25.7 | 15.2 | 22.5 | 23.1 |
3 | 20240604 | 24.6 | 15.7 | 19.3 | 24.6 |
4 | 20240605 | 23.4 | 12.8 | 20.3 | 23.2 |
Everything seems to be OK. Now we can start by creating a new column DIFF
in our DataFrame
. This can be done by specifying the name of the column and giving it some default value (in this case the decimal number 0.0
).
data["DIFF"] = 0.0
data.head()
YEARMODA | MAX | MIN | TEMP1 | TEMP2 | DIFF | |
---|---|---|---|---|---|---|
0 | 20240601 | 26.2 | 18.4 | 24.2 | 23.2 | 0.0 |
1 | 20240602 | 24.4 | 17.1 | 20.7 | 24.2 | 0.0 |
2 | 20240603 | 25.7 | 15.2 | 22.5 | 23.1 | 0.0 |
3 | 20240604 | 24.6 | 15.7 | 19.3 | 24.6 | 0.0 |
4 | 20240605 | 23.4 | 12.8 | 20.3 | 23.2 | 0.0 |
As we can see, now we have a new column DIFF
in our DataFrame
that has value 0.0
for all rows. When creating a new column, you can initialize it with any value you want. Typically, the value could be a number (0.0
as we use here), but it could also be None
(i.e., nothing), some text (e.g., "test text"
), or more or less any other value or object that can be represented as a single item. You could even initialize the column by storing a function inside the cells if you like. Let’s continue by checking the data type of our new column.
data["DIFF"].dtypes
dtype('float64')
As we can see, pandas
created a new column and automatically recognized that the data type is float64
as we passed a 0.0
value to it. Great, but what about making those calculations with pandas
as promised at the start of this section? Next, we will calculate the difference between the MAX
and MIN
columns to get an idea how much the temperatures have been varied on different days. The result will be updated into the column DIFF
that we created earlier. A typical way of performing calculations such as this is to access the set of Series
(columns) from the DataFrame
that interests us and perform the mathematical calculation using the selected columns. Typically you store the result directly into a column in the DataFrame
, such as shown below.
data["DIFF"] = data["MAX"] - data["MIN"]
data.head()
YEARMODA | MAX | MIN | TEMP1 | TEMP2 | DIFF | |
---|---|---|---|---|---|---|
0 | 20240601 | 26.2 | 18.4 | 24.2 | 23.2 | 7.8 |
1 | 20240602 | 24.4 | 17.1 | 20.7 | 24.2 | 7.3 |
2 | 20240603 | 25.7 | 15.2 | 22.5 | 23.1 | 10.5 |
3 | 20240604 | 24.6 | 15.7 | 19.3 | 24.6 | 8.9 |
4 | 20240605 | 23.4 | 12.8 | 20.3 | 23.2 | 10.6 |
As expected, the output from the calculations was stored into the DIFF
column. Conducting calculations like this is extremely fast in pandas
because the math operations happen in a vectorized manner. This means that instead of looping over individual values of the DataFrame
and comparing them to each other individually, calculating the difference happens simultaneously at all rows. You can also create new columns on the fly when doing the calculation (i.e., the column does not have to exist beforehand). Furthermore, it is possible to use any kind of math algebra (e.g., subtraction, addition, multiplication, division, exponentiation, etc.) when creating new columns. We can, for example, calculate another estimate of the average daily temperature by averaging the values in the TEMP1
and TEMP2
columns and storing the result as TEMP
.
data["TEMP"] = (data["TEMP1"] + data["TEMP2"]) / 2
data.head()
YEARMODA | MAX | MIN | TEMP1 | TEMP2 | DIFF | TEMP | |
---|---|---|---|---|---|---|---|
0 | 20240601 | 26.2 | 18.4 | 24.2 | 23.2 | 7.8 | 23.70 |
1 | 20240602 | 24.4 | 17.1 | 20.7 | 24.2 | 7.3 | 22.45 |
2 | 20240603 | 25.7 | 15.2 | 22.5 | 23.1 | 10.5 | 22.80 |
3 | 20240604 | 24.6 | 15.7 | 19.3 | 24.6 | 8.9 | 21.95 |
4 | 20240605 | 23.4 | 12.8 | 20.3 | 23.2 | 10.6 | 21.75 |
Question 3.4#
Calculate the temperatures in Kelvins using the Celsius values and store the result in a new column called TEMP_KELVIN
in our DataFrame
. Zero Kelvins is is -273.15 degrees Celsius as we learned in Chapter 2, and the formula for converting degrees Celsius (C) to Kelvins (K) is thus \(T_{K} = T_{C} + 273.15\).
Show code cell content
# Solution
data["TEMP_KELVIN"] = data["TEMP"] + 273.15
Selecting and updating data#
We often want to make selections from our data and only use specific rows from a DataFrame
in the analysis. There are multiple ways of selecting subsets from a pandas
DataFrame
that can be based on specific index values, for example, or using some predefined criteria to make the selection such as selecting all rows where values in column X
are larger than 0.0
. Next, we will go through the most useful ways of selecting specific rows, columns, and individual values.
Selecting rows and columns#
One common way of selecting only specific rows from a DataFrame
is done via the concept of slicing. Getting a slice of data in pandas
can be done in a similar manner as with normal Python lists, by specifying an index range inside square brackets: DataFrame[start_index:stop_index]
. Let’s select the first five rows and assign them to a variable called selection
. Here, we will first see how selecting the data based on index values works just like with “normal” Python lists.
selection = data[0:5]
selection
YEARMODA | MAX | MIN | TEMP1 | TEMP2 | DIFF | TEMP | TEMP_KELVIN | |
---|---|---|---|---|---|---|---|---|
0 | 20240601 | 26.2 | 18.4 | 24.2 | 23.2 | 7.8 | 23.70 | 296.85 |
1 | 20240602 | 24.4 | 17.1 | 20.7 | 24.2 | 7.3 | 22.45 | 295.60 |
2 | 20240603 | 25.7 | 15.2 | 22.5 | 23.1 | 10.5 | 22.80 | 295.95 |
3 | 20240604 | 24.6 | 15.7 | 19.3 | 24.6 | 8.9 | 21.95 | 295.10 |
4 | 20240605 | 23.4 | 12.8 | 20.3 | 23.2 | 10.6 | 21.75 | 294.90 |
Here we have selected the first five rows (index 0
-4
) using the integer index. Notice that the logic here is similar to how Python’s list
slicing (or range()
function) works; the value on the right side of the colon (here number 5
) is the stop value, which is excluded from the range selection. Hence, the syntax is [start_index:stop_index]
. In addition, the parameter :step
could be added at the end in indicate the step size between values (1
by default).
Selections like the previous example are possible, but there is a better and more flexible way of selecting data using the pandas
.loc[]
label-based indexer [1]. As the name suggests, .loc[]
selects data based on axis labels (row and column labels). This alone does not necessarily tell much to you at this point, but .loc[]
makes it possible to conduct more specific selections, such as allowing you to choose which columns are chosen when selecting a subset of rows. It also makes possible to benefit from row labels that are not necessarily sequential numbers (as has been the case with all our examples thus far). For instance, you can make selections for rows indexed by dates or timestamps. Hence, .loc[]
can become very handy when working with time series data (we will learn more about this later). Let’s now test out the .loc[]
label indexer by selecting values from column TEMP
from rows 0
-5
.
# Select temp column values on rows 0-5
selection = data.loc[0:5, "TEMP"]
selection
0 23.70
1 22.45
2 22.80
3 21.95
4 21.75
5 17.05
Name: TEMP, dtype: float64
Notice that in this case, we get six rows of data (index 0
-5
)! This happens because we are now doing the selection based on axis labels instead of normal Python collection indices. It is important to notice the difference between these two approaches, as mixing the two may cause confusion, incorrect analysis results, or bugs in your code. We recommend to use .loc[]
whenever possible (there are specific cases when you may want to use other approaches). The basic syntax for using .loc[]
is:
.loc[first_included_label:last_included_label, columns]
By looking at the syntax, you might imagine that it is possible to select multiple columns when using .loc[]
. We can test this by selecting the TEMP
and TEMP_KELVIN
columns from a set of rows by passing the selected column names inside a list.
selection = data.loc[0:5, ["TEMP", "TEMP_KELVIN"]]
selection
TEMP | TEMP_KELVIN | |
---|---|---|
0 | 23.70 | 296.85 |
1 | 22.45 | 295.60 |
2 | 22.80 | 295.95 |
3 | 21.95 | 295.10 |
4 | 21.75 | 294.90 |
5 | 17.05 | 290.20 |
As a result, we now have a new DataFrame
with two columns and 6 rows (i.e., index labels ranging from 0
to 5
).
Question 3.5#
Calculate the mean temperature for the last seven days of August 2024. Do the selection using the row index values.
Show code cell content
# Solution
data.loc[85:91, "TEMP"].mean()
np.float64(19.671428571428574)
Selecting a single row or value#
You can also select an individual row from specific position using the .loc[]
indexing. Here we select all the data values using index 4
(the 5th row).
row = data.loc[4]
row
YEARMODA 20240605.00
MAX 23.40
MIN 12.80
TEMP1 20.30
TEMP2 23.20
DIFF 10.60
TEMP 21.75
TEMP_KELVIN 294.90
Name: 4, dtype: float64
.loc[]
indexing returns the values from that row as a Series
where the indices are actually the column names of the row variables. Hence, you can access the value of an individual column by referring to its index using following format:
# Print one attribute from the selected row
row["TEMP"]
np.float64(21.75)
Sometimes it is enough to access a single value in a DataFrame
directly. In this case, we can use the pandas
.at[]
indexer [2] instead of .loc[]
.
Let’s select the temperature (column TEMP
) on the first row (index 0
) of our DataFrame
.
selection.at[0, "TEMP"]
np.float64(23.7)
As an output, we get an individual value 23.7
. .at[]
works only when accessing a single value, whereas the .loc[]
can be used to access both single or multiple values at the same time. The end result when fetching a single value with .loc[]
is exactly the same and the difference between the two approaches is minimal. Hence, we recommend using .loc[]
in all cases because it is more flexible (.at[]
is slightly faster but in most cases this does not make a difference).
selection.loc[0, "TEMP"]
np.float64(23.7)
Selections based on index positions#
As we have learned thus far, .loc[]
and .at[]
are based on the axis labels, the names of columns and rows. For positional based indexing, pandas
has an .iloc[]
indexer [3], which is based on integer value indices. With .iloc[]
it is also possible to refer to the columns based on their index value (i.e., to a positional number of a column in the DataFrame
). For example, data.iloc[0, 0]
would return 20240601
in our example DataFrame
, which is the value on the first row and first column in the data.
# Check the first rows
print(data.head())
print()
print(f"The value at position (0, 0) is {data.iloc[0, 0]}.")
YEARMODA MAX MIN TEMP1 TEMP2 DIFF TEMP TEMP_KELVIN
0 20240601 26.2 18.4 24.2 23.2 7.8 23.70 296.85
1 20240602 24.4 17.1 20.7 24.2 7.3 22.45 295.60
2 20240603 25.7 15.2 22.5 23.1 10.5 22.80 295.95
3 20240604 24.6 15.7 19.3 24.6 8.9 21.95 295.10
4 20240605 23.4 12.8 20.3 23.2 10.6 21.75 294.90
The value at position (0, 0) is 20240601.
The syntax for using .iloc[]
is:
.iloc[start_row_position:stop_row_position,
start_column_position:stop_column_position]
Using this syntax, we can access the value on the first row and second column (TEMP
) as follows:
data.iloc[0, 1]
np.float64(26.2)
It is also possible to get ranges of rows and columns with .iloc[]
. For example, we could select the YEARMODA
and TEMP
columns from the first five rows based on their indices (positions) in the data set. Here, we will select rows from positions 0 to 5 and columns from positions 0 to 2.
selection = data.iloc[0:5, 0:2]
selection
YEARMODA | MAX | |
---|---|---|
0 | 20240601 | 26.2 |
1 | 20240602 | 24.4 |
2 | 20240603 | 25.7 |
3 | 20240604 | 24.6 |
4 | 20240605 | 23.4 |
As a result we indeed get only two columns and the first five rows. It is important to notice is that the behavior in terms of how many rows are returned differs between .iloc[]
and .loc[]
. Here, the .iloc[0:5]
returns 5 rows (following the Python list slicing behavior), whereas using .loc[0:5]
would return 6 rows (i.e., also including the row at index 5
).
One handy functionality with .iloc[]
is the ability to fetch data starting from the end of the DataFrame
, similar to how values can be selected from the end of a Python list
. It is possible to retrieve the last row in the DataFrame
by passing a negative number to .iloc[]
, where value -1
corresponds to the last row (or column), -2 corresponds to the second to last, and so on. Following this, it is easy to find the value in the last row and column (the final value in the TEMP_KELVIN
column in our case).
data.iloc[-1, -1]
np.float64(289.45)
Selections using listed criteria#
Another common way of selecting rows from a DataFrame
is to provide a list of values that are used for finding matching rows in a specific DataFrame
column. For example, selecting rows that match specific dates can be done by passing a list of values used as criteria to the .isin()
function of pandas
[4]. This will go through each value in the selected column (in this case YEARMODA
) and check whether there is a match or not. As an output, the .isin()
function returns a Series
of Boolean values (True or False) that can be combined with .loc[]
to do the final selection that returns only rows that meet the selection criteria.
# List of values that will be used as basis for selecting the rows
selection_criteria = [20240601, 20240708, 20240809]
# Do the selection based on criteria applied to YEARMODA column
data.loc[data["YEARMODA"].isin(selection_criteria)]
YEARMODA | MAX | MIN | TEMP1 | TEMP2 | DIFF | TEMP | TEMP_KELVIN | |
---|---|---|---|---|---|---|---|---|
0 | 20240601 | 26.2 | 18.4 | 24.2 | 23.2 | 7.8 | 23.70 | 296.85 |
37 | 20240708 | 21.3 | 13.8 | 18.6 | 21.3 | 7.5 | 19.95 | 293.10 |
69 | 20240809 | 24.1 | 17.1 | 22.2 | 24.1 | 7.0 | 23.15 | 296.30 |
Conditional selections#
One really useful feature in pandas
is the ability to easily filter and select rows based on a conditional statement. The following example shows how we can check whether the temperature at each row of the MAX
column is greater than or equal to 25 degrees.
data["MAX"] > 25
0 True
1 False
2 True
3 False
...
88 False
89 False
90 False
91 False
Name: MAX, Length: 92, dtype: bool
As a result, we get a Series
of Boolean values, where the value True
or False
at each row determines whether or not our condition was met. This kind of Series
or numpy.array
of Boolean values based on some predefined criteria is typically called a mask. We can take advantage of this mask when doing selections with .loc[]
based on specific criteria. In the following example, we use the same criterion as above and store all rows meeting the criterion into the variable hot_temps
(“hot” temperatures). We can specify the criterion directly inside the square brackets of the .loc[]
indexer. Let’s select rows that have a temperature greater than or equal to 25 degrees.
hot_temps = data.loc[data["MAX"] >= 25]
hot_temps
YEARMODA | MAX | MIN | TEMP1 | TEMP2 | DIFF | TEMP | TEMP_KELVIN | |
---|---|---|---|---|---|---|---|---|
0 | 20240601 | 26.2 | 18.4 | 24.2 | 23.2 | 7.8 | 23.70 | 296.85 |
2 | 20240603 | 25.7 | 15.2 | 22.5 | 23.1 | 10.5 | 22.80 | 295.95 |
24 | 20240625 | 25.7 | 12.3 | 20.6 | 25.0 | 13.4 | 22.80 | 295.95 |
25 | 20240626 | 25.7 | 15.9 | 22.2 | 25.5 | 9.8 | 23.85 | 297.00 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
54 | 20240725 | 26.1 | 18.5 | 22.3 | 26.1 | 7.6 | 24.20 | 297.35 |
55 | 20240726 | 26.7 | 17.9 | 24.3 | 26.4 | 8.8 | 25.35 | 298.50 |
56 | 20240727 | 25.1 | NaN | 22.7 | 24.8 | NaN | 23.75 | 296.90 |
85 | 20240825 | 25.9 | 14.4 | 19.0 | 25.9 | 11.5 | 22.45 | 295.60 |
10 rows × 8 columns
It is also possible to combine multiple criteria at the same time. Combining multiple criteria can be done using the &
(and) or the |
(or) operators. Notice, that it is often useful to separate the different conditional clauses with parentheses ()
. Let’s select rows having average daily temperatures above 20 degrees from the second half of the summer of 2024 (July 15th onwards).
warm_temps = data.loc[(data["TEMP"] > 20) & (data["YEARMODA"] >= 20240715)]
warm_temps
YEARMODA | MAX | MIN | TEMP1 | TEMP2 | DIFF | TEMP | TEMP_KELVIN | |
---|---|---|---|---|---|---|---|---|
44 | 20240715 | 22.3 | 14.9 | 19.5 | 22.0 | 7.4 | 20.75 | 293.90 |
45 | 20240716 | 24.3 | 15.4 | 20.6 | 24.3 | 8.9 | 22.45 | 295.60 |
46 | 20240717 | 24.3 | 17.9 | 21.6 | 23.5 | 6.4 | 22.55 | 295.70 |
47 | 20240718 | 22.5 | 16.9 | 19.1 | 22.5 | 5.6 | 20.80 | 293.95 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
81 | 20240821 | 22.3 | 16.0 | 20.8 | 20.6 | 6.3 | 20.70 | 293.85 |
82 | 20240822 | 22.1 | NaN | 19.7 | 20.8 | NaN | 20.25 | 293.40 |
85 | 20240825 | 25.9 | 14.4 | 19.0 | 25.9 | 11.5 | 22.45 | 295.60 |
90 | 20240830 | 24.6 | 16.0 | 20.7 | 23.5 | 8.6 | 22.10 | 295.25 |
27 rows × 8 columns
Now we have a subset of our DataFrame
with only rows where the TEMP
is above 20 and the dates in YEARMODA
column start from the 15th of July. Notice, that the index values (numbers on the left) are still showing the index labels from the original DataFrame
. This indicates that our result is a slice from the original data.
Of course, it is possible to reset the index using the .reset_index()
function, which makes the index numbering to start from 0 and increases the index values in a sequential manner. This is often a useful operation to do because it makes it easier then to slice the data with .loc[]
or .iloc[]
. By default .reset_index()
would make a new column called index
to store the previous index, which might be useful in some cases. That is not the case here, so we can omit storing the old index by including the parameter drop=True
.
warm_temps = warm_temps.reset_index(drop=True)
warm_temps
YEARMODA | MAX | MIN | TEMP1 | TEMP2 | DIFF | TEMP | TEMP_KELVIN | |
---|---|---|---|---|---|---|---|---|
0 | 20240715 | 22.3 | 14.9 | 19.5 | 22.0 | 7.4 | 20.75 | 293.90 |
1 | 20240716 | 24.3 | 15.4 | 20.6 | 24.3 | 8.9 | 22.45 | 295.60 |
2 | 20240717 | 24.3 | 17.9 | 21.6 | 23.5 | 6.4 | 22.55 | 295.70 |
3 | 20240718 | 22.5 | 16.9 | 19.1 | 22.5 | 5.6 | 20.80 | 293.95 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
23 | 20240821 | 22.3 | 16.0 | 20.8 | 20.6 | 6.3 | 20.70 | 293.85 |
24 | 20240822 | 22.1 | NaN | 19.7 | 20.8 | NaN | 20.25 | 293.40 |
25 | 20240825 | 25.9 | 14.4 | 19.0 | 25.9 | 11.5 | 22.45 | 295.60 |
26 | 20240830 | 24.6 | 16.0 | 20.7 | 23.5 | 8.6 | 22.10 | 295.25 |
27 rows × 8 columns
As can be seen, now the index values goes from 0 to 26. Resetting the index has now also unlinked the warm_temps
DataFrame
from data
, meaning that it is not a view anymore but an independent pandas
object. When making selections, it is quite typical that pandas
might give you warnings if you modify the selected data without first resetting the index or making a copy of the selected data. To demonstrate this, we will make the selection again and create a new column indicating days in which the temperature was “hot” (maximum temperature greater than 25 degrees Celsius).
warm_temps = data.loc[(data["TEMP"] > 20) & (data["YEARMODA"] >= 20240715)]
warm_temps["HOT_TEMP"] = warm_temps["MAX"] > 25
/var/folders/lp/cjwc88bd3w10sg327y_4ghg0fsk7jj/T/ipykernel_86921/3602583181.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
warm_temps["HOT_TEMP"] = warm_temps["MAX"] > 25
In this case we have created a new column using the selection, which is a slice from the original data. As a result, pandas
raises a warning about a possible invalid value assignment. In most cases this warning can be ignored, but it is a good practice to always make a copy when doing selections, especially if you continue working with the selected data and intend modify it further.
# Make the selection and make a copy
warm_temps = data.loc[(data["TEMP"] > 20) & (data["YEARMODA"] >= 20240715)].copy()
# Now update the first value of the last column
warm_temps["HOT_TEMP"] = warm_temps["MAX"] > 25
As we can see, now we did not receive any warnings this time and it would be safe to continue working with this selection without needing to worry that there are some “hidden linkages” between the selection and another DataFrame
that could cause issues (we discuss this more in the next section).
Question 3.6#
Calculate the mean temperature for the last seven days of August 2024 again. This time you should select the rows based on a condition for the YEARMODA
column.
Show code cell content
# Solution
data["TEMP"].loc[data["YEARMODA"] >= 20240825].mean()
np.float64(19.671428571428574)
View versus a copy#
As we have seen above, making a selection can sometimes result to something called a view. In such cases, the selection and the original data may still linked to each other. This happens, for example, if you make a selection like above but return only a single column from the original source data. In a situation where you have a view, a change in the original data for that specific column can also change the value in the selection. This behavior can be confusing and yield unexpected consequences, so a good practice to follow is to always make a copy whenever doing selections to unlink the source DataFrame
from the selection. You can make a copy easily while doing the selection by adding .copy()
at the end of the selection command.
selection = data.loc[0:5, ["TEMP", "TEMP_KELVIN"]].copy()
selection
TEMP | TEMP_KELVIN | |
---|---|---|
0 | 23.70 | 296.85 |
1 | 22.45 | 295.60 |
2 | 22.80 | 295.95 |
3 | 21.95 | 295.10 |
4 | 21.75 | 294.90 |
5 | 17.05 | 290.20 |
Now we have the exact same data in our end result, but we have ensured that the selection is not linked to the original data anymore. To demonstrate what can happen with the view, let’s make a selection of a single column from the selection data (which will be a view), and modify the data a bit to demonstrate the consequences if we are not careful.
temp = selection["TEMP"]
temp
0 23.70
1 22.45
2 22.80
3 21.95
4 21.75
5 17.05
Name: TEMP, dtype: float64
Now if we make a change to our original data selection
it will also influence our values in temp
.
selection.iloc[0, 0] = 30.0
selection.head()
TEMP | TEMP_KELVIN | |
---|---|---|
0 | 30.00 | 296.85 |
1 | 22.45 | 295.60 |
2 | 22.80 | 295.95 |
3 | 21.95 | 295.10 |
4 | 21.75 | 294.90 |
# Check the values in temp (which we did not modify)
temp
0 30.00
1 22.45
2 22.80
3 21.95
4 21.75
5 17.05
Name: TEMP, dtype: float64
As we can see, the value in our temp
Series
has changed from 23.70
to 30.00
although we did not make any change to it directly. The change happened because the data objects were still linked to each other.
Dealing with missing data#
As you may have noticed by now, we have several missing values for the temperature minimum, maximum, and average columns (MIN
, MAX
, and TEMP
). These missing values appear as NaN
(not a number). Having missing data in your data file is quite common and typically you want to deal with these values somehow. Common procedures to deal with NaN
values are to either remove them from the DataFrame
or replace (fill) them with some other value. In pandas
, both of these options are easy to do. Let’s first start by checking whether the data we are using has and NaN
values.
warm_temps
YEARMODA | MAX | MIN | TEMP1 | TEMP2 | DIFF | TEMP | TEMP_KELVIN | HOT_TEMP | |
---|---|---|---|---|---|---|---|---|---|
44 | 20240715 | 22.3 | 14.9 | 19.5 | 22.0 | 7.4 | 20.75 | 293.90 | False |
45 | 20240716 | 24.3 | 15.4 | 20.6 | 24.3 | 8.9 | 22.45 | 295.60 | False |
46 | 20240717 | 24.3 | 17.9 | 21.6 | 23.5 | 6.4 | 22.55 | 295.70 | False |
47 | 20240718 | 22.5 | 16.9 | 19.1 | 22.5 | 5.6 | 20.80 | 293.95 | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
81 | 20240821 | 22.3 | 16.0 | 20.8 | 20.6 | 6.3 | 20.70 | 293.85 | False |
82 | 20240822 | 22.1 | NaN | 19.7 | 20.8 | NaN | 20.25 | 293.40 | False |
85 | 20240825 | 25.9 | 14.4 | 19.0 | 25.9 | 11.5 | 22.45 | 295.60 | True |
90 | 20240830 | 24.6 | 16.0 | 20.7 | 23.5 | 8.6 | 22.10 | 295.25 | False |
27 rows × 9 columns
As we can see, the MIN
and TEMP
columns are clearly missing values at index 82
. It is also possible to confirm this with pandas
by accessing a specific Series
attribute called .hasnans
, which can be a handy tool when automating a data analysis pipeline. Each Series
(or column) in the DataFrame
has this attribute. Let’s check whether the MIN
column contains any NaN
values.
warm_temps["MIN"].hasnans
True
The .hasnans
attribute will be either True
or False
depending on whether or not the Series
contains any NaN
values. Let’s now see how we can remove the missing data values (i.e., clean the data) using the .dropna()
function. Inside the .dropna()
function you can pass a list of column(s) in which the NaN
values should be processed by using the subset
parameter.
cols_to_check = ["MIN"]
warm_temps_clean = warm_temps.dropna(subset=cols_to_check)
warm_temps_clean
YEARMODA | MAX | MIN | TEMP1 | TEMP2 | DIFF | TEMP | TEMP_KELVIN | HOT_TEMP | |
---|---|---|---|---|---|---|---|---|---|
44 | 20240715 | 22.3 | 14.9 | 19.5 | 22.0 | 7.4 | 20.75 | 293.90 | False |
45 | 20240716 | 24.3 | 15.4 | 20.6 | 24.3 | 8.9 | 22.45 | 295.60 | False |
46 | 20240717 | 24.3 | 17.9 | 21.6 | 23.5 | 6.4 | 22.55 | 295.70 | False |
47 | 20240718 | 22.5 | 16.9 | 19.1 | 22.5 | 5.6 | 20.80 | 293.95 | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
79 | 20240819 | 22.3 | 15.1 | 18.8 | 22.3 | 7.2 | 20.55 | 293.70 | False |
81 | 20240821 | 22.3 | 16.0 | 20.8 | 20.6 | 6.3 | 20.70 | 293.85 | False |
85 | 20240825 | 25.9 | 14.4 | 19.0 | 25.9 | 11.5 | 22.45 | 295.60 | True |
90 | 20240830 | 24.6 | 16.0 | 20.7 | 23.5 | 8.6 | 22.10 | 295.25 | False |
23 rows × 9 columns
As you can see by looking at the output above, we now have a DataFrame
without NaN
values in the MIN
column (we have only 23 rows compared to 27 before cleaning). However, you might recall that there may also have been NaN
values in the MAX
column. In order to drop all rows containing NaN
values, we could instead to the following:
warm_temps_clean = warm_temps.dropna()
warm_temps_clean
YEARMODA | MAX | MIN | TEMP1 | TEMP2 | DIFF | TEMP | TEMP_KELVIN | HOT_TEMP | |
---|---|---|---|---|---|---|---|---|---|
44 | 20240715 | 22.3 | 14.9 | 19.5 | 22.0 | 7.4 | 20.75 | 293.90 | False |
45 | 20240716 | 24.3 | 15.4 | 20.6 | 24.3 | 8.9 | 22.45 | 295.60 | False |
46 | 20240717 | 24.3 | 17.9 | 21.6 | 23.5 | 6.4 | 22.55 | 295.70 | False |
47 | 20240718 | 22.5 | 16.9 | 19.1 | 22.5 | 5.6 | 20.80 | 293.95 | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
79 | 20240819 | 22.3 | 15.1 | 18.8 | 22.3 | 7.2 | 20.55 | 293.70 | False |
81 | 20240821 | 22.3 | 16.0 | 20.8 | 20.6 | 6.3 | 20.70 | 293.85 | False |
85 | 20240825 | 25.9 | 14.4 | 19.0 | 25.9 | 11.5 | 22.45 | 295.60 | True |
90 | 20240830 | 24.6 | 16.0 | 20.7 | 23.5 | 8.6 | 22.10 | 295.25 | False |
21 rows × 9 columns
In this case we can see an additional 2 rows have been dropped due to the NaN
values that were in the MAX
column. However, it is using .dropna()
without the subset
parameter is not recommended as sometimes it can result in dropping data you may not want to remove. Thus, we generally recommend always using the subset
parameter with the list of columns from which to remove NaN
values.
Of course, by dropping rows using .dropna()
you naturally lose data (rows), which might not be an optimal solution for all cases. Because of this, pandas
also provides an option to fill the NaN
values with some other value using the .fillna()
function. Let’s instead fill the missing values in our data set the with value -9999
. Note that we are not giving the subset
parameter this time.
warm_temps.fillna(-9999)
YEARMODA | MAX | MIN | TEMP1 | TEMP2 | DIFF | TEMP | TEMP_KELVIN | HOT_TEMP | |
---|---|---|---|---|---|---|---|---|---|
44 | 20240715 | 22.3 | 14.9 | 19.5 | 22.0 | 7.4 | 20.75 | 293.90 | False |
45 | 20240716 | 24.3 | 15.4 | 20.6 | 24.3 | 8.9 | 22.45 | 295.60 | False |
46 | 20240717 | 24.3 | 17.9 | 21.6 | 23.5 | 6.4 | 22.55 | 295.70 | False |
47 | 20240718 | 22.5 | 16.9 | 19.1 | 22.5 | 5.6 | 20.80 | 293.95 | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
81 | 20240821 | 22.3 | 16.0 | 20.8 | 20.6 | 6.3 | 20.70 | 293.85 | False |
82 | 20240822 | 22.1 | -9999.0 | 19.7 | 20.8 | -9999.0 | 20.25 | 293.40 | False |
85 | 20240825 | 25.9 | 14.4 | 19.0 | 25.9 | 11.5 | 22.45 | 295.60 | True |
90 | 20240830 | 24.6 | 16.0 | 20.7 | 23.5 | 8.6 | 22.10 | 295.25 | False |
27 rows × 9 columns
As a result we now have a DataFrame
where all NaN
values in the DataFrame are filled with the value -9999
. As was the case for the .dropna()
function, you can fill values in select columns by directing the .fillna()
function to a specific column (or set of columns).
warm_temps["MIN"] = warm_temps["MIN"].fillna(-9999)
warm_temps
YEARMODA | MAX | MIN | TEMP1 | TEMP2 | DIFF | TEMP | TEMP_KELVIN | HOT_TEMP | |
---|---|---|---|---|---|---|---|---|---|
44 | 20240715 | 22.3 | 14.9 | 19.5 | 22.0 | 7.4 | 20.75 | 293.90 | False |
45 | 20240716 | 24.3 | 15.4 | 20.6 | 24.3 | 8.9 | 22.45 | 295.60 | False |
46 | 20240717 | 24.3 | 17.9 | 21.6 | 23.5 | 6.4 | 22.55 | 295.70 | False |
47 | 20240718 | 22.5 | 16.9 | 19.1 | 22.5 | 5.6 | 20.80 | 293.95 | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
81 | 20240821 | 22.3 | 16.0 | 20.8 | 20.6 | 6.3 | 20.70 | 293.85 | False |
82 | 20240822 | 22.1 | -9999.0 | 19.7 | 20.8 | NaN | 20.25 | 293.40 | False |
85 | 20240825 | 25.9 | 14.4 | 19.0 | 25.9 | 11.5 | 22.45 | 295.60 | True |
90 | 20240830 | 24.6 | 16.0 | 20.7 | 23.5 | 8.6 | 22.10 | 295.25 | False |
27 rows × 9 columns
Notice that now the NaN
value at index 82
of the MIN
has been replaced, while the NaN
value in the DIFF
column remains.
It is important to be aware, that in many cases filling the data with a specific value is dangerous because you end up modifying the actual data, which might affect the results of your analysis. For example, in the case above we would have dramatically changed the temperature difference columns because the -9999
values not an actual temperature! Hence, use caution when filling missing values. In some cases, you might have to fill in missing data values for the purpose of saving the data file in a specific format. For example, some GIS software cannot handle missing values. Always pay attention to potential missing data values when reading in data files and performing your data analysis.
Data type conversions#
When doing data analysis, another quite typical operation that needs to be done is to convert values in a column from one data type to another, such as from floating point values to integers. Remember, that we have already performed data type conversions in Chapter 2 using built-in Python functions such as int()
or str()
. For values stored in a pandas
Series
, we can use the .astype()
method for converting data types. Let’s explore this by converting the temperature values (type float64
) in the column TEMP
to be integers (type int64
). Let’s first have a look at the existing data.
data["TEMP"].head()
0 23.70
1 22.45
2 22.80
3 21.95
4 21.75
Name: TEMP, dtype: float64
Now we can easily convert those decimal values to integers using .astype(int)
.
data["TEMP"].astype(int).head()
0 23
1 22
2 22
3 21
4 21
Name: TEMP, dtype: int64
Great! As we can see the values were converted to integers. However, this example demonstrates the importance of being careful with type conversions from floating point values to integers. This type conversion simply drops the stuff to the right of the decimal point, so all values are effectively rounded down to the nearest whole number. For example, the first value in our Series
(23.7
) was truncated to 23
as an integer when it clearly should be rounded up to 24
. This issue can be resolved by chaining together the .round()
function with the .astype()
function. For example, .round(0).astype(int)
will first round the values to have zero decimal places and then converts those values into integers.
data["TEMP"].round(0).astype(int).head()
0 24
1 22
2 23
3 22
4 22
Name: TEMP, dtype: int64
As we can see, now the integer values are correctly rounded.
The .astype()
method supports converting between all basic Python data types (int
, float
, str
), but it also knows how to convert between more specific numpy
data types such as int64
, int32
, float64
, float32
(a full list can be found in the NumPy documentation [5]). Using the numpy
data type can useful if you need to be more specific about how many bits should be reserved for storing the values, for example. For instance, int64
(i.e., a 64-bit integer) can store integer values ranging between -9223372036854775808 and 9223372036854775807, whereas int16
can only store values from -32768 to 32767. If passing the “normal” int
or float
to the astype()
function, pandas
will automatically store 64-bit numeric values. However, higher the number precision (i.e., the larger number of bits you use), the more physical memory is required to store the data. Hence, in some specific cases dealing with extremely large datasets, it may be useful to be able to specify that the values in specific columns should be stored with lower precision to save memory.
Sorting data#
Quite often it is useful to be able to sort your data (descending/ascending) based on values in some column. This can be done easily with pandas
using the .sort_values(by='YourColumnName')
function. Let’s first sort the values in ascending order using the TEMP
column.
# Sort DataFrame, ascending
data.sort_values(by="TEMP")
YEARMODA | MAX | MIN | TEMP1 | TEMP2 | DIFF | TEMP | TEMP_KELVIN | |
---|---|---|---|---|---|---|---|---|
8 | 20240609 | 15.5 | 11.3 | 12.8 | 15.4 | 4.2 | 14.10 | 287.25 |
35 | 20240706 | 19.0 | 11.3 | 11.4 | 18.7 | 7.7 | 15.05 | 288.20 |
72 | 20240812 | 17.4 | 12.7 | 14.7 | 16.0 | 4.7 | 15.35 | 288.50 |
32 | 20240703 | 17.4 | 13.1 | 14.2 | 16.6 | 4.3 | 15.40 | 288.55 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
54 | 20240725 | 26.1 | 18.5 | 22.3 | 26.1 | 7.6 | 24.20 | 297.35 |
26 | 20240627 | NaN | 15.3 | 22.9 | 26.9 | NaN | 24.90 | 298.05 |
55 | 20240726 | 26.7 | 17.9 | 24.3 | 26.4 | 8.8 | 25.35 | 298.50 |
27 | 20240628 | 27.2 | 17.0 | 25.4 | 27.2 | 10.2 | 26.30 | 299.45 |
92 rows × 8 columns
Of course, it is also possible to sort them in descending order by including the ascending=False
parameter.
# Sort DataFrame, descending
data.sort_values(by="TEMP", ascending=False)
YEARMODA | MAX | MIN | TEMP1 | TEMP2 | DIFF | TEMP | TEMP_KELVIN | |
---|---|---|---|---|---|---|---|---|
27 | 20240628 | 27.2 | 17.0 | 25.4 | 27.2 | 10.2 | 26.30 | 299.45 |
55 | 20240726 | 26.7 | 17.9 | 24.3 | 26.4 | 8.8 | 25.35 | 298.50 |
26 | 20240627 | NaN | 15.3 | 22.9 | 26.9 | NaN | 24.90 | 298.05 |
54 | 20240725 | 26.1 | 18.5 | 22.3 | 26.1 | 7.6 | 24.20 | 297.35 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
32 | 20240703 | 17.4 | 13.1 | 14.2 | 16.6 | 4.3 | 15.40 | 288.55 |
72 | 20240812 | 17.4 | 12.7 | 14.7 | 16.0 | 4.7 | 15.35 | 288.50 |
35 | 20240706 | 19.0 | 11.3 | 11.4 | 18.7 | 7.7 | 15.05 | 288.20 |
8 | 20240609 | 15.5 | 11.3 | 12.8 | 15.4 | 4.2 | 14.10 | 287.25 |
92 rows × 8 columns
In some situations, you might need to sort values based on multiple columns simultaneously, which is sometimes referred to as multi-level sorting. This can be done by passing a list of column names to the by
parameter. When you sort the data based on multiple columns, sometimes you also might want to sort your data in a way that the first-level sorting happens in ascending order and the second-level sorting happens in descending order. An example situation for this kind of sorting could be when sorting the temperatures first by weekday (Monday, Tuesday, etc.) and then ordering the values for each weekday in descending order. This would always show the warmest temperature for a given day of the week first. Let’s modify our data a bit to demonstrate this. We will add a new column that has information about the weekday. The 1st of June, 2024 was a Saturday, so we start from that.
# Create a list of weekdays that matches with our data
# The data covers 13 weeks + 1 day (altogether 92 days)
week_days = ["Sat", "Sun", "Mon", "Tue", "Wed", "Thu", "Fri"]
day_list = 13 * week_days + week_days[:1]
# Add the weekdays to our DataFrame
data["WEEKDAY"] = day_list
data
YEARMODA | MAX | MIN | TEMP1 | TEMP2 | DIFF | TEMP | TEMP_KELVIN | WEEKDAY | |
---|---|---|---|---|---|---|---|---|---|
0 | 20240601 | 26.2 | 18.4 | 24.2 | 23.2 | 7.8 | 23.70 | 296.85 | Sat |
1 | 20240602 | 24.4 | 17.1 | 20.7 | 24.2 | 7.3 | 22.45 | 295.60 | Sun |
2 | 20240603 | 25.7 | 15.2 | 22.5 | 23.1 | 10.5 | 22.80 | 295.95 | Mon |
3 | 20240604 | 24.6 | 15.7 | 19.3 | 24.6 | 8.9 | 21.95 | 295.10 | Tue |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
88 | 20240828 | 21.6 | 14.2 | 18.7 | 21.3 | 7.4 | 20.00 | 293.15 | Wed |
89 | 20240829 | 21.9 | NaN | 16.6 | 21.8 | NaN | 19.20 | 292.35 | Thu |
90 | 20240830 | 24.6 | 16.0 | 20.7 | 23.5 | 8.6 | 22.10 | 295.25 | Fri |
91 | 20240831 | NaN | 13.7 | 13.9 | 18.7 | NaN | 16.30 | 289.45 | Sat |
92 rows × 9 columns
Now we have a new column with information about the weekday of each row. Next, we can test how to sort the values, so that we order the data by weekday and within each weekday the temperatures are in descending order. You can adjust how the ordering works by passing a list of Boolean values to the ascending
parameter.
data.sort_values(by=["WEEKDAY", "TEMP"], ascending=[True, False])
YEARMODA | MAX | MIN | TEMP1 | TEMP2 | DIFF | TEMP | TEMP_KELVIN | WEEKDAY | |
---|---|---|---|---|---|---|---|---|---|
27 | 20240628 | 27.2 | 17.0 | 25.4 | 27.2 | 10.2 | 26.30 | 299.45 | Fri |
55 | 20240726 | 26.7 | 17.9 | 24.3 | 26.4 | 8.8 | 25.35 | 298.50 | Fri |
69 | 20240809 | 24.1 | 17.1 | 22.2 | 24.1 | 7.0 | 23.15 | 296.30 | Fri |
90 | 20240830 | 24.6 | 16.0 | 20.7 | 23.5 | 8.6 | 22.10 | 295.25 | Fri |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
88 | 20240828 | 21.6 | 14.2 | 18.7 | 21.3 | 7.4 | 20.00 | 293.15 | Wed |
18 | 20240619 | 17.8 | 12.8 | 16.1 | 16.8 | 5.0 | 16.45 | 289.60 | Wed |
11 | 20240612 | 17.9 | 11.1 | 14.9 | 17.5 | 6.8 | 16.20 | 289.35 | Wed |
32 | 20240703 | 17.4 | 13.1 | 14.2 | 16.6 | 4.3 | 15.40 | 288.55 | Wed |
92 rows × 9 columns
As a result the data are now ordered first by weekday (i.e., the same weekday values are grouped) and the within these weekdays the temperature values are sorted in descending order showing the warmest day first. Ordering data in this manner based on multiple criteria can sometimes be very useful when analyzing your data.
Table join: Combining DataFrames using a common key#
Basic logic of a table join#
Joining data between two or more DataFrame
objects is a common task when doing data analysis. The minimum requirement for being able to combine data between two (or more) DataFrame
objects (or tables) is to have at least one common attribute (called a key) that is identical in both DataFrame
objects. Figure 3.4 illustrates the logic of a table join. In this case, we want to merge the precipitation data from Kumpula weather station to the temperature data that we worked earlier in this chapter. The common key in this case is the time information in the YEARMODA
column in the left DataFrame
and the Date
column in the right DataFrame
. The column names of the keys can be different (as in our case), but the values stored in these columns should correspond to each other so that it is possible to match the records between tables. The attribute values of the key can contain data in any format (dates, text, numbers, etc.). Hence, the data is not limited to dates or integers as demonstrated in this example.
Figure 3.4. Joining precipitation data from the right DataFrame to the left based on common key.
Table join using pandas .merge()
#
In the following example, we first read some precipitation data for the Kumpula weather station and then join this data with the DataFrame
containing the average temperature data we have worked with. Merging two DataFrame
objects together based on a common key (or multiple keys) can be done easily with pandas
using the .merge()
function. The column that represents the key can be specified using the on
parameter if the key column is identical in both DataFrame
objects. In our case, the DataFrame
columns containing common values have different names. Thus, we need to separately specify the key for the left DataFrame
using the parameter left_on
and use the parameter right_on
for the right DataFrame
.
# Read precipitation data and show first rows
precip = pd.read_csv("data/kumpula-temps-precip-summer-2024.txt")
precip.head()
Date | Precipitation amount [mm] | Average temperature [deg. C] | Maximum temperature [deg. C] | Minimum temperature [deg. C] | |
---|---|---|---|---|---|
0 | 20240601 | 0.7 | 22.3 | 27.2 | 17.6 |
1 | 20240602 | 0.1 | 20.8 | 24.9 | 16.9 |
2 | 20240603 | 0.0 | 20.8 | 26.7 | 15.0 |
3 | 20240604 | -1.0 | 20.2 | 25.1 | 15.4 |
4 | 20240605 | -1.0 | 18.6 | 24.0 | 12.4 |
# Make a table join
join = data.merge(precip, left_on="YEARMODA", right_on="Date")
join.head()
YEARMODA | MAX | MIN | TEMP1 | TEMP2 | DIFF | TEMP | TEMP_KELVIN | WEEKDAY | Date | Precipitation amount [mm] | Average temperature [deg. C] | Maximum temperature [deg. C] | Minimum temperature [deg. C] | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 20240601 | 26.2 | 18.4 | 24.2 | 23.2 | 7.8 | 23.70 | 296.85 | Sat | 20240601 | 0.7 | 22.3 | 27.2 | 17.6 |
1 | 20240602 | 24.4 | 17.1 | 20.7 | 24.2 | 7.3 | 22.45 | 295.60 | Sun | 20240602 | 0.1 | 20.8 | 24.9 | 16.9 |
2 | 20240603 | 25.7 | 15.2 | 22.5 | 23.1 | 10.5 | 22.80 | 295.95 | Mon | 20240603 | 0.0 | 20.8 | 26.7 | 15.0 |
3 | 20240604 | 24.6 | 15.7 | 19.3 | 24.6 | 8.9 | 21.95 | 295.10 | Tue | 20240604 | -1.0 | 20.2 | 25.1 | 15.4 |
4 | 20240605 | 23.4 | 12.8 | 20.3 | 23.2 | 10.6 | 21.75 | 294.90 | Wed | 20240605 | -1.0 | 18.6 | 24.0 | 12.4 |
Now we have merged all the information from the right DataFrame
into the left one and stored the result in variable join
. By doing this, we can analyze the relationship between the daily average temperature and precipitation, for example, to understand whether the temperature is lower when it rains. Currently, however, the join
DataFrame
also contains some variables that are not necessarily useful for us. A useful trick to do when joining is to limit the number of columns that will be kept from the right DataFrame
. This can be done by chaining a simple column selection while doing the merge as shown next. When doing this, it is important to remember that the key
column on the right DataFrame
must be part of the selection for the table join to work.
# Make another join but only keep the attribute of interest
join2 = data.merge(
precip[["Date", "Precipitation amount [mm]"]], left_on="YEARMODA", right_on="Date"
)
join2.head()
YEARMODA | MAX | MIN | TEMP1 | TEMP2 | DIFF | TEMP | TEMP_KELVIN | WEEKDAY | Date | Precipitation amount [mm] | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 20240601 | 26.2 | 18.4 | 24.2 | 23.2 | 7.8 | 23.70 | 296.85 | Sat | 20240601 | 0.7 |
1 | 20240602 | 24.4 | 17.1 | 20.7 | 24.2 | 7.3 | 22.45 | 295.60 | Sun | 20240602 | 0.1 |
2 | 20240603 | 25.7 | 15.2 | 22.5 | 23.1 | 10.5 | 22.80 | 295.95 | Mon | 20240603 | 0.0 |
3 | 20240604 | 24.6 | 15.7 | 19.3 | 24.6 | 8.9 | 21.95 | 295.10 | Tue | 20240604 | -1.0 |
4 | 20240605 | 23.4 | 12.8 | 20.3 | 23.2 | 10.6 | 21.75 | 294.90 | Wed | 20240605 | -1.0 |
As can be seen in the DataFrame
above, now only the column Date
and the attribute of interest Precipitation amount [mm]
were joined and kept in the output from the right DataFrame
. A similar trick can also be applied to the left DataFrame
by adding the selection before the .merge()
call.
In our case, doing the table join was fairly straightforward because we had only one unique record per day in both DataFrame
objects. However, in some cases you might have multiple records in one or the other of the DataFrame
objects (e.g., hourly observations vs. daily observations). This can in specific cases cause issues (not always!), incorrect analysis results, and other undesired consequences if not taken into account properly. This kind of mismatch in number of records per table can be handled by first aggregating the hourly data to a daily average, for example. You can learn more about these aggregation techniques in the following sections.