Common tabular operations in pandas#

Now you have learned the basics of pandas data structures (i.e. Series and DataFrame) and you should be familiar with basic methods for loading and exploring data. Next, we will continue exploring the pandas functionalities, and see how it can be used for data manipulation, conducting simple calculations, and making selections based on specific criteria.

Basic calculations#

One of the most common things to do in pandas is to create new columns based on calculations between different variables (columns). Next, we will learn how to do that using the same input data ('Kumpula-June-2016-w-metadata.txt') as in the previous section. We will first load it using the pd.read_csv() method. Remember, that the first 8 lines contains the metadata which we will skip. This time, let’s store the filepath into a separate variable in order to make the code more readable and easier to change afterwards (a good practice).

import pandas as pd

# Define file path:
fp = "data/Kumpula-June-2016-w-metadata.txt"

# Read in the data from the file (starting at row 9):
data = pd.read_csv(fp, skiprows=8)

As a first step, it is always good to remember to check the data after reading it. This way we can be sure that everything looks as it should.

data.head()
YEARMODA TEMP MAX MIN
0 20160601 65.5 73.6 54.7
1 20160602 65.8 80.8 55.0
2 20160603 68.4 NaN 55.6
3 20160604 57.5 70.9 47.3
4 20160605 51.4 58.3 43.2

All good. Now we can start by creating a new column DIFF in our DataFrame. This can be done by specifying the name of the column and giving it some default value (in this case the decimal number 0.0):

data["DIFF"] = 0.0
data.head()
YEARMODA TEMP MAX MIN DIFF
0 20160601 65.5 73.6 54.7 0.0
1 20160602 65.8 80.8 55.0 0.0
2 20160603 68.4 NaN 55.6 0.0
3 20160604 57.5 70.9 47.3 0.0
4 20160605 51.4 58.3 43.2 0.0

As we can see, now we have a new column (DIFF) in our DataFrame that has value 0.0 for all rows. When creating a new column, you can initialize it with any value you want. Typically, the value could be a number 0 as in here, but it could also be None (i.e. nothing), some text (e.g. "test text"), or more or less any other value or object that can be represented as a single item. You could even initiliaze the column by storing a function inside the cells if you like. Let’s continue by checking the datatype of our new column:

data["DIFF"].dtypes
dtype('float64')

As we can see, pandas created a new column and automatically recognized that the data type is float as we passed a 0.0 value to it. Great, but whatabout making those calculations with pandas as promised in the beginning? Next, we will calculate the difference between MAX and MIN columns to get an idea how much the temperatures have been varying during different days. The result will be updated into the column DIFF that we created earlier. A typical way of conducting calculations such as this, is to access the Series (columns) that interests us from the DataFrame and conduct the mathematical calculation using the selected columns. Typically you store the result directly into a column in the DataFrame, like following:

data["DIFF"] = data["MAX"] - data["MIN"]
data.head()
YEARMODA TEMP MAX MIN DIFF
0 20160601 65.5 73.6 54.7 18.9
1 20160602 65.8 80.8 55.0 25.8
2 20160603 68.4 NaN 55.6 NaN
3 20160604 57.5 70.9 47.3 23.6
4 20160605 51.4 58.3 43.2 15.1

The calculations were stored into the DIFF column as planned. Conducting calculation like this is extremely fast in pandas because the math operations happen in vectorized manner. This means that instead of looping over individual values of the DataFrame and comparing them to each other, calculating the difference happens simultaneously at all rows. You can also create new columns on-the-fly when doing the calculation (i.e. the column does not have to exist before). Furthermore, it is possible to use any kind of math algebra (e.g. subtracttion, addition, multiplication, division, exponentiation, etc.) when creating new columns. We can for example convert the Fahrenheit temperatures in the TEMP column into Celsius using the formula that we have seen already many times. Let’s do that and store it in a new column called TEMP_CELSIUS:

data["TEMP_CELSIUS"] = (data["TEMP"] - 32) / (9 / 5)
data.head()
YEARMODA TEMP MAX MIN DIFF TEMP_CELSIUS
0 20160601 65.5 73.6 54.7 18.9 18.611111
1 20160602 65.8 80.8 55.0 25.8 18.777778
2 20160603 68.4 NaN 55.6 NaN 20.222222
3 20160604 57.5 70.9 47.3 23.6 14.166667
4 20160605 51.4 58.3 43.2 15.1 10.777778

Question 3.2#

Calculate the temperatures in Kelvins using the Celsius values and store the result in a new column called TEMP_KELVIN in our dataframe. 0 Kelvins is is -273.15 degrees Celsius as we learned in Chapter 2, and the formula for converting degrees Celsius (C) to Kelvins (K) is: K = C + 273.15

Hide code cell content
# Solution

data["TEMP_KELVIN"] = data["TEMP_CELSIUS"] + 273.15

Selecting and updating data#

We often want to make selections from our data and only use specific rows from a DataFrame in the analysis. There are multiple ways of selecting subsets of a pandas DataFrame than can be based on e.g. specific index values or using some predefined criteria to make the selection, such as “give me all rows where values in column X are larger than zero”. Next, we will go through the most useful tricks for selecting specific rows, columns and individual values.

Selecting rows and columns#

One common way of selecting only specific rows from your DataFrame is done via a concept of slicing. Getting a slice of data in pandas can be done in a similar manner as with normal Python lists by specifying an index range inside square brackets: dataframe[start_index:stop_index]. Let’s select the first five rows and assign them to a variable called selection. Here, we will first see how selecting the data works like you would do with “normal” Python lists, based on index values:

selection = data[0:5]
selection
YEARMODA TEMP MAX MIN DIFF TEMP_CELSIUS TEMP_KELVIN
0 20160601 65.5 73.6 54.7 18.9 18.611111 291.761111
1 20160602 65.8 80.8 55.0 25.8 18.777778 291.927778
2 20160603 68.4 NaN 55.6 NaN 20.222222 293.372222
3 20160604 57.5 70.9 47.3 23.6 14.166667 287.316667
4 20160605 51.4 58.3 43.2 15.1 10.777778 283.927778

Here we have selected the first five rows (index 0-4) using the integer index. Notice that the logic here follows how Python’s list slicing (or range function) works, i.e. the value on the right side of the colon (here number 5) tells when to stop, but that value is not taken into the final selection. Hence, the syntax is start_index:stop_index (also additional parameter :step could be added here to the end).

Doing selections like in the previous example can be done, but there is also a better and more flexible way of selecting data using so called .loc label-indexing. As the name implies, .loc selects data based on axis labels (row and column labels). This does not necesssarily tell much to you at this point, but .loc makes it possible to conduct more specific selections, such as allowing you to choose which columns are chosen when selecting a subset of rows. It also makes possible to benefit from row labels that are not necessarily sequantial numbers (as with all our examples thus far), but they can represent other objects as well, such as dates or timestamps. Hence, .loc can become very handy when working with timeseries data (we will learn more about this later). Let’s test the .loc label-indexing by selecting temperature values from column TEMP using rows 0-5:

# Select temp column values on rows 0-5
selection = data.loc[0:5, "TEMP"]
selection
0    65.5
1    65.8
2    68.4
3    57.5
4    51.4
5    52.2
Name: TEMP, dtype: float64

Notice that in this case, we get six rows of data (index 0-5)! This happens because we are now doing the selection based on axis labels instead of normal Python-kind of indexing. It is important to notice the difference between these two approaches, as mixing the two may cause confusion, incorrect analysis results or bugs in your code, if you do not pay attention. We recommend to use .loc always when possible (there are specific cases when you want to use other approaches, more about this soon). Hence, the basic syntax for .loc is:

.loc[first_included_label:last_included_label, columns]

By looking at the syntax, you might guess that it is also possible to select multiple columns when using .loc. Next, we will test this by selecting the TEMP and TEMP_CELSIUS columns from a set of rows by passing them inside a list:

selection = data.loc[0:5, ["TEMP", "TEMP_CELSIUS"]]
selection
TEMP TEMP_CELSIUS
0 65.5 18.611111
1 65.8 18.777778
2 68.4 20.222222
3 57.5 14.166667
4 51.4 10.777778
5 52.2 11.222222

As a result, we now have a new DataFrame with two columns and 6 rows (i.e. index labels ranging from 0 to 5).

Question 3.3#

Calculate the mean temperature (in Celsius) for the last seven days of June. Do the selection using the row index values.

Hide code cell content
# Solution

data.loc[23:29, "TEMP_CELSIUS"].mean()
18.253968253968257

Selecting a single row or value#

You can also select an individual row from specific position using the .loc indexing. Here we select all the data values using index 4 (the 5th row):

row = data.loc[4]
row
YEARMODA        2.016060e+07
TEMP            5.140000e+01
MAX             5.830000e+01
MIN             4.320000e+01
DIFF            1.510000e+01
TEMP_CELSIUS    1.077778e+01
TEMP_KELVIN     2.839278e+02
Name: 4, dtype: float64

.loc indexing returns the values from that position as a Series where the indices are actually the column names of those variables. Hence, you can access the value of an individual column by referring to its index using following format (both should work):

# Print one attribute from the selected row
row["TEMP"]
51.4

Sometimes it is enough to access a single value in a DataFrame directly. In this case, we can use .at instead of .loc. Let’s select the temperature (column TEMP) on the first row (index 0) of our DataFrame:

selection.at[0, "TEMP"]
65.5

As an output, we got the individual value 65.5. .at works only when accessing a single value, whereas the .loc can be used to access both a single or multiple values at the same time. The end result when fetching a single value with .loc is exactly the same. The difference between the two approaches is minimal, hence we recommend using .loc in all cases because it is more flexible (.at is slightly faster but in most cases it does not make a difference):

selection.loc[0, "TEMP"]
65.5

Selections based on index positions#

As we have learned thus far, .loc and .at are based on the axis labels - the names of columns and rows. For positional based indexing, pandas has an .iloc which is based on integer value indices. With .iloc, it is also possible to refer to the columns based on their index value (i.e. to a positional number of a column in the DataFrame). For example, data.iloc[0,0] would return 20160601 in our example DataFrame which is the value on the first row and first column in the data:

# Check the first rows
print(data.head())
print()
print("Value at position (0,0) is", data.iloc[0, 0])
   YEARMODA  TEMP   MAX   MIN  DIFF  TEMP_CELSIUS  TEMP_KELVIN
0  20160601  65.5  73.6  54.7  18.9     18.611111   291.761111
1  20160602  65.8  80.8  55.0  25.8     18.777778   291.927778
2  20160603  68.4   NaN  55.6   NaN     20.222222   293.372222
3  20160604  57.5  70.9  47.3  23.6     14.166667   287.316667
4  20160605  51.4  58.3  43.2  15.1     10.777778   283.927778

Value at position (0,0) is 20160601

Hence, the syntax for .iloc is:

.iloc[start_row_position:stop_row_position,
      start_column_position:stop_column_position]

By following this syntax, we can access the value on the first row and second column (TEMP) by calling:

data.iloc[0, 1]
65.5

It is also possible to get ranges of rows and columns with .iloc. For example, we could select YEARMODA and TEMP columns from the first five rows based on their indices (positions) in the data. Here, we will select rows from positions 0 to 5 and columns from positions 0 to 2:

selection = data.iloc[0:5:, 0:2]
selection
YEARMODA TEMP
0 20160601 65.5
1 20160602 65.8
2 20160603 68.4
3 20160604 57.5
4 20160605 51.4

As a result we indeed get only two columns and the first five rows. A good thing to notice, is that with .iloc, the behavior in terms of how many rows are returned differs from .loc. Here, the 0:5 returns 5 rows (following the Python list slicing behavarior), whereas using .loc would return 6 rows (i.e. also including the row at index 5).

One handy functionality with .iloc is the ability to fetch data starting from the end of the DataFrame. Hence, it is possible to retrieve the last row in the DataFrame by passing a negative number to the .iloc, where value -1 corresponds to the last row (or column), -2 corresponds to the second last, and so on. Following this, it is easy to see e.g. what is the TEMP_CELSIUS value (the last column) of the last row of data:

data.iloc[-1, -1]
291.8722222222222

Selections using listed criteria#

One common way of selecting rows from a DataFrame is to provide a list of values that are used for finding matching rows in a specific DataFrame column. For example, selecting rows that match specific dates can be done by passing a list of values used as criteria to the .isin() -function of pandas. This will go through each value in the selected column (in this case YEARMODA) and checks whether there is a match or not. As an output, the .isin() command returns a Series of boolean values (True or False) which can be combined with .loc to do the final selection that returns only rows that meet the selection criteria.

# List of values that will be used as basis for selecting the rows
selection_criteria = [20160601, 20160608, 20160609]

# Do the selection based on criteria applied to YEARMODA column
data.loc[data["YEARMODA"].isin(selection_criteria)]
YEARMODA TEMP MAX MIN DIFF TEMP_CELSIUS TEMP_KELVIN
0 20160601 65.5 73.6 54.7 18.9 18.611111 291.761111
7 20160608 54.2 NaN 47.5 NaN 12.333333 285.483333
8 20160609 49.4 54.1 45.7 8.4 9.666667 282.816667

Conditional selections#

One really useful feature in pandas is the ability to easily filter and select rows based on a conditional statement. The following example shows how we check if the Celsius temperature at each row on the TEMP_CELSIUScolumn is higher than 15 degrees:

In [23]: help("modules")
0      True
1      True
2      True
3     False
4     False
... (output truncated)

As a result, we get a Series of booleans, where the value True or False at each row determines whether our condition was met or not. This kind of Series or numpy.array of boolean values based on some predefined criteria is typically called a mask. We can take advantage of this mask when doing selections with .loc based on specific criteria. In the following, we will use the same criteria, and store all rows meeting the criteria into a variable warm_temps (warm temperatures). We can specify the criteria directly inside the .loc square brackets. Next, we will select rows which have higher temperature (in Celsius) than 15 degrees:

warm_temps = data.loc[data["TEMP_CELSIUS"] > 15]
warm_temps
YEARMODA TEMP MAX MIN DIFF TEMP_CELSIUS TEMP_KELVIN
0 20160601 65.5 73.6 54.7 18.9 18.611111 291.761111
1 20160602 65.8 80.8 55.0 25.8 18.777778 291.927778
2 20160603 68.4 NaN 55.6 NaN 20.222222 293.372222
13 20160614 59.7 67.8 47.8 20.0 15.388889 288.538889
14 20160615 63.4 70.3 49.3 21.0 17.444444 290.594444
16 20160617 60.4 70.7 55.9 14.8 15.777778 288.927778
19 20160620 59.3 69.1 52.2 16.9 15.166667 288.316667
20 20160621 62.6 71.4 50.4 21.0 17.000000 290.150000
21 20160622 61.7 70.2 55.4 14.8 16.500000 289.650000
22 20160623 60.9 67.1 54.9 12.2 16.055556 289.205556
23 20160624 61.1 68.9 56.7 12.2 16.166667 289.316667
24 20160625 65.7 75.4 57.9 17.5 18.722222 291.872222
25 20160626 69.6 77.7 60.3 17.4 20.888889 294.038889
26 20160627 60.7 70.0 NaN NaN 15.944444 289.094444
27 20160628 65.4 73.0 55.8 17.2 18.555556 291.705556
28 20160629 65.8 73.2 NaN NaN 18.777778 291.927778
29 20160630 65.7 72.7 59.2 13.5 18.722222 291.872222

It is also possible to combine multiple criteria at the same time. Here, we select temperatures above 15 degrees that were recorded on the second half of June in 2016 (i.e. YEARMODA >= 20160615). Combining multiple criteria can be done with the & operator (AND) or the | operator (OR). Notice, that it is often useful to separate the different clauses inside the parentheses (). Let’s select rows having higher temperature than 15 degrees from late June 2016 (from 15th day onwards):

warm_temps = data.loc[(data["TEMP_CELSIUS"] > 15) & (data["YEARMODA"] >= 20160615)]
warm_temps
YEARMODA TEMP MAX MIN DIFF TEMP_CELSIUS TEMP_KELVIN
14 20160615 63.4 70.3 49.3 21.0 17.444444 290.594444
16 20160617 60.4 70.7 55.9 14.8 15.777778 288.927778
19 20160620 59.3 69.1 52.2 16.9 15.166667 288.316667
20 20160621 62.6 71.4 50.4 21.0 17.000000 290.150000
21 20160622 61.7 70.2 55.4 14.8 16.500000 289.650000
22 20160623 60.9 67.1 54.9 12.2 16.055556 289.205556
23 20160624 61.1 68.9 56.7 12.2 16.166667 289.316667
24 20160625 65.7 75.4 57.9 17.5 18.722222 291.872222
25 20160626 69.6 77.7 60.3 17.4 20.888889 294.038889
26 20160627 60.7 70.0 NaN NaN 15.944444 289.094444
27 20160628 65.4 73.0 55.8 17.2 18.555556 291.705556
28 20160629 65.8 73.2 NaN NaN 18.777778 291.927778
29 20160630 65.7 72.7 59.2 13.5 18.722222 291.872222

Now we have a subset of our DataFrame with only rows where the TEMP_CELSIUS is above 15 and the dates in YEARMODA column start from 15th of June. Notice, that the index values (numbers on the left) are still showing the index labels from the original DataFrame. This indicates that our result is really a slice from the original data.

It is possible to reset the index using reset_index() function which makes the index numbering to start from 0 and increase the index values in a sequantal manner. This is often a useful operation to do, because it makes it easier then to slice the data with .loc or .iloc. By default the reset_index() would make a new column called index to keep track on the previous index which might be useful in some cases but here not, so we can omit that by passing parameter drop=True:

warm_temps = warm_temps.reset_index(drop=True)
warm_temps
YEARMODA TEMP MAX MIN DIFF TEMP_CELSIUS TEMP_KELVIN
0 20160615 63.4 70.3 49.3 21.0 17.444444 290.594444
1 20160617 60.4 70.7 55.9 14.8 15.777778 288.927778
2 20160620 59.3 69.1 52.2 16.9 15.166667 288.316667
3 20160621 62.6 71.4 50.4 21.0 17.000000 290.150000
4 20160622 61.7 70.2 55.4 14.8 16.500000 289.650000
5 20160623 60.9 67.1 54.9 12.2 16.055556 289.205556
6 20160624 61.1 68.9 56.7 12.2 16.166667 289.316667
7 20160625 65.7 75.4 57.9 17.5 18.722222 291.872222
8 20160626 69.6 77.7 60.3 17.4 20.888889 294.038889
9 20160627 60.7 70.0 NaN NaN 15.944444 289.094444
10 20160628 65.4 73.0 55.8 17.2 18.555556 291.705556
11 20160629 65.8 73.2 NaN NaN 18.777778 291.927778
12 20160630 65.7 72.7 59.2 13.5 18.722222 291.872222

As can be seen, now the index values goes from 0 to 12. Resetting the index has now also unlinked the warm_temps DataFrame from the data, meaning that it is not a view anymore but an independent Pandas object. When making selections, it is quite typical that pandas might give you warnings if you modify the selected data without first resetting the index or taking a copy of the selected data. To demonstrate this, we will make the selection again and do a small change for the first value of the TEMP_CELSIUS column by updating it to 17.5. Here, we can take advantage of the .iloc which makes it easy to access the first row based based on position 0:

warm_temps = data.loc[(data["TEMP_CELSIUS"] > 15) & (data["YEARMODA"] >= 20160615)]
warm_temps.iloc[0, -1] = 17.5

Because we modified the selection which is a slice from the original data, pandas gives us a warning about a possibly invalid value assignment. In most cases, the warning can be ignored but it is a good practice to always take a copy when doing selections, especially if you continue working with the selected data and possibly modify it further:

# Make the selection and take a copy
warm_temps = data.loc[
    (data["TEMP_CELSIUS"] > 15) & (data["YEARMODA"] >= 20160615)
].copy()

# Now update the first value of the last column
warm_temps.iloc[0, -1] = 17.5

As we can see, now we did not receive any warnings, and it would be safe to continue working with this selection without needing to worry that there are some “hidden linkages” between our DataFrames that could cause issues (we discuss more about this in the next section).

Question 3.4#

Find the mean temperatures (in Celsius) for the last seven days of June again. This time you should select the rows based on a condition for the YEARMODA column.

Hide code cell content
# Solution
data["TEMP_CELSIUS"].loc[data["YEARMODA"] >= 20160624].mean()
18.253968253968257

View vs a copy#

A good thing to know when doing selections, is that the end result after making the selection can sometimes result to something called a view. In such cases, the selection and the original data may still linked to each other. This happens e.g. if you make a selection like above but return only a single column from the original data. In a situation where you have a view, a change in the original data for that specific column can also change the value in the selection. Without going into details why and when this happens, this behavior can nevertheless be confusing and have unexpected consequences. To avoid this behavior, a good practice to follow is to always make a copy whenever doing selections (i.e. unlink the two DataFrames). You can make a copy easily while doing the selection by adding a .copy() at the end of the command:

selection = data.loc[0:5, ["TEMP", "TEMP_CELSIUS"]].copy()
selection
TEMP TEMP_CELSIUS
0 65.5 18.611111
1 65.8 18.777778
2 68.4 20.222222
3 57.5 14.166667
4 51.4 10.777778
5 52.2 11.222222

Now we have the exact same data in our end result, but we have ensured that the selection is not linked to the original data anymore. To demonstrate what can happen with the view, let’s make a selection of a single column from our data (which will be a view), and modify the data a bit to see what consequences might happen if we are not careful:

temp = selection["TEMP"]
temp
0    65.5
1    65.8
2    68.4
3    57.5
4    51.4
5    52.2
Name: TEMP, dtype: float64

Now if we make a change to our original data, i.e. selection, it will also influence our values in temp:

selection.iloc[0, 0] = 30.0
selection.head()
TEMP TEMP_CELSIUS
0 30.0 18.611111
1 65.8 18.777778
2 68.4 20.222222
3 57.5 14.166667
4 51.4 10.777778
# Check the values in temp (which we did not touch)
temp
0    30.0
1    65.8
2    68.4
3    57.5
4    51.4
5    52.2
Name: TEMP, dtype: float64

As we can see, now the value in our temp Series changed from 65.5 to 30.0 although we did not make any change to it directly. The change happened nevertheless because the data objects were still linked to each other.

Dealing with missing data#

As you may have noticed by now, we have several missing values for the temperature minimum, maximum, and difference columns (MIN, MAX, DIFF, and DIFF_MIN). These missing values are indicated as NaN (not-a-number). Having missing data in your datafile is really common situation and typically you want to deal with it somehow. Common procedures to deal with NaN values are to either remove them from the DataFrame or fill them with some value. In Pandas both of these options are really easy to do. Let’s first start by checking whether our data has NaN values:

warm_temps
YEARMODA TEMP MAX MIN DIFF TEMP_CELSIUS TEMP_KELVIN
14 20160615 63.4 70.3 49.3 21.0 17.444444 17.500000
16 20160617 60.4 70.7 55.9 14.8 15.777778 288.927778
19 20160620 59.3 69.1 52.2 16.9 15.166667 288.316667
20 20160621 62.6 71.4 50.4 21.0 17.000000 290.150000
21 20160622 61.7 70.2 55.4 14.8 16.500000 289.650000
22 20160623 60.9 67.1 54.9 12.2 16.055556 289.205556
23 20160624 61.1 68.9 56.7 12.2 16.166667 289.316667
24 20160625 65.7 75.4 57.9 17.5 18.722222 291.872222
25 20160626 69.6 77.7 60.3 17.4 20.888889 294.038889
26 20160627 60.7 70.0 NaN NaN 15.944444 289.094444
27 20160628 65.4 73.0 55.8 17.2 18.555556 291.705556
28 20160629 65.8 73.2 NaN NaN 18.777778 291.927778
29 20160630 65.7 72.7 59.2 13.5 18.722222 291.872222

As we can see, the MIN and DIFF columns are missing values at index 26. It is also possible to confirm this with pandas by accessing a specific Series attribute called .hasnans which can be a handy tool especially when automating some data analysis pipelines. Each Series (i.e. column) in the DataFrame has this attribute. Let’s check if MIN column contains any NaNs:

warm_temps["MIN"].hasnans
True

As a result, you get either True or False (here True), depending on whether the Series contained any NaN values, or not. Let’s now see how we can remove the NoData values (i.e. clean the data) using the .dropna() function. Inside the function you can pass a list of column(s) from which the NaN values should found using the subset parameter:

cols_to_check = ["MIN"]
warm_temps_clean = warm_temps.dropna(subset=cols_to_check)
warm_temps_clean
YEARMODA TEMP MAX MIN DIFF TEMP_CELSIUS TEMP_KELVIN
14 20160615 63.4 70.3 49.3 21.0 17.444444 17.500000
16 20160617 60.4 70.7 55.9 14.8 15.777778 288.927778
19 20160620 59.3 69.1 52.2 16.9 15.166667 288.316667
20 20160621 62.6 71.4 50.4 21.0 17.000000 290.150000
21 20160622 61.7 70.2 55.4 14.8 16.500000 289.650000
22 20160623 60.9 67.1 54.9 12.2 16.055556 289.205556
23 20160624 61.1 68.9 56.7 12.2 16.166667 289.316667
24 20160625 65.7 75.4 57.9 17.5 18.722222 291.872222
25 20160626 69.6 77.7 60.3 17.4 20.888889 294.038889
27 20160628 65.4 73.0 55.8 17.2 18.555556 291.705556
29 20160630 65.7 72.7 59.2 13.5 18.722222 291.872222

As you can see by looking at the table above (and the change in index values), we now have a DataFrame without the NoData values. By doing this, you naturally loose data (rows) which might not be an optimal solution in some cases. Due to this, pandas also provides an option to fill the NoData with some value using fillna() function. Here, we can fill the missing values in the with value -9999. Note that we are not giving the subset parameter this time:

warm_temps.fillna(-9999)
YEARMODA TEMP MAX MIN DIFF TEMP_CELSIUS TEMP_KELVIN
14 20160615 63.4 70.3 49.3 21.0 17.444444 17.500000
16 20160617 60.4 70.7 55.9 14.8 15.777778 288.927778
19 20160620 59.3 69.1 52.2 16.9 15.166667 288.316667
20 20160621 62.6 71.4 50.4 21.0 17.000000 290.150000
21 20160622 61.7 70.2 55.4 14.8 16.500000 289.650000
22 20160623 60.9 67.1 54.9 12.2 16.055556 289.205556
23 20160624 61.1 68.9 56.7 12.2 16.166667 289.316667
24 20160625 65.7 75.4 57.9 17.5 18.722222 291.872222
25 20160626 69.6 77.7 60.3 17.4 20.888889 294.038889
26 20160627 60.7 70.0 -9999.0 -9999.0 15.944444 289.094444
27 20160628 65.4 73.0 55.8 17.2 18.555556 291.705556
28 20160629 65.8 73.2 -9999.0 -9999.0 18.777778 291.927778
29 20160630 65.7 72.7 59.2 13.5 18.722222 291.872222

As a result we now have a DataFrame where all NoData values in the whole DataFrame are filled with the value -9999. If you want to fill NaN values values of only a specific column, you can do that by targeting the fillna function to only a specific column (or columns):

warm_temps["MIN"] = warm_temps["MIN"].fillna(-9999)

Notice that because we already filled every NaN in the whole DataFrame in the previous step, the command above does not change anything.

It is important to be aware, that in many cases filling the data with a specific value is dangerous because you end up modifying the actual data, which might affect the results of your analysis. For example, in the case above we would have dramatically changed the temperature difference columns because the -9999 values not an actual temperature difference! Hence, use caution when filling missing values. In some cases, you might have to fill in no data values for the purposes of saving the data to file in a spesific format. For example, some GIS software don’t accept missing values. Always pay attention to potential no data values when reading in data files and doing further analysis.

Data type conversions#

When doing data analysis, one quite typical operation that needs to be done is to convert values in a column from one datatype to another, such as from floating point to integer. Remember, that we already did data type conversions in Chapter 1 using the built-in Python functions such as int() or str(). For values stored in pandas Series, we can use astype() method. Next we will continue using the same data as previously, and convert all the temperature values (floats) in column TEMP to integers. Let’s start by looking our original data:

data["TEMP"].head()
0    65.5
1    65.8
2    68.4
3    57.5
4    51.4
Name: TEMP, dtype: float64

Now we can easily convert those decimal numbers to integers by calling astype(int):

data["TEMP"].astype(int).head()
0    65
1    65
2    68
3    57
4    51
Name: TEMP, dtype: int64

As we can see the values were converted to integers. However, it is important to be careful with type conversions from floating point values to integers. The conversion simply drops the stuff to the right of the decimal point, so all values are rounded down to the nearest whole number. For example, the second value in our Series (65.8) was truncated to 65 as an integer, when it clearly should be rounded up to 66. Chaining the round and type conversion functions solves this issue as the .round(0).astype(int) command first rounds the values with zero decimals and then converts those values into integers:

data["TEMP"].round(0).astype(int).head()
0    66
1    66
2    68
3    58
4    51
Name: TEMP, dtype: int64

As we can see, now the integer values are correctly rounded.

The astype() method supports converting between all basic Python data types (int, float, str), but it also knows how to convert between more specific numpy data types, such as int64, int32, float64, float32 to mention a few (see [1] for a full list). Using the numpy data type can useful if you need to be more specific e.g. how many bits should be reserved for storing the values. For instance, int64 (i.e. 64-bit integer) can store integer values ranging between -9223372036854775808 and 9223372036854775807, whereas int16 can only store values from -32768 to 32767. If passing the “normal” int or float to the astype() function, pandas will automatically store 64-bit numeric values. The larger bits you use, the more physical memory is required to store the data. Hence, in some specific cases e.g. when dealing with extremely large datasets, it might be useful to be able to determine that the values in specific columns are stored with lower bits (to save memory).

Sorting data#

Quite often it is useful to be able to sort your data (descending/ascending) based on values in some column This can be easily done with Pandas using sort_values(by='YourColumnName') -function. Let’s first sort the values on ascending order based on the TEMP column:

# Sort dataframe, ascending
data.sort_values(by="TEMP").head()
YEARMODA TEMP MAX MIN DIFF TEMP_CELSIUS TEMP_KELVIN
8 20160609 49.4 54.1 45.7 8.4 9.666667 282.816667
9 20160610 49.5 55.9 43.0 12.9 9.722222 282.872222
4 20160605 51.4 58.3 43.2 15.1 10.777778 283.927778
5 20160606 52.2 59.7 42.8 16.9 11.222222 284.372222
10 20160611 54.0 62.1 41.7 20.4 12.222222 285.372222

Of course, it is also possible to sort them in descending order with ascending=False parameter:

# Sort dataframe, descending
data.sort_values(by="TEMP", ascending=False).head()
YEARMODA TEMP MAX MIN DIFF TEMP_CELSIUS TEMP_KELVIN
25 20160626 69.6 77.7 60.3 17.4 20.888889 294.038889
2 20160603 68.4 NaN 55.6 NaN 20.222222 293.372222
1 20160602 65.8 80.8 55.0 25.8 18.777778 291.927778
28 20160629 65.8 73.2 NaN NaN 18.777778 291.927778
29 20160630 65.7 72.7 59.2 13.5 18.722222 291.872222

In some situations, you might need to sort values based on multiple columns simultaneously, which is sometimes called as multi-level sorting. This can be done by passing a list of column names to the by parameter. When you sort the data based on multiple columns, sometimes you also might want to sort your data in a way that the first-level sorting happens in ascending order, and the second-level sorting happens in descending order. A typical situation for this kind of sorting could be e.g. when sorting the temperatures first by weekday (Monday, Tuesday, etc.) and then under each weekday ordering the values in descending order that would always show the warmest temperature of specific week first. Let’s modify our data a bit to demonstrate this. We will add a new column that has information about the weekday. The 1st of June 2016 was Wednesday, so we start from that:

# Create a list of weekdays that matches with our data
# The data covers 4 weeks + 2 days (i.e. altogether 30 days)
week_days = ["Wed", "Thu", "Fri", "Sat", "Sun", "Mon", "Tue"]
day_list = week_days * 4 + week_days[:2]

# Add the weekdays to our DataFrame
data["WEEKDAY"] = day_list
data.head(10)
YEARMODA TEMP MAX MIN DIFF TEMP_CELSIUS TEMP_KELVIN WEEKDAY
0 20160601 65.5 73.6 54.7 18.9 18.611111 291.761111 Wed
1 20160602 65.8 80.8 55.0 25.8 18.777778 291.927778 Thu
2 20160603 68.4 NaN 55.6 NaN 20.222222 293.372222 Fri
3 20160604 57.5 70.9 47.3 23.6 14.166667 287.316667 Sat
4 20160605 51.4 58.3 43.2 15.1 10.777778 283.927778 Sun
5 20160606 52.2 59.7 42.8 16.9 11.222222 284.372222 Mon
6 20160607 56.9 65.1 45.9 19.2 13.833333 286.983333 Tue
7 20160608 54.2 NaN 47.5 NaN 12.333333 285.483333 Wed
8 20160609 49.4 54.1 45.7 8.4 9.666667 282.816667 Thu
9 20160610 49.5 55.9 43.0 12.9 9.722222 282.872222 Fri

Now we have a new column with information about the weekday of each row. Next, we can test how to sort the values, so that we order the data by weekday and within each weekday the temperatures are in descending order. You can adjust how the ordering works by passing a list of boolean values to ascending parameter:

data.sort_values(by=["WEEKDAY", "TEMP_CELSIUS"], ascending=[True, False]).head(10)
YEARMODA TEMP MAX MIN DIFF TEMP_CELSIUS TEMP_KELVIN WEEKDAY
2 20160603 68.4 NaN 55.6 NaN 20.222222 293.372222 Fri
23 20160624 61.1 68.9 56.7 12.2 16.166667 289.316667 Fri
16 20160617 60.4 70.7 55.9 14.8 15.777778 288.927778 Fri
9 20160610 49.5 55.9 43.0 12.9 9.722222 282.872222 Fri
26 20160627 60.7 70.0 NaN NaN 15.944444 289.094444 Mon
19 20160620 59.3 69.1 52.2 16.9 15.166667 288.316667 Mon
12 20160613 58.3 68.2 47.3 20.9 14.611111 287.761111 Mon
5 20160606 52.2 59.7 42.8 16.9 11.222222 284.372222 Mon
24 20160625 65.7 75.4 57.9 17.5 18.722222 291.872222 Sat
3 20160604 57.5 70.9 47.3 23.6 14.166667 287.316667 Sat

As a result the data is now ordered first by weekday (i.e. the same weekday values are grouped) and the within these weekdays the temperature values are always in descending order showing the warmest day first. Ordering data in this manner based on multiple criteria can sometimes be very useful when analyzing your data.

Table joins: Combining DataFrames based on a common key#

The basic logic of a table join#

Joining data between two or several DataFrames is a common task when doing data anaysis. The minimum requirement for being able to combine data between two (or more) DataFrames (i.e. tables), is to have at least one common attribute (called key) that has identical values in both DataFrames. Figure 3.4 illustrates this logic: we want to merge the precipitation data from Kumpula weather station to the temperature data that we worked earlier. The common key in this case is the time information which is in column YEARMODA in the left DataFrame and DATE column in the right DataFrame accordingly. The column names of the keys can be different (as in our case), but the actual values stored in these columns should correspond to each other, so that it is possible to match the records between tables. The attribute values of the key can contain data in any format (dates, text, numbers, etc.). Hence, the data is not limited to dates or integers as demonstrated in this example.

Figure 3.4. Joining precipitation data from the right DataFrame to the left based on common key.

Figure 3.4. Joining precipitation data from the right DataFrame to the left based on common key.

Table join using pandas .merge()#

In the following, we first read the precipitation data from Kumpula, and then join this data with the DataFrame containing the average temperature data. Merging two DataFrames together based on a common key (or multiple keys) can be done easily with pandas using the .merge() -function. The column which represents the key can be specified with parameter on, if the key column is identical in both DataFrames. In our case, the columns containing the common values between the DataFrames are named differently. Hence, we need to specify separately the key for the left DataFrame using parameter left_on, and parameter right_on for the right DataFrame accordingly.

# Read precipitation data and show first rows
rainfall = pd.read_csv("data/2902781.csv")
rainfall.head()
STATION NAME DATE PRCP TMAX TMIN
0 FIE00142226 HELSINKI KUMPULA, FI 20160601 0.00 74 55
1 FIE00142226 HELSINKI KUMPULA, FI 20160602 0.56 81 55
2 FIE00142226 HELSINKI KUMPULA, FI 20160603 0.00 78 56
3 FIE00142226 HELSINKI KUMPULA, FI 20160604 0.00 71 51
4 FIE00142226 HELSINKI KUMPULA, FI 20160605 0.01 58 43
# Make a table join
join = data.merge(rainfall, left_on="YEARMODA", right_on="DATE")
join.head()
YEARMODA TEMP MAX MIN DIFF TEMP_CELSIUS TEMP_KELVIN WEEKDAY STATION NAME DATE PRCP TMAX TMIN
0 20160601 65.5 73.6 54.7 18.9 18.611111 291.761111 Wed FIE00142226 HELSINKI KUMPULA, FI 20160601 0.00 74 55
1 20160602 65.8 80.8 55.0 25.8 18.777778 291.927778 Thu FIE00142226 HELSINKI KUMPULA, FI 20160602 0.56 81 55
2 20160603 68.4 NaN 55.6 NaN 20.222222 293.372222 Fri FIE00142226 HELSINKI KUMPULA, FI 20160603 0.00 78 56
3 20160604 57.5 70.9 47.3 23.6 14.166667 287.316667 Sat FIE00142226 HELSINKI KUMPULA, FI 20160604 0.00 71 51
4 20160605 51.4 58.3 43.2 15.1 10.777778 283.927778 Sun FIE00142226 HELSINKI KUMPULA, FI 20160605 0.01 58 43

Now we have merged all the information from the right DataFrame into the left one and stored the result into variable join. By doing this, we can e.g. analyze the relationship between the daily average temperature and precipitation, to understand whether the temperature is lower when it rains. Currently, the join DataFrame contains many variables that are not necessarily useful for us. To make the output more concise, a useful trick to do when joining is to limit the number of columns that will be kept from the right DataFrame. This can be done by chaining a simple column selection while doing the merge as shown next. When doing this, it is important to remember that the key column on the right DataFrame needs to be part of the selection for the table join to work.

# Make another join but only keep the attribute of interest
join2 = data.merge(rainfall[["DATE", "PRCP"]], left_on="YEARMODA", right_on="DATE")
join2.head()
YEARMODA TEMP MAX MIN DIFF TEMP_CELSIUS TEMP_KELVIN WEEKDAY DATE PRCP
0 20160601 65.5 73.6 54.7 18.9 18.611111 291.761111 Wed 20160601 0.00
1 20160602 65.8 80.8 55.0 25.8 18.777778 291.927778 Thu 20160602 0.56
2 20160603 68.4 NaN 55.6 NaN 20.222222 293.372222 Fri 20160603 0.00
3 20160604 57.5 70.9 47.3 23.6 14.166667 287.316667 Sat 20160604 0.00
4 20160605 51.4 58.3 43.2 15.1 10.777778 283.927778 Sun 20160605 0.01

As can be seen, now only the column DATE and the attribute of interest PRCP were joined and kept in the output from the right DataFrame. Similar trick can also be applied to the left DataFrame by adding the selection before the .merge() call if you want to reduce the number of columns on the result.

In our case, doing the table join was fairly straightforward because we had only one unique record per day in both DataFrames. However, in some cases you might have multiple records on either one of the DataFrames (e.g. hourly observations vs daily observations). This can in specific cases cause issues (not always!), incorrect analysis results, and other undesired consequences if not taken into account properly. This kind of mismatch in number of records per table can be handled e.g. by aggregating the hourly data to a daily average. You can learn more about these aggregation techniques in the following sections.

Footnotes#