Getting started with data analysis#

What is pandas?#

The pandas Python library [1] is a easy-to-use but powerful data analysis toolkit. It is a mature data analytics framework that is widely used among different fields of science. The development of pandas started in 2008 and it is now maintained by an active developer community (McKinney, 2017).

Pandas is a “high-level” package, which means that it makes use of several other packages in the background. It combines the performance of powerful Python libraries such as NumPy [2], matplotlib [3] and SciPy [4].

One of the most useful features of pandas is its ability to interact with numerous data formats. It supports reading and writing data e.g. from/to:

  • CSV

  • JSON

  • HTML

  • MS Excel

  • HDF5

  • Stata

  • SAS

  • Python Pickle format

  • SQL (Postgresql, MySQL, Oracle, MariaDB, etc.)

For a full list of supported file formats and other features, see the official pandas documentation and reference guide [5].

Pandas data structures#

In pandas, table-like data are stored in two-dimensional DataFrame objects with labeled rows and columns. The pandas DataFrame was originally inspired by data frames that are in-built in the R programming language. You can think of the pandas DataFrame as a programmable spreadsheet. One-dimensional sequences of values are stored in pandas Series. One row or one column in a pandas DataFrame is actually a pandas Series. You can think of a pandas Series as a clever list. These pandas structures incorporate a number of things we’ve already encountered earlier in this book, such as indices, data stored in a collection, and data types.

Figure 3.1. Illustration of pandas DaraFrame and pandas Series data structures. Pandas DataFrame is a 2-dimensional data structure used for storing and mainpulating table-like data (data with rows and columns). Pandas Series is a 1-dimensional data structure used for storing and manipulating an sequence of values.

Figure 3.1. Illustration of pandas DaraFrame and pandas Series data structures. Pandas DataFrame is a 2-dimensional data structure used for storing and mainpulating table-like data (data with rows and columns). Pandas Series is a 1-dimensional data structure used for storing and manipulating an sequence of values.

As you can see from Figure 3.1, both the DataFrame and Series in pandas have an index that can be used to select values, but they also have column labels to identify columns in DataFrames. In the next sections, we will use many of these features to explore real-world data and learn some useful data analysis procedures. For a comprehensive overview of pandas data structures, we recommend you to have a look at pandas online documentation about data structures [6] as well as Chapter 5 in McKinney, 2017.

Reading tabular data#

In the following sections, we will learn how to read data from a text file (“Kumpula-June-2016-w-metadata.txt”) which contains weather observations from Kumpula, Helsinki (Finland). The data were retrieved from NOAA climate database [7] and it contains observed daily mean, minimum, and maximum temperatures from June 2016, recorded by a weather observation station in Kumpula. The file includes altogether 30 rows of observations (one per day). The first fifteen rows in the file look like following:

# Data file contents: Daily temperatures (mean, min, max) for Kumpula, Helsinki
#                     for June 1-30, 2016
# Data source: https://www.ncdc.noaa.gov/cdo-web/search?datasetid=GHCND
# Data processing: Extracted temperatures from raw data file, converted to
#                  comma-separated format
#
# David Whipp - 02.10.2017

YEARMODA,TEMP,MAX,MIN
20160601,65.5,73.6,54.7
20160602,65.8,80.8,55.0
20160603,68.4,,55.6
20160604,57.5,70.9,47.3
20160605,51.4,58.3,43.2
20160606,52.2,59.7,42.8

Now as we have familiarized ourselves with the data, we can continue and see how to read the data using pandas. Let’s start by importing the pandas module. It is customary to import pandas as pd:

import pandas as pd

Next, we will read the input data file and store the contents of that file into a variable called data Using the read_csv() function:

data = pd.read_csv("data/Kumpula-June-2016-w-metadata.txt")

Our input file is a comma-delimited file, meaning that the columns in the data are separted by commas (,) on each row. The read_csv() function uses comma as the default delimiter, hence we did not need to specify it separately in the command above. If all went as planned, you should now have a new variable data in memory that contains the input data.

It is quite common to have some other character instead of comma separating the columns from each other. The read_csv() in pandas is a generic function for reading text-based data files, supporting files separated by commas, spaces, or other common separators. The sep parameter can be used to specify which character is used as a delimiter. For instance sep=';', would assume that the columns in the data file are delimited with semicolon (;). For a full list of available parameters, please refer to the pandas read_csv() documention [8], or run help(pd.read_csv).

Pandas has also several functions for parsing input data from different formats. For example, reading Excel files can be easily done by using a function read_excel(). Another useful function is read_pickle() that reads data stored in the Python pickle format. Check out the pandas documentation about input and output tools [9] for more details about reading data.

Let’s check the the contents of our data. You can print the contents of a DataFrame (or a snippet of it) simply by calling data. However, quite often you want to check only n-number of first (or last) rows in your data. For doing that, we can use the head() method of the pandas DataFrame object that will, by default, return the first 5 rows of the DataFrame. You can return any number of rows by adding a number inside the parentheses, such as 10 which will return the first ten rows of data.

data.head(10)
# Data file contents: Daily temperatures (mean min max) for Kumpula Helsinki
0 # for June 1-30 2016 NaN NaN
1 # Data source: https://www.ncdc.noaa.gov/cdo-w... NaN NaN NaN
2 # Data processing: Extracted temperatures from... converted to NaN NaN
3 # comma-separated format NaN NaN NaN
4 # NaN NaN NaN
5 # David Whipp - 02.10.2017 NaN NaN NaN
6 YEARMODA TEMP MAX MIN
7 20160601 65.5 73.6 54.7
8 20160602 65.8 80.8 55.0
9 20160603 68.4 NaN 55.6

Similarly, tail() will return the last 5 rows of the DataFrame.

data.tail()
# Data file contents: Daily temperatures (mean min max) for Kumpula Helsinki
32 20160626 69.6 77.7 60.3
33 20160627 60.7 70.0 NaN
34 20160628 65.4 73.0 55.8
35 20160629 65.8 73.2 NaN
36 20160630 65.7 72.7 59.2

The first lines of the dataframe look a bit weird. NaN stands for “not a number”, and might indicate some problem with reading in the contents of the file. Plus, we expected about 30 lines of data, but the index values go up to 36 when we print the last rows of the data variable. Looks like we need to investigate this further.

As we can observe, there are some metadata at the top of the file giving basic information about its contents and source. This isn’t data we want to process, so we need to skip over that part of the file when we load it. Here are the 8 first rows of data in the text file (note that the 8th row is blank):

# Data file contents: Daily temperatures (mean, min, max) for Kumpula, Helsinki
#                     for June 1-30, 2016
# Data source: https://www.ncdc.noaa.gov/cdo-web/search?datasetid=GHCND
# Data processing: Extracted temperatures from raw data file, converted to
#                  comma-separated format
#
# David Whipp - 02.10.2017

Fortunately, skipping over rows is easy to do when reading data with pandas. We just need to add the skiprows parameter when we read the file, listing the number of rows to skip (8 in this case). Let’s try reading the datafile again, and this time defining the skiprows parameter.

data = pd.read_csv("data/Kumpula-June-2016-w-metadata.txt", skiprows=8)

Let’s check how the data looks like now:

data.head()
YEARMODA TEMP MAX MIN
0 20160601 65.5 73.6 54.7
1 20160602 65.8 80.8 55.0
2 20160603 68.4 NaN 55.6
3 20160604 57.5 70.9 47.3
4 20160605 51.4 58.3 43.2
data.tail()
YEARMODA TEMP MAX MIN
25 20160626 69.6 77.7 60.3
26 20160627 60.7 70.0 NaN
27 20160628 65.4 73.0 55.8
28 20160629 65.8 73.2 NaN
29 20160630 65.7 72.7 59.2

Looks good! We seem to have 30 rows of data (index running from 0 to 30), and meanigful column names and values.

Note that pandas DataFrames have labelled axes (rows and columns). In our sample data, the rows labeled with an index value (0 to 29), and columns labelled YEARMODA, TEMP, MAX, and MIN. Later on, we will learn how to use these labels for selecting and updating subsets of the data.

It is also possible to read only specific columns from the data when using the read_csv() function. You can achieve this using the usecols parameter when reading the file which accepts a list of column names that will be included in the resulting DataFrame. Also positional indices (e.g. [0, 1]) corresponding the position of the column in the file can be used to specify which columns should be read. For additional details, see the pandas read_csv() documention. Next, we will read the file Kumpula-June-2016-w-metadata.txt again and store its contents into a new variable called temp_data. In this case, we will only read the columns YEARMODA and TEMP, meaning that the new variable temp_data should have 30 rows and 2 columns:

temp_data = pd.read_csv(
    "data/Kumpula-June-2016-w-metadata.txt", skiprows=8, usecols=["YEARMODA", "TEMP"]
)

temp_data.head()
YEARMODA TEMP
0 20160601 65.5
1 20160602 65.8
2 20160603 68.4
3 20160604 57.5
4 20160605 51.4

As a result, we now have only two columns instead of the original four. Using the usecols function to limit the number of columns can be useful when having data files with possibly tens or even hundreds of columns. Typically you are not interested in all of them, but you want focus on only a few important ones which you can select already when reading the data.

Writing data to a file#

Naturally, it is also possible to write the data from pandas to a file. Pandas supports many common output formats such as CSV files, MS Excel, xml and others. For full details about supported file formats and writer functions, see the pandas i/o documentation [9]. We will store our data back into a csv-file called Kumpula_temp_results_June_2016.csv.

# define output filename
output_fp = "Kumpula_temps_June_2016.csv"

# Save dataframe to csv
data.to_csv(output_fp, sep=",")

Figure 3.2. The output CSV file opened in JupyterLab.

Figure 3.2. The output CSV file opened in JupyterLab.

Figure 3.2 shows the output file. As we can see, the first column contains the index value without any column name. Also the temperature values are represented as floating point number with a presision of 1 decimal.

Let’s see how we can modify these aspects of the output file using pandas and save the data again without the index, and with zero decimals. Omitting the index can be achieved using the index=False parameter. Decimal precision can be specified using the float_format parameter. Format text "%.0f" will set decimal precision to zero (while "%.2f" would set it to 2 decimals and so on).

# define output filename
output_fp2 = "Kumpula_temperatures_integers_June_2016.csv"

# Save dataframe to csv
data.to_csv(output_fp2, sep=",", index=False, float_format="%.0f")

Figure 3.3. The formatted output CSV file opened in JupyterLab.

Figure 3.3. The formatted output CSV file opened in JupyterLab.

Figure 3.3 shows the “cleaner” output file that lacks the index column and temperature values are rounded to integers.

DataFrame properties#

Now we will continue exploring the full dataset that we have stored in the variable data. A normal first step when you load new data is to explore the dataset a bit to understand how the data is structured, and what kind of values are stored in there. Let’s start by checking the size of our DataFrame. We can use the len() function similar to the use with lists to check how many rows we have:

len(data)
30

We can also get a quick sense of the size of the dataset using the shape. It returns a tuple with the number of rows as the first element and the number of columns as the second element:

data.shape
(30, 4)

Here we see that our dataset has 30 rows and 4 columns, just as we saw above when printing out the entire DataFrame. Pay attention that we do not use parentheses after the word shape when accessing attributes. The shape in here is one of the several attributes related to a pandas DataFrame object. Attributes are typically used to store useful information (or metadata) about the object at hand. We will see many more examples of these throughout the book.

Let’s also check the column names we have in our DataFrame (yes, it is an attribute as well!). We already saw the column names when we checked the 5 first rows using data.head(), but often it is useful to access the column names directly. You can check the column names by calling data.columns (returns an index object that contains the column labels) or data.columns.values:

data.columns.values
array(['YEARMODA', 'TEMP', 'MAX', 'MIN'], dtype=object)

We can also find information about the row identifiers using the index attribute:

data.index
RangeIndex(start=0, stop=30, step=1)

Here we see how the data is indexed, starting at 0, ending at 30, and with an increment of 1 between each value. This is basically the same way in which Python lists are indexed, however, pandas allows also other ways of identifying the rows. DataFrame indices could, for example, be character strings, or date objects. We will learn more about resetting the index later. What about the data types of each column in our dataFrame? We can check the data type of all columns at once using pandas.DataFrame.dtypes:

# Print data types
data.dtypes
YEARMODA      int64
TEMP        float64
MAX         float64
MIN         float64
dtype: object

Here we see that YEARMODA is an integer value (with 64-bit precision; int64), while the other values are all decimal values with 64-bit precision (float64).

Question 3.1#

How would you print out the number of columns in our DataFrame?

Hide code cell content
# Solution

len(data.columns)
4

Selecting columns#

When conducting data analysis, it is very common that you want to choose certain columns from the data for further processing. We can select specific columns based on the column values. The basic syntax is dataframe[value], where value can be a single column name, or a list of column names. Let’s start by selecting two columns, 'YEARMODA' and 'TEMP':

selection = data[["YEARMODA", "TEMP"]]
selection.head()
YEARMODA TEMP
0 20160601 65.5
1 20160602 65.8
2 20160603 68.4
3 20160604 57.5
4 20160605 51.4

Let’s also check the data type of this selection:

type(selection)
pandas.core.frame.DataFrame

As we can see, the subset is still a pandas DataFrame, and we are able to use all the methods and attributes related to a pandas DataFrame also with this subset. For example, we can check the shape:

selection.shape
(30, 2)

We can also access a single column of the data based on the column name:

data["TEMP"].head()
0    65.5
1    65.8
2    68.4
3    57.5
4    51.4
Name: TEMP, dtype: float64

What about the type of the column itself?

type(data["TEMP"])
pandas.core.series.Series

Each column (and each row) in a pandas data frame is indeed a pandas Series. Notice that you can also retreive a column using a different syntax data.TEMP. This syntax works only if the column name is a valid name for a Python variable (e.g. the column name should not contain whitespace). The syntax data["column"] works for all kinds of column names, so we recommend using this approach.

Unique values#

Sometimes it is useful to extract the unique values that you have in your column. We can do that by using unique() method:

data["TEMP"].unique()
array([65.5, 65.8, 68.4, 57.5, 51.4, 52.2, 56.9, 54.2, 49.4, 49.5, 54. ,
       55.4, 58.3, 59.7, 63.4, 57.8, 60.4, 57.3, 56.3, 59.3, 62.6, 61.7,
       60.9, 61.1, 65.7, 69.6, 60.7, 65.4])

As a result we get an array of unique values in that column. We can also directly access the number of unique values using the nunique() method:

print(
    "There were",
    data["TEMP"].nunique(),
    "days with unique mean temperatures in June 2016.",
)
There were 28 days with unique mean temperatures in June 2016.

Descriptive statistics#

pandas DataFrame and Series contain useful methods for getting summary statistics. Available methods include mean(), median(), min(), max(), and std() (the standard deviation). The statistics can be calculated on a Series level (a single column) or getting the statistics for all columns at once for the whole DataFrame. To check e.g. the mean temperature in out input data, i.e. focusing on a single column (Series), we can do following:

data["TEMP"].mean()
59.730000000000004

The output in this case will be a single floating point number presenting the mean temperature, 59.73 Fahrenheits. To get the mean statistics for all columns in the DataFrame, we can call the mean() in a similar manner, but without specifying the column name:

data.mean()
YEARMODA    2.016062e+07
TEMP        5.973000e+01
MAX         6.804074e+01
MIN         5.125714e+01
dtype: float64

Notice that in this case, the result is Series showing the mean values for each column. For an overview of the basic statistics for all attributes in the data, we can use the describe() method:

data.describe()
YEARMODA TEMP MAX MIN
count 3.000000e+01 30.000000 27.000000 28.000000
mean 2.016062e+07 59.730000 68.040741 51.257143
std 8.803408e+00 5.475472 6.505575 5.498985
min 2.016060e+07 49.400000 54.100000 41.700000
25% 2.016061e+07 56.450000 64.650000 46.975000
50% 2.016062e+07 60.050000 69.100000 53.100000
75% 2.016062e+07 64.900000 72.050000 55.600000
max 2.016063e+07 69.600000 80.800000 60.300000

As a result, we get the number of values that are not None for each column (count) as well as the basic statistics and quartiles (min, 25%, 50%, 75% and max). It is also possible to get other DataFrame specific information, such as the index dtype and columns, non-null values and memory usage by calling info():

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   YEARMODA  30 non-null     int64  
 1   TEMP      30 non-null     float64
 2   MAX       27 non-null     float64
 3   MIN       28 non-null     float64
dtypes: float64(3), int64(1)
memory usage: 1.1 KB

Different ways of constructing pandas objects#

Most often we create pandas objects by reading in data from an external source, such as a text file. Here, we will briefly show different ways how you can create pandas objects from Python objects (lists and dictionaries). If you have a very long lists of numbers, creating a pandas Series (or numpy.array) will allow you to interact with these values more efficiently in terms of computing time that using and working with a pure Python list. This is how you can create a pandas Series from a list:

number_series = pd.Series([4, 5, 6, 7.0])
print(number_series)
0    4.0
1    5.0
2    6.0
3    7.0
dtype: float64

Note that Pandas is smart about the conversion, detecting a single floating point value (7.0) and assigning all values in the Series the data type float64. If needed, you can also set a custom index when creating the object:

number_series = pd.Series([4, 5, 6, 7.0], index=["a", "b", "c", "d"])
print(number_series)
a    4.0
b    5.0
c    6.0
d    7.0
dtype: float64
type(number_series)
pandas.core.series.Series

How about combining several lists and construct a DataFrame from them? Certainly. Let’s see how we can convert lists into a pandas DataFrame. Let’s first create a few lists having the same number of items each:

stations = ["Hanko", "Heinola", "Kaisaniemi", "Malmi"]
latitudes = [59.77, 61.2, 60.18, 60.25]
longitudes = [22.95, 26.05, 24.94, 25.05]

Now we can create a pandas DataFrames by using the pandas.DataFrame constructor and passing a Python dictionary {"column_1": list_1, "column_2": list_2, ...} to it, indicating the structure of our data:

new_data = pd.DataFrame(data={"station": stations, "lat": latitudes, "lon": longitudes})
new_data
station lat lon
0 Hanko 59.77 22.95
1 Heinola 61.20 26.05
2 Kaisaniemi 60.18 24.94
3 Malmi 60.25 25.05

It is also possible to create pandas DataFrame from a list of dictionaries. Hence, this approach is quite similar as the previous example, but we organize the data a bit differently. Having a list of dictionaries is common if you happen to work with JSON files, which are commonly used when reading data from the web (e.g. via Application Programming Interface):

dictionaries = [
    {"station": "Hanko", "lat": 59.77, "lon": 22.95},
    {"station": "Heinola", "lat": 61.2, "lon": 26.05},
    {"station": "Kaisaniemi", "lat": 60.18, "lon": 24.94},
    {"station": "Malmi", "lat": 60.25, "lon": 25.05},
]

# Pass the list into the DataFrame constructor
new_data_2 = pd.DataFrame(dictionaries)
new_data_2
station lat lon
0 Hanko 59.77 22.95
1 Heinola 61.20 26.05
2 Kaisaniemi 60.18 24.94
3 Malmi 60.25 25.05

As a result, we got an identical DataFrame as in our first example above. Notice, that sometimes you might start working with an empty DataFrame and only later “populate” it with new columns:

df = pd.DataFrame()
print(df)
Empty DataFrame
Columns: []
Index: []

Now we have an empty DataFrame, and we can add new columns to it based on the lists we created earlier. Notice that the following approach requires that the length of all the lists that you want to add are equal:

df["lon"] = longitudes
df["lat"] = latitudes
df
lon lat
0 22.95 59.77
1 26.05 61.20
2 24.94 60.18
3 25.05 60.25

Footnotes#