{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data in/out: Preparing GeoDataFrames from spatial data \n", "\n", "Reading data into Python is usually the first step of an analysis workflow. There are various different GIS data formats available such as [Shapefile](https://en.wikipedia.org/wiki/Shapefile), [GeoJSON](https://en.wikipedia.org/wiki/GeoJSON), [KML](https://en.wikipedia.org/wiki/Keyhole_Markup_Language), and [GPKG](https://en.wikipedia.org/wiki/GeoPackage). Geopandas is capable of reading data from all of these formats (plus many more). \n", "\n", "This tutorial will show some typical examples how to read (and write) data from different sources. The main point in this section is to demonstrate the basic syntax for reading and writing data using short code snippets. You can find the example data sets in the data-folder. However, most of the example databases do not exists, but you can use and modify the example syntax according to your own setup.\n", "\n", "## Creating new GeoDataFrame from scratch\n", "\n", "Since geopandas takes advantage of Shapely geometric objects, it is possible to create spatial data from scratch by passing Shapely's geometric objects into the GeoDataFrame. This is useful as it makes it easy to convert e.g. a text file that contains coordinates into spatial data layers. Next we will see how to create a new GeoDataFrame from scratch and save it into a file. Our goal is to define a geometry that represents the outlines of the [Senate square in Helsinki, Finland](https://fi.wikipedia.org/wiki/Senaatintori).\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's start by creating a new empty `GeoDataFrame` object." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import geopandas as gpd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "newdata = gpd.GeoDataFrame()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "geopandas.geodataframe.GeoDataFrame" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(newdata)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have an empty GeoDataFrame! A geodataframe is basically a pandas DataFrame that should have one column dedicated for geometries. By default, the geometry-column should be named `geometry` (geopandas looks for geometries from this column). \n", "\n", "Let's create the `geometry` column:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "# Create a new column called 'geometry' to the GeoDataFrame\n", "newdata[\"geometry\"] = None" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Empty GeoDataFrame\n", "Columns: [geometry]\n", "Index: []\n" ] } ], "source": [ "print(newdata)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we have a `geometry` column in our GeoDataFrame but we still don't have any data.\n", "\n", "Let's create a Shapely `Polygon` repsenting the Helsinki Senate square that we can later insert to our GeoDataFrame:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "from shapely.geometry import Polygon" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "# Coordinates of the Helsinki Senate square in decimal degrees\n", "coordinates = [\n", " (24.950899, 60.169158),\n", " (24.953492, 60.169158),\n", " (24.953510, 60.170104),\n", " (24.950958, 60.169990),\n", "]" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# Create a Shapely polygon from the coordinate-tuple list\n", "poly = Polygon(coordinates)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "POLYGON ((24.950899 60.169158, 24.953492 60.169158, 24.95351 60.170104, 24.950958 60.16999, 24.950899 60.169158))\n" ] } ], "source": [ "# Check the polyogon\n", "print(poly)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Okay, now we have an appropriate `Polygon` -object.\n", "\n", "Let's insert the polygon into our 'geometry' column of our GeoDataFrame on the first row:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "# Insert the polygon into 'geometry' -column at row 0\n", "newdata.at[0, \"geometry\"] = poly" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " geometry\n", "0 POLYGON ((24.95090 60.16916, 24.95349 60.16916...\n" ] } ], "source": [ "# Let's see what we have now\n", "print(newdata)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Great, now we have a GeoDataFrame with a Polygon that we could already now export to a Shapefile. However, typically you might want to include some attribute information with the geometry. \n", "\n", "Let's add another column to our GeoDataFrame called `location` with text `Senaatintori` that describes the location of the feature." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " geometry location\n", "0 POLYGON ((24.95090 60.16916, 24.95349 60.16916... Senaatintori\n" ] } ], "source": [ "# Add a new column and insert data\n", "newdata.at[0, \"location\"] = \"Senaatintori\"\n", "\n", "# Let's check the data\n", "print(newdata)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Okay, now we have additional information that is useful for recognicing what the feature represents. \n", "\n", "The next step would be to **determine the coordinate reference system (CRS) for the GeoDataFrame.** GeoDataFrame has an attribute called `.crs` that shows the coordinate system of the data (we will discuss more about CRS in next chapter). In our case, the layer doesn't yet have any crs definition:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "None\n" ] } ], "source": [ "print(newdata.crs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We passed the coordinates as latitude and longitude decimal degrees, so the correct CRS is WGS84 (epsg code: 4326). In this case, we can simply re-build the geodataframe and pass the correct crs information to the GeoDataFrame constructor. You will learn more about how to handle coordinate reference systems using pyproj CRS objects later in this chapter. \n", "\n", "Re-create the GeoDataFrame with correct crs definition: " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "newdata = gpd.GeoDataFrame(newdata, crs=4326)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "newdata.crs.name" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see, now we have added coordinate reference system information into our `GeoDataFrame`. The CRS information is necessary for creating a valid projection information for the output file. \n", "\n", "Finally, we can export the GeoDataFrame using `.to_file()` -function. The function works quite similarly as the export functions in pandas, but here we only need to provide the output path for the Shapefile. Easy isn't it!:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Determine the output path for the Shapefile\n", "outfp = \"L2_data/Senaatintori.shp\"\n", "\n", "# Write the data into that Shapefile\n", "newdata.to_file(outfp)" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 2 }, "source": [ "Now we have successfully created a Shapefile from scratch using geopandas. Similar approach can be used to for example to read coordinates from a text file (e.g. points) and turn that information into a spatial layer.\n", "\n", "\n", "#### Check your understanding\n", "\n", "\n", "
\n", "\n", " \n", "Check the output Shapefile by reading it with geopandas and make sure that the attribute table and geometry seems correct.\n", "\n", "
\n", "\n", "
\n", " \n", "Re-project the data to ETRS-TM35FIN (EPSG:3067) and save into a new file!\n", "\n", "
\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reading from different spatial data formats\n", "\n", "In geopandas, we can use a generic function [from_file()](http://geopandas.org/reference.html#geopandas.GeoDataFrame.to_file) for reading in different data formats. Esri Shapefile is the default file format. For other file formats we need to specify which driver to use for reading in the data. In the following section, we show how to read spatial data from a few of the most common vector file formats. To see all supported data formats, you can execute following: " ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'ARCGEN': 'r',\n", " 'DXF': 'rw',\n", " 'CSV': 'raw',\n", " 'OpenFileGDB': 'r',\n", " 'ESRIJSON': 'r',\n", " 'ESRI Shapefile': 'raw',\n", " 'FlatGeobuf': 'rw',\n", " 'GeoJSON': 'raw',\n", " 'GeoJSONSeq': 'rw',\n", " 'GPKG': 'raw',\n", " 'GML': 'rw',\n", " 'OGR_GMT': 'rw',\n", " 'GPX': 'rw',\n", " 'GPSTrackMaker': 'rw',\n", " 'Idrisi': 'r',\n", " 'MapInfo File': 'raw',\n", " 'DGN': 'raw',\n", " 'PCIDSK': 'rw',\n", " 'OGR_PDS': 'r',\n", " 'S57': 'r',\n", " 'SQLite': 'raw',\n", " 'TopoJSON': 'r'}" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import geopandas as gpd\n", "\n", "gpd.io.file.fiona.drvsupport.supported_drivers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Read / write Shapefile\n", "\n", "Shapefile format originally developed by ESRI in the early 1990's is one of the most commonly used data formats (still) used today. The Shapefile is in fact comprised of several separate files that are all important for representing the spatial data. Typically a Shapefile includes (at least) four separate files with extensions `.shp`, `.dbx`, `.shx` and `.prj`. The first three of them are obligatory" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import geopandas as gpd\n", "\n", "# Read file from Shapefile\n", "fp = \"data/finland_municipalities.shp\"\n", "data = gpd.read_file(fp)\n", "\n", "# Write to Shapefile (just make a copy)\n", "outfp = \"temp/finland_municipalities.shp\"\n", "data.to_file(outfp)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Read / write GeoJSON" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Read file from GeoJSON\n", "fp = \"data/finland_municipalities.gjson\"\n", "data = gpd.read_file(fp, driver=\"GeoJSON\")\n", "\n", "# Write to GeoJSON (just make a copy)\n", "outfp = \"temp/finland_municipalities.gjson\"\n", "data.to_file(outfp, driver=\"GeoJSON\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Read / write KML" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Enable KML driver\n", "gpd.io.file.fiona.drvsupport.supported_drivers[\"KML\"] = \"rw\"\n", "\n", "# Read file from KML\n", "fp = \"data/finland_municipalities.kml\"\n", "data = gpd.read_file(fp)\n", "\n", "# Write to KML (just make a copy)\n", "outfp = \"temp/finland_municipalities.kml\"\n", "data.to_file(outfp, driver=\"KML\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Read / write Geopackage" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Read file from Geopackage\n", "fp = \"data/finland_municipalities.gpkg\"\n", "data = gpd.read_file(fp)\n", "\n", "# Write to Geopackage (just make a copy)\n", "outfp = \"temp/finland_municipalities.gpkg\"\n", "data.to_file(outfp, driver=\"GPKG\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Read / write GeoDatabase" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Read file from File Geodatabase\n", "fp = \"data/finland.gdb\"\n", "data = gpd.read_file(fp, driver=\"OpenFileGDB\", layer=\"municipalities\")\n", "\n", "# Write to same FileGDB (just add a new layer) - requires additional package installations(?)\n", "# outfp = \"data/finland.gdb\"\n", "# data.to_file(outfp, driver=\"FileGDB\", layer=\"municipalities_copy\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Read / write MapInfo Tab" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Read file from MapInfo Tab\n", "fp = \"data/finland_municipalities.tab\"\n", "data = gpd.read_file(fp, driver=\"MapInfo File\")\n", "\n", "# Write to same FileGDB (just add a new layer)\n", "outfp = \"temp/finland_municipalities.tab\"\n", "data.to_file(outfp, driver=\"MapInfo File\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Geocoding\n", "\n", "Geocoding is the process of transforming place names or addresses into coordinates.\n", "In this lesson we will learn how to geocode addresses using Geopandas and\n", "[geopy](https://geopy.readthedocs.io/en/stable/).\n", "\n", "Geopy and other geocoding libaries (such as [geocoder](http://geocoder.readthedocs.io/))\n", "make it easy to locate the coordinates of addresses, cities, countries, and landmarks\n", "across the globe using web services (\"geocoders\"). In practice, geocoders are often\n", "Application Programming Interfaces (APIs) where you can send requests, and receive responses in the form of place names, addresses and coordinates.\n", "\n", "Geopy offers access to several geocoding services. Check the geopy documentation for more details about how to use each service via Python.\n", "\n", "Geopandas has a function called `geocode()` that can geocode a list of addresses (strings) and return a GeoDataFrame containing the resulting point objects in ``geometry`` column. \n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Geocoding addresses\n", "\n", "Let's try this out.\n", "\n", "We will geocode addresses stored in a text file called `addresses.txt`. These addresses are located in the Helsinki Region in Southern Finland.\n", "\n", "The first rows of the data look like this:\n", "\n", "```\n", "id;addr\n", "1000;Itämerenkatu 14, 00101 Helsinki, Finland\n", "1001;Kampinkuja 1, 00100 Helsinki, Finland\n", "1002;Kaivokatu 8, 00101 Helsinki, Finland\n", "1003;Hermannin rantatie 1, 00580 Helsinki, Finland\n", "```\n", "\n", "We have an `id` for each row and an address on column `addr`.\n", "\n", "Let's first read the data into a Pandas DataFrame using the `read_csv()` -function:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": true, "editable": true }, "outputs": [], "source": [ "# Import necessary modules\n", "import pandas as pd\n", "import geopandas as gpd\n", "from shapely.geometry import Point\n", "\n", "# Filepath\n", "fp = r\"data/addresses.txt\"\n", "\n", "# Read the data\n", "data = pd.read_csv(fp, sep=\";\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's check that we imported the file correctly:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "len(data)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "data.head()" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Now we have our data in a pandas DataFrame and we can geocode our addresses using the [geopandas geocoding function](http://geopandas.org/reference/geopandas.tools.geocode.html#geopandas-tools-geocode) that uses `geopy` package in the background. \n", "\n", "- Let's import the geocoding function and geocode the addresses (column `addr`) using Nominatim. \n", "- Remember to provide a custom string (name of your application) in the `user_agent` parameter.\n", "- If needed, you can add the `timeout`-parameter which specifies how many seconds we will wait for a response from the service." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "# Import the geocoding tool\n", "from geopandas.tools import geocode\n", "\n", "# Geocode addresses using Nominatim. Remember to provide a custom \"application name\" in the user_agent parameter!\n", "geo = geocode(data[\"addr\"], provider=\"nominatim\", user_agent=\"autogis_xx\", timeout=4)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "geo.head()" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "And Voilà! As a result we have a GeoDataFrame that contains our original\n", "address and a 'geometry' column containing Shapely Point -objects that\n", "we can use for exporting the addresses to a Shapefile for example.\n", "However, the ``id`` column is not there. Thus, we need to join the\n", "information from ``data`` into our new GeoDataFrame ``geo``, thus making\n", "a **Table Join**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "**Rate-limiting**\n", "\n", "When geocoding a large dataframe, you might encounter an error when geocoding. In case you get a time out error, try first using the `timeout` parameter as we did above (allow the service a bit more time to respond). In case of Too Many Requests error, you have hit the rate-limit of the service, and you should slow down your requests. To our convenience, geopy provides additional tools for taking into account rate limits in geocoding services. This script adapts the usage of [geopy RateLimiter](https://geopy.readthedocs.io/en/stable/#geopy.extra.rate_limiter.RateLimiter) to our input data:\n", "\n", "```\n", "from geopy.geocoders import Nominatim\n", "from geopy.extra.rate_limiter import RateLimiter\n", "from shapely.geometry import Point\n", "\n", "# Initiate geocoder\n", "geolocator = Nominatim(user_agent='autogis_xx')\n", "\n", "# Create a geopy rate limiter:\n", "geocode_with_delay = RateLimiter(geolocator.geocode, min_delay_seconds=1)\n", "\n", "# Apply the geocoder with delay using the rate limiter:\n", "data['temp'] = data['addr'].apply(geocode_with_delay)\n", "\n", "# Get point coordinates from the GeoPy location object on each row:\n", "data[\"coords\"] = data['temp'].apply(lambda loc: tuple(loc.point) if loc else None)\n", "\n", "# Create shapely point objects to geometry column:\n", "data[\"geometry\"] = data[\"coords\"].apply(Point)\n", "```\n", "All in all, remember that Nominatim is not meant for super heavy use. \n", "
\n", "\n", "**Joining the geocoding result with the original DataFrame** \n", "\n", "However, sometimes it is useful to join two tables together based on the **index** of those DataFrames. In such case, we assume\n", "that there is **same number of records** in our DataFrames and that the **order of the records should be the same** in both DataFrames.\n", "\n", "We can use this approach to join information from the original data to our geocoded addresses row-by-row \n", "``join()`` -function which merges the two DataFrames together\n", "based on index by default. This approach works correctly because the order of the geocoded addresses in ``geo`` DataFrame is the same as in our original ``data`` DataFrame.\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "join = geo.join(data)\n", "join.head()" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Let's also check the data type of our new ``join`` table." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "type(join)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "As a result we have a new GeoDataFrame called ``join`` where we now have\n", "all original columns plus a new column for ``geometry``. **Note!** If you would do the join the other way around, i.e. `data.join(geo)`, the output would be a pandas DataFrame, not a GeoDataFrame!\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now it is easy to save our address points into a Shapefile" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "deletable": true, "editable": true }, "outputs": [], "source": [ "# Output file path\n", "outfp = r\"data/addresses.shp\"\n", "\n", "# Save to Shapefile\n", "join.to_file(outfp)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "That's it. Now we have successfully geocoded those addresses into Points\n", "and made a Shapefile out of them. Easy isn't it!" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Nominatim works relatively nicely if you have well defined and well-known addresses such as the ones that we used in this tutorial. In practice, the address needs to exist in the OpenStreetMap database. Sometimes, however, you might want to geocode a \"point-of-interest\", such as a museum, only based on it's name. If the museum name is not on OpenStreetMap, Nominatim won't provide any results for it, but you might be able to geocode the place using some other geocoder." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 4 }