{ "cells": [ { "cell_type": "markdown", "id": "060b7221", "metadata": {}, "source": [ "# Geo-Spatial processing of GIS shapefiles with Amazon Athena\n", "\n", "Processing geo-spatial data often requires to assign spatial data points to regions or zones, such as zip codes, urban neighborhoods, operational districts, etc.\n", "The Athena Engine 2.0 no supports spatial queries.\n", "\n", "We demonstrate using Athena to join regions or zones to geo-spatial datapoints. The regons tabels are created from ArcGIS shape files." ] }, { "cell_type": "markdown", "id": "401aca64", "metadata": {}, "source": [ "## Data Sources\n", "\n", "Download data from these public data sources: the first set includes geo-spatial locations with longitude and latitude, the remaining three sets are shape files that define areas on the map. We use those to map locations to areas, such as neighborhoods, designated service areas, and zip codes.\n", "\n", "|Raw data | Description | Source |\n", "|---------|-------------|--------|\n", "| dockless-vehicles-3.csv | A public dataset of dock-less vehicle rentals, provided by the Office of Civic Innovation and Technology (https://louisvilleky.gov/government/civic-innovation-and-technology/civic-innovation) of the Louisville (KY) Metro Government. The table includes trip data with longitude and latitude of the start and end locations | https://data.louisvilleky.gov/dataset/dockless-vehicles |\n", "| Dockless_Vehicle_Service_Area.* | Shape files of service areas | https://data.louisvilleky.gov/dataset/dockless-vehicles |\n", "| cb_2018_us_zcta510_500k.* | [ZIP Code Tabulation Areas](https://www.census.gov/programs-surveys/geography/guidance/geo-areas/zctas.html) (ZCTAs) are generalized areal representations of United States Postal Service (USPS) ZIP Code service areas. The USPS ZIP Codes identify the individual post office or metropolitan area delivery station associated with mailing addresses. USPS ZIP Codes are not areal features but a collection of mail delivery routes. | https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_zcta510_500k.zip |\n", "| Louisville_KY_Urban_Neighborhoods | The Louisville Neighborhoods layer consists of polygons representing approximate boundaries and extents of historical and cultural neighborhoods within the pre-merger limits of the City of Louisville, or post-merger Urban Service District. | https://data.lojic.org/datasets/louisville-ky-urban-neighborhoods| " ] }, { "cell_type": "markdown", "id": "893450fc", "metadata": {}, "source": [ "## Setup: Load packages, define functions\n", "\n", "In addition to [Pandas](https://pandas.pydata.org) and [NumPy](https://numpy.org) you need to install the following package:\n", "1. [s3fs](https://github.com/dask/s3fs/) is required by Pandas. It supports Amazon S3 URIs as file names in the Pandas methords `read_*` and `to_*`. The package loads indirectly with Pandas.\n", "2. [PyAthena](https://github.com/laughingman7743/PyAthena/) provides the database connection to Amazon Athena\n", "3. [Python Shapefile Library](https://github.com/GeospatialPython/pyshp) reads and writes ESRI Shapefiles.\n" ] }, { "cell_type": "code", "execution_count": 3, "id": "6c5ff1db", "metadata": {}, "outputs": [], "source": [ "import os\n", "import sys\n", "import datetime\n", "import pandas as pd\n", "import numpy as np\n", "import s3fs # required to read and write tables on Amazon S3. Not called, but importing it ensures that the package is available\n", "import pyathena # connection to Amazon Athena\n", "import shapefile # read and write ESRI Shapefiles\n", "\n", "S3_BUCKET = ''\n", "S3_TABLE_PREFIX = ''\n", "S3_ATHENA_QUERY = f's3://{S3_BUCKET}/{S3_TABLE_PREFIX}/query_output'\n", "ATHENA_SCHEMA_NAME = ''\n", "ATHENA_WORK_GROUP = ''\n", "REGION = ''\n", "\n", "def get_athena_connecttion():\n", " '''\n", " Open database connection to Amazon Athena\n", " \n", " Returns:\n", " connection object for Pandas\n", " '''\n", " return pyathena.connect(s3_staging_dir=S3_ATHENA_QUERY, schema_name=ATHENA_SCHEMA_NAME, region_name=REGION, work_group=ATHENA_WORK_GROUP)\n", " \n", " \n", "def save_as_external_athena_table(conn, df, bucket, dirkey, dbname, tblname, overwrite=False):\n", " '''\n", " Save Pandas DataFrame as external Athena table. The function supports the following column types: timestamp, int, double, and string.\n", " The column names will be converted to lower case. The names must not include any special characters, exctept '_'.\n", " \n", " Parametes:\n", " conn: PyAthena connection\n", " df: pandas.DataFrame\n", " bucket: S3 bucket\n", " dirkey: S3 prefix to data file on S3\n", " dbname: Name of GLUE database\n", " tblname: Name of GLUE table\n", " '''\n", " assert dirkey[-1] != '/', \"Don't use '/' in S3 key\"\n", " \n", " df.rename(lambda s: s.lower(), axis=1) \\\n", " .to_csv(f\"s3://{bucket}/{dirkey}/table.tsv\", index=None, quoting=None, sep='\\t', header=None, date_format='%Y-%m-%d %H:%M:%S')\n", " \n", " type_renaming = {\n", " 'datetime64[ns]' : 'timestamp',\n", " 'int64' : 'int',\n", " 'float64': 'double',\n", " 'object': 'string',\n", " }\n", "\n", " cols = [ \"`{}` {}\".format(r[1], type_renaming[str(r[2])]) for r in df.dtypes.reset_index().itertuples() ]\n", " cols_string = ',\\n'.join(cols)\n", " \n", " curs = conn.cursor()\n", " if overwrite:\n", " try:\n", " res = curs.execute(f\"\"\"DROP TABLE {dbname}.{tblname}\"\"\")\n", " except pyathena.error.OperationalError as ex:\n", " if 'Table not found' in str(ex):\n", " pass\n", " else:\n", " raise ex\n", " \n", " q = f\"\"\"\n", " CREATE EXTERNAL TABLE IF NOT EXISTS {dbname}.{tblname} (\n", " {cols_string}\n", " )\n", " ROW FORMAT DELIMITED \n", " FIELDS TERMINATED BY '\\\\t' \n", " LINES TERMINATED BY '\\\\n' \n", " LOCATION 's3://{bucket}/{dirkey}'\n", " TBLPROPERTIES ('has_encrypted_data'='false')\n", " \"\"\"\n", " \n", " res = curs.execute(q)\n", " return res\n", "\n", "\n", "def centroid(points):\n", " '''\n", " Computes centroid of a 2-dimensional area https://en.wikipedia.org/wiki/Centroid\n", " \n", " Parameters:\n", " points: numpy array of polynom coordinates\n", " \n", " Returns:\n", " x/y-coordinates of centroid, area of centroid\n", " '''\n", " pts = np.array(points)\n", " assert pts.shape[1] == 2, \"Not 2-dimensional\"\n", " n = pts.shape[0]\n", " x = pts[:,0]\n", " y = pts[:,1]\n", " xyxy = [ x[i] * y[(i+1)%n] - x[(i+1)%n] * y[i] for i in range(n) ]\n", " A = 0.5 * np.sum(xyxy)\n", " Cx = np.sum([ (x[i] + x[(i+1)%n]) * xyxy[i] for i in range(n) ])/(6*A)\n", " Cy = np.sum([ (y[i] + y[(i+1)%n]) * xyxy[i] for i in range(n) ])/(6*A)\n", " return Cx, Cy, np.abs(A)\n", "\n", "\n", "def load_shape_file(shape_file_name):\n", " '''\n", " Function to load \n", " \n", " Parameters:\n", " \n", " Returns:\n", " Pandas DataFrame: the field `shape` holds the polygon defintion of the shape that can be interpreted\n", " by the Athena function `ST_GeometryFromText()`\n", " Other fields include coordinates of the bounding boxes and centroids, and meta data from the shapefile\n", " '''\n", " sf = shapefile.Reader(shape_file_name)\n", " recs = sf.records()\n", " shps = sf.shapes()\n", " fields = sf.fields\n", " \n", " shape_df = pd.DataFrame()\n", " for i, rec in enumerate(recs):\n", " rdata = {}\n", " for fld, dt, le, ze in sf.fields[1:]:\n", " rdata[fld] = [ rec[fld] ]\n", " bbox = shps[i].bbox\n", " for j, b in enumerate(['bb_west', 'bb_south', 'bb_east', 'bb_north']):\n", " rdata[b] = [ bbox[j] ]\n", " rdata['shape'] = [ \"\"\"POLYGON(({poly}))\"\"\" \\\n", " .format(poly = ','.join([\"{lo} {la}\".format(lo=p[0], la=p[1]) \n", " for p in shps[i].points ]))\n", " ]\n", " cog_long, cog_lat, _ = centroid(shps[i].points)\n", " rdata['cog_longitude'] = cog_long\n", " rdata['cog_latitude'] = cog_lat\n", "\n", " tmp_df = pd.DataFrame(rdata)\n", " shape_df = pd.concat([shape_df, tmp_df])\n", " \n", " shape_df.index = range(shape_df.shape[0])\n", " return shape_df\n" ] }, { "cell_type": "markdown", "id": "c4e547f3", "metadata": {}, "source": [ "## Process Shape Files\n", "Download the shape files. Each data set consists of four files with the same name and different extension. Just provide the name of the `.shp` file to load. The other files must be in the same directory.\n", " \n", "- load shapefiles into Pandas DataFrames with `load_shape_file()`, then\n", "- save on S3 and create external table in Athena with `save_as_external_athena_table()`" ] }, { "cell_type": "markdown", "id": "b509faf1", "metadata": {}, "source": [ "### Zip Codes" ] }, { "cell_type": "code", "execution_count": 40, "id": "33ab824c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of records: 33,144\n", "CPU times: user 5min 22s, sys: 10.2 s, total: 5min 32s\n", "Wall time: 5min 48s\n" ] } ], "source": [ "%%time\n", "zip_code_df = load_shape_file('./data/cb_2018_us_zcta510_500k.shp')\n", "print(f\"Number of records: {zip_code_df.shape[0]:,}\")" ] }, { "cell_type": "code", "execution_count": 41, "id": "8c8a9945", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ZCTA5CE10AFFGEOID10GEOID10ALAND10AWATER10bb_westbb_southbb_eastbb_northshapecog_longitudecog_latitude
0360838600000US36083360836597506625522919-85.89778732.233444-85.44988532.513303POLYGON((-85.63224699999999 32.280982,-85.6243...-85.69039332.387282
1354418600000US35441354411728504298749105-87.87050232.756722-87.62421432.940973POLYGON((-87.83287399999999 32.844372,-87.8318...-87.73861032.853736
2350518600000US35051350512802364565427285-86.74507633.099176-86.48150433.337380POLYGON((-86.743844 33.250019,-86.738019 33.25...-86.61971333.205467
3351218600000US35121351213727360305349303-86.58526633.829742-86.31422634.071217POLYGON((-86.58526599999999 33.94743,-86.58032...-86.45527233.941246
4350588600000US35058350581780399223109259-86.88833034.171265-86.63605134.306979POLYGON((-86.87884199999999 34.211959,-86.8764...-86.73787734.231338
\n", "
" ], "text/plain": [ " ZCTA5CE10 AFFGEOID10 GEOID10 ALAND10 AWATER10 bb_west \\\n", "0 36083 8600000US36083 36083 659750662 5522919 -85.897787 \n", "1 35441 8600000US35441 35441 172850429 8749105 -87.870502 \n", "2 35051 8600000US35051 35051 280236456 5427285 -86.745076 \n", "3 35121 8600000US35121 35121 372736030 5349303 -86.585266 \n", "4 35058 8600000US35058 35058 178039922 3109259 -86.888330 \n", "\n", " bb_south bb_east bb_north \\\n", "0 32.233444 -85.449885 32.513303 \n", "1 32.756722 -87.624214 32.940973 \n", "2 33.099176 -86.481504 33.337380 \n", "3 33.829742 -86.314226 34.071217 \n", "4 34.171265 -86.636051 34.306979 \n", "\n", " shape cog_longitude \\\n", "0 POLYGON((-85.63224699999999 32.280982,-85.6243... -85.690393 \n", "1 POLYGON((-87.83287399999999 32.844372,-87.8318... -87.738610 \n", "2 POLYGON((-86.743844 33.250019,-86.738019 33.25... -86.619713 \n", "3 POLYGON((-86.58526599999999 33.94743,-86.58032... -86.455272 \n", "4 POLYGON((-86.87884199999999 34.211959,-86.8764... -86.737877 \n", "\n", " cog_latitude \n", "0 32.387282 \n", "1 32.853736 \n", "2 33.205467 \n", "3 33.941246 \n", "4 34.231338 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(zip_code_df.head())" ] }, { "cell_type": "code", "execution_count": 42, "id": "19c591de", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 3.93 s, sys: 95 ms, total: 4.03 s\n", "Wall time: 9.21 s\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "if not 'conn' in globals():\n", " print(\"Connecting to Athena\")\n", " conn = get_athena_connecttion()\n", "\n", "save_as_external_athena_table(conn, zip_code_df,\n", " S3_BUCKET, f\"{S3_TABLE_PREFIX}/zipcode_shapes\",\n", " ATHENA_SCHEMA_NAME, 'zipcode_shapes',\n", " overwrite=True)" ] }, { "cell_type": "code", "execution_count": 43, "id": "1513547a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
zcta5ce10affgeoid10geoid10aland10awater10bb_westbb_southbb_eastbb_northshapecog_longitudecog_latitude
0991588600000US99158991582354451580-117.38619747.027920-117.12391947.259760POLYGON((-117.386197 47.087768,-117.383378 47....-117.24735847.130323
1032158600000US0321503215124336441177915-71.57125243.892722-71.36196543.999343POLYGON((-71.569045 43.901404,-71.569515 43.90...-71.45282343.945304
2295478600000US2954729547103218389215692-79.41115434.399488-79.23560434.576940POLYGON((-79.411154 34.57694,-79.4053249999999...-79.33338434.487992
3177288600000US1772817728122976765691784-77.20915341.274431-76.94535941.400996POLYGON((-77.209153 41.290804,-77.206322 41.29...-77.08533241.327618
4160408600000US1604016040508109488983-79.89653241.046170-79.77083041.136607POLYGON((-79.896532 41.100111,-79.881548999999...-79.84301041.092546
\n", "
" ], "text/plain": [ " zcta5ce10 affgeoid10 geoid10 aland10 awater10 bb_west \\\n", "0 99158 8600000US99158 99158 235445158 0 -117.386197 \n", "1 03215 8600000US03215 03215 124336441 177915 -71.571252 \n", "2 29547 8600000US29547 29547 103218389 215692 -79.411154 \n", "3 17728 8600000US17728 17728 122976765 691784 -77.209153 \n", "4 16040 8600000US16040 16040 50810948 8983 -79.896532 \n", "\n", " bb_south bb_east bb_north \\\n", "0 47.027920 -117.123919 47.259760 \n", "1 43.892722 -71.361965 43.999343 \n", "2 34.399488 -79.235604 34.576940 \n", "3 41.274431 -76.945359 41.400996 \n", "4 41.046170 -79.770830 41.136607 \n", "\n", " shape cog_longitude \\\n", "0 POLYGON((-117.386197 47.087768,-117.383378 47.... -117.247358 \n", "1 POLYGON((-71.569045 43.901404,-71.569515 43.90... -71.452823 \n", "2 POLYGON((-79.411154 34.57694,-79.4053249999999... -79.333384 \n", "3 POLYGON((-77.209153 41.290804,-77.206322 41.29... -77.085332 \n", "4 POLYGON((-79.896532 41.100111,-79.881548999999... -79.843010 \n", "\n", " cog_latitude \n", "0 47.130323 \n", "1 43.945304 \n", "2 34.487992 \n", "3 41.327618 \n", "4 41.092546 " ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_sql(\"\"\"SELECT * FROM blog1.zipcode_shapes LIMIT 5\"\"\", conn)" ] }, { "cell_type": "code", "execution_count": null, "id": "3d9742c0", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "1ab74943", "metadata": {}, "source": [ "### Loisville/KY Neighborhoods" ] }, { "cell_type": "code", "execution_count": 13, "id": "c4649116", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of records: 92\n", "CPU times: user 617 ms, sys: 6.32 ms, total: 624 ms\n", "Wall time: 626 ms\n" ] } ], "source": [ "%%time\n", "neighnorhood_df = load_shape_file('./data/Louisville_KY_Urban_Neighborhoods.shp')\n", "print(f\"Number of records: {neighnorhood_df.shape[0]:,}\")" ] }, { "cell_type": "code", "execution_count": 14, "id": "4aad9679", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
OBJECTIDNH_CODENH_NAMESHAPEAREASHAPELENbb_westbb_southbb_eastbb_northshapecog_longitudecog_latitude
0199REMAINDER OF CITY2.558333e+0723718.323690-85.72344038.263231-85.69324238.281569POLYGON((-85.70281560232544 38.270463184337494...-85.70871238.272222
1253PORTLAND7.009830e+0743851.334658-85.82022238.256329-85.76375338.280011POLYGON((-85.8202215407467 38.27686470289577,-...-85.79299838.268303
2362SHAWNEE5.995385e+0734088.057055-85.83328238.249823-85.80431238.276865POLYGON((-85.8202215407467 38.27686470289577,-...-85.81867238.261047
3412BROWNSBORO ZORN2.197732e+0728060.141437-85.70281638.259572-85.66838238.275999POLYGON((-85.70281560232544 38.270463184337494...-85.68863938.266630
4523CLIFTON HEIGHTS1.785864e+0719464.933453-85.71687338.258293-85.69261338.270738POLYGON((-85.71500895019209 38.264042197247306...-85.70402538.263401
\n", "
" ], "text/plain": [ " OBJECTID NH_CODE NH_NAME SHAPEAREA SHAPELEN \\\n", "0 1 99 REMAINDER OF CITY 2.558333e+07 23718.323690 \n", "1 2 53 PORTLAND 7.009830e+07 43851.334658 \n", "2 3 62 SHAWNEE 5.995385e+07 34088.057055 \n", "3 4 12 BROWNSBORO ZORN 2.197732e+07 28060.141437 \n", "4 5 23 CLIFTON HEIGHTS 1.785864e+07 19464.933453 \n", "\n", " bb_west bb_south bb_east bb_north \\\n", "0 -85.723440 38.263231 -85.693242 38.281569 \n", "1 -85.820222 38.256329 -85.763753 38.280011 \n", "2 -85.833282 38.249823 -85.804312 38.276865 \n", "3 -85.702816 38.259572 -85.668382 38.275999 \n", "4 -85.716873 38.258293 -85.692613 38.270738 \n", "\n", " shape cog_longitude \\\n", "0 POLYGON((-85.70281560232544 38.270463184337494... -85.708712 \n", "1 POLYGON((-85.8202215407467 38.27686470289577,-... -85.792998 \n", "2 POLYGON((-85.8202215407467 38.27686470289577,-... -85.818672 \n", "3 POLYGON((-85.70281560232544 38.270463184337494... -85.688639 \n", "4 POLYGON((-85.71500895019209 38.264042197247306... -85.704025 \n", "\n", " cog_latitude \n", "0 38.272222 \n", "1 38.268303 \n", "2 38.261047 \n", "3 38.266630 \n", "4 38.263401 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(neighnorhood_df.head())" ] }, { "cell_type": "code", "execution_count": 15, "id": "91946106", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "if not 'conn' in globals():\n", " print(\"Connecting to Athena\")\n", " conn = get_athena_connecttion()\n", "\n", "save_as_external_athena_table(conn, neighnorhood_df,\n", " S3_BUCKET, f\"{S3_TABLE_PREFIX}/loisville_ky_neighborhoods\",\n", " ATHENA_SCHEMA_NAME, 'loisville_ky_neighborhoods',\n", " overwrite=True)" ] }, { "cell_type": "code", "execution_count": 16, "id": "57ec1e08", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
objectidnh_codenh_nameshapeareashapelenbb_westbb_southbb_eastbb_northshapecog_longitudecog_latitude
0199REMAINDER OF CITY2.558333e+0723718.323690-85.72344038.263231-85.69324238.281569POLYGON((-85.70281560232544 38.270463184337494...-85.70871238.272222
1253PORTLAND7.009830e+0743851.334658-85.82022238.256329-85.76375338.280011POLYGON((-85.8202215407467 38.27686470289577,-...-85.79299838.268303
2362SHAWNEE5.995385e+0734088.057055-85.83328238.249823-85.80431238.276865POLYGON((-85.8202215407467 38.27686470289577,-...-85.81867238.261047
3412BROWNSBORO ZORN2.197732e+0728060.141437-85.70281638.259572-85.66838238.275999POLYGON((-85.70281560232544 38.270463184337494...-85.68863938.266630
4523CLIFTON HEIGHTS1.785864e+0719464.933453-85.71687338.258293-85.69261338.270738POLYGON((-85.71500895019209 38.264042197247306...-85.70402538.263401
\n", "
" ], "text/plain": [ " objectid nh_code nh_name shapearea shapelen \\\n", "0 1 99 REMAINDER OF CITY 2.558333e+07 23718.323690 \n", "1 2 53 PORTLAND 7.009830e+07 43851.334658 \n", "2 3 62 SHAWNEE 5.995385e+07 34088.057055 \n", "3 4 12 BROWNSBORO ZORN 2.197732e+07 28060.141437 \n", "4 5 23 CLIFTON HEIGHTS 1.785864e+07 19464.933453 \n", "\n", " bb_west bb_south bb_east bb_north \\\n", "0 -85.723440 38.263231 -85.693242 38.281569 \n", "1 -85.820222 38.256329 -85.763753 38.280011 \n", "2 -85.833282 38.249823 -85.804312 38.276865 \n", "3 -85.702816 38.259572 -85.668382 38.275999 \n", "4 -85.716873 38.258293 -85.692613 38.270738 \n", "\n", " shape cog_longitude \\\n", "0 POLYGON((-85.70281560232544 38.270463184337494... -85.708712 \n", "1 POLYGON((-85.8202215407467 38.27686470289577,-... -85.792998 \n", "2 POLYGON((-85.8202215407467 38.27686470289577,-... -85.818672 \n", "3 POLYGON((-85.70281560232544 38.270463184337494... -85.688639 \n", "4 POLYGON((-85.71500895019209 38.264042197247306... -85.704025 \n", "\n", " cog_latitude \n", "0 38.272222 \n", "1 38.268303 \n", "2 38.261047 \n", "3 38.266630 \n", "4 38.263401 " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_sql(\"\"\"SELECT * FROM blog1.loisville_ky_neighborhoods LIMIT 5\"\"\", conn)" ] }, { "cell_type": "code", "execution_count": null, "id": "5e42713c", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "5e3ece3d", "metadata": {}, "source": [ "### Dockless Service Zones" ] }, { "cell_type": "code", "execution_count": 19, "id": "c15b8d8b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of records: 9\n", "CPU times: user 63.8 ms, sys: 3.91 ms, total: 67.7 ms\n", "Wall time: 66 ms\n" ] } ], "source": [ "%%time\n", "dockless_zone_df = load_shape_file('./data/Dockless_Vehicle_Distribution_Zones.shp')\n", "print(f\"Number of records: {dockless_zone_df.shape[0]:,}\")" ] }, { "cell_type": "code", "execution_count": 20, "id": "dbe9260b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Dist_ZonePl2040Areabb_westbb_southbb_eastbb_northshapecog_longitudecog_latitude
01Northwest Core1.185772e+06275565.3537501.205373e+06287823.380152POLYGON((1193691.6109733582 287815.0763631761,...1.195295e+06281342.995387
19West Core1.184528e+06263536.7662501.205891e+06277443.735876POLYGON((1204659.266196905 274362.96903444396,...1.195880e+06271588.566397
22Downtown1.204477e+06271532.2737501.218294e+06280954.288272POLYGON((1205218.9462500215 280123.6099999994,...1.210482e+06276616.567796
36University1.203515e+06254582.9693791.215279e+06273211.377500POLYGON((1212727.448750025 270748.27874999365,...1.208898e+06263606.969458
43Northeast Core1.212013e+06274241.3000001.229596e+06287709.093407POLYGON((1223601.7970568056 287709.09340739076...1.221822e+06280371.766150
55Southeast Core1.210765e+06254840.4974091.239122e+06276544.758750POLYGON((1220489.8675000072 276140.2512500117,...1.221418e+06265314.126921
64East Core1.220823e+06266079.2912501.247259e+06292598.833856POLYGON((1231939.7309992649 292598.8338559895,...1.235454e+06278947.829163
78Southwest Core1.186221e+06253446.7976591.207267e+06268155.372131POLYGON((1198503.793750003 266260.02249999344,...1.195788e+06260001.481260
87Iroquois Park1.186316e+06239861.2811951.208779e+06254935.596250POLYGON((1206324.8737500012 244494.5175000131,...1.198511e+06247641.294413
\n", "
" ], "text/plain": [ " Dist_Zone Pl2040Area bb_west bb_south bb_east \\\n", "0 1 Northwest Core 1.185772e+06 275565.353750 1.205373e+06 \n", "1 9 West Core 1.184528e+06 263536.766250 1.205891e+06 \n", "2 2 Downtown 1.204477e+06 271532.273750 1.218294e+06 \n", "3 6 University 1.203515e+06 254582.969379 1.215279e+06 \n", "4 3 Northeast Core 1.212013e+06 274241.300000 1.229596e+06 \n", "5 5 Southeast Core 1.210765e+06 254840.497409 1.239122e+06 \n", "6 4 East Core 1.220823e+06 266079.291250 1.247259e+06 \n", "7 8 Southwest Core 1.186221e+06 253446.797659 1.207267e+06 \n", "8 7 Iroquois Park 1.186316e+06 239861.281195 1.208779e+06 \n", "\n", " bb_north shape \\\n", "0 287823.380152 POLYGON((1193691.6109733582 287815.0763631761,... \n", "1 277443.735876 POLYGON((1204659.266196905 274362.96903444396,... \n", "2 280954.288272 POLYGON((1205218.9462500215 280123.6099999994,... \n", "3 273211.377500 POLYGON((1212727.448750025 270748.27874999365,... \n", "4 287709.093407 POLYGON((1223601.7970568056 287709.09340739076... \n", "5 276544.758750 POLYGON((1220489.8675000072 276140.2512500117,... \n", "6 292598.833856 POLYGON((1231939.7309992649 292598.8338559895,... \n", "7 268155.372131 POLYGON((1198503.793750003 266260.02249999344,... \n", "8 254935.596250 POLYGON((1206324.8737500012 244494.5175000131,... \n", "\n", " cog_longitude cog_latitude \n", "0 1.195295e+06 281342.995387 \n", "1 1.195880e+06 271588.566397 \n", "2 1.210482e+06 276616.567796 \n", "3 1.208898e+06 263606.969458 \n", "4 1.221822e+06 280371.766150 \n", "5 1.221418e+06 265314.126921 \n", "6 1.235454e+06 278947.829163 \n", "7 1.195788e+06 260001.481260 \n", "8 1.198511e+06 247641.294413 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(dockless_zone_df)" ] }, { "cell_type": "code", "execution_count": 22, "id": "8d4ec435", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "if not 'conn' in globals():\n", " print(\"Connecting to Athena\")\n", " conn = get_athena_connecttion()\n", "\n", "save_as_external_athena_table(conn, dockless_zone_df,\n", " S3_BUCKET, f\"{S3_TABLE_PREFIX}/distribution_zones\",\n", " ATHENA_SCHEMA_NAME, 'distribution_zones',\n", " overwrite=False)" ] }, { "cell_type": "code", "execution_count": 23, "id": "1d930026", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dist_zonepl2040areabb_westbb_southbb_eastbb_northshapecog_longitudecog_latitude
01Northwest Core1.185772e+06275565.3537501.205373e+06287823.380152POLYGON((1193691.6109733582 287815.0763631761,...1.195295e+06281342.995387
19West Core1.184528e+06263536.7662501.205891e+06277443.735876POLYGON((1204659.266196905 274362.96903444396,...1.195880e+06271588.566397
22Downtown1.204477e+06271532.2737501.218294e+06280954.288272POLYGON((1205218.9462500215 280123.6099999994,...1.210482e+06276616.567796
36University1.203515e+06254582.9693791.215279e+06273211.377500POLYGON((1212727.448750025 270748.27874999365,...1.208898e+06263606.969458
43Northeast Core1.212013e+06274241.3000001.229596e+06287709.093407POLYGON((1223601.7970568056 287709.09340739076...1.221822e+06280371.766150
\n", "
" ], "text/plain": [ " dist_zone pl2040area bb_west bb_south bb_east \\\n", "0 1 Northwest Core 1.185772e+06 275565.353750 1.205373e+06 \n", "1 9 West Core 1.184528e+06 263536.766250 1.205891e+06 \n", "2 2 Downtown 1.204477e+06 271532.273750 1.218294e+06 \n", "3 6 University 1.203515e+06 254582.969379 1.215279e+06 \n", "4 3 Northeast Core 1.212013e+06 274241.300000 1.229596e+06 \n", "\n", " bb_north shape \\\n", "0 287823.380152 POLYGON((1193691.6109733582 287815.0763631761,... \n", "1 277443.735876 POLYGON((1204659.266196905 274362.96903444396,... \n", "2 280954.288272 POLYGON((1205218.9462500215 280123.6099999994,... \n", "3 273211.377500 POLYGON((1212727.448750025 270748.27874999365,... \n", "4 287709.093407 POLYGON((1223601.7970568056 287709.09340739076... \n", "\n", " cog_longitude cog_latitude \n", "0 1.195295e+06 281342.995387 \n", "1 1.195880e+06 271588.566397 \n", "2 1.210482e+06 276616.567796 \n", "3 1.208898e+06 263606.969458 \n", "4 1.221822e+06 280371.766150 " ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_sql(\"\"\"SELECT * FROM blog1.distribution_zones LIMIT 5\"\"\", conn)" ] }, { "cell_type": "code", "execution_count": null, "id": "f28844eb", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "ceee4c86", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "426e19d6", "metadata": {}, "source": [ "## Load trip data\n", "\n", "Upload the trip data" ] }, { "cell_type": "code", "execution_count": 7, "id": "9d2ed2da", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of records: 505,993\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TripIDStartDateStartTimeEndDateEndTimeTripDurationTripDistanceStartLatitudeStartLongitudeEndLatitudeEndLongitudeDayOfWeekHourNum
00000045c-2677-3a7d-4b73-cad99a572019-06-2619:302019-06-2619:303.00.00038.253-85.75638.253-85.755419
10000487b-92e6-50d6-7569-42ed38182019-09-2214:302019-09-2214:305.00.00038.203-85.75238.204-85.751114
200006088-2579-e0d0-6a30-a15bb8782019-08-2117:302019-08-2117:306.00.33038.259-85.73338.265-85.739417
300008c1a-899b-8596-970f-9f6bf4952019-07-0311:002019-07-0311:156.00.64038.217-85.75738.221-85.763411
400009301-3225-2aea-a84a-165a480a2019-11-2210:452019-11-2211:007.00.59938.215-85.75938.222-85.764610
\n", "
" ], "text/plain": [ " TripID StartDate StartTime EndDate EndTime \\\n", "0 0000045c-2677-3a7d-4b73-cad99a57 2019-06-26 19:30 2019-06-26 19:30 \n", "1 0000487b-92e6-50d6-7569-42ed3818 2019-09-22 14:30 2019-09-22 14:30 \n", "2 00006088-2579-e0d0-6a30-a15bb878 2019-08-21 17:30 2019-08-21 17:30 \n", "3 00008c1a-899b-8596-970f-9f6bf495 2019-07-03 11:00 2019-07-03 11:15 \n", "4 00009301-3225-2aea-a84a-165a480a 2019-11-22 10:45 2019-11-22 11:00 \n", "\n", " TripDuration TripDistance StartLatitude StartLongitude EndLatitude \\\n", "0 3.0 0.000 38.253 -85.756 38.253 \n", "1 5.0 0.000 38.203 -85.752 38.204 \n", "2 6.0 0.330 38.259 -85.733 38.265 \n", "3 6.0 0.640 38.217 -85.757 38.221 \n", "4 7.0 0.599 38.215 -85.759 38.222 \n", "\n", " EndLongitude DayOfWeek HourNum \n", "0 -85.755 4 19 \n", "1 -85.751 1 14 \n", "2 -85.739 4 17 \n", "3 -85.763 4 11 \n", "4 -85.764 6 10 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "trips = pd.read_csv('data/dockless-vehicles-3.csv')\n", "print(f\"Number of records: {trips.shape[0]:,}\")\n", "trips.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "dae39095", "metadata": {}, "outputs": [], "source": [ "if not 'conn' in globals():\n", " print(\"Connecting to Athena\")\n", " conn = get_athena_connecttion()\n", "\n", "save_as_external_athena_table(conn, dockless_zone_df,\n", " S3_BUCKET, f\"{S3_TABLE_PREFIX}/dockless_vehicles\",\n", " ATHENA_SCHEMA_NAME, 'dockless_vehicles',\n", " overwrite=False)" ] }, { "cell_type": "markdown", "id": "09157583", "metadata": {}, "source": [ "## Athena Query\n", "\n", "Now you can try join the neighborhoods to the geo-locations. Learn more about geo-spatial functions in Athena at https://docs.aws.amazon.com/athena/latest/ug/geospatial-functions-list.html" ] }, { "cell_type": "code", "execution_count": 9, "id": "27feebc5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of records: 100\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
startdatestarttimeendtimetripdurationtripdistancestart_nbidstart_neighborhoodend_nbidend_neighborhood
02019-10-1810:4511:003.00.1037BELKNAP7BELKNAP
12019-07-2809:3009:454.00.66037HIGHLANDS DOUGLASS0
22019-07-2112:3012:4514.00.71017CENTRAL BUSINESS DISTRICT17CENTRAL BUSINESS DISTRICT
32019-06-1806:4506:455.00.04071UNIVERSITY71UNIVERSITY
42019-08-1011:0011:1510.00.00017CENTRAL BUSINESS DISTRICT17CENTRAL BUSINESS DISTRICT
\n", "
" ], "text/plain": [ " startdate starttime endtime tripduration tripdistance start_nbid \\\n", "0 2019-10-18 10:45 11:00 3.0 0.103 7 \n", "1 2019-07-28 09:30 09:45 4.0 0.660 37 \n", "2 2019-07-21 12:30 12:45 14.0 0.710 17 \n", "3 2019-06-18 06:45 06:45 5.0 0.040 71 \n", "4 2019-08-10 11:00 11:15 10.0 0.000 17 \n", "\n", " start_neighborhood end_nbid end_neighborhood \n", "0 BELKNAP 7 BELKNAP \n", "1 HIGHLANDS DOUGLASS 0 \n", "2 CENTRAL BUSINESS DISTRICT 17 CENTRAL BUSINESS DISTRICT \n", "3 UNIVERSITY 71 UNIVERSITY \n", "4 CENTRAL BUSINESS DISTRICT 17 CENTRAL BUSINESS DISTRICT " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "if not 'conn' in globals():\n", " print(\"Connecting to Athena\")\n", " conn = get_athena_connecttion()\n", "\n", "\n", "q = \"\"\"\n", "SELECT tr.startdate, tr.starttime, tr.endtime, tr.tripduration, tripdistance \n", " , nb1.nh_code AS start_nbid, nb1.nh_name AS start_neighborhood\n", " , nb2.nh_code AS end_nbid, nb2.nh_name AS end_neighborhood\n", " \n", "FROM \"{database}\".\"dockless_vehicles\" tr\n", "\n", "JOIN \"{database}\".\"loisville_ky_neighborhoods\" nb1\n", " ON ST_Within(ST_POINT(CAST(tr.startlongitude AS DOUBLE), CAST(tr.startlatitude AS DOUBLE)), ST_GeometryFromText(nb1.shape)) \n", "JOIN \"{database}\".\"loisville_ky_neighborhoods\" nb2\n", " ON ST_Within(ST_POINT(CAST(tr.endlongitude AS DOUBLE), CAST(tr.endlatitude AS DOUBLE)), ST_GeometryFromText(nb2.shape))\n", "\n", "LIMIT 100;\n", "\"\"\".format(\n", " database=ATHENA_SCHEMA_NAME\n", ")\n", "\n", "trips_with_nbhd = pd.read_sql(q, con=conn)\n", "print(f\"Number of records: {trips_with_nbhd.shape[0]:,}\")\n", "trips_with_nbhd.head()" ] } ], "metadata": { "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.13" } }, "nbformat": 4, "nbformat_minor": 5 }