{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Predicting NYC Taxi Fares with RAPIDS\n",
    "\n",
    "[RAPIDS](https://rapids.ai/) is a suite of GPU accelerated data science libraries with APIs that should be familiar to users of Pandas, Dask, and Scikitlearn.\n",
    "\n",
    "This notebook focuses on showing how to use cuDF with Dask & XGBoost to scale GPU DataFrame ETL-style operations & model training out to multiple GPUs on mutliple nodes using Amazon SageMaker.\n",
    "\n",
    "We will use the NYC Taxi dataset available available in us-west-2. If you are using this notebook in a different region, make sure you copy the relevant data in a bucket in your region."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table style=\"border: 2px solid white;\">\n",
       "<tr>\n",
       "<td style=\"vertical-align: top; border: 0px solid white\">\n",
       "<h3 style=\"text-align: left;\">Client</h3>\n",
       "<ul style=\"text-align: left; list-style: none; margin: 0; padding: 0;\">\n",
       "  <li><b>Scheduler: </b>tcp://127.0.0.1:34993</li>\n",
       "  <li><b>Dashboard: </b><a href='http://127.0.0.1:8787/status' target='_blank'>http://127.0.0.1:8787/status</a></li>\n",
       "</ul>\n",
       "</td>\n",
       "<td style=\"vertical-align: top; border: 0px solid white\">\n",
       "<h3 style=\"text-align: left;\">Cluster</h3>\n",
       "<ul style=\"text-align: left; list-style:none; margin: 0; padding: 0;\">\n",
       "  <li><b>Workers: </b>4</li>\n",
       "  <li><b>Cores: </b>4</li>\n",
       "  <li><b>Memory: </b>257.79 GB</li>\n",
       "</ul>\n",
       "</td>\n",
       "</tr>\n",
       "</table>"
      ],
      "text/plain": [
       "<Client: 'tcp://127.0.0.1:34993' processes=4 threads=4, memory=257.79 GB>"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "import numba, xgboost, socket\n",
    "import dask, dask_cudf\n",
    "from dask.distributed import Client, wait\n",
    "from dask_cuda import LocalCUDACluster\n",
    "import xgboost as xgb\n",
    "\n",
    "\n",
    "cluster = LocalCUDACluster() # by default use all GPUs in the node.\n",
    "client = Client(cluster)\n",
    "client"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Inspecting the Data\n",
    "\n",
    "Now that we have a cluster of GPU workers, we'll use [dask-cudf](https://github.com/rapidsai/dask-cudf/) to load and parse a bunch of CSV files into a distributed DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>vendor_id</th>\n",
       "      <th>pickup_datetime</th>\n",
       "      <th>dropoff_datetime</th>\n",
       "      <th>passenger_count</th>\n",
       "      <th>trip_distance</th>\n",
       "      <th>pickup_longitude</th>\n",
       "      <th>pickup_latitude</th>\n",
       "      <th>rate_code</th>\n",
       "      <th>store_and_fwd_flag</th>\n",
       "      <th>dropoff_longitude</th>\n",
       "      <th>dropoff_latitude</th>\n",
       "      <th>payment_type</th>\n",
       "      <th>fare_amount</th>\n",
       "      <th>surcharge</th>\n",
       "      <th>mta_tax</th>\n",
       "      <th>tip_amount</th>\n",
       "      <th>tolls_amount</th>\n",
       "      <th>total_amount</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>CMT</td>\n",
       "      <td>2014-01-09 20:45:25</td>\n",
       "      <td>2014-01-09 20:52:31</td>\n",
       "      <td>1</td>\n",
       "      <td>0.7</td>\n",
       "      <td>-73.994770</td>\n",
       "      <td>40.736828</td>\n",
       "      <td>1</td>\n",
       "      <td>N</td>\n",
       "      <td>-73.982227</td>\n",
       "      <td>40.731790</td>\n",
       "      <td>CRD</td>\n",
       "      <td>6.5</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.5</td>\n",
       "      <td>1.40</td>\n",
       "      <td>0.0</td>\n",
       "      <td>8.90</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>CMT</td>\n",
       "      <td>2014-01-09 20:46:12</td>\n",
       "      <td>2014-01-09 20:55:12</td>\n",
       "      <td>1</td>\n",
       "      <td>1.4</td>\n",
       "      <td>-73.982392</td>\n",
       "      <td>40.773382</td>\n",
       "      <td>1</td>\n",
       "      <td>N</td>\n",
       "      <td>-73.960449</td>\n",
       "      <td>40.763995</td>\n",
       "      <td>CRD</td>\n",
       "      <td>8.5</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.5</td>\n",
       "      <td>1.90</td>\n",
       "      <td>0.0</td>\n",
       "      <td>11.40</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>CMT</td>\n",
       "      <td>2014-01-09 20:44:47</td>\n",
       "      <td>2014-01-09 20:59:46</td>\n",
       "      <td>2</td>\n",
       "      <td>2.3</td>\n",
       "      <td>-73.988570</td>\n",
       "      <td>40.739406</td>\n",
       "      <td>1</td>\n",
       "      <td>N</td>\n",
       "      <td>-73.986626</td>\n",
       "      <td>40.765217</td>\n",
       "      <td>CRD</td>\n",
       "      <td>11.5</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.5</td>\n",
       "      <td>1.50</td>\n",
       "      <td>0.0</td>\n",
       "      <td>14.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>CMT</td>\n",
       "      <td>2014-01-09 20:44:57</td>\n",
       "      <td>2014-01-09 20:51:40</td>\n",
       "      <td>1</td>\n",
       "      <td>1.7</td>\n",
       "      <td>-73.960213</td>\n",
       "      <td>40.770464</td>\n",
       "      <td>1</td>\n",
       "      <td>N</td>\n",
       "      <td>-73.979863</td>\n",
       "      <td>40.777050</td>\n",
       "      <td>CRD</td>\n",
       "      <td>7.5</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.5</td>\n",
       "      <td>1.70</td>\n",
       "      <td>0.0</td>\n",
       "      <td>10.20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>CMT</td>\n",
       "      <td>2014-01-09 20:47:09</td>\n",
       "      <td>2014-01-09 20:53:32</td>\n",
       "      <td>1</td>\n",
       "      <td>0.9</td>\n",
       "      <td>-73.995371</td>\n",
       "      <td>40.717248</td>\n",
       "      <td>1</td>\n",
       "      <td>N</td>\n",
       "      <td>-73.984367</td>\n",
       "      <td>40.720524</td>\n",
       "      <td>CRD</td>\n",
       "      <td>6.0</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.5</td>\n",
       "      <td>1.75</td>\n",
       "      <td>0.0</td>\n",
       "      <td>8.75</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  vendor_id      pickup_datetime     dropoff_datetime   passenger_count  \\\n",
       "0       CMT  2014-01-09 20:45:25  2014-01-09 20:52:31                 1   \n",
       "1       CMT  2014-01-09 20:46:12  2014-01-09 20:55:12                 1   \n",
       "2       CMT  2014-01-09 20:44:47  2014-01-09 20:59:46                 2   \n",
       "3       CMT  2014-01-09 20:44:57  2014-01-09 20:51:40                 1   \n",
       "4       CMT  2014-01-09 20:47:09  2014-01-09 20:53:32                 1   \n",
       "\n",
       "    trip_distance   pickup_longitude   pickup_latitude   rate_code  \\\n",
       "0             0.7         -73.994770         40.736828           1   \n",
       "1             1.4         -73.982392         40.773382           1   \n",
       "2             2.3         -73.988570         40.739406           1   \n",
       "3             1.7         -73.960213         40.770464           1   \n",
       "4             0.9         -73.995371         40.717248           1   \n",
       "\n",
       "   store_and_fwd_flag   dropoff_longitude   dropoff_latitude  payment_type  \\\n",
       "0                   N          -73.982227          40.731790           CRD   \n",
       "1                   N          -73.960449          40.763995           CRD   \n",
       "2                   N          -73.986626          40.765217           CRD   \n",
       "3                   N          -73.979863          40.777050           CRD   \n",
       "4                   N          -73.984367          40.720524           CRD   \n",
       "\n",
       "    fare_amount   surcharge   mta_tax   tip_amount   tolls_amount  \\\n",
       "0           6.5         0.5       0.5         1.40            0.0   \n",
       "1           8.5         0.5       0.5         1.90            0.0   \n",
       "2          11.5         0.5       0.5         1.50            0.0   \n",
       "3           7.5         0.5       0.5         1.70            0.0   \n",
       "4           6.0         0.5       0.5         1.75            0.0   \n",
       "\n",
       "    total_amount  \n",
       "0           8.90  \n",
       "1          11.40  \n",
       "2          14.00  \n",
       "3          10.20  \n",
       "4           8.75  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "base_path = 's3://us-west-2.serverless-analytics/NYC-Pub/yellow/yellow_tripdata_'\n",
    "\n",
    "df_2014 = dask_cudf.read_csv(base_path+'*2014*.csv')\n",
    "df_2014.head().to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have about 164 mio records for the 2014 yellow taxi data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 5.51 s, sys: 993 ms, total: 6.5 s\n",
      "Wall time: 3min 16s\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "165114361"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "len(df_2014)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data Cleanup\n",
    "\n",
    "As usual, the data needs to be massaged a bit before we can start adding features that are useful to an ML model.\n",
    "\n",
    "For example, in the 2014 taxi CSV files, there are `pickup_datetime` and `dropoff_datetime` columns. The 2015 CSVs have `tpep_pickup_datetime` and `tpep_dropoff_datetime`, which are the same columns. One year has `rate_code`, and another `RateCodeID`.\n",
    "\n",
    "Also, some CSV files have column names with extraneous spaces in them.\n",
    "\n",
    "Worst of all, starting in the July 2016 CSVs, pickup & dropoff latitude and longitude data were replaced by location IDs, making the second half of the year useless to us.\n",
    "\n",
    "We'll do a little string manipulation, column renaming, and concatenating of DataFrames to sidestep the problems."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# list of column names that need to be re-mapped\n",
    "remap = {}\n",
    "remap['tpep_pickup_datetime'] = 'pickup_datetime'\n",
    "remap['tpep_dropoff_datetime'] = 'dropoff_datetime'\n",
    "remap['ratecodeid'] = 'rate_code'\n",
    "\n",
    "#create a list of columns & dtypes the df must have\n",
    "must_haves = {\n",
    " 'pickup_datetime': 'datetime64[ns]',\n",
    " 'dropoff_datetime': 'datetime64[ns]',\n",
    " 'passenger_count': 'int32',\n",
    " 'trip_distance': 'float32',\n",
    " 'pickup_longitude': 'float32',\n",
    " 'pickup_latitude': 'float32',\n",
    " 'rate_code': 'int32',\n",
    " 'dropoff_longitude': 'float32',\n",
    " 'dropoff_latitude': 'float32',\n",
    " 'fare_amount': 'float32'\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# helper function which takes a DataFrame partition\n",
    "def clean(df_part, remap, must_haves):    \n",
    "    # some col-names include pre-pended spaces remove & lowercase column names\n",
    "    tmp = {col:col.strip().lower() for col in list(df_part.columns)}\n",
    "    df_part = df_part.rename(tmp)\n",
    "    \n",
    "    # rename using the supplied mapping\n",
    "    df_part = df_part.rename(remap)\n",
    "    \n",
    "    # iterate through columns in this df partition\n",
    "    for col in df_part.columns:\n",
    "        # drop anything not in our expected list\n",
    "        if col not in must_haves:\n",
    "            df_part = df_part.drop(col)\n",
    "            continue\n",
    "        \n",
    "        # fixes datetime error found by Ty Mckercher and fixed by Paul Mahler\n",
    "        if df_part[col].dtype == 'object' and col in ['pickup_datetime', 'dropoff_datetime']:\n",
    "            df_part[col] = df_part[col].astype('datetime64[ns]')\n",
    "            continue\n",
    "                \n",
    "        # if column was read as a string, recast as float\n",
    "        if df_part[col].dtype == 'object':\n",
    "            df_part[col] = df_part[col].str.fillna('-1')\n",
    "            df_part[col] = df_part[col].astype('float32')\n",
    "        else:\n",
    "            # downcast from 64bit to 32bit types\n",
    "            # Tesla T4 are faster on 32bit ops\n",
    "            if 'int' in str(df_part[col].dtype):\n",
    "                df_part[col] = df_part[col].astype('int32')\n",
    "            if 'float' in str(df_part[col].dtype):\n",
    "                df_part[col] = df_part[col].astype('float32')\n",
    "            df_part[col] = df_part[col].fillna(-1)\n",
    "    \n",
    "    return df_part"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div><strong>Dask DataFrame Structure:</strong></div>\n",
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>vendor_id</th>\n",
       "      <th>pickup_datetime</th>\n",
       "      <th>dropoff_datetime</th>\n",
       "      <th>passenger_count</th>\n",
       "      <th>trip_distance</th>\n",
       "      <th>pickup_longitude</th>\n",
       "      <th>pickup_latitude</th>\n",
       "      <th>rate_code</th>\n",
       "      <th>store_and_fwd_flag</th>\n",
       "      <th>dropoff_longitude</th>\n",
       "      <th>dropoff_latitude</th>\n",
       "      <th>payment_type</th>\n",
       "      <th>fare_amount</th>\n",
       "      <th>surcharge</th>\n",
       "      <th>mta_tax</th>\n",
       "      <th>tip_amount</th>\n",
       "      <th>tolls_amount</th>\n",
       "      <th>total_amount</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>npartitions=110</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <td>object</td>\n",
       "      <td>object</td>\n",
       "      <td>object</td>\n",
       "      <td>int64</td>\n",
       "      <td>float64</td>\n",
       "      <td>float64</td>\n",
       "      <td>float64</td>\n",
       "      <td>int64</td>\n",
       "      <td>object</td>\n",
       "      <td>float64</td>\n",
       "      <td>float64</td>\n",
       "      <td>object</td>\n",
       "      <td>float64</td>\n",
       "      <td>float64</td>\n",
       "      <td>float64</td>\n",
       "      <td>float64</td>\n",
       "      <td>float64</td>\n",
       "      <td>float64</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>\n",
       "<div>Dask Name: read-csv, 110 tasks</div>"
      ],
      "text/plain": [
       "<dask_cudf.DataFrame | 110 tasks | 110 npartitions>"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_2014"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pickup_datetime</th>\n",
       "      <th>dropoff_datetime</th>\n",
       "      <th>passenger_count</th>\n",
       "      <th>trip_distance</th>\n",
       "      <th>pickup_longitude</th>\n",
       "      <th>pickup_latitude</th>\n",
       "      <th>rate_code</th>\n",
       "      <th>dropoff_longitude</th>\n",
       "      <th>dropoff_latitude</th>\n",
       "      <th>fare_amount</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2014-01-09 20:45:25</td>\n",
       "      <td>2014-01-09 20:52:31</td>\n",
       "      <td>1</td>\n",
       "      <td>0.7</td>\n",
       "      <td>-73.994766</td>\n",
       "      <td>40.736828</td>\n",
       "      <td>1</td>\n",
       "      <td>-73.982224</td>\n",
       "      <td>40.731789</td>\n",
       "      <td>6.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2014-01-09 20:46:12</td>\n",
       "      <td>2014-01-09 20:55:12</td>\n",
       "      <td>1</td>\n",
       "      <td>1.4</td>\n",
       "      <td>-73.982391</td>\n",
       "      <td>40.773380</td>\n",
       "      <td>1</td>\n",
       "      <td>-73.960449</td>\n",
       "      <td>40.763996</td>\n",
       "      <td>8.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2014-01-09 20:44:47</td>\n",
       "      <td>2014-01-09 20:59:46</td>\n",
       "      <td>2</td>\n",
       "      <td>2.3</td>\n",
       "      <td>-73.988571</td>\n",
       "      <td>40.739407</td>\n",
       "      <td>1</td>\n",
       "      <td>-73.986626</td>\n",
       "      <td>40.765217</td>\n",
       "      <td>11.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2014-01-09 20:44:57</td>\n",
       "      <td>2014-01-09 20:51:40</td>\n",
       "      <td>1</td>\n",
       "      <td>1.7</td>\n",
       "      <td>-73.960213</td>\n",
       "      <td>40.770466</td>\n",
       "      <td>1</td>\n",
       "      <td>-73.979866</td>\n",
       "      <td>40.777050</td>\n",
       "      <td>7.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2014-01-09 20:47:09</td>\n",
       "      <td>2014-01-09 20:53:32</td>\n",
       "      <td>1</td>\n",
       "      <td>0.9</td>\n",
       "      <td>-73.995369</td>\n",
       "      <td>40.717247</td>\n",
       "      <td>1</td>\n",
       "      <td>-73.984367</td>\n",
       "      <td>40.720524</td>\n",
       "      <td>6.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      pickup_datetime    dropoff_datetime  passenger_count  trip_distance  \\\n",
       "0 2014-01-09 20:45:25 2014-01-09 20:52:31                1            0.7   \n",
       "1 2014-01-09 20:46:12 2014-01-09 20:55:12                1            1.4   \n",
       "2 2014-01-09 20:44:47 2014-01-09 20:59:46                2            2.3   \n",
       "3 2014-01-09 20:44:57 2014-01-09 20:51:40                1            1.7   \n",
       "4 2014-01-09 20:47:09 2014-01-09 20:53:32                1            0.9   \n",
       "\n",
       "   pickup_longitude  pickup_latitude  rate_code  dropoff_longitude  \\\n",
       "0        -73.994766        40.736828          1         -73.982224   \n",
       "1        -73.982391        40.773380          1         -73.960449   \n",
       "2        -73.988571        40.739407          1         -73.986626   \n",
       "3        -73.960213        40.770466          1         -73.979866   \n",
       "4        -73.995369        40.717247          1         -73.984367   \n",
       "\n",
       "   dropoff_latitude  fare_amount  \n",
       "0         40.731789          6.5  \n",
       "1         40.763996          8.5  \n",
       "2         40.765217         11.5  \n",
       "3         40.777050          7.5  \n",
       "4         40.720524          6.0  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_2014 = df_2014.map_partitions(clean, remap, must_haves, meta=must_haves)\n",
    "df_2014.head().to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note the schema and the columns have been properly casted and renamed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div><strong>Dask DataFrame Structure:</strong></div>\n",
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pickup_datetime</th>\n",
       "      <th>dropoff_datetime</th>\n",
       "      <th>passenger_count</th>\n",
       "      <th>trip_distance</th>\n",
       "      <th>pickup_longitude</th>\n",
       "      <th>pickup_latitude</th>\n",
       "      <th>rate_code</th>\n",
       "      <th>dropoff_longitude</th>\n",
       "      <th>dropoff_latitude</th>\n",
       "      <th>fare_amount</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>npartitions=110</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <td>datetime64[ns]</td>\n",
       "      <td>datetime64[ns]</td>\n",
       "      <td>int32</td>\n",
       "      <td>float32</td>\n",
       "      <td>float32</td>\n",
       "      <td>float32</td>\n",
       "      <td>int32</td>\n",
       "      <td>float32</td>\n",
       "      <td>float32</td>\n",
       "      <td>float32</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>\n",
       "<div>Dask Name: clean, 220 tasks</div>"
      ],
      "text/plain": [
       "<dask_cudf.DataFrame | 220 tasks | 110 npartitions>"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_2014"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# concatenate multiple DataFrames into one bigger one\n",
    "#[TODO] add more data\n",
    "#taxi_df = dask.dataframe.multi.concat([df_2014, df_2015, df_2016])\n",
    "taxi_df = df_2014"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pickup_datetime</th>\n",
       "      <th>dropoff_datetime</th>\n",
       "      <th>passenger_count</th>\n",
       "      <th>trip_distance</th>\n",
       "      <th>pickup_longitude</th>\n",
       "      <th>pickup_latitude</th>\n",
       "      <th>rate_code</th>\n",
       "      <th>dropoff_longitude</th>\n",
       "      <th>dropoff_latitude</th>\n",
       "      <th>fare_amount</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2014-01-09 20:45:25</td>\n",
       "      <td>2014-01-09 20:52:31</td>\n",
       "      <td>1</td>\n",
       "      <td>0.7</td>\n",
       "      <td>-73.994766</td>\n",
       "      <td>40.736828</td>\n",
       "      <td>1</td>\n",
       "      <td>-73.982224</td>\n",
       "      <td>40.731789</td>\n",
       "      <td>6.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2014-01-09 20:46:12</td>\n",
       "      <td>2014-01-09 20:55:12</td>\n",
       "      <td>1</td>\n",
       "      <td>1.4</td>\n",
       "      <td>-73.982391</td>\n",
       "      <td>40.773380</td>\n",
       "      <td>1</td>\n",
       "      <td>-73.960449</td>\n",
       "      <td>40.763996</td>\n",
       "      <td>8.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2014-01-09 20:44:47</td>\n",
       "      <td>2014-01-09 20:59:46</td>\n",
       "      <td>2</td>\n",
       "      <td>2.3</td>\n",
       "      <td>-73.988571</td>\n",
       "      <td>40.739407</td>\n",
       "      <td>1</td>\n",
       "      <td>-73.986626</td>\n",
       "      <td>40.765217</td>\n",
       "      <td>11.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2014-01-09 20:44:57</td>\n",
       "      <td>2014-01-09 20:51:40</td>\n",
       "      <td>1</td>\n",
       "      <td>1.7</td>\n",
       "      <td>-73.960213</td>\n",
       "      <td>40.770466</td>\n",
       "      <td>1</td>\n",
       "      <td>-73.979866</td>\n",
       "      <td>40.777050</td>\n",
       "      <td>7.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2014-01-09 20:47:09</td>\n",
       "      <td>2014-01-09 20:53:32</td>\n",
       "      <td>1</td>\n",
       "      <td>0.9</td>\n",
       "      <td>-73.995369</td>\n",
       "      <td>40.717247</td>\n",
       "      <td>1</td>\n",
       "      <td>-73.984367</td>\n",
       "      <td>40.720524</td>\n",
       "      <td>6.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      pickup_datetime    dropoff_datetime  passenger_count  trip_distance  \\\n",
       "0 2014-01-09 20:45:25 2014-01-09 20:52:31                1            0.7   \n",
       "1 2014-01-09 20:46:12 2014-01-09 20:55:12                1            1.4   \n",
       "2 2014-01-09 20:44:47 2014-01-09 20:59:46                2            2.3   \n",
       "3 2014-01-09 20:44:57 2014-01-09 20:51:40                1            1.7   \n",
       "4 2014-01-09 20:47:09 2014-01-09 20:53:32                1            0.9   \n",
       "\n",
       "   pickup_longitude  pickup_latitude  rate_code  dropoff_longitude  \\\n",
       "0        -73.994766        40.736828          1         -73.982224   \n",
       "1        -73.982391        40.773380          1         -73.960449   \n",
       "2        -73.988571        40.739407          1         -73.986626   \n",
       "3        -73.960213        40.770466          1         -73.979866   \n",
       "4        -73.995369        40.717247          1         -73.984367   \n",
       "\n",
       "   dropoff_latitude  fare_amount  \n",
       "0         40.731789          6.5  \n",
       "1         40.763996          8.5  \n",
       "2         40.765217         11.5  \n",
       "3         40.777050          7.5  \n",
       "4         40.720524          6.0  "
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# apply a list of filter conditions to throw out records with missing or outlier values\n",
    "query_frags = [\n",
    "    'fare_amount > 0 and fare_amount < 500',\n",
    "    'passenger_count > 0 and passenger_count < 6',\n",
    "    'pickup_longitude > -75 and pickup_longitude < -73',\n",
    "    'dropoff_longitude > -75 and dropoff_longitude < -73',\n",
    "    'pickup_latitude > 40 and pickup_latitude < 42',\n",
    "    'dropoff_latitude > 40 and dropoff_latitude < 42'\n",
    "]\n",
    "taxi_df = taxi_df.query(' and '.join(query_frags))\n",
    "\n",
    "# inspect the results of cleaning\n",
    "taxi_df.head().to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Adding Interesting Features\n",
    "\n",
    "Dask & cuDF provide standard DataFrame operations, but also let you run \"user defined functions\" on the underlying data.\n",
    "\n",
    "cuDF's [apply_rows](https://rapidsai.github.io/projects/cudf/en/0.6.0/api.html#cudf.dataframe.DataFrame.apply_rows) operation is similar to Pandas's [DataFrame.apply](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html), except that for cuDF, custom Python code is [JIT compiled by numba](https://numba.pydata.org/numba-doc/dev/cuda/kernels.html) into GPU kernels.\n",
    "\n",
    "We'll use a Haversine Distance calculation to find total trip distance, and extract additional useful variables from the datetime fields."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "import math\n",
    "from math import cos, sin, asin, sqrt, pi\n",
    "\n",
    "def haversine_distance_kernel(pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude, h_distance):\n",
    "    for i, (x_1, y_1, x_2, y_2) in enumerate(zip(pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude)):\n",
    "        x_1 = pi/180 * x_1\n",
    "        y_1 = pi/180 * y_1\n",
    "        x_2 = pi/180 * x_2\n",
    "        y_2 = pi/180 * y_2\n",
    "        \n",
    "        dlon = y_2 - y_1\n",
    "        dlat = x_2 - x_1\n",
    "        a = sin(dlat/2)**2 + cos(x_1) * cos(x_2) * sin(dlon/2)**2\n",
    "        \n",
    "        c = 2 * asin(sqrt(a)) \n",
    "        r = 6371 # Radius of earth in kilometers\n",
    "        \n",
    "        h_distance[i] = c * r\n",
    "\n",
    "def day_of_the_week_kernel(day, month, year, day_of_week):\n",
    "    for i, (d_1, m_1, y_1) in enumerate(zip(day, month, year)):\n",
    "        if month[i] <3:\n",
    "            shift = month[i]\n",
    "        else:\n",
    "            shift = 0\n",
    "        Y = year[i] - (month[i] < 3)\n",
    "        y = Y - 2000\n",
    "        c = 20\n",
    "        d = day[i]\n",
    "        m = month[i] + shift + 1\n",
    "        day_of_week[i] = (d + math.floor(m*2.6) + y + (y//4) + (c//4) -2*c)%7\n",
    "        \n",
    "def add_features(df):\n",
    "    df['hour'] = df['pickup_datetime'].dt.hour\n",
    "    df['year'] = df['pickup_datetime'].dt.year\n",
    "    df['month'] = df['pickup_datetime'].dt.month\n",
    "    df['day'] = df['pickup_datetime'].dt.day\n",
    "    \n",
    "    df['pickup_latitude_r'] = df['pickup_latitude']//.01*.01\n",
    "    df['pickup_longitude_r'] = df['pickup_longitude']//.01*.01\n",
    "    df['dropoff_latitude_r'] = df['dropoff_latitude']//.01*.01\n",
    "    df['dropoff_longitude_r'] = df['dropoff_longitude']//.01*.01\n",
    "    \n",
    "    \n",
    "    \n",
    "    df = df.apply_rows(haversine_distance_kernel,\n",
    "                   incols=['pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude'],\n",
    "                   outcols=dict(h_distance=np.float32),\n",
    "                   kwargs=dict())\n",
    "    \n",
    "    \n",
    "    df = df.apply_rows(day_of_the_week_kernel,\n",
    "                      incols=['day', 'month', 'year'],\n",
    "                      outcols=dict(day_of_week=np.float32),\n",
    "                      kwargs=dict())\n",
    "    \n",
    "    \n",
    "    df['is_weekend'] = (df['day_of_week']<2).astype(np.int32)\n",
    "    return df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'pickup_datetime': 'datetime64[ns]', 'dropoff_datetime': 'datetime64[ns]', 'passenger_count': 'int32', 'trip_distance': 'float32', 'pickup_longitude': 'float32', 'pickup_latitude': 'float32', 'rate_code': 'int32', 'dropoff_longitude': 'float32', 'dropoff_latitude': 'float32', 'fare_amount': 'float32'}\n"
     ]
    }
   ],
   "source": [
    "print(must_haves)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "features = {\n",
    " 'hour': 'int32',\n",
    " 'year': 'int32',\n",
    " 'month': 'int32',\n",
    " 'day': 'float32',\n",
    " 'pickup_latitude_r': 'float32',\n",
    " 'pickup_longitude_r': 'float32',\n",
    " 'dropoff_latitude_r': 'float32',\n",
    " 'dropoff_longitude_r': 'float32',\n",
    " 'h_distance': 'float32',\n",
    " 'day_of_week': 'float32',\n",
    " 'is_weekend': 'int32'\n",
    "}\n",
    "\n",
    "all_features = {**must_haves, **features}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'pickup_datetime': 'datetime64[ns]', 'dropoff_datetime': 'datetime64[ns]', 'passenger_count': 'int32', 'trip_distance': 'float32', 'pickup_longitude': 'float32', 'pickup_latitude': 'float32', 'rate_code': 'int32', 'dropoff_longitude': 'float32', 'dropoff_latitude': 'float32', 'fare_amount': 'float32', 'hour': 'int32', 'year': 'int32', 'month': 'int32', 'day': 'float32', 'pickup_latitude_r': 'float32', 'pickup_longitude_r': 'float32', 'dropoff_latitude_r': 'float32', 'dropoff_longitude_r': 'float32', 'h_distance': 'float32', 'day_of_week': 'float32', 'is_weekend': 'int32'}\n"
     ]
    }
   ],
   "source": [
    "print(all_features)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div><strong>Dask DataFrame Structure:</strong></div>\n",
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pickup_datetime</th>\n",
       "      <th>dropoff_datetime</th>\n",
       "      <th>passenger_count</th>\n",
       "      <th>trip_distance</th>\n",
       "      <th>pickup_longitude</th>\n",
       "      <th>pickup_latitude</th>\n",
       "      <th>rate_code</th>\n",
       "      <th>dropoff_longitude</th>\n",
       "      <th>dropoff_latitude</th>\n",
       "      <th>fare_amount</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>npartitions=110</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <td>datetime64[ns]</td>\n",
       "      <td>datetime64[ns]</td>\n",
       "      <td>int32</td>\n",
       "      <td>float32</td>\n",
       "      <td>float32</td>\n",
       "      <td>float32</td>\n",
       "      <td>int32</td>\n",
       "      <td>float32</td>\n",
       "      <td>float32</td>\n",
       "      <td>float32</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>\n",
       "<div>Dask Name: query, 330 tasks</div>"
      ],
      "text/plain": [
       "<dask_cudf.DataFrame | 330 tasks | 110 npartitions>"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "taxi_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pickup_datetime</th>\n",
       "      <th>dropoff_datetime</th>\n",
       "      <th>passenger_count</th>\n",
       "      <th>trip_distance</th>\n",
       "      <th>pickup_longitude</th>\n",
       "      <th>pickup_latitude</th>\n",
       "      <th>rate_code</th>\n",
       "      <th>dropoff_longitude</th>\n",
       "      <th>dropoff_latitude</th>\n",
       "      <th>fare_amount</th>\n",
       "      <th>...</th>\n",
       "      <th>year</th>\n",
       "      <th>month</th>\n",
       "      <th>day</th>\n",
       "      <th>pickup_latitude_r</th>\n",
       "      <th>pickup_longitude_r</th>\n",
       "      <th>dropoff_latitude_r</th>\n",
       "      <th>dropoff_longitude_r</th>\n",
       "      <th>h_distance</th>\n",
       "      <th>day_of_week</th>\n",
       "      <th>is_weekend</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2014-01-09 20:45:25</td>\n",
       "      <td>2014-01-09 20:52:31</td>\n",
       "      <td>1</td>\n",
       "      <td>0.7</td>\n",
       "      <td>-73.994766</td>\n",
       "      <td>40.736828</td>\n",
       "      <td>1</td>\n",
       "      <td>-73.982224</td>\n",
       "      <td>40.731789</td>\n",
       "      <td>6.5</td>\n",
       "      <td>...</td>\n",
       "      <td>2014</td>\n",
       "      <td>1</td>\n",
       "      <td>9</td>\n",
       "      <td>40.730000</td>\n",
       "      <td>-74.000000</td>\n",
       "      <td>40.730000</td>\n",
       "      <td>-73.989998</td>\n",
       "      <td>1.196175</td>\n",
       "      <td>4.0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2014-01-09 20:46:12</td>\n",
       "      <td>2014-01-09 20:55:12</td>\n",
       "      <td>1</td>\n",
       "      <td>1.4</td>\n",
       "      <td>-73.982391</td>\n",
       "      <td>40.773380</td>\n",
       "      <td>1</td>\n",
       "      <td>-73.960449</td>\n",
       "      <td>40.763996</td>\n",
       "      <td>8.5</td>\n",
       "      <td>...</td>\n",
       "      <td>2014</td>\n",
       "      <td>1</td>\n",
       "      <td>9</td>\n",
       "      <td>40.770000</td>\n",
       "      <td>-73.989998</td>\n",
       "      <td>40.759998</td>\n",
       "      <td>-73.970001</td>\n",
       "      <td>2.122098</td>\n",
       "      <td>4.0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2014-01-09 20:44:47</td>\n",
       "      <td>2014-01-09 20:59:46</td>\n",
       "      <td>2</td>\n",
       "      <td>2.3</td>\n",
       "      <td>-73.988571</td>\n",
       "      <td>40.739407</td>\n",
       "      <td>1</td>\n",
       "      <td>-73.986626</td>\n",
       "      <td>40.765217</td>\n",
       "      <td>11.5</td>\n",
       "      <td>...</td>\n",
       "      <td>2014</td>\n",
       "      <td>1</td>\n",
       "      <td>9</td>\n",
       "      <td>40.730000</td>\n",
       "      <td>-73.989998</td>\n",
       "      <td>40.759998</td>\n",
       "      <td>-73.989998</td>\n",
       "      <td>2.874643</td>\n",
       "      <td>4.0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2014-01-09 20:44:57</td>\n",
       "      <td>2014-01-09 20:51:40</td>\n",
       "      <td>1</td>\n",
       "      <td>1.7</td>\n",
       "      <td>-73.960213</td>\n",
       "      <td>40.770466</td>\n",
       "      <td>1</td>\n",
       "      <td>-73.979866</td>\n",
       "      <td>40.777050</td>\n",
       "      <td>7.5</td>\n",
       "      <td>...</td>\n",
       "      <td>2014</td>\n",
       "      <td>1</td>\n",
       "      <td>9</td>\n",
       "      <td>40.770000</td>\n",
       "      <td>-73.970001</td>\n",
       "      <td>40.770000</td>\n",
       "      <td>-73.979996</td>\n",
       "      <td>1.809662</td>\n",
       "      <td>4.0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2014-01-09 20:47:09</td>\n",
       "      <td>2014-01-09 20:53:32</td>\n",
       "      <td>1</td>\n",
       "      <td>0.9</td>\n",
       "      <td>-73.995369</td>\n",
       "      <td>40.717247</td>\n",
       "      <td>1</td>\n",
       "      <td>-73.984367</td>\n",
       "      <td>40.720524</td>\n",
       "      <td>6.0</td>\n",
       "      <td>...</td>\n",
       "      <td>2014</td>\n",
       "      <td>1</td>\n",
       "      <td>9</td>\n",
       "      <td>40.709999</td>\n",
       "      <td>-74.000000</td>\n",
       "      <td>40.719997</td>\n",
       "      <td>-73.989998</td>\n",
       "      <td>0.996204</td>\n",
       "      <td>4.0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 21 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      pickup_datetime    dropoff_datetime  passenger_count  trip_distance  \\\n",
       "0 2014-01-09 20:45:25 2014-01-09 20:52:31                1            0.7   \n",
       "1 2014-01-09 20:46:12 2014-01-09 20:55:12                1            1.4   \n",
       "2 2014-01-09 20:44:47 2014-01-09 20:59:46                2            2.3   \n",
       "3 2014-01-09 20:44:57 2014-01-09 20:51:40                1            1.7   \n",
       "4 2014-01-09 20:47:09 2014-01-09 20:53:32                1            0.9   \n",
       "\n",
       "   pickup_longitude  pickup_latitude  rate_code  dropoff_longitude  \\\n",
       "0        -73.994766        40.736828          1         -73.982224   \n",
       "1        -73.982391        40.773380          1         -73.960449   \n",
       "2        -73.988571        40.739407          1         -73.986626   \n",
       "3        -73.960213        40.770466          1         -73.979866   \n",
       "4        -73.995369        40.717247          1         -73.984367   \n",
       "\n",
       "   dropoff_latitude  fare_amount  ...  year  month  day  pickup_latitude_r  \\\n",
       "0         40.731789          6.5  ...  2014      1    9          40.730000   \n",
       "1         40.763996          8.5  ...  2014      1    9          40.770000   \n",
       "2         40.765217         11.5  ...  2014      1    9          40.730000   \n",
       "3         40.777050          7.5  ...  2014      1    9          40.770000   \n",
       "4         40.720524          6.0  ...  2014      1    9          40.709999   \n",
       "\n",
       "   pickup_longitude_r  dropoff_latitude_r  dropoff_longitude_r  h_distance  \\\n",
       "0          -74.000000           40.730000           -73.989998    1.196175   \n",
       "1          -73.989998           40.759998           -73.970001    2.122098   \n",
       "2          -73.989998           40.759998           -73.989998    2.874643   \n",
       "3          -73.970001           40.770000           -73.979996    1.809662   \n",
       "4          -74.000000           40.719997           -73.989998    0.996204   \n",
       "\n",
       "   day_of_week  is_weekend  \n",
       "0          4.0           0  \n",
       "1          4.0           0  \n",
       "2          4.0           0  \n",
       "3          4.0           0  \n",
       "4          4.0           0  \n",
       "\n",
       "[5 rows x 21 columns]"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# actually add the features\n",
    "taxi_df = taxi_df.map_partitions(add_features, meta=all_features)\n",
    "# inspect the result\n",
    "taxi_df.head().to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Pick a Training Set\n",
    "\n",
    "Let's imagine you're making a trip to New York on the 25th and want to build a model to predict what fare prices will be like the last few days of the month based on the first part of the month. We'll use a query expression to identify the `day` of the month to use to divide the data into train and test sets.\n",
    "\n",
    "The wall-time below represents how long it takes your GPU cluster to load data from the Google Cloud Storage bucket and the ETL portion of the workflow."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 7.91 s, sys: 630 ms, total: 8.54 s\n",
      "Wall time: 3min 33s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "X_train = taxi_df.query('day < 25').persist()\n",
    "\n",
    "# create a Y_train ddf with just the target variable\n",
    "Y_train = X_train[['fare_amount']].persist()\n",
    "# drop the target variable from the training ddf\n",
    "X_train = X_train[X_train.columns.difference(['fare_amount'])]\n",
    "\n",
    "# this wont return until all data is in GPU memory\n",
    "done = wait([X_train, Y_train])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train = X_train.drop('pickup_datetime', axis=1)\n",
    "X_train = X_train.drop('dropoff_datetime', axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div><strong>Dask DataFrame Structure:</strong></div>\n",
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>day</th>\n",
       "      <th>day_of_week</th>\n",
       "      <th>dropoff_latitude</th>\n",
       "      <th>dropoff_latitude_r</th>\n",
       "      <th>dropoff_longitude</th>\n",
       "      <th>dropoff_longitude_r</th>\n",
       "      <th>h_distance</th>\n",
       "      <th>hour</th>\n",
       "      <th>is_weekend</th>\n",
       "      <th>month</th>\n",
       "      <th>passenger_count</th>\n",
       "      <th>pickup_latitude</th>\n",
       "      <th>pickup_latitude_r</th>\n",
       "      <th>pickup_longitude</th>\n",
       "      <th>pickup_longitude_r</th>\n",
       "      <th>rate_code</th>\n",
       "      <th>trip_distance</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>npartitions=110</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <td>float32</td>\n",
       "      <td>float32</td>\n",
       "      <td>float32</td>\n",
       "      <td>float32</td>\n",
       "      <td>float32</td>\n",
       "      <td>float32</td>\n",
       "      <td>float32</td>\n",
       "      <td>int32</td>\n",
       "      <td>int32</td>\n",
       "      <td>int32</td>\n",
       "      <td>int32</td>\n",
       "      <td>float32</td>\n",
       "      <td>float32</td>\n",
       "      <td>float32</td>\n",
       "      <td>float32</td>\n",
       "      <td>int32</td>\n",
       "      <td>float32</td>\n",
       "      <td>int32</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>\n",
       "<div>Dask Name: drop_by_shallow_copy, 440 tasks</div>"
      ],
      "text/plain": [
       "<dask_cudf.DataFrame | 440 tasks | 110 npartitions>"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_train"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Train the XGBoost Regression Model\n",
    "\n",
    "The wall time output below indicates how long it took your GPU cluster to train an XGBoost model over the training set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 1.14 s, sys: 80.8 ms, total: 1.22 s\n",
      "Wall time: 19.9 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "dtrain = xgb.dask.DaskDMatrix(client, X_train, Y_train)\n",
    "# Use train method from xgboost.dask instead of xgboost.  This\n",
    "# distributed version of train returns a dictionary containing the\n",
    "# resulting booster and evaluation history obtained from\n",
    "# evaluation metrics.\n",
    "model_output = xgb.dask.train(client,\n",
    "                        # Use GPU training algorithm\n",
    "                        {'tree_method': 'gpu_hist'},\n",
    "                        dtrain,\n",
    "                        num_boost_round=100,\n",
    "                        evals=[(dtrain, 'train')])\n",
    "booster = model_output['booster']  # booster is the trained model\n",
    "history = model_output['history']  # A dictionary containing evaluation results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# How Good is Our Model?\n",
    "\n",
    "Now that we have a trained model, we need to test it with the 25% of records we held out.\n",
    "\n",
    "Based on the filtering conditions applied to this dataset, many of the DataFrame partitions will wind up having 0 rows.\n",
    "\n",
    "This is a problem for XGBoost which doesn't know what to do with 0 length arrays. We'll apply a bit of Dask logic to check for and drop partitions without any rows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "def drop_empty_partitions(df):\n",
    "    lengths = df.map_partitions(len).compute()\n",
    "    nonempty = [length > 0 for length in lengths]\n",
    "    return df.partitions[nonempty]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "31739751"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_test = taxi_df.query('day >= 25').persist()\n",
    "X_test = drop_empty_partitions(X_test)\n",
    "\n",
    "# Create Y_test with just the fare amount\n",
    "Y_test = X_test[['fare_amount']]\n",
    "\n",
    "# Drop the fare amount from X_test\n",
    "X_test = X_test[X_test.columns.difference(['fare_amount'])]\n",
    "\n",
    "# display test set size\n",
    "len(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_test = X_test.drop('pickup_datetime', axis=1)\n",
    "X_test = X_test.drop('dropoff_datetime', axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Evaluation history: {'train': {'rmse': [11.327804, 8.172996, 6.041537, 4.641979, 3.757672, 3.224258, 2.911278, 2.732369, 2.630654, 2.569407, 2.531624, 2.502579, 2.486299, 2.471962, 2.460887, 2.449411, 2.437919, 2.429802, 2.42383, 2.41742, 2.412308, 2.407344, 2.403826, 2.400395, 2.395525, 2.393419, 2.388041, 2.384101, 2.380613, 2.378282, 2.371596, 2.367515, 2.364666, 2.357522, 2.355485, 2.352574, 2.349749, 2.346168, 2.343192, 2.341385, 2.338813, 2.337814, 2.336341, 2.333639, 2.331098, 2.327207, 2.326241, 2.324432, 2.320762, 2.318273, 2.315912, 2.314492, 2.313371, 2.31209, 2.311155, 2.309329, 2.306317, 2.305289, 2.303312, 2.302437, 2.30142, 2.298965, 2.297313, 2.296462, 2.294187, 2.292799, 2.291584, 2.29013, 2.289528, 2.288281, 2.287606, 2.286923, 2.285734, 2.284415, 2.283031, 2.282219, 2.281136, 2.279738, 2.278096, 2.277354, 2.276891, 2.276182, 2.274509, 2.273721, 2.272978, 2.272003, 2.27018, 2.269635, 2.268117, 2.267274, 2.266372, 2.264311, 2.263582, 2.262124, 2.261794, 2.260962, 2.259462, 2.258576, 2.257909, 2.257146]}}\n"
     ]
    }
   ],
   "source": [
    "prediction = xgb.dask.predict(client, booster, X_test)\n",
    "prediction = prediction.compute()\n",
    "print('Evaluation history:', history)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_rapidsai",
   "language": "python",
   "name": "conda_rapidsai"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}