{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Predicting NYC Taxi Fares with RAPIDS\n",
"\n",
"[RAPIDS](https://rapids.ai/) is a suite of GPU accelerated data science libraries with APIs that should be familiar to users of Pandas, Dask, and Scikitlearn.\n",
"\n",
"This notebook focuses on showing how to use cuDF with Dask & XGBoost to scale GPU DataFrame ETL-style operations & model training out to multiple GPUs on mutliple nodes using Amazon SageMaker.\n",
"\n",
"We will use the NYC Taxi dataset available available in us-west-2. If you are using this notebook in a different region, make sure you copy the relevant data in a bucket in your region."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"\n",
"Client\n",
"\n",
" | \n",
"\n",
"Cluster\n",
"\n",
" - Workers: 4
\n",
" - Cores: 4
\n",
" - Memory: 257.79 GB
\n",
" \n",
" | \n",
"
\n",
"
"
],
"text/plain": [
""
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numpy as np\n",
"import numba, xgboost, socket\n",
"import dask, dask_cudf\n",
"from dask.distributed import Client, wait\n",
"from dask_cuda import LocalCUDACluster\n",
"import xgboost as xgb\n",
"\n",
"\n",
"cluster = LocalCUDACluster() # by default use all GPUs in the node.\n",
"client = Client(cluster)\n",
"client"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Inspecting the Data\n",
"\n",
"Now that we have a cluster of GPU workers, we'll use [dask-cudf](https://github.com/rapidsai/dask-cudf/) to load and parse a bunch of CSV files into a distributed DataFrame."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" vendor_id | \n",
" pickup_datetime | \n",
" dropoff_datetime | \n",
" passenger_count | \n",
" trip_distance | \n",
" pickup_longitude | \n",
" pickup_latitude | \n",
" rate_code | \n",
" store_and_fwd_flag | \n",
" dropoff_longitude | \n",
" dropoff_latitude | \n",
" payment_type | \n",
" fare_amount | \n",
" surcharge | \n",
" mta_tax | \n",
" tip_amount | \n",
" tolls_amount | \n",
" total_amount | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" CMT | \n",
" 2014-01-09 20:45:25 | \n",
" 2014-01-09 20:52:31 | \n",
" 1 | \n",
" 0.7 | \n",
" -73.994770 | \n",
" 40.736828 | \n",
" 1 | \n",
" N | \n",
" -73.982227 | \n",
" 40.731790 | \n",
" CRD | \n",
" 6.5 | \n",
" 0.5 | \n",
" 0.5 | \n",
" 1.40 | \n",
" 0.0 | \n",
" 8.90 | \n",
"
\n",
" \n",
" 1 | \n",
" CMT | \n",
" 2014-01-09 20:46:12 | \n",
" 2014-01-09 20:55:12 | \n",
" 1 | \n",
" 1.4 | \n",
" -73.982392 | \n",
" 40.773382 | \n",
" 1 | \n",
" N | \n",
" -73.960449 | \n",
" 40.763995 | \n",
" CRD | \n",
" 8.5 | \n",
" 0.5 | \n",
" 0.5 | \n",
" 1.90 | \n",
" 0.0 | \n",
" 11.40 | \n",
"
\n",
" \n",
" 2 | \n",
" CMT | \n",
" 2014-01-09 20:44:47 | \n",
" 2014-01-09 20:59:46 | \n",
" 2 | \n",
" 2.3 | \n",
" -73.988570 | \n",
" 40.739406 | \n",
" 1 | \n",
" N | \n",
" -73.986626 | \n",
" 40.765217 | \n",
" CRD | \n",
" 11.5 | \n",
" 0.5 | \n",
" 0.5 | \n",
" 1.50 | \n",
" 0.0 | \n",
" 14.00 | \n",
"
\n",
" \n",
" 3 | \n",
" CMT | \n",
" 2014-01-09 20:44:57 | \n",
" 2014-01-09 20:51:40 | \n",
" 1 | \n",
" 1.7 | \n",
" -73.960213 | \n",
" 40.770464 | \n",
" 1 | \n",
" N | \n",
" -73.979863 | \n",
" 40.777050 | \n",
" CRD | \n",
" 7.5 | \n",
" 0.5 | \n",
" 0.5 | \n",
" 1.70 | \n",
" 0.0 | \n",
" 10.20 | \n",
"
\n",
" \n",
" 4 | \n",
" CMT | \n",
" 2014-01-09 20:47:09 | \n",
" 2014-01-09 20:53:32 | \n",
" 1 | \n",
" 0.9 | \n",
" -73.995371 | \n",
" 40.717248 | \n",
" 1 | \n",
" N | \n",
" -73.984367 | \n",
" 40.720524 | \n",
" CRD | \n",
" 6.0 | \n",
" 0.5 | \n",
" 0.5 | \n",
" 1.75 | \n",
" 0.0 | \n",
" 8.75 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" vendor_id pickup_datetime dropoff_datetime passenger_count \\\n",
"0 CMT 2014-01-09 20:45:25 2014-01-09 20:52:31 1 \n",
"1 CMT 2014-01-09 20:46:12 2014-01-09 20:55:12 1 \n",
"2 CMT 2014-01-09 20:44:47 2014-01-09 20:59:46 2 \n",
"3 CMT 2014-01-09 20:44:57 2014-01-09 20:51:40 1 \n",
"4 CMT 2014-01-09 20:47:09 2014-01-09 20:53:32 1 \n",
"\n",
" trip_distance pickup_longitude pickup_latitude rate_code \\\n",
"0 0.7 -73.994770 40.736828 1 \n",
"1 1.4 -73.982392 40.773382 1 \n",
"2 2.3 -73.988570 40.739406 1 \n",
"3 1.7 -73.960213 40.770464 1 \n",
"4 0.9 -73.995371 40.717248 1 \n",
"\n",
" store_and_fwd_flag dropoff_longitude dropoff_latitude payment_type \\\n",
"0 N -73.982227 40.731790 CRD \n",
"1 N -73.960449 40.763995 CRD \n",
"2 N -73.986626 40.765217 CRD \n",
"3 N -73.979863 40.777050 CRD \n",
"4 N -73.984367 40.720524 CRD \n",
"\n",
" fare_amount surcharge mta_tax tip_amount tolls_amount \\\n",
"0 6.5 0.5 0.5 1.40 0.0 \n",
"1 8.5 0.5 0.5 1.90 0.0 \n",
"2 11.5 0.5 0.5 1.50 0.0 \n",
"3 7.5 0.5 0.5 1.70 0.0 \n",
"4 6.0 0.5 0.5 1.75 0.0 \n",
"\n",
" total_amount \n",
"0 8.90 \n",
"1 11.40 \n",
"2 14.00 \n",
"3 10.20 \n",
"4 8.75 "
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"base_path = 's3://us-west-2.serverless-analytics/NYC-Pub/yellow/yellow_tripdata_'\n",
"\n",
"df_2014 = dask_cudf.read_csv(base_path+'*2014*.csv')\n",
"df_2014.head().to_pandas()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We have about 164 mio records for the 2014 yellow taxi data."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 5.51 s, sys: 993 ms, total: 6.5 s\n",
"Wall time: 3min 16s\n"
]
},
{
"data": {
"text/plain": [
"165114361"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%time\n",
"len(df_2014)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data Cleanup\n",
"\n",
"As usual, the data needs to be massaged a bit before we can start adding features that are useful to an ML model.\n",
"\n",
"For example, in the 2014 taxi CSV files, there are `pickup_datetime` and `dropoff_datetime` columns. The 2015 CSVs have `tpep_pickup_datetime` and `tpep_dropoff_datetime`, which are the same columns. One year has `rate_code`, and another `RateCodeID`.\n",
"\n",
"Also, some CSV files have column names with extraneous spaces in them.\n",
"\n",
"Worst of all, starting in the July 2016 CSVs, pickup & dropoff latitude and longitude data were replaced by location IDs, making the second half of the year useless to us.\n",
"\n",
"We'll do a little string manipulation, column renaming, and concatenating of DataFrames to sidestep the problems."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# list of column names that need to be re-mapped\n",
"remap = {}\n",
"remap['tpep_pickup_datetime'] = 'pickup_datetime'\n",
"remap['tpep_dropoff_datetime'] = 'dropoff_datetime'\n",
"remap['ratecodeid'] = 'rate_code'\n",
"\n",
"#create a list of columns & dtypes the df must have\n",
"must_haves = {\n",
" 'pickup_datetime': 'datetime64[ns]',\n",
" 'dropoff_datetime': 'datetime64[ns]',\n",
" 'passenger_count': 'int32',\n",
" 'trip_distance': 'float32',\n",
" 'pickup_longitude': 'float32',\n",
" 'pickup_latitude': 'float32',\n",
" 'rate_code': 'int32',\n",
" 'dropoff_longitude': 'float32',\n",
" 'dropoff_latitude': 'float32',\n",
" 'fare_amount': 'float32'\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# helper function which takes a DataFrame partition\n",
"def clean(df_part, remap, must_haves): \n",
" # some col-names include pre-pended spaces remove & lowercase column names\n",
" tmp = {col:col.strip().lower() for col in list(df_part.columns)}\n",
" df_part = df_part.rename(tmp)\n",
" \n",
" # rename using the supplied mapping\n",
" df_part = df_part.rename(remap)\n",
" \n",
" # iterate through columns in this df partition\n",
" for col in df_part.columns:\n",
" # drop anything not in our expected list\n",
" if col not in must_haves:\n",
" df_part = df_part.drop(col)\n",
" continue\n",
" \n",
" # fixes datetime error found by Ty Mckercher and fixed by Paul Mahler\n",
" if df_part[col].dtype == 'object' and col in ['pickup_datetime', 'dropoff_datetime']:\n",
" df_part[col] = df_part[col].astype('datetime64[ns]')\n",
" continue\n",
" \n",
" # if column was read as a string, recast as float\n",
" if df_part[col].dtype == 'object':\n",
" df_part[col] = df_part[col].str.fillna('-1')\n",
" df_part[col] = df_part[col].astype('float32')\n",
" else:\n",
" # downcast from 64bit to 32bit types\n",
" # Tesla T4 are faster on 32bit ops\n",
" if 'int' in str(df_part[col].dtype):\n",
" df_part[col] = df_part[col].astype('int32')\n",
" if 'float' in str(df_part[col].dtype):\n",
" df_part[col] = df_part[col].astype('float32')\n",
" df_part[col] = df_part[col].fillna(-1)\n",
" \n",
" return df_part"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"Dask DataFrame Structure:
\n",
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" vendor_id | \n",
" pickup_datetime | \n",
" dropoff_datetime | \n",
" passenger_count | \n",
" trip_distance | \n",
" pickup_longitude | \n",
" pickup_latitude | \n",
" rate_code | \n",
" store_and_fwd_flag | \n",
" dropoff_longitude | \n",
" dropoff_latitude | \n",
" payment_type | \n",
" fare_amount | \n",
" surcharge | \n",
" mta_tax | \n",
" tip_amount | \n",
" tolls_amount | \n",
" total_amount | \n",
"
\n",
" \n",
" npartitions=110 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" | \n",
" object | \n",
" object | \n",
" object | \n",
" int64 | \n",
" float64 | \n",
" float64 | \n",
" float64 | \n",
" int64 | \n",
" object | \n",
" float64 | \n",
" float64 | \n",
" object | \n",
" float64 | \n",
" float64 | \n",
" float64 | \n",
" float64 | \n",
" float64 | \n",
" float64 | \n",
"
\n",
" \n",
" | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
"
\n",
"
\n",
"Dask Name: read-csv, 110 tasks
"
],
"text/plain": [
""
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_2014"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" pickup_datetime | \n",
" dropoff_datetime | \n",
" passenger_count | \n",
" trip_distance | \n",
" pickup_longitude | \n",
" pickup_latitude | \n",
" rate_code | \n",
" dropoff_longitude | \n",
" dropoff_latitude | \n",
" fare_amount | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2014-01-09 20:45:25 | \n",
" 2014-01-09 20:52:31 | \n",
" 1 | \n",
" 0.7 | \n",
" -73.994766 | \n",
" 40.736828 | \n",
" 1 | \n",
" -73.982224 | \n",
" 40.731789 | \n",
" 6.5 | \n",
"
\n",
" \n",
" 1 | \n",
" 2014-01-09 20:46:12 | \n",
" 2014-01-09 20:55:12 | \n",
" 1 | \n",
" 1.4 | \n",
" -73.982391 | \n",
" 40.773380 | \n",
" 1 | \n",
" -73.960449 | \n",
" 40.763996 | \n",
" 8.5 | \n",
"
\n",
" \n",
" 2 | \n",
" 2014-01-09 20:44:47 | \n",
" 2014-01-09 20:59:46 | \n",
" 2 | \n",
" 2.3 | \n",
" -73.988571 | \n",
" 40.739407 | \n",
" 1 | \n",
" -73.986626 | \n",
" 40.765217 | \n",
" 11.5 | \n",
"
\n",
" \n",
" 3 | \n",
" 2014-01-09 20:44:57 | \n",
" 2014-01-09 20:51:40 | \n",
" 1 | \n",
" 1.7 | \n",
" -73.960213 | \n",
" 40.770466 | \n",
" 1 | \n",
" -73.979866 | \n",
" 40.777050 | \n",
" 7.5 | \n",
"
\n",
" \n",
" 4 | \n",
" 2014-01-09 20:47:09 | \n",
" 2014-01-09 20:53:32 | \n",
" 1 | \n",
" 0.9 | \n",
" -73.995369 | \n",
" 40.717247 | \n",
" 1 | \n",
" -73.984367 | \n",
" 40.720524 | \n",
" 6.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" pickup_datetime dropoff_datetime passenger_count trip_distance \\\n",
"0 2014-01-09 20:45:25 2014-01-09 20:52:31 1 0.7 \n",
"1 2014-01-09 20:46:12 2014-01-09 20:55:12 1 1.4 \n",
"2 2014-01-09 20:44:47 2014-01-09 20:59:46 2 2.3 \n",
"3 2014-01-09 20:44:57 2014-01-09 20:51:40 1 1.7 \n",
"4 2014-01-09 20:47:09 2014-01-09 20:53:32 1 0.9 \n",
"\n",
" pickup_longitude pickup_latitude rate_code dropoff_longitude \\\n",
"0 -73.994766 40.736828 1 -73.982224 \n",
"1 -73.982391 40.773380 1 -73.960449 \n",
"2 -73.988571 40.739407 1 -73.986626 \n",
"3 -73.960213 40.770466 1 -73.979866 \n",
"4 -73.995369 40.717247 1 -73.984367 \n",
"\n",
" dropoff_latitude fare_amount \n",
"0 40.731789 6.5 \n",
"1 40.763996 8.5 \n",
"2 40.765217 11.5 \n",
"3 40.777050 7.5 \n",
"4 40.720524 6.0 "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_2014 = df_2014.map_partitions(clean, remap, must_haves, meta=must_haves)\n",
"df_2014.head().to_pandas()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note the schema and the columns have been properly casted and renamed."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"Dask DataFrame Structure:
\n",
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" pickup_datetime | \n",
" dropoff_datetime | \n",
" passenger_count | \n",
" trip_distance | \n",
" pickup_longitude | \n",
" pickup_latitude | \n",
" rate_code | \n",
" dropoff_longitude | \n",
" dropoff_latitude | \n",
" fare_amount | \n",
"
\n",
" \n",
" npartitions=110 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" | \n",
" datetime64[ns] | \n",
" datetime64[ns] | \n",
" int32 | \n",
" float32 | \n",
" float32 | \n",
" float32 | \n",
" int32 | \n",
" float32 | \n",
" float32 | \n",
" float32 | \n",
"
\n",
" \n",
" | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
"
\n",
"
\n",
"Dask Name: clean, 220 tasks
"
],
"text/plain": [
""
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_2014"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"# concatenate multiple DataFrames into one bigger one\n",
"#[TODO] add more data\n",
"#taxi_df = dask.dataframe.multi.concat([df_2014, df_2015, df_2016])\n",
"taxi_df = df_2014"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" pickup_datetime | \n",
" dropoff_datetime | \n",
" passenger_count | \n",
" trip_distance | \n",
" pickup_longitude | \n",
" pickup_latitude | \n",
" rate_code | \n",
" dropoff_longitude | \n",
" dropoff_latitude | \n",
" fare_amount | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2014-01-09 20:45:25 | \n",
" 2014-01-09 20:52:31 | \n",
" 1 | \n",
" 0.7 | \n",
" -73.994766 | \n",
" 40.736828 | \n",
" 1 | \n",
" -73.982224 | \n",
" 40.731789 | \n",
" 6.5 | \n",
"
\n",
" \n",
" 1 | \n",
" 2014-01-09 20:46:12 | \n",
" 2014-01-09 20:55:12 | \n",
" 1 | \n",
" 1.4 | \n",
" -73.982391 | \n",
" 40.773380 | \n",
" 1 | \n",
" -73.960449 | \n",
" 40.763996 | \n",
" 8.5 | \n",
"
\n",
" \n",
" 2 | \n",
" 2014-01-09 20:44:47 | \n",
" 2014-01-09 20:59:46 | \n",
" 2 | \n",
" 2.3 | \n",
" -73.988571 | \n",
" 40.739407 | \n",
" 1 | \n",
" -73.986626 | \n",
" 40.765217 | \n",
" 11.5 | \n",
"
\n",
" \n",
" 3 | \n",
" 2014-01-09 20:44:57 | \n",
" 2014-01-09 20:51:40 | \n",
" 1 | \n",
" 1.7 | \n",
" -73.960213 | \n",
" 40.770466 | \n",
" 1 | \n",
" -73.979866 | \n",
" 40.777050 | \n",
" 7.5 | \n",
"
\n",
" \n",
" 4 | \n",
" 2014-01-09 20:47:09 | \n",
" 2014-01-09 20:53:32 | \n",
" 1 | \n",
" 0.9 | \n",
" -73.995369 | \n",
" 40.717247 | \n",
" 1 | \n",
" -73.984367 | \n",
" 40.720524 | \n",
" 6.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" pickup_datetime dropoff_datetime passenger_count trip_distance \\\n",
"0 2014-01-09 20:45:25 2014-01-09 20:52:31 1 0.7 \n",
"1 2014-01-09 20:46:12 2014-01-09 20:55:12 1 1.4 \n",
"2 2014-01-09 20:44:47 2014-01-09 20:59:46 2 2.3 \n",
"3 2014-01-09 20:44:57 2014-01-09 20:51:40 1 1.7 \n",
"4 2014-01-09 20:47:09 2014-01-09 20:53:32 1 0.9 \n",
"\n",
" pickup_longitude pickup_latitude rate_code dropoff_longitude \\\n",
"0 -73.994766 40.736828 1 -73.982224 \n",
"1 -73.982391 40.773380 1 -73.960449 \n",
"2 -73.988571 40.739407 1 -73.986626 \n",
"3 -73.960213 40.770466 1 -73.979866 \n",
"4 -73.995369 40.717247 1 -73.984367 \n",
"\n",
" dropoff_latitude fare_amount \n",
"0 40.731789 6.5 \n",
"1 40.763996 8.5 \n",
"2 40.765217 11.5 \n",
"3 40.777050 7.5 \n",
"4 40.720524 6.0 "
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# apply a list of filter conditions to throw out records with missing or outlier values\n",
"query_frags = [\n",
" 'fare_amount > 0 and fare_amount < 500',\n",
" 'passenger_count > 0 and passenger_count < 6',\n",
" 'pickup_longitude > -75 and pickup_longitude < -73',\n",
" 'dropoff_longitude > -75 and dropoff_longitude < -73',\n",
" 'pickup_latitude > 40 and pickup_latitude < 42',\n",
" 'dropoff_latitude > 40 and dropoff_latitude < 42'\n",
"]\n",
"taxi_df = taxi_df.query(' and '.join(query_frags))\n",
"\n",
"# inspect the results of cleaning\n",
"taxi_df.head().to_pandas()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Adding Interesting Features\n",
"\n",
"Dask & cuDF provide standard DataFrame operations, but also let you run \"user defined functions\" on the underlying data.\n",
"\n",
"cuDF's [apply_rows](https://rapidsai.github.io/projects/cudf/en/0.6.0/api.html#cudf.dataframe.DataFrame.apply_rows) operation is similar to Pandas's [DataFrame.apply](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html), except that for cuDF, custom Python code is [JIT compiled by numba](https://numba.pydata.org/numba-doc/dev/cuda/kernels.html) into GPU kernels.\n",
"\n",
"We'll use a Haversine Distance calculation to find total trip distance, and extract additional useful variables from the datetime fields."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"import math\n",
"from math import cos, sin, asin, sqrt, pi\n",
"\n",
"def haversine_distance_kernel(pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude, h_distance):\n",
" for i, (x_1, y_1, x_2, y_2) in enumerate(zip(pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude)):\n",
" x_1 = pi/180 * x_1\n",
" y_1 = pi/180 * y_1\n",
" x_2 = pi/180 * x_2\n",
" y_2 = pi/180 * y_2\n",
" \n",
" dlon = y_2 - y_1\n",
" dlat = x_2 - x_1\n",
" a = sin(dlat/2)**2 + cos(x_1) * cos(x_2) * sin(dlon/2)**2\n",
" \n",
" c = 2 * asin(sqrt(a)) \n",
" r = 6371 # Radius of earth in kilometers\n",
" \n",
" h_distance[i] = c * r\n",
"\n",
"def day_of_the_week_kernel(day, month, year, day_of_week):\n",
" for i, (d_1, m_1, y_1) in enumerate(zip(day, month, year)):\n",
" if month[i] <3:\n",
" shift = month[i]\n",
" else:\n",
" shift = 0\n",
" Y = year[i] - (month[i] < 3)\n",
" y = Y - 2000\n",
" c = 20\n",
" d = day[i]\n",
" m = month[i] + shift + 1\n",
" day_of_week[i] = (d + math.floor(m*2.6) + y + (y//4) + (c//4) -2*c)%7\n",
" \n",
"def add_features(df):\n",
" df['hour'] = df['pickup_datetime'].dt.hour\n",
" df['year'] = df['pickup_datetime'].dt.year\n",
" df['month'] = df['pickup_datetime'].dt.month\n",
" df['day'] = df['pickup_datetime'].dt.day\n",
" \n",
" df['pickup_latitude_r'] = df['pickup_latitude']//.01*.01\n",
" df['pickup_longitude_r'] = df['pickup_longitude']//.01*.01\n",
" df['dropoff_latitude_r'] = df['dropoff_latitude']//.01*.01\n",
" df['dropoff_longitude_r'] = df['dropoff_longitude']//.01*.01\n",
" \n",
" \n",
" \n",
" df = df.apply_rows(haversine_distance_kernel,\n",
" incols=['pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude'],\n",
" outcols=dict(h_distance=np.float32),\n",
" kwargs=dict())\n",
" \n",
" \n",
" df = df.apply_rows(day_of_the_week_kernel,\n",
" incols=['day', 'month', 'year'],\n",
" outcols=dict(day_of_week=np.float32),\n",
" kwargs=dict())\n",
" \n",
" \n",
" df['is_weekend'] = (df['day_of_week']<2).astype(np.int32)\n",
" return df"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'pickup_datetime': 'datetime64[ns]', 'dropoff_datetime': 'datetime64[ns]', 'passenger_count': 'int32', 'trip_distance': 'float32', 'pickup_longitude': 'float32', 'pickup_latitude': 'float32', 'rate_code': 'int32', 'dropoff_longitude': 'float32', 'dropoff_latitude': 'float32', 'fare_amount': 'float32'}\n"
]
}
],
"source": [
"print(must_haves)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"features = {\n",
" 'hour': 'int32',\n",
" 'year': 'int32',\n",
" 'month': 'int32',\n",
" 'day': 'float32',\n",
" 'pickup_latitude_r': 'float32',\n",
" 'pickup_longitude_r': 'float32',\n",
" 'dropoff_latitude_r': 'float32',\n",
" 'dropoff_longitude_r': 'float32',\n",
" 'h_distance': 'float32',\n",
" 'day_of_week': 'float32',\n",
" 'is_weekend': 'int32'\n",
"}\n",
"\n",
"all_features = {**must_haves, **features}"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'pickup_datetime': 'datetime64[ns]', 'dropoff_datetime': 'datetime64[ns]', 'passenger_count': 'int32', 'trip_distance': 'float32', 'pickup_longitude': 'float32', 'pickup_latitude': 'float32', 'rate_code': 'int32', 'dropoff_longitude': 'float32', 'dropoff_latitude': 'float32', 'fare_amount': 'float32', 'hour': 'int32', 'year': 'int32', 'month': 'int32', 'day': 'float32', 'pickup_latitude_r': 'float32', 'pickup_longitude_r': 'float32', 'dropoff_latitude_r': 'float32', 'dropoff_longitude_r': 'float32', 'h_distance': 'float32', 'day_of_week': 'float32', 'is_weekend': 'int32'}\n"
]
}
],
"source": [
"print(all_features)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"Dask DataFrame Structure:
\n",
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" pickup_datetime | \n",
" dropoff_datetime | \n",
" passenger_count | \n",
" trip_distance | \n",
" pickup_longitude | \n",
" pickup_latitude | \n",
" rate_code | \n",
" dropoff_longitude | \n",
" dropoff_latitude | \n",
" fare_amount | \n",
"
\n",
" \n",
" npartitions=110 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" | \n",
" datetime64[ns] | \n",
" datetime64[ns] | \n",
" int32 | \n",
" float32 | \n",
" float32 | \n",
" float32 | \n",
" int32 | \n",
" float32 | \n",
" float32 | \n",
" float32 | \n",
"
\n",
" \n",
" | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
"
\n",
"
\n",
"Dask Name: query, 330 tasks
"
],
"text/plain": [
""
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"taxi_df"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" pickup_datetime | \n",
" dropoff_datetime | \n",
" passenger_count | \n",
" trip_distance | \n",
" pickup_longitude | \n",
" pickup_latitude | \n",
" rate_code | \n",
" dropoff_longitude | \n",
" dropoff_latitude | \n",
" fare_amount | \n",
" ... | \n",
" year | \n",
" month | \n",
" day | \n",
" pickup_latitude_r | \n",
" pickup_longitude_r | \n",
" dropoff_latitude_r | \n",
" dropoff_longitude_r | \n",
" h_distance | \n",
" day_of_week | \n",
" is_weekend | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2014-01-09 20:45:25 | \n",
" 2014-01-09 20:52:31 | \n",
" 1 | \n",
" 0.7 | \n",
" -73.994766 | \n",
" 40.736828 | \n",
" 1 | \n",
" -73.982224 | \n",
" 40.731789 | \n",
" 6.5 | \n",
" ... | \n",
" 2014 | \n",
" 1 | \n",
" 9 | \n",
" 40.730000 | \n",
" -74.000000 | \n",
" 40.730000 | \n",
" -73.989998 | \n",
" 1.196175 | \n",
" 4.0 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" 2014-01-09 20:46:12 | \n",
" 2014-01-09 20:55:12 | \n",
" 1 | \n",
" 1.4 | \n",
" -73.982391 | \n",
" 40.773380 | \n",
" 1 | \n",
" -73.960449 | \n",
" 40.763996 | \n",
" 8.5 | \n",
" ... | \n",
" 2014 | \n",
" 1 | \n",
" 9 | \n",
" 40.770000 | \n",
" -73.989998 | \n",
" 40.759998 | \n",
" -73.970001 | \n",
" 2.122098 | \n",
" 4.0 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" 2014-01-09 20:44:47 | \n",
" 2014-01-09 20:59:46 | \n",
" 2 | \n",
" 2.3 | \n",
" -73.988571 | \n",
" 40.739407 | \n",
" 1 | \n",
" -73.986626 | \n",
" 40.765217 | \n",
" 11.5 | \n",
" ... | \n",
" 2014 | \n",
" 1 | \n",
" 9 | \n",
" 40.730000 | \n",
" -73.989998 | \n",
" 40.759998 | \n",
" -73.989998 | \n",
" 2.874643 | \n",
" 4.0 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" 2014-01-09 20:44:57 | \n",
" 2014-01-09 20:51:40 | \n",
" 1 | \n",
" 1.7 | \n",
" -73.960213 | \n",
" 40.770466 | \n",
" 1 | \n",
" -73.979866 | \n",
" 40.777050 | \n",
" 7.5 | \n",
" ... | \n",
" 2014 | \n",
" 1 | \n",
" 9 | \n",
" 40.770000 | \n",
" -73.970001 | \n",
" 40.770000 | \n",
" -73.979996 | \n",
" 1.809662 | \n",
" 4.0 | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" 2014-01-09 20:47:09 | \n",
" 2014-01-09 20:53:32 | \n",
" 1 | \n",
" 0.9 | \n",
" -73.995369 | \n",
" 40.717247 | \n",
" 1 | \n",
" -73.984367 | \n",
" 40.720524 | \n",
" 6.0 | \n",
" ... | \n",
" 2014 | \n",
" 1 | \n",
" 9 | \n",
" 40.709999 | \n",
" -74.000000 | \n",
" 40.719997 | \n",
" -73.989998 | \n",
" 0.996204 | \n",
" 4.0 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
5 rows × 21 columns
\n",
"
"
],
"text/plain": [
" pickup_datetime dropoff_datetime passenger_count trip_distance \\\n",
"0 2014-01-09 20:45:25 2014-01-09 20:52:31 1 0.7 \n",
"1 2014-01-09 20:46:12 2014-01-09 20:55:12 1 1.4 \n",
"2 2014-01-09 20:44:47 2014-01-09 20:59:46 2 2.3 \n",
"3 2014-01-09 20:44:57 2014-01-09 20:51:40 1 1.7 \n",
"4 2014-01-09 20:47:09 2014-01-09 20:53:32 1 0.9 \n",
"\n",
" pickup_longitude pickup_latitude rate_code dropoff_longitude \\\n",
"0 -73.994766 40.736828 1 -73.982224 \n",
"1 -73.982391 40.773380 1 -73.960449 \n",
"2 -73.988571 40.739407 1 -73.986626 \n",
"3 -73.960213 40.770466 1 -73.979866 \n",
"4 -73.995369 40.717247 1 -73.984367 \n",
"\n",
" dropoff_latitude fare_amount ... year month day pickup_latitude_r \\\n",
"0 40.731789 6.5 ... 2014 1 9 40.730000 \n",
"1 40.763996 8.5 ... 2014 1 9 40.770000 \n",
"2 40.765217 11.5 ... 2014 1 9 40.730000 \n",
"3 40.777050 7.5 ... 2014 1 9 40.770000 \n",
"4 40.720524 6.0 ... 2014 1 9 40.709999 \n",
"\n",
" pickup_longitude_r dropoff_latitude_r dropoff_longitude_r h_distance \\\n",
"0 -74.000000 40.730000 -73.989998 1.196175 \n",
"1 -73.989998 40.759998 -73.970001 2.122098 \n",
"2 -73.989998 40.759998 -73.989998 2.874643 \n",
"3 -73.970001 40.770000 -73.979996 1.809662 \n",
"4 -74.000000 40.719997 -73.989998 0.996204 \n",
"\n",
" day_of_week is_weekend \n",
"0 4.0 0 \n",
"1 4.0 0 \n",
"2 4.0 0 \n",
"3 4.0 0 \n",
"4 4.0 0 \n",
"\n",
"[5 rows x 21 columns]"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# actually add the features\n",
"taxi_df = taxi_df.map_partitions(add_features, meta=all_features)\n",
"# inspect the result\n",
"taxi_df.head().to_pandas()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Pick a Training Set\n",
"\n",
"Let's imagine you're making a trip to New York on the 25th and want to build a model to predict what fare prices will be like the last few days of the month based on the first part of the month. We'll use a query expression to identify the `day` of the month to use to divide the data into train and test sets.\n",
"\n",
"The wall-time below represents how long it takes your GPU cluster to load data from the Google Cloud Storage bucket and the ETL portion of the workflow."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 7.91 s, sys: 630 ms, total: 8.54 s\n",
"Wall time: 3min 33s\n"
]
}
],
"source": [
"%%time\n",
"X_train = taxi_df.query('day < 25').persist()\n",
"\n",
"# create a Y_train ddf with just the target variable\n",
"Y_train = X_train[['fare_amount']].persist()\n",
"# drop the target variable from the training ddf\n",
"X_train = X_train[X_train.columns.difference(['fare_amount'])]\n",
"\n",
"# this wont return until all data is in GPU memory\n",
"done = wait([X_train, Y_train])"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"X_train = X_train.drop('pickup_datetime', axis=1)\n",
"X_train = X_train.drop('dropoff_datetime', axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"Dask DataFrame Structure:
\n",
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" day | \n",
" day_of_week | \n",
" dropoff_latitude | \n",
" dropoff_latitude_r | \n",
" dropoff_longitude | \n",
" dropoff_longitude_r | \n",
" h_distance | \n",
" hour | \n",
" is_weekend | \n",
" month | \n",
" passenger_count | \n",
" pickup_latitude | \n",
" pickup_latitude_r | \n",
" pickup_longitude | \n",
" pickup_longitude_r | \n",
" rate_code | \n",
" trip_distance | \n",
" year | \n",
"
\n",
" \n",
" npartitions=110 | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" | \n",
" float32 | \n",
" float32 | \n",
" float32 | \n",
" float32 | \n",
" float32 | \n",
" float32 | \n",
" float32 | \n",
" int32 | \n",
" int32 | \n",
" int32 | \n",
" int32 | \n",
" float32 | \n",
" float32 | \n",
" float32 | \n",
" float32 | \n",
" int32 | \n",
" float32 | \n",
" int32 | \n",
"
\n",
" \n",
" | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
"
\n",
"
\n",
"Dask Name: drop_by_shallow_copy, 440 tasks
"
],
"text/plain": [
""
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Train the XGBoost Regression Model\n",
"\n",
"The wall time output below indicates how long it took your GPU cluster to train an XGBoost model over the training set."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 1.14 s, sys: 80.8 ms, total: 1.22 s\n",
"Wall time: 19.9 s\n"
]
}
],
"source": [
"%%time\n",
"dtrain = xgb.dask.DaskDMatrix(client, X_train, Y_train)\n",
"# Use train method from xgboost.dask instead of xgboost. This\n",
"# distributed version of train returns a dictionary containing the\n",
"# resulting booster and evaluation history obtained from\n",
"# evaluation metrics.\n",
"model_output = xgb.dask.train(client,\n",
" # Use GPU training algorithm\n",
" {'tree_method': 'gpu_hist'},\n",
" dtrain,\n",
" num_boost_round=100,\n",
" evals=[(dtrain, 'train')])\n",
"booster = model_output['booster'] # booster is the trained model\n",
"history = model_output['history'] # A dictionary containing evaluation results"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# How Good is Our Model?\n",
"\n",
"Now that we have a trained model, we need to test it with the 25% of records we held out.\n",
"\n",
"Based on the filtering conditions applied to this dataset, many of the DataFrame partitions will wind up having 0 rows.\n",
"\n",
"This is a problem for XGBoost which doesn't know what to do with 0 length arrays. We'll apply a bit of Dask logic to check for and drop partitions without any rows."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"def drop_empty_partitions(df):\n",
" lengths = df.map_partitions(len).compute()\n",
" nonempty = [length > 0 for length in lengths]\n",
" return df.partitions[nonempty]"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"31739751"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_test = taxi_df.query('day >= 25').persist()\n",
"X_test = drop_empty_partitions(X_test)\n",
"\n",
"# Create Y_test with just the fare amount\n",
"Y_test = X_test[['fare_amount']]\n",
"\n",
"# Drop the fare amount from X_test\n",
"X_test = X_test[X_test.columns.difference(['fare_amount'])]\n",
"\n",
"# display test set size\n",
"len(X_test)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"X_test = X_test.drop('pickup_datetime', axis=1)\n",
"X_test = X_test.drop('dropoff_datetime', axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Evaluation history: {'train': {'rmse': [11.327804, 8.172996, 6.041537, 4.641979, 3.757672, 3.224258, 2.911278, 2.732369, 2.630654, 2.569407, 2.531624, 2.502579, 2.486299, 2.471962, 2.460887, 2.449411, 2.437919, 2.429802, 2.42383, 2.41742, 2.412308, 2.407344, 2.403826, 2.400395, 2.395525, 2.393419, 2.388041, 2.384101, 2.380613, 2.378282, 2.371596, 2.367515, 2.364666, 2.357522, 2.355485, 2.352574, 2.349749, 2.346168, 2.343192, 2.341385, 2.338813, 2.337814, 2.336341, 2.333639, 2.331098, 2.327207, 2.326241, 2.324432, 2.320762, 2.318273, 2.315912, 2.314492, 2.313371, 2.31209, 2.311155, 2.309329, 2.306317, 2.305289, 2.303312, 2.302437, 2.30142, 2.298965, 2.297313, 2.296462, 2.294187, 2.292799, 2.291584, 2.29013, 2.289528, 2.288281, 2.287606, 2.286923, 2.285734, 2.284415, 2.283031, 2.282219, 2.281136, 2.279738, 2.278096, 2.277354, 2.276891, 2.276182, 2.274509, 2.273721, 2.272978, 2.272003, 2.27018, 2.269635, 2.268117, 2.267274, 2.266372, 2.264311, 2.263582, 2.262124, 2.261794, 2.260962, 2.259462, 2.258576, 2.257909, 2.257146]}}\n"
]
}
],
"source": [
"prediction = xgb.dask.predict(client, booster, X_test)\n",
"prediction = prediction.compute()\n",
"print('Evaluation history:', history)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "conda_rapidsai",
"language": "python",
"name": "conda_rapidsai"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.10"
}
},
"nbformat": 4,
"nbformat_minor": 4
}