{
"cells": [
{
"cell_type": "markdown",
"id": "32f98d3f",
"metadata": {},
"source": [
"# Demand Prediction for Dockless Vehicles using Amzon SageMaker and Amazon Athena\n",
"\n",
"We demonstrate building a machine learning model for short term demand predictions of dockless vehicles in different neighborhood of Loisville, KY. The project has been inspired by several Kaagle competitions and publications.\n",
"\n",
"The goal is to develop a demand model that predicts the number of vehicles needed within the next hour for each neighborhood.\n",
"\n",
"The data set is provided by the Office of Advanced Planning of Loisville, Kentucky.\n",
"The data set contains trip data of individual vehicles with start and end locations. The start and end times are truncated to 15 minute intervals, though we aggregate over the full hour.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a4b5b53d",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 2,
"id": "27509a1b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Role: arn:aws:iam::969171869770:role/service-role/AmazonSageMaker-ExecutionRole-20210428T175911\n",
"ml_train_test_data.csv\n"
]
}
],
"source": [
"import boto3\n",
"import re\n",
"import s3fs\n",
"import pandas as pd\n",
"\n",
"import numpy as np\n",
"import datetime\n",
"\n",
"\n",
"import sagemaker\n",
"ROLE = sagemaker.get_execution_role()\n",
"print(f\"Role: {ROLE}\")\n",
"\n",
"REGION = 33.Session().region_name\n",
"\n",
"S3BUCKET = f'athena-blog-1'\n",
"# S3PREFIX = 'taxi_data_raw/athena_data'\n",
"S3PREFIX = 'dockless_vehicles/ml'\n",
"\n",
"DATAFILE = 'ml_train_test_data.csv'\n",
"print(DATAFILE)\n",
"\n",
"def printShape(df):\n",
" print(f\"Number of records: {df.shape[0]:,}, number of columns: {df.shape[1]}\")\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "4fe7c771",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of records: 895,413, number of columns: 17\n"
]
}
],
"source": [
"df = pd.read_csv(DATAFILE)\n",
"df.dropna(inplace=True)\n",
"printShape(df)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "8b273097",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" date \n",
" nbid \n",
" hour \n",
" dow \n",
" n_start \n",
" n_end \n",
" n_start_1 \n",
" n_start_2 \n",
" n_start_3 \n",
" n_start_4 \n",
" n_end_1 \n",
" n_end_2 \n",
" n_end_3 \n",
" n_end_4 \n",
" y_start_1 \n",
" y_start_2 \n",
" y_start_3 \n",
" \n",
" \n",
" \n",
" \n",
" 4 \n",
" 2018-08-09 \n",
" 35 \n",
" 1 \n",
" 0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" \n",
" \n",
" 5 \n",
" 2018-08-09 \n",
" 35 \n",
" 23 \n",
" 0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" \n",
" \n",
" 6 \n",
" 2018-08-09 \n",
" 35 \n",
" 6 \n",
" 0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" \n",
" \n",
" 7 \n",
" 2018-08-09 \n",
" 35 \n",
" 2 \n",
" 0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" \n",
" \n",
" 8 \n",
" 2018-08-09 \n",
" 35 \n",
" 11 \n",
" 0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" date nbid hour dow n_start n_end n_start_1 n_start_2 \\\n",
"4 2018-08-09 35 1 0 0.0 0.0 0.0 0.0 \n",
"5 2018-08-09 35 23 0 0.0 0.0 0.0 0.0 \n",
"6 2018-08-09 35 6 0 0.0 0.0 0.0 0.0 \n",
"7 2018-08-09 35 2 0 0.0 0.0 0.0 0.0 \n",
"8 2018-08-09 35 11 0 0.0 0.0 0.0 0.0 \n",
"\n",
" n_start_3 n_start_4 n_end_1 n_end_2 n_end_3 n_end_4 y_start_1 \\\n",
"4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
"5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
"6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
"7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
"8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
"\n",
" y_start_2 y_start_3 \n",
"4 0.0 0.0 \n",
"5 0.0 0.0 \n",
"6 0.0 0.0 \n",
"7 0.0 0.0 \n",
"8 0.0 0.0 "
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "742627a4",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['date', 'nbid', 'hour', 'dow', 'n_start', 'n_end', 'n_start_1',\n",
" 'n_start_2', 'n_start_3', 'n_start_4', 'n_end_1', 'n_end_2', 'n_end_3',\n",
" 'n_end_4', 'y_start_1', 'y_start_2', 'y_start_3'],\n",
" dtype='object')"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.columns"
]
},
{
"cell_type": "markdown",
"id": "5dff857c",
"metadata": {},
"source": [
"Extract all-zero rows and use a sub-sample to balanace data set"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "e338feae",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of records: 10,000, number of columns: 17\n",
"Number of records: 304,619, number of columns: 17\n",
"Number of records: 314,619, number of columns: 17\n"
]
}
],
"source": [
"num_cols = ['n_start', 'n_end','n_start_1', 'n_start_2', 'n_start_3', 'n_start_4', 'n_end_1', 'n_end_2', 'n_end_3', 'n_end_4', 'y_start_1', 'y_start_2', 'y_start_3']\n",
"mask = df[num_cols].values.sum(axis=1)>0\n",
"mask.shape, mask.sum()\n",
"\n",
"df_zero = df[~mask].sample(10000)\n",
"printShape(df_zero)\n",
"df_nonzero = df[mask]\n",
"printShape(df_nonzero)\n",
"df2 = pd.concat([df_zero, df_nonzero])\n",
"printShape(df2)"
]
},
{
"cell_type": "markdown",
"id": "30a3a42c",
"metadata": {},
"source": [
"# Processing Data"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "24058744",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split\n",
"from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelBinarizer, KBinsDiscretizer\n",
"from sklearn.preprocessing import PolynomialFeatures\n",
"from sklearn.compose import make_column_transformer\n",
"from sklearn.exceptions import DataConversionWarning"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "83518370",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" y_start_1 \n",
" nbid \n",
" hour \n",
" dow \n",
" n_start_1 \n",
" n_start_2 \n",
" n_start_3 \n",
" n_start_4 \n",
" n_end_1 \n",
" n_end_2 \n",
" n_end_3 \n",
" n_end_4 \n",
" \n",
" \n",
" \n",
" \n",
" 895367 \n",
" 0.0 \n",
" 9 \n",
" 17 \n",
" 4 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" \n",
" \n",
" 895368 \n",
" 0.0 \n",
" 9 \n",
" 11 \n",
" 0 \n",
" 1.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 1.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" \n",
" \n",
" 895369 \n",
" 0.0 \n",
" 9 \n",
" 9 \n",
" 0 \n",
" 0.0 \n",
" 1.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 1.0 \n",
" 0.0 \n",
" 0.0 \n",
" \n",
" \n",
" 895370 \n",
" 0.0 \n",
" 9 \n",
" 18 \n",
" 0 \n",
" 0.0 \n",
" 0.0 \n",
" 1.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 1.0 \n",
" 0.0 \n",
" \n",
" \n",
" 895371 \n",
" 0.0 \n",
" 9 \n",
" 5 \n",
" 0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 1.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 1.0 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" y_start_1 nbid hour dow n_start_1 n_start_2 n_start_3 \\\n",
"895367 0.0 9 17 4 0.0 0.0 0.0 \n",
"895368 0.0 9 11 0 1.0 0.0 0.0 \n",
"895369 0.0 9 9 0 0.0 1.0 0.0 \n",
"895370 0.0 9 18 0 0.0 0.0 1.0 \n",
"895371 0.0 9 5 0 0.0 0.0 0.0 \n",
"\n",
" n_start_4 n_end_1 n_end_2 n_end_3 n_end_4 \n",
"895367 0.0 0.0 0.0 0.0 0.0 \n",
"895368 0.0 1.0 0.0 0.0 0.0 \n",
"895369 0.0 0.0 1.0 0.0 0.0 \n",
"895370 0.0 0.0 0.0 1.0 0.0 \n",
"895371 1.0 0.0 0.0 0.0 1.0 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of rows: 314,619\n"
]
}
],
"source": [
"features = ['y_start_1', 'nbid', 'hour', 'dow', 'n_start_1', 'n_start_2', 'n_start_3', 'n_start_4',\n",
" 'n_end_1', 'n_end_2', 'n_end_3', 'n_end_4' ]\n",
"display(df2[features].tail())\n",
" \n",
"### Save the data to S3 bucket\n",
"input_data = df2[features].dropna().copy()\n",
"print(f\"Number of rows: {input_data.shape[0]:,}\")\n",
"\n",
"dataInPath = f's3://{S3BUCKET}/{S3PREFIX}/data/data_in.csv'\n",
"input_data.to_csv(dataInPath, sep = ',', index = False)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "c57629f2",
"metadata": {},
"outputs": [],
"source": [
"### Import required libraries\n",
" \n",
"import boto3\n",
"import sagemaker\n",
"from sagemaker import get_execution_role\n",
"from sagemaker.sklearn.processing import SKLearnProcessor \n",
" \n",
"### Create an scikit-learn processor\n",
" \n",
"sklearn_processor = SKLearnProcessor(\n",
" framework_version = '0.20.0', # specify the scikit-learn version that we want to use - the image will be this version\n",
" role = ROLE, # insert the role\n",
" instance_type = 'ml.m5.xlarge', # this is the \"normal\" instance type\n",
" instance_count = 1, # we only use 1 instance\n",
" base_job_name = f'dockless-processing', # need to follow ADA Datalab guideline \n",
")"
]
},
{
"cell_type": "markdown",
"id": "f282fd7e",
"metadata": {},
"source": [
"## Process script"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "0b6731ec",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Overwriting preprocessing.py\n"
]
}
],
"source": [
"%%writefile preprocessing.py\n",
" \n",
"import argparse\n",
"import os\n",
"import warnings\n",
"import pandas as pd\n",
"import numpy as np\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelBinarizer, KBinsDiscretizer\n",
"from sklearn.preprocessing import PolynomialFeatures\n",
"from sklearn.compose import make_column_transformer\n",
"from sklearn.exceptions import DataConversionWarning\n",
"warnings.filterwarnings(action='ignore', category=DataConversionWarning)\n",
" \n",
"def print_shape(df):\n",
" pass\n",
" \n",
"if __name__=='__main__':\n",
" \n",
" ### Specifying input path - Note how set the input paths\n",
" \n",
" input_data_path = os.path.join('/opt/ml/processing/input', 'data_in.csv')\n",
" \n",
" ### Reading + basic data handling\n",
" \n",
" print('Reading input data from {}'.format(input_data_path))\n",
" data_raw = pd.read_csv(input_data_path) #### , parse_dates=['t_hour'])\n",
"# features = ['y_start_1', 'location_id', 'hr', 'dayofweek', 'n_start_1', 'n_start_2', 'n_start_3', 'n_start_4',\n",
"# 'n_end_1', 'n_end_2', 'n_end_3', 'n_end_4' ]\n",
" features = list(data_raw.columns)\n",
" label_col = features[0]\n",
" data = data_raw[features]\n",
" # data = pd.DataFrame(data = data)\n",
" \n",
" ### Transform response variable\n",
" \n",
" \n",
" \n",
" ### data split in three sets, training, validation and batch inference\n",
" \n",
" rand_split = np.random.rand(len(data))\n",
" train_list = rand_split < 0.8\n",
" val_list = (rand_split >= 0.8) & (rand_split < 0.9)\n",
" test_list = rand_split >= 0.9\n",
" \n",
" ### Split data to train, validation and test\n",
" \n",
" data_train = data[train_list]\n",
" data_val = data[val_list]\n",
" data_test = data[test_list]\n",
" test_class = data_test[label_col]\n",
" data_test = data_test.drop([label_col], axis = 1)\n",
" \n",
" ### Specifying output paths - Note how set the output paths\n",
" \n",
" train_features_output_path = os.path.join('/opt/ml/processing/train', 'train_data.csv')\n",
" validation_features_output_path = os.path.join('/opt/ml/processing/validation', 'validation_data.csv')\n",
" test_features_output_path = os.path.join('/opt/ml/processing/test', 'test_data.csv')\n",
" test_classes_output_path = os.path.join('/opt/ml/processing/test', 'test_class.csv')\n",
" \n",
" ### Save files to output destinations\n",
" \n",
" pd.DataFrame(data_train).to_csv(train_features_output_path, header = False, index = False)\n",
" pd.DataFrame(data_val).to_csv(validation_features_output_path, header = False, index = False)\n",
" pd.DataFrame(data_test).to_csv(test_features_output_path, header = False, index = False)\n",
" pd.DataFrame(test_class).to_csv(test_classes_output_path, header = False, index = False)"
]
},
{
"cell_type": "markdown",
"id": "f968c1cd",
"metadata": {},
"source": [
"## Upload and Run"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "78af9c0b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"s3://athena-blog-1/dockless_vehicles/ml/scripts/preprocessing.py\n",
"\n",
"Job Name: dockless-processing-2021-05-13-09-31-44-921\n",
"Inputs: [{'InputName': 'input-1', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://athena-blog-1/dockless_vehicles/ml/data/data_in.csv', 'LocalPath': '/opt/ml/processing/input', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://athena-blog-1/dockless_vehicles/ml/scripts/preprocessing.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]\n",
"Outputs: [{'OutputName': 'train_data', 'AppManaged': False, 'S3Output': {'S3Uri': 's3://athena-blog-1/dockless_vehicles/ml/train/', 'LocalPath': '/opt/ml/processing/train', 'S3UploadMode': 'EndOfJob'}}, {'OutputName': 'validation_data', 'AppManaged': False, 'S3Output': {'S3Uri': 's3://athena-blog-1/dockless_vehicles/ml/validation/', 'LocalPath': '/opt/ml/processing/validation', 'S3UploadMode': 'EndOfJob'}}, {'OutputName': 'test_data', 'AppManaged': False, 'S3Output': {'S3Uri': 's3://athena-blog-1/dockless_vehicles/ml/test/', 'LocalPath': '/opt/ml/processing/test', 'S3UploadMode': 'EndOfJob'}}]\n",
".......................\u001b[34m/miniconda3/lib/python3.7/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses\n",
" import imp\u001b[0m\n",
"\u001b[34mReading input data from /opt/ml/processing/input/data_in.csv\u001b[0m\n",
"\n"
]
}
],
"source": [
"### Copy the preprocessing code over to the s3 bucket\n",
" \n",
"sess = sagemaker.Session()\n",
"codeprefix = S3PREFIX + '/scripts'\n",
"codeupload = sess.upload_data('preprocessing.py', bucket = S3BUCKET, key_prefix = codeprefix)\n",
"print(codeupload)\n",
" \n",
"### Import ProcessingInput and ProcessingOutput function\n",
"from sagemaker.processing import ProcessingInput, ProcessingOutput\n",
" \n",
"### Set our data destination path in S3\n",
"dataDestination = 's3://' + S3BUCKET + '/' + S3PREFIX\n",
" \n",
"### Run the processing job\n",
"sklearn_processor.run(\n",
" code = codeupload,\n",
" inputs = [ProcessingInput(source = dataInPath, destination = '/opt/ml/processing/input')],\n",
" outputs = [\n",
" ProcessingOutput(\n",
" output_name = 'train_data',\n",
" source = '/opt/ml/processing/train',\n",
" destination = dataDestination + '/train/'\n",
" ),\n",
" ProcessingOutput(\n",
" output_name = 'validation_data', \n",
" source = '/opt/ml/processing/validation',\n",
" destination = dataDestination + '/validation/'\n",
" ),\n",
" ProcessingOutput(\n",
" output_name = 'test_data',\n",
" source = '/opt/ml/processing/test',\n",
" destination = dataDestination + '/test/'\n",
" )\n",
" ]\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b9a2a28d",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "72a6b6a7",
"metadata": {},
"source": [
"# Training Data"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "f45bc73a",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"The method get_image_uri has been renamed in sagemaker>=2.\n",
"See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Using ML image '257758044811.dkr.ecr.us-east-2.amazonaws.com/sagemaker-xgboost:1.2-1' in region 'us-east-2'\n",
"training artifacts will be uploaded to: s3://athena-blog-1/dockless_vehicles/ml\n",
"InProgress\n",
"Training job ended with status: Completed\n"
]
}
],
"source": [
"### This is to get the image df_nonzerohe XGBoost\n",
" \n",
"from sagemaker.amazon.amazon_estimator import get_image_uri\n",
"\n",
"image = get_image_uri(REGION, 'xgboost', '1.2-1')\n",
"#image = sagemaker.image_uris.retrieve(\"xgboost\", REGION, \"1.2-1\")\n",
"print(f\"Using ML image '{image}' in region '{REGION}'\")\n",
" \n",
"### This is the output location of the model\n",
" \n",
"output_location = 's3://{}/{}'.format(S3BUCKET, S3PREFIX)\n",
"print('training artifacts will be uploaded to: {}'.format(output_location))\n",
" \n",
"### Training Parameters\n",
"import datetime\n",
"ts = datetime.datetime.now().strftime(\"%Y%m%d-%H%M%S\")\n",
"job_name = f\"dockless-training-{ts}\"\n",
"\n",
"create_training_params = {\n",
" \"AlgorithmSpecification\": {\n",
" \"TrainingImage\": image, ### This is the image of the XGBoost that we want to use as built-in algorithm directly\n",
" \"TrainingInputMode\": \"File\"\n",
" },\n",
" \"RoleArn\": ROLE, \n",
" \"OutputDataConfig\": {\n",
" \"S3OutputPath\": output_location ### This is the output location (with some addition on prefix, etc) where model.tar.gz will be stored\n",
" },\n",
" \"ResourceConfig\": {\n",
" \"InstanceCount\": 1, \n",
" \"InstanceType\": \"ml.m5.4xlarge\",\n",
" \"VolumeSizeInGB\": 50,\n",
" },\n",
" \"TrainingJobName\": job_name, ### This is the training job name that we have provided previously\n",
" \"HyperParameters\": {\n",
" ## \"objective\": \"binary:logistic\", ### For our case, we use binary:logistic as our objective function (0/1 case)\n",
" \"objective\": \"reg:squarederror\", ### For our case, we use binary:logistic as our objective function (0/1 case)\n",
" \"max_depth\": \"5\",\n",
" \"eta\": \"0.2\",\n",
" \"gamma\": \"4\",\n",
" \"min_child_weight\": \"6\",\n",
" \"subsample\": \"0.8\",\n",
" # \"silent\": \"0\",\n",
" \"num_round\": \"500\"\n",
" },\n",
" \"StoppingCondition\": {\n",
" \"MaxRuntimeInSeconds\": 60 * 60\n",
" },\n",
" \"InputDataConfig\": [\n",
" {\n",
" \"ChannelName\": \"train\", ### This contains the channels that we want to use for training, here we are using train and validation channel\n",
" \"DataSource\": {\n",
" \"S3DataSource\": {\n",
" \"S3DataType\": \"S3Prefix\",\n",
" \"S3Uri\": 's3://{}/{}/train'.format(S3BUCKET, S3PREFIX),\n",
" \"S3DataDistributionType\": \"FullyReplicated\"\n",
" }\n",
" },\n",
" \"CompressionType\": \"None\",\n",
" \"RecordWrapperType\": \"None\",\n",
" \"ContentType\": \"text/csv\"\n",
" },\n",
" {\n",
" \"ChannelName\": \"validation\",\n",
" \"DataSource\": {\n",
" \"S3DataSource\": {\n",
" \"S3DataType\": \"S3Prefix\",\n",
" \"S3Uri\": 's3://{}/{}/validation'.format(S3BUCKET, S3PREFIX),\n",
" \"S3DataDistributionType\": \"FullyReplicated\"\n",
" }\n",
" },\n",
" \"CompressionType\": \"None\",\n",
" \"RecordWrapperType\": \"None\",\n",
" \"ContentType\": \"text/csv\"\n",
" }\n",
" ],\n",
"}\n",
" \n",
"sagemaker = boto3.client('sagemaker')\n",
" \n",
"### Create Training Job\n",
" \n",
"sagemaker.create_training_job(**create_training_params)\n",
" \n",
"### Describe the status of training job\n",
" \n",
"status = sagemaker.describe_training_job(TrainingJobName = job_name)['TrainingJobStatus']\n",
"print(status)\n",
" \n",
"try:\n",
" sagemaker.get_waiter('training_job_completed_or_stopped').wait(TrainingJobName = job_name)\n",
"finally:\n",
" status = sagemaker.describe_training_job(TrainingJobName = job_name)['TrainingJobStatus']\n",
" print(\"Training job ended with status: \" + status)\n",
" if status == 'Failed':\n",
" message = sagemaker.describe_training_job(TrainingJobName = job_name)['FailureReason']\n",
" print('Training failed with the following error: {}'.format(message))\n",
" raise Exception('Training job failed')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bd89e269",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "416277d6",
"metadata": {},
"source": [
"# Create Model"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "4ff18a1f",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Model Info\n",
"----------\n",
"Location of model parameters: s3://athena-blog-1/dockless_vehicles/ml/dockless-training-20210513-093745/output/model.tar.gz\n",
"Container image: 257758044811.dkr.ecr.us-east-2.amazonaws.com/sagemaker-xgboost:1.2-1\n",
"Training job: dockless-training-20210513-093745\n",
"\n",
"\n",
"arn:aws:sagemaker:us-east-2:969171869770:model/dockless-regression-model\n"
]
}
],
"source": [
"sagemaker = boto3.client('sagemaker')\n",
"info = sagemaker.describe_training_job(TrainingJobName = job_name) # This is to see all the details of your training job\n",
"model_data = info['ModelArtifacts']['S3ModelArtifacts'] # That detail includes where the model.tar.gz is saved in your S3 bucket\n",
" \n",
"print(f\"\"\"\n",
"Model Info\n",
"----------\n",
"Location of model parameters: {model_data}\n",
"Container image: {image}\n",
"Training job: {job_name}\n",
"\n",
"\"\"\")\n",
"\n",
"primary_container = {\n",
" 'Image': image, # This is the image for inference (not for training). But in our case, this is the same as training image.\n",
" 'ModelDataUrl': model_data # This is the model artifacts, the one with model.tar.gz\n",
"}\n",
" \n",
"create_model_response = sagemaker.create_model(\n",
" ModelName = f'dockless-regression-model', # Specify the job name of the training job earlier on\n",
" ExecutionRoleArn = ROLE, # Specify the role, which equals to get_execution_role()\n",
" PrimaryContainer = primary_container, # Specify the primary container. \n",
")\n",
" \n",
"print(create_model_response['ModelArn'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ed88d904",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "93020190",
"metadata": {},
"source": [
"# Eval"
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "ec1ff60c",
"metadata": {},
"outputs": [],
"source": [
"import boto3\n",
"sagemaker_runtime = boto3.client('sagemaker-runtime')"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "3b025761",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" y_start_1 \n",
" nbid \n",
" hour \n",
" dow \n",
" n_start_1 \n",
" n_start_2 \n",
" n_start_3 \n",
" n_start_4 \n",
" n_end_1 \n",
" n_end_2 \n",
" n_end_3 \n",
" n_end_4 \n",
" \n",
" \n",
" \n",
" \n",
" 470 \n",
" 0.0 \n",
" 35 \n",
" 16 \n",
" 2 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" \n",
" \n",
" 471 \n",
" 0.0 \n",
" 35 \n",
" 11 \n",
" 0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 1.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" \n",
" \n",
" 472 \n",
" 0.0 \n",
" 35 \n",
" 9 \n",
" 0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 1.0 \n",
" 0.0 \n",
" 0.0 \n",
" \n",
" \n",
" 473 \n",
" 0.0 \n",
" 35 \n",
" 7 \n",
" 0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 1.0 \n",
" 0.0 \n",
" \n",
" \n",
" 474 \n",
" 0.0 \n",
" 35 \n",
" 20 \n",
" 0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 1.0 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" y_start_1 nbid hour dow n_start_1 n_start_2 n_start_3 n_start_4 \\\n",
"470 0.0 35 16 2 0.0 0.0 0.0 0.0 \n",
"471 0.0 35 11 0 0.0 0.0 0.0 0.0 \n",
"472 0.0 35 9 0 0.0 0.0 0.0 0.0 \n",
"473 0.0 35 7 0 0.0 0.0 0.0 0.0 \n",
"474 0.0 35 20 0 0.0 0.0 0.0 0.0 \n",
"\n",
" n_end_1 n_end_2 n_end_3 n_end_4 \n",
"470 0.0 0.0 0.0 0.0 \n",
"471 1.0 0.0 0.0 0.0 \n",
"472 0.0 1.0 0.0 0.0 \n",
"473 0.0 0.0 1.0 0.0 \n",
"474 0.0 0.0 0.0 1.0 "
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"features = ['y_start_1', 'nbid', 'hour', 'dow', 'n_start_1', 'n_start_2', 'n_start_3', 'n_start_4',\n",
" 'n_end_1', 'n_end_2', 'n_end_3', 'n_end_4' ]\n",
"df_nonzero[features].dropna().head()"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "42ce4d34",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"114599\ty=0.0\ty_hat=0.03657257556915283\t23.0,14.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0\n",
"577313\ty=0.0\ty_hat=0.6612421274185181\t20.0,22.0,0.0,4.0,1.0,6.0,0.0,1.0,1.0,6.0,0.0\n",
"763932\ty=0.0\ty_hat=-0.006854534149169922\t49.0,6.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0\n",
"395856\ty=10.0\ty_hat=0.4053572416305542\t47.0,6.0,1.0,5.0,1.0,5.0,0.0,3.0,0.0,9.0,0.0\n",
"147445\ty=0.0\ty_hat=0.0036890506744384766\t3.0,15.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0\n",
"196262\ty=0.0\ty_hat=0.022508323192596436\t39.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0\n",
"419153\ty=0.0\ty_hat=0.0021598339080810547\t19.0,19.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0\n",
"876123\ty=0.0\ty_hat=0.05208203196525574\t72.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0\n",
"801412\ty=0.0\ty_hat=0.05565375089645386\t70.0,5.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0\n",
"327177\ty=0.0\ty_hat=0.024671852588653564\t63.0,16.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0\n",
"42257\ty=1.0\ty_hat=1.0007874965667725\t22.0,11.0,6.0,0.0,0.0,0.0,6.0,0.0,0.0,0.0,5.0\n",
"802636\ty=1.0\ty_hat=0.6643890738487244\t70.0,15.0,5.0,1.0,7.0,1.0,0.0,1.0,2.0,2.0,0.0\n",
"882518\ty=1.0\ty_hat=-0.0007305145263671875\t9.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0\n",
"413048\ty=0.0\ty_hat=0.0002645254135131836\t30.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0\n",
"620443\ty=6.0\ty_hat=0.0058509111404418945\t60.0,22.0,0.0,2.0,0.0,0.0,0.0,6.0,0.0,0.0,0.0\n",
"154856\ty=1.0\ty_hat=0.0002645254135131836\t26.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0\n",
"706889\ty=0.0\ty_hat=0.011612534523010254\t58.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0\n",
"764850\ty=0.0\ty_hat=-0.006854534149169922\t49.0,3.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0\n",
"121845\ty=0.0\ty_hat=0.0021598339080810547\t26.0,17.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0\n",
"21848\ty=0.0\ty_hat=0.1311495304107666\t1.0,12.0,2.0,0.0,1.0,0.0,0.0,2.0,1.0,0.0,0.0\n"
]
}
],
"source": [
"\n",
"for i, r in df_nonzero[features].dropna().sample(20).iterrows():\n",
" b = ','.join([ str(r[c]) for c in features[1:] ])\n",
" y = r['y_start_1']\n",
" res = sagemaker_runtime.invoke_endpoint(\n",
" EndpointName='dockless-regression-ep',\n",
" Body = b, ###. '79.0,2.0,3.0,727.0,721.0,1.0,3.0,602.0,421.0,1.0,1.0',\n",
" ContentType = 'text/csv'\n",
" )\n",
" y_hat = float(res['Body'].read())\n",
" print(f\"{i}\\ty={y}\\ty_hat={y_hat}\\t{b}\")"
]
},
{
"cell_type": "code",
"execution_count": 113,
"id": "30229d28",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(16928, 18)"
]
},
"execution_count": 113,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_nonzero.shape"
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "bdbbb46e",
"metadata": {},
"outputs": [],
"source": [
"def predict_row(r):\n",
" features = ['y_start_1', 'nbid', 'hour', 'dow', 'n_start_1', 'n_start_2', 'n_start_3', 'n_start_4',\n",
" 'n_end_1', 'n_end_2', 'n_end_3', 'n_end_4' ] \n",
" \n",
" b = ','.join([ str(r[c]) for c in features[1:] ])\n",
" y = r['y_start_1']\n",
" res = sagemaker_runtime.invoke_endpoint(\n",
" EndpointName='dockless-regression-ep',\n",
" Body = b, ###. '79.0,2.0,3.0,727.0,721.0,1.0,3.0,602.0,421.0,1.0,1.0',\n",
" ContentType = 'text/csv'\n",
" )\n",
" y_hat = float(res['Body'].read())\n",
" return y_hat\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 66,
"id": "7bbe12db",
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"df_hat = df_nonzero.sample(10000).copy()\n",
"df_hat['y_hat'] = df_hat.apply(predict_row, axis=1)\n",
"df_hat['y_predict'] = df_hat.y_hat.round()\n",
"df_hat.rename({'y_start_1': 'y_actual'}, axis=1, inplace=True)\n",
"printShape(df_hat)\n",
"\n",
"\n",
"confusion_matrix = pd.pivot_table(df_hat, index='y_actual', columns='y_predict', values='date', aggfunc='count', fill_value=0)\n",
"printShape(confusion_matrix)"
]
},
{
"cell_type": "code",
"execution_count": 72,
"id": "63a209c8",
"metadata": {},
"outputs": [],
"source": [
"def plot_confusion_matrix(subdf, label=''):\n",
" confusion_matrix = pd.pivot_table(subdf, index='y_actual', columns='y_predict', values='date', aggfunc='count', fill_value=0)\n",
" print(f\"Confusion Matrix: {label}\")\n",
" printShape(confusion_matrix)\n",
" mae = np.abs(df_hat.y_actual-df_hat.y_predict).mean()\n",
" print(f\"MAE: {mae}\")\n",
" plt.figure(figsize=(12,12))\n",
" plt.imshow(np.log1p(confusion_matrix))\n",
" plt.colorbar()\n",
" plt.ylabel('Actual')\n",
" plt.xlabel('Predict')\n",
" plt.show()\n",
" plt.close()\n",
" return confusion_matrix"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0c6b49b3",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 74,
"id": "7d5646f1",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Confusion Matrix: \n",
"Number of records: 80, number of columns: 73\n",
"MAE: 1.0903\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"''"
]
},
"execution_count": 74,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"plot_confusion_matrix(df_hat)\n",
";\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5041c7d2",
"metadata": {},
"outputs": [],
"source": [
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7f535f91",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "a5181e5e",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 79,
"id": "1487c96a",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Confusion Matrix: \n",
"Number of records: 71, number of columns: 70\n",
"MAE: 1.0903\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"''"
]
},
"execution_count": 79,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"plot_confusion_matrix(df_hat[df_hat.nbid==17])\n",
";\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ac433967",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "b580ca44",
"metadata": {},
"source": [
"former value: 0.19214895122203912"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5449b33c",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "95aa2d46",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "15bbfe6c",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "conda_python3",
"language": "python",
"name": "conda_python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.13"
},
"toc-autonumbering": true
},
"nbformat": 4,
"nbformat_minor": 5
}