{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Redshift ML BYOM Remote Inference using Amazon SageMaker Random Cut Forests\n",
"\n",
"_**Run Predictions from your Amazon Redshift cluster on a model trained and deployed on Amazon Sagemaker**_\n",
"\n",
"---\n",
"\n",
"---\n",
"## Contents\n",
"1. [Introduction](#Introduction)\n",
"2. [Setup Parameters](#Setup-Parameters)\n",
"3. [Training](#Training) \n",
"4. [Inference](#Training)\n",
"5. [Redshift ML BYOM Remote Inference](#Redshift-ML-BYOM-Remote-Inference) \n",
"6. [Conclusion](#Conclusion)\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Introduction\n",
"***\n",
"\n",
"Amazon SageMaker Random Cut Forest (RCF) is an algorithm designed to detect anomalous data points within a dataset. Examples of when anomalies are important to detect include when website activity uncharactersitically spikes, when temperature data diverges from a periodic behavior, or when changes to public transit ridership reflect the occurrence of a special event.\n",
"\n",
"In this notebook, we will use the SageMaker RCF algorithm to train an RCF model on the Numenta Anomaly Benchmark (NAB) NYC Taxi dataset which records the amount New York City taxi ridership over the course of six months. We will then use this model to predict anomalous events by emitting an \"anomaly score\" for each data point. The main goals of this notebook are,\n",
"\n",
"* to learn how to obtain, transform, and store data for use in Amazon SageMaker;\n",
"* to create an AWS SageMaker training job on a data set to produce an RCF model,\n",
"* use the RCF model to perform inference with an Amazon SageMaker endpoint.\n",
"\n",
"The following are ***not*** goals of this notebook:\n",
"\n",
"* deeply understand the RCF model,\n",
"* understand how the Amazon SageMaker RCF algorithm works.\n",
"\n",
"If you would like to know more please check out the [SageMaker RCF Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/randomcutforest.html)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Setup Parameters\n",
"\n",
"***\n",
"Please set below input parameters\n",
"\n",
"1. REDSHIFT_IAM_ROLE: The IAM role arn attached to Redshift Cluster.\n",
"2. REDSHIFT_USER: Database users to run SQL commands\n",
"3. REDSHIFT_ENDPOINT: Redshift Cluster end point.\n",
"4. SAGEMAKER_S3_BUCKET: S3 Bucket to store training input/output"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"\n",
"REDSHIFT_ENDPOINT = 'redshift-cluster.xxxxxxxxxx.us-east-1.redshift.amazonaws.com:5439/dev'\n",
"REDSHIFT_USER=\"awsuser\"\n",
"REDSHIFT_IAM_ROLE='your-amazon-redshift-sagemaker-iam-role-arn'\n",
"SAGEMAKER_S3_BUCKET='your-s3-bucket'\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"isConfigCell": true,
"tags": [
"parameters"
]
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Downloaded training data will be read from s3://sagemaker-sample-files/datasets/tabular/anomaly_benchmark_taxi\n"
]
}
],
"source": [
"import boto3\n",
"import botocore\n",
"import sagemaker\n",
"import sys\n",
"\n",
"\n",
"bucket = SAGEMAKER_S3_BUCKET\n",
"prefix = \"sagemaker/rcf-benchmarks\"\n",
"execution_role = sagemaker.get_execution_role()\n",
"region = boto3.Session().region_name\n",
"\n",
"# S3 bucket where the original data is downloaded and stored.\n",
"downloaded_data_bucket = f\"sagemaker-sample-files\"\n",
"downloaded_data_prefix = \"datasets/tabular/anomaly_benchmark_taxi\"\n",
"\n",
"\n",
"def check_bucket_permission(bucket):\n",
" # check if the bucket exists\n",
" permission = False\n",
" try:\n",
" boto3.Session().client(\"s3\").head_bucket(Bucket=bucket)\n",
" except botocore.exceptions.ParamValidationError as e:\n",
" print(\n",
" \"Hey! You either forgot to specify your S3 bucket\"\n",
" \" or you gave your bucket an invalid name!\"\n",
" )\n",
" except botocore.exceptions.ClientError as e:\n",
" if e.response[\"Error\"][\"Code\"] == \"403\":\n",
" print(f\"Hey! You don't have permission to access the bucket, {bucket}.\")\n",
" elif e.response[\"Error\"][\"Code\"] == \"404\":\n",
" print(f\"Hey! Your bucket, {bucket}, doesn't exist!\")\n",
" else:\n",
" raise\n",
" else:\n",
" permission = True\n",
" return permission\n",
"\n",
"\n",
"\n",
"if check_bucket_permission(downloaded_data_bucket):\n",
" print(\n",
" f\"Downloaded training data will be read from s3://{downloaded_data_bucket}/{downloaded_data_prefix}\"\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Obtain and Inspect Example Data\n",
"\n",
"\n",
"Our data comes from the Numenta Anomaly Benchmark (NAB) NYC Taxi dataset [[1](https://github.com/numenta/NAB/blob/master/data/realKnownCause/nyc_taxi.csv)]. We downloaded data from [here](https://raw.githubusercontent.com/numenta/NAB/master/data/realKnownCause/nyc_taxi.csv) and stored in an S3 bucket. These data consists of the number of New York City taxi passengers over the course of six months aggregated into 30-minute buckets. We know, a priori, that there are anomalous events occurring during the NYC marathon, Thanksgiving, Christmas, New Year's day, and on the day of a snow storm.\n",
"\n",
"> [1] https://github.com/numenta/NAB/blob/master/data/realKnownCause/nyc_taxi.csv"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 67.8 ms, sys: 16.3 ms, total: 84.1 ms\n",
"Wall time: 250 ms\n"
]
}
],
"source": [
"%%time\n",
"\n",
"import pandas as pd\n",
"\n",
"data_filename = \"NAB_nyc_taxi.csv\"\n",
"s3 = boto3.client(\"s3\")\n",
"s3.download_file(downloaded_data_bucket, f\"{downloaded_data_prefix}/{data_filename}\", data_filename)\n",
"taxi_data = pd.read_csv(data_filename, delimiter=\",\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Before training any models it is important to inspect our data, first. Perhaps there are some underlying patterns or structures that we could provide as \"hints\" to the model or maybe there is some noise that we could pre-process away. The raw data looks like this:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"%matplotlib inline\n",
"\n",
"import matplotlib\n",
"import matplotlib.pyplot as plt\n",
"\n",
"matplotlib.rcParams[\"figure.dpi\"] = 100\n",
"\n",
"taxi_data.plot()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Human beings are also extraordinarily good at perceiving patterns. Note, for example, that something uncharacteristic occurs at around datapoint number 6000. Additionally, as we might expect with taxi ridership, the passenger count appears more or less periodic. Let's zoom in to not only examine this anomaly but also to get a better picture of what the \"normal\" data looks like."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"taxi_data[5500:6500].plot()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we see that the number of taxi trips taken is mostly periodic with one mode of length approximately 50 data points. In fact, the mode is length 48 since each datapoint represents a 30-minute bin of ridership count. Therefore, we expect another mode of length $336 = 48 \\times 7$, the length of a week. Smaller frequencies over the course of the day occur, as well.\n",
"\n",
"For example, here is the data across the day containing the above anomaly:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
timestamp
\n",
"
value
\n",
"
\n",
" \n",
" \n",
"
\n",
"
5952
\n",
"
2014-11-02 00:00:00
\n",
"
25110
\n",
"
\n",
"
\n",
"
5953
\n",
"
2014-11-02 00:30:00
\n",
"
23109
\n",
"
\n",
"
\n",
"
5954
\n",
"
2014-11-02 01:00:00
\n",
"
39197
\n",
"
\n",
"
\n",
"
5955
\n",
"
2014-11-02 01:30:00
\n",
"
35212
\n",
"
\n",
"
\n",
"
5956
\n",
"
2014-11-02 02:00:00
\n",
"
13259
\n",
"
\n",
"
\n",
"
5957
\n",
"
2014-11-02 02:30:00
\n",
"
12250
\n",
"
\n",
"
\n",
"
5958
\n",
"
2014-11-02 03:00:00
\n",
"
10013
\n",
"
\n",
"
\n",
"
5959
\n",
"
2014-11-02 03:30:00
\n",
"
7898
\n",
"
\n",
"
\n",
"
5960
\n",
"
2014-11-02 04:00:00
\n",
"
6375
\n",
"
\n",
"
\n",
"
5961
\n",
"
2014-11-02 04:30:00
\n",
"
4532
\n",
"
\n",
"
\n",
"
5962
\n",
"
2014-11-02 05:00:00
\n",
"
5116
\n",
"
\n",
"
\n",
"
5963
\n",
"
2014-11-02 05:30:00
\n",
"
5232
\n",
"
\n",
"
\n",
"
5964
\n",
"
2014-11-02 06:00:00
\n",
"
4542
\n",
"
\n",
"
\n",
"
5965
\n",
"
2014-11-02 06:30:00
\n",
"
5298
\n",
"
\n",
"
\n",
"
5966
\n",
"
2014-11-02 07:00:00
\n",
"
5155
\n",
"
\n",
"
\n",
"
5967
\n",
"
2014-11-02 07:30:00
\n",
"
6029
\n",
"
\n",
"
\n",
"
5968
\n",
"
2014-11-02 08:00:00
\n",
"
6280
\n",
"
\n",
"
\n",
"
5969
\n",
"
2014-11-02 08:30:00
\n",
"
8771
\n",
"
\n",
"
\n",
"
5970
\n",
"
2014-11-02 09:00:00
\n",
"
10151
\n",
"
\n",
"
\n",
"
5971
\n",
"
2014-11-02 09:30:00
\n",
"
12501
\n",
"
\n",
"
\n",
"
5972
\n",
"
2014-11-02 10:00:00
\n",
"
13990
\n",
"
\n",
"
\n",
"
5973
\n",
"
2014-11-02 10:30:00
\n",
"
16534
\n",
"
\n",
"
\n",
"
5974
\n",
"
2014-11-02 11:00:00
\n",
"
17133
\n",
"
\n",
"
\n",
"
5975
\n",
"
2014-11-02 11:30:00
\n",
"
18775
\n",
"
\n",
"
\n",
"
5976
\n",
"
2014-11-02 12:00:00
\n",
"
18985
\n",
"
\n",
"
\n",
"
5977
\n",
"
2014-11-02 12:30:00
\n",
"
19911
\n",
"
\n",
"
\n",
"
5978
\n",
"
2014-11-02 13:00:00
\n",
"
19123
\n",
"
\n",
"
\n",
"
5979
\n",
"
2014-11-02 13:30:00
\n",
"
19524
\n",
"
\n",
"
\n",
"
5980
\n",
"
2014-11-02 14:00:00
\n",
"
19640
\n",
"
\n",
"
\n",
"
5981
\n",
"
2014-11-02 14:30:00
\n",
"
18364
\n",
"
\n",
"
\n",
"
5982
\n",
"
2014-11-02 15:00:00
\n",
"
17940
\n",
"
\n",
"
\n",
"
5983
\n",
"
2014-11-02 15:30:00
\n",
"
17949
\n",
"
\n",
"
\n",
"
5984
\n",
"
2014-11-02 16:00:00
\n",
"
17288
\n",
"
\n",
"
\n",
"
5985
\n",
"
2014-11-02 16:30:00
\n",
"
16326
\n",
"
\n",
"
\n",
"
5986
\n",
"
2014-11-02 17:00:00
\n",
"
17522
\n",
"
\n",
"
\n",
"
5987
\n",
"
2014-11-02 17:30:00
\n",
"
19243
\n",
"
\n",
"
\n",
"
5988
\n",
"
2014-11-02 18:00:00
\n",
"
20291
\n",
"
\n",
"
\n",
"
5989
\n",
"
2014-11-02 18:30:00
\n",
"
21649
\n",
"
\n",
"
\n",
"
5990
\n",
"
2014-11-02 19:00:00
\n",
"
22839
\n",
"
\n",
"
\n",
"
5991
\n",
"
2014-11-02 19:30:00
\n",
"
21772
\n",
"
\n",
"
\n",
"
5992
\n",
"
2014-11-02 20:00:00
\n",
"
20994
\n",
"
\n",
"
\n",
"
5993
\n",
"
2014-11-02 20:30:00
\n",
"
19774
\n",
"
\n",
"
\n",
"
5994
\n",
"
2014-11-02 21:00:00
\n",
"
18398
\n",
"
\n",
"
\n",
"
5995
\n",
"
2014-11-02 21:30:00
\n",
"
17764
\n",
"
\n",
"
\n",
"
5996
\n",
"
2014-11-02 22:00:00
\n",
"
17334
\n",
"
\n",
"
\n",
"
5997
\n",
"
2014-11-02 22:30:00
\n",
"
15431
\n",
"
\n",
"
\n",
"
5998
\n",
"
2014-11-02 23:00:00
\n",
"
12958
\n",
"
\n",
"
\n",
"
5999
\n",
"
2014-11-02 23:30:00
\n",
"
10224
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" timestamp value\n",
"5952 2014-11-02 00:00:00 25110\n",
"5953 2014-11-02 00:30:00 23109\n",
"5954 2014-11-02 01:00:00 39197\n",
"5955 2014-11-02 01:30:00 35212\n",
"5956 2014-11-02 02:00:00 13259\n",
"5957 2014-11-02 02:30:00 12250\n",
"5958 2014-11-02 03:00:00 10013\n",
"5959 2014-11-02 03:30:00 7898\n",
"5960 2014-11-02 04:00:00 6375\n",
"5961 2014-11-02 04:30:00 4532\n",
"5962 2014-11-02 05:00:00 5116\n",
"5963 2014-11-02 05:30:00 5232\n",
"5964 2014-11-02 06:00:00 4542\n",
"5965 2014-11-02 06:30:00 5298\n",
"5966 2014-11-02 07:00:00 5155\n",
"5967 2014-11-02 07:30:00 6029\n",
"5968 2014-11-02 08:00:00 6280\n",
"5969 2014-11-02 08:30:00 8771\n",
"5970 2014-11-02 09:00:00 10151\n",
"5971 2014-11-02 09:30:00 12501\n",
"5972 2014-11-02 10:00:00 13990\n",
"5973 2014-11-02 10:30:00 16534\n",
"5974 2014-11-02 11:00:00 17133\n",
"5975 2014-11-02 11:30:00 18775\n",
"5976 2014-11-02 12:00:00 18985\n",
"5977 2014-11-02 12:30:00 19911\n",
"5978 2014-11-02 13:00:00 19123\n",
"5979 2014-11-02 13:30:00 19524\n",
"5980 2014-11-02 14:00:00 19640\n",
"5981 2014-11-02 14:30:00 18364\n",
"5982 2014-11-02 15:00:00 17940\n",
"5983 2014-11-02 15:30:00 17949\n",
"5984 2014-11-02 16:00:00 17288\n",
"5985 2014-11-02 16:30:00 16326\n",
"5986 2014-11-02 17:00:00 17522\n",
"5987 2014-11-02 17:30:00 19243\n",
"5988 2014-11-02 18:00:00 20291\n",
"5989 2014-11-02 18:30:00 21649\n",
"5990 2014-11-02 19:00:00 22839\n",
"5991 2014-11-02 19:30:00 21772\n",
"5992 2014-11-02 20:00:00 20994\n",
"5993 2014-11-02 20:30:00 19774\n",
"5994 2014-11-02 21:00:00 18398\n",
"5995 2014-11-02 21:30:00 17764\n",
"5996 2014-11-02 22:00:00 17334\n",
"5997 2014-11-02 22:30:00 15431\n",
"5998 2014-11-02 23:00:00 12958\n",
"5999 2014-11-02 23:30:00 10224"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"taxi_data[5952:6000]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Training\n",
"\n",
"***\n",
"\n",
"Next, we configure a SageMaker training job to train the Random Cut Forest (RCF) algorithm on the taxi cab data."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Hyperparameters\n",
"\n",
"Particular to a SageMaker RCF training job are the following hyperparameters:\n",
"\n",
"* **`num_samples_per_tree`** - the number randomly sampled data points sent to each tree. As a general rule, `1/num_samples_per_tree` should approximate the the estimated ratio of anomalies to normal points in the dataset.\n",
"* **`num_trees`** - the number of trees to create in the forest. Each tree learns a separate model from different samples of data. The full forest model uses the mean predicted anomaly score from each constituent tree.\n",
"* **`feature_dim`** - the dimension of each data point.\n",
"\n",
"In addition to these RCF model hyperparameters, we provide additional parameters defining things like the EC2 instance type on which training will run, the S3 bucket containing the data, and the AWS access role. Note that,\n",
"\n",
"* Recommended instance type: `ml.m4`, `ml.c4`, or `ml.c5`\n",
"* Current limitations:\n",
" * The RCF algorithm does not take advantage of GPU hardware."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: 1.\n",
"Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: 1.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"2021-09-03 15:41:29 Starting - Starting the training job...ProfilerReport-1630683689: InProgress\n",
"...\n",
"2021-09-03 15:42:27 Starting - Launching requested ML instances......\n",
"2021-09-03 15:43:29 Starting - Preparing the instances for training............\n",
"2021-09-03 15:45:30 Downloading - Downloading input data...\n",
"2021-09-03 15:45:54 Training - Downloading the training image...\n",
"2021-09-03 15:46:30 Uploading - Uploading generated training model\u001b[34mDocker entrypoint called with argument(s): train\u001b[0m\n",
"\u001b[34mRunning default environment configuration script\u001b[0m\n",
"\u001b[34m[09/03/2021 15:46:21 INFO 140518723704640] Reading default configuration from /opt/amazon/lib/python3.7/site-packages/algorithm/resources/default-conf.json: {'num_samples_per_tree': 256, 'num_trees': 100, 'force_dense': 'true', 'eval_metrics': ['accuracy', 'precision_recall_fscore'], 'epochs': 1, 'mini_batch_size': 1000, '_log_level': 'info', '_kvstore': 'dist_async', '_num_kv_servers': 'auto', '_num_gpus': 'auto', '_tuning_objective_metric': '', '_ftp_port': 8999}\u001b[0m\n",
"\u001b[34m[09/03/2021 15:46:21 INFO 140518723704640] Merging with provided configuration from /opt/ml/input/config/hyperparameters.json: {'num_trees': '50', 'num_samples_per_tree': '512', 'feature_dim': '1', 'mini_batch_size': '1000'}\u001b[0m\n",
"\u001b[34m[09/03/2021 15:46:21 INFO 140518723704640] Final configuration: {'num_samples_per_tree': '512', 'num_trees': '50', 'force_dense': 'true', 'eval_metrics': ['accuracy', 'precision_recall_fscore'], 'epochs': 1, 'mini_batch_size': '1000', '_log_level': 'info', '_kvstore': 'dist_async', '_num_kv_servers': 'auto', '_num_gpus': 'auto', '_tuning_objective_metric': '', '_ftp_port': 8999, 'feature_dim': '1'}\u001b[0m\n",
"\u001b[34m[09/03/2021 15:46:21 WARNING 140518723704640] Loggers have already been setup.\u001b[0m\n",
"\u001b[34m[09/03/2021 15:46:21 INFO 140518723704640] Launching parameter server for role scheduler\u001b[0m\n",
"\u001b[34m[09/03/2021 15:46:21 INFO 140518723704640] {'ENVROOT': '/opt/amazon', 'PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION': 'cpp', 'HOSTNAME': 'ip-10-2-232-234.ec2.internal', 'TRAINING_JOB_NAME': 'randomcutforest-2021-09-03-15-41-29-072', 'NVIDIA_REQUIRE_CUDA': 'cuda>=9.0', 'TRAINING_JOB_ARN': 'arn:aws:sagemaker:us-east-1:845897987212:training-job/randomcutforest-2021-09-03-15-41-29-072', 'AWS_CONTAINER_CREDENTIALS_RELATIVE_URI': '/v2/credentials/3a9de4cc-b7da-4154-b32f-625999c2fe7d', 'CANONICAL_ENVROOT': '/opt/amazon', 'PYTHONUNBUFFERED': 'TRUE', 'NVIDIA_VISIBLE_DEVICES': 'void', 'LD_LIBRARY_PATH': '/opt/amazon/lib/python3.7/site-packages/cv2/../../../../lib:/usr/local/nvidia/lib64:/opt/amazon/lib', 'MXNET_KVSTORE_BIGARRAY_BOUND': '400000000', 'NVIDIA_DRIVER_CAPABILITIES': 'compute,utility', 'AWS_EXECUTION_ENV': 'AWS_ECS_EC2', 'PATH': '/opt/amazon/bin:/usr/local/nvidia/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/amazon/bin:/opt/amazon/bin', 'PWD': '/', 'LANG': 'en_US.utf8', 'AWS_REGION': 'us-east-1', 'SAGEMAKER_METRICS_DIRECTORY': '/opt/ml/output/metrics/sagemaker', 'HOME': '/root', 'SHLVL': '1', 'PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION_VERSION': '2', 'OMP_NUM_THREADS': '2', 'DMLC_INTERFACE': 'eth0', 'ECS_CONTAINER_METADATA_URI': 'http://169.254.170.2/v3/e2b93386-53a1-46ec-8f1f-419bf568a86c', 'ECS_CONTAINER_METADATA_URI_V4': 'http://169.254.170.2/v4/e2b93386-53a1-46ec-8f1f-419bf568a86c', 'SAGEMAKER_HTTP_PORT': '8080', 'SAGEMAKER_DATA_PATH': '/opt/ml', 'KMP_DUPLICATE_LIB_OK': 'True', 'KMP_INIT_AT_FORK': 'FALSE'}\u001b[0m\n",
"\u001b[34m[09/03/2021 15:46:21 INFO 140518723704640] envs={'ENVROOT': '/opt/amazon', 'PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION': 'cpp', 'HOSTNAME': 'ip-10-2-232-234.ec2.internal', 'TRAINING_JOB_NAME': 'randomcutforest-2021-09-03-15-41-29-072', 'NVIDIA_REQUIRE_CUDA': 'cuda>=9.0', 'TRAINING_JOB_ARN': 'arn:aws:sagemaker:us-east-1:845897987212:training-job/randomcutforest-2021-09-03-15-41-29-072', 'AWS_CONTAINER_CREDENTIALS_RELATIVE_URI': '/v2/credentials/3a9de4cc-b7da-4154-b32f-625999c2fe7d', 'CANONICAL_ENVROOT': '/opt/amazon', 'PYTHONUNBUFFERED': 'TRUE', 'NVIDIA_VISIBLE_DEVICES': 'void', 'LD_LIBRARY_PATH': '/opt/amazon/lib/python3.7/site-packages/cv2/../../../../lib:/usr/local/nvidia/lib64:/opt/amazon/lib', 'MXNET_KVSTORE_BIGARRAY_BOUND': '400000000', 'NVIDIA_DRIVER_CAPABILITIES': 'compute,utility', 'AWS_EXECUTION_ENV': 'AWS_ECS_EC2', 'PATH': '/opt/amazon/bin:/usr/local/nvidia/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/amazon/bin:/opt/amazon/bin', 'PWD': '/', 'LANG': 'en_US.utf8', 'AWS_REGION': 'us-east-1', 'SAGEMAKER_METRICS_DIRECTORY': '/opt/ml/output/metrics/sagemaker', 'HOME': '/root', 'SHLVL': '1', 'PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION_VERSION': '2', 'OMP_NUM_THREADS': '2', 'DMLC_INTERFACE': 'eth0', 'ECS_CONTAINER_METADATA_URI': 'http://169.254.170.2/v3/e2b93386-53a1-46ec-8f1f-419bf568a86c', 'ECS_CONTAINER_METADATA_URI_V4': 'http://169.254.170.2/v4/e2b93386-53a1-46ec-8f1f-419bf568a86c', 'SAGEMAKER_HTTP_PORT': '8080', 'SAGEMAKER_DATA_PATH': '/opt/ml', 'KMP_DUPLICATE_LIB_OK': 'True', 'KMP_INIT_AT_FORK': 'FALSE', 'DMLC_ROLE': 'scheduler', 'DMLC_PS_ROOT_URI': '10.2.232.234', 'DMLC_PS_ROOT_PORT': '9000', 'DMLC_NUM_SERVER': '1', 'DMLC_NUM_WORKER': '1'}\u001b[0m\n",
"\u001b[34m[09/03/2021 15:46:21 INFO 140518723704640] Launching parameter server for role server\u001b[0m\n",
"\u001b[34m[09/03/2021 15:46:21 INFO 140518723704640] {'ENVROOT': '/opt/amazon', 'PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION': 'cpp', 'HOSTNAME': 'ip-10-2-232-234.ec2.internal', 'TRAINING_JOB_NAME': 'randomcutforest-2021-09-03-15-41-29-072', 'NVIDIA_REQUIRE_CUDA': 'cuda>=9.0', 'TRAINING_JOB_ARN': 'arn:aws:sagemaker:us-east-1:845897987212:training-job/randomcutforest-2021-09-03-15-41-29-072', 'AWS_CONTAINER_CREDENTIALS_RELATIVE_URI': '/v2/credentials/3a9de4cc-b7da-4154-b32f-625999c2fe7d', 'CANONICAL_ENVROOT': '/opt/amazon', 'PYTHONUNBUFFERED': 'TRUE', 'NVIDIA_VISIBLE_DEVICES': 'void', 'LD_LIBRARY_PATH': '/opt/amazon/lib/python3.7/site-packages/cv2/../../../../lib:/usr/local/nvidia/lib64:/opt/amazon/lib', 'MXNET_KVSTORE_BIGARRAY_BOUND': '400000000', 'NVIDIA_DRIVER_CAPABILITIES': 'compute,utility', 'AWS_EXECUTION_ENV': 'AWS_ECS_EC2', 'PATH': '/opt/amazon/bin:/usr/local/nvidia/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/amazon/bin:/opt/amazon/bin', 'PWD': '/', 'LANG': 'en_US.utf8', 'AWS_REGION': 'us-east-1', 'SAGEMAKER_METRICS_DIRECTORY': '/opt/ml/output/metrics/sagemaker', 'HOME': '/root', 'SHLVL': '1', 'PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION_VERSION': '2', 'OMP_NUM_THREADS': '2', 'DMLC_INTERFACE': 'eth0', 'ECS_CONTAINER_METADATA_URI': 'http://169.254.170.2/v3/e2b93386-53a1-46ec-8f1f-419bf568a86c', 'ECS_CONTAINER_METADATA_URI_V4': 'http://169.254.170.2/v4/e2b93386-53a1-46ec-8f1f-419bf568a86c', 'SAGEMAKER_HTTP_PORT': '8080', 'SAGEMAKER_DATA_PATH': '/opt/ml', 'KMP_DUPLICATE_LIB_OK': 'True', 'KMP_INIT_AT_FORK': 'FALSE'}\u001b[0m\n",
"\u001b[34m[09/03/2021 15:46:21 INFO 140518723704640] envs={'ENVROOT': '/opt/amazon', 'PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION': 'cpp', 'HOSTNAME': 'ip-10-2-232-234.ec2.internal', 'TRAINING_JOB_NAME': 'randomcutforest-2021-09-03-15-41-29-072', 'NVIDIA_REQUIRE_CUDA': 'cuda>=9.0', 'TRAINING_JOB_ARN': 'arn:aws:sagemaker:us-east-1:845897987212:training-job/randomcutforest-2021-09-03-15-41-29-072', 'AWS_CONTAINER_CREDENTIALS_RELATIVE_URI': '/v2/credentials/3a9de4cc-b7da-4154-b32f-625999c2fe7d', 'CANONICAL_ENVROOT': '/opt/amazon', 'PYTHONUNBUFFERED': 'TRUE', 'NVIDIA_VISIBLE_DEVICES': 'void', 'LD_LIBRARY_PATH': '/opt/amazon/lib/python3.7/site-packages/cv2/../../../../lib:/usr/local/nvidia/lib64:/opt/amazon/lib', 'MXNET_KVSTORE_BIGARRAY_BOUND': '400000000', 'NVIDIA_DRIVER_CAPABILITIES': 'compute,utility', 'AWS_EXECUTION_ENV': 'AWS_ECS_EC2', 'PATH': '/opt/amazon/bin:/usr/local/nvidia/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/amazon/bin:/opt/amazon/bin', 'PWD': '/', 'LANG': 'en_US.utf8', 'AWS_REGION': 'us-east-1', 'SAGEMAKER_METRICS_DIRECTORY': '/opt/ml/output/metrics/sagemaker', 'HOME': '/root', 'SHLVL': '1', 'PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION_VERSION': '2', 'OMP_NUM_THREADS': '2', 'DMLC_INTERFACE': 'eth0', 'ECS_CONTAINER_METADATA_URI': 'http://169.254.170.2/v3/e2b93386-53a1-46ec-8f1f-419bf568a86c', 'ECS_CONTAINER_METADATA_URI_V4': 'http://169.254.170.2/v4/e2b93386-53a1-46ec-8f1f-419bf568a86c', 'SAGEMAKER_HTTP_PORT': '8080', 'SAGEMAKER_DATA_PATH': '/opt/ml', 'KMP_DUPLICATE_LIB_OK': 'True', 'KMP_INIT_AT_FORK': 'FALSE', 'DMLC_ROLE': 'server', 'DMLC_PS_ROOT_URI': '10.2.232.234', 'DMLC_PS_ROOT_PORT': '9000', 'DMLC_NUM_SERVER': '1', 'DMLC_NUM_WORKER': '1'}\u001b[0m\n",
"\u001b[34m[09/03/2021 15:46:21 INFO 140518723704640] Environment: {'ENVROOT': '/opt/amazon', 'PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION': 'cpp', 'HOSTNAME': 'ip-10-2-232-234.ec2.internal', 'TRAINING_JOB_NAME': 'randomcutforest-2021-09-03-15-41-29-072', 'NVIDIA_REQUIRE_CUDA': 'cuda>=9.0', 'TRAINING_JOB_ARN': 'arn:aws:sagemaker:us-east-1:845897987212:training-job/randomcutforest-2021-09-03-15-41-29-072', 'AWS_CONTAINER_CREDENTIALS_RELATIVE_URI': '/v2/credentials/3a9de4cc-b7da-4154-b32f-625999c2fe7d', 'CANONICAL_ENVROOT': '/opt/amazon', 'PYTHONUNBUFFERED': 'TRUE', 'NVIDIA_VISIBLE_DEVICES': 'void', 'LD_LIBRARY_PATH': '/opt/amazon/lib/python3.7/site-packages/cv2/../../../../lib:/usr/local/nvidia/lib64:/opt/amazon/lib', 'MXNET_KVSTORE_BIGARRAY_BOUND': '400000000', 'NVIDIA_DRIVER_CAPABILITIES': 'compute,utility', 'AWS_EXECUTION_ENV': 'AWS_ECS_EC2', 'PATH': '/opt/amazon/bin:/usr/local/nvidia/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/amazon/bin:/opt/amazon/bin', 'PWD': '/', 'LANG': 'en_US.utf8', 'AWS_REGION': 'us-east-1', 'SAGEMAKER_METRICS_DIRECTORY': '/opt/ml/output/metrics/sagemaker', 'HOME': '/root', 'SHLVL': '1', 'PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION_VERSION': '2', 'OMP_NUM_THREADS': '2', 'DMLC_INTERFACE': 'eth0', 'ECS_CONTAINER_METADATA_URI': 'http://169.254.170.2/v3/e2b93386-53a1-46ec-8f1f-419bf568a86c', 'ECS_CONTAINER_METADATA_URI_V4': 'http://169.254.170.2/v4/e2b93386-53a1-46ec-8f1f-419bf568a86c', 'SAGEMAKER_HTTP_PORT': '8080', 'SAGEMAKER_DATA_PATH': '/opt/ml', 'KMP_DUPLICATE_LIB_OK': 'True', 'KMP_INIT_AT_FORK': 'FALSE', 'DMLC_ROLE': 'worker', 'DMLC_PS_ROOT_URI': '10.2.232.234', 'DMLC_PS_ROOT_PORT': '9000', 'DMLC_NUM_SERVER': '1', 'DMLC_NUM_WORKER': '1'}\u001b[0m\n",
"\u001b[34mProcess 33 is a shell:scheduler.\u001b[0m\n",
"\u001b[34mProcess 44 is a shell:server.\u001b[0m\n",
"\u001b[34mProcess 1 is a worker.\u001b[0m\n",
"\u001b[34m[09/03/2021 15:46:21 INFO 140518723704640] Using default worker.\u001b[0m\n",
"\u001b[34m[09/03/2021 15:46:21 INFO 140518723704640] Loaded iterator creator application/x-recordio-protobuf for content type ('application/x-recordio-protobuf', '1.0')\u001b[0m\n",
"\u001b[34m[09/03/2021 15:46:21 INFO 140518723704640] Checkpoint loading and saving are disabled.\u001b[0m\n",
"\u001b[34m[09/03/2021 15:46:21 INFO 140518723704640] Verifying hyperparamemters...\u001b[0m\n",
"\u001b[34m[09/03/2021 15:46:21 INFO 140518723704640] Hyperparameters are correct.\u001b[0m\n",
"\u001b[34m[09/03/2021 15:46:21 INFO 140518723704640] Validating that feature_dim agrees with dimensions in training data...\u001b[0m\n",
"\u001b[34m[09/03/2021 15:46:21 INFO 140518723704640] feature_dim is correct.\u001b[0m\n",
"\u001b[34m[09/03/2021 15:46:21 INFO 140518723704640] Validating memory limits...\u001b[0m\n",
"\u001b[34m[09/03/2021 15:46:21 INFO 140518723704640] Available memory in bytes: 15184003072\u001b[0m\n",
"\u001b[34m[09/03/2021 15:46:21 INFO 140518723704640] Estimated sample size in bytes: 204800\u001b[0m\n",
"\u001b[34m[09/03/2021 15:46:21 INFO 140518723704640] Estimated memory needed to build the forest in bytes: 1024000\u001b[0m\n",
"\u001b[34m[09/03/2021 15:46:21 INFO 140518723704640] Memory limits validated.\u001b[0m\n",
"\u001b[34m[09/03/2021 15:46:21 INFO 140518723704640] Starting cluster sharing facilities...\u001b[0m\n",
"\u001b[34m[09/03/2021 15:46:21 INFO 140517211883264] >>> starting FTP server on 0.0.0.0:8999, pid=1 <<<\u001b[0m\n",
"\u001b[34m[I 21-09-03 15:46:21] >>> starting FTP server on 0.0.0.0:8999, pid=1 <<<\u001b[0m\n",
"\u001b[34m[I 21-09-03 15:46:21] poller: \u001b[0m\n",
"\u001b[34m[I 21-09-03 15:46:21] masquerade (NAT) address: None\u001b[0m\n",
"\u001b[34m[09/03/2021 15:46:21 INFO 140517211883264] poller: \u001b[0m\n",
"\u001b[34m[09/03/2021 15:46:21 INFO 140517211883264] masquerade (NAT) address: None\u001b[0m\n",
"\u001b[34m[09/03/2021 15:46:21 INFO 140517211883264] passive ports: None\u001b[0m\n",
"\u001b[34m[09/03/2021 15:46:21 INFO 140517211883264] use sendfile(2): True\u001b[0m\n",
"\u001b[34m[I 21-09-03 15:46:21] passive ports: None\u001b[0m\n",
"\u001b[34m[I 21-09-03 15:46:21] use sendfile(2): True\u001b[0m\n",
"\u001b[34m[09/03/2021 15:46:21 INFO 140518723704640] Create Store: dist_async\u001b[0m\n",
"\u001b[34m[09/03/2021 15:46:23 INFO 140518723704640] Cluster sharing facilities started.\u001b[0m\n",
"\u001b[34m[09/03/2021 15:46:23 INFO 140518723704640] Verifying all workers are accessible...\u001b[0m\n",
"\u001b[34m[09/03/2021 15:46:23 INFO 140518723704640] All workers accessible.\u001b[0m\n",
"\u001b[34m[09/03/2021 15:46:23 INFO 140518723704640] Initializing Sampler...\u001b[0m\n",
"\u001b[34m[09/03/2021 15:46:23 INFO 140518723704640] Sampler correctly initialized.\u001b[0m\n",
"\u001b[34m#metrics {\"StartTime\": 1630683981.8909588, \"EndTime\": 1630683983.213549, \"Dimensions\": {\"Algorithm\": \"RandomCutForest\", \"Host\": \"algo-1\", \"Operation\": \"training\"}, \"Metrics\": {\"initialize.time\": {\"sum\": 1316.580057144165, \"count\": 1, \"min\": 1316.580057144165, \"max\": 1316.580057144165}}}\n",
"\u001b[0m\n",
"\u001b[34m#metrics {\"StartTime\": 1630683983.2137547, \"EndTime\": 1630683983.21381, \"Dimensions\": {\"Algorithm\": \"RandomCutForest\", \"Host\": \"algo-1\", \"Operation\": \"training\", \"Meta\": \"init_train_data_iter\"}, \"Metrics\": {\"Total Records Seen\": {\"sum\": 0.0, \"count\": 1, \"min\": 0, \"max\": 0}, \"Total Batches Seen\": {\"sum\": 0.0, \"count\": 1, \"min\": 0, \"max\": 0}, \"Max Records Seen Between Resets\": {\"sum\": 0.0, \"count\": 1, \"min\": 0, \"max\": 0}, \"Max Batches Seen Between Resets\": {\"sum\": 0.0, \"count\": 1, \"min\": 0, \"max\": 0}, \"Reset Count\": {\"sum\": 0.0, \"count\": 1, \"min\": 0, \"max\": 0}, \"Number of Records Since Last Reset\": {\"sum\": 0.0, \"count\": 1, \"min\": 0, \"max\": 0}, \"Number of Batches Since Last Reset\": {\"sum\": 0.0, \"count\": 1, \"min\": 0, \"max\": 0}}}\n",
"\u001b[0m\n",
"\u001b[34m[2021-09-03 15:46:23.214] [tensorio] [info] epoch_stats={\"data_pipeline\": \"/opt/ml/input/data/train\", \"epoch\": 0, \"duration\": 1322, \"num_examples\": 1, \"num_bytes\": 28000}\u001b[0m\n",
"\u001b[34m[09/03/2021 15:46:23 INFO 140518723704640] Sampling training data...\u001b[0m\n",
"\u001b[34m[2021-09-03 15:46:23.244] [tensorio] [info] epoch_stats={\"data_pipeline\": \"/opt/ml/input/data/train\", \"epoch\": 1, \"duration\": 29, \"num_examples\": 11, \"num_bytes\": 288960}\u001b[0m\n",
"\u001b[34m[09/03/2021 15:46:23 INFO 140518723704640] Sampling training data completed.\u001b[0m\n",
"\u001b[34m#metrics {\"StartTime\": 1630683983.2136898, \"EndTime\": 1630683983.2479045, \"Dimensions\": {\"Algorithm\": \"RandomCutForest\", \"Host\": \"algo-1\", \"Operation\": \"training\"}, \"Metrics\": {\"epochs\": {\"sum\": 1.0, \"count\": 1, \"min\": 1, \"max\": 1}, \"update.time\": {\"sum\": 33.71024131774902, \"count\": 1, \"min\": 33.71024131774902, \"max\": 33.71024131774902}}}\n",
"\u001b[0m\n",
"\u001b[34m[09/03/2021 15:46:23 INFO 140518723704640] Early stop condition met. Stopping training.\u001b[0m\n",
"\u001b[34m[09/03/2021 15:46:23 INFO 140518723704640] #progress_metric: host=algo-1, completed 100 % epochs\u001b[0m\n",
"\u001b[34m#metrics {\"StartTime\": 1630683983.2141645, \"EndTime\": 1630683983.248269, \"Dimensions\": {\"Algorithm\": \"RandomCutForest\", \"Host\": \"algo-1\", \"Operation\": \"training\", \"epoch\": 0, \"Meta\": \"training_data_iter\"}, \"Metrics\": {\"Total Records Seen\": {\"sum\": 10320.0, \"count\": 1, \"min\": 10320, \"max\": 10320}, \"Total Batches Seen\": {\"sum\": 11.0, \"count\": 1, \"min\": 11, \"max\": 11}, \"Max Records Seen Between Resets\": {\"sum\": 10320.0, \"count\": 1, \"min\": 10320, \"max\": 10320}, \"Max Batches Seen Between Resets\": {\"sum\": 11.0, \"count\": 1, \"min\": 11, \"max\": 11}, \"Reset Count\": {\"sum\": 1.0, \"count\": 1, \"min\": 1, \"max\": 1}, \"Number of Records Since Last Reset\": {\"sum\": 10320.0, \"count\": 1, \"min\": 10320, \"max\": 10320}, \"Number of Batches Since Last Reset\": {\"sum\": 11.0, \"count\": 1, \"min\": 11, \"max\": 11}}}\n",
"\u001b[0m\n",
"\u001b[34m[09/03/2021 15:46:23 INFO 140518723704640] #throughput_metric: host=algo-1, train throughput=301332.5625496011 records/second\u001b[0m\n",
"\u001b[34m[09/03/2021 15:46:23 INFO 140518723704640] Master node: building Random Cut Forest...\u001b[0m\n",
"\u001b[34m[09/03/2021 15:46:23 INFO 140518723704640] Gathering samples...\u001b[0m\n",
"\u001b[34m[09/03/2021 15:46:23 INFO 140518723704640] 10320 samples gathered\u001b[0m\n",
"\u001b[34m[09/03/2021 15:46:23 INFO 140518723704640] Building Random Cut Forest...\u001b[0m\n",
"\u001b[34m[09/03/2021 15:46:23 INFO 140518723704640] Random Cut Forest built: \n",
"\u001b[0m\n",
"\u001b[34mForestInfo{num_trees: 50, num_samples_in_forest: 10300, num_samples_per_tree: 206, sample_dim: 1, shingle_size: 1, trees_num_nodes: [405, 403, 409, 409, 411, 409, 407, 409, 407, 411, 409, 411, 411, 407, 405, 409, 407, 409, 409, 409, 411, 411, 401, 409, 409, 411, 407, 411, 411, 401, 411, 405, 411, 411, 409, 409, 411, 405, 409, 407, 409, 411, 411, 407, 409, 411, 407, 407, 411, 409, ], trees_depth: [16, 21, 22, 16, 14, 18, 17, 17, 19, 16, 17, 18, 18, 16, 17, 17, 20, 16, 19, 17, 21, 15, 15, 17, 15, 18, 15, 18, 18, 22, 18, 13, 17, 15, 18, 17, 18, 15, 18, 14, 18, 20, 20, 15, 20, 18, 14, 16, 15, 18, ], max_num_nodes: 411, min_num_nodes: 401, avg_num_nodes: 408, max_tree_depth: 22, min_tree_depth: 13, avg_tree_depth: 17, mem_size: 2124960}\u001b[0m\n",
"\u001b[34m#metrics {\"StartTime\": 1630683983.2479992, \"EndTime\": 1630683983.2633133, \"Dimensions\": {\"Algorithm\": \"RandomCutForest\", \"Host\": \"algo-1\", \"Operation\": \"training\"}, \"Metrics\": {\"fit_model.time\": {\"sum\": 7.804393768310547, \"count\": 1, \"min\": 7.804393768310547, \"max\": 7.804393768310547}, \"model.bytes\": {\"sum\": 2124960.0, \"count\": 1, \"min\": 2124960, \"max\": 2124960}, \"finalize.time\": {\"sum\": 14.678478240966797, \"count\": 1, \"min\": 14.678478240966797, \"max\": 14.678478240966797}}}\n",
"\u001b[0m\n",
"\u001b[34m[09/03/2021 15:46:23 INFO 140518723704640] Master node: Serializing the RandomCutForest model\u001b[0m\n",
"\u001b[34m#metrics {\"StartTime\": 1630683983.263396, \"EndTime\": 1630683983.2957816, \"Dimensions\": {\"Algorithm\": \"RandomCutForest\", \"Host\": \"algo-1\", \"Operation\": \"training\"}, \"Metrics\": {\"serialize_model.time\": {\"sum\": 32.332420349121094, \"count\": 1, \"min\": 32.332420349121094, \"max\": 32.332420349121094}}}\n",
"\u001b[0m\n",
"\u001b[34m[09/03/2021 15:46:23 INFO 140518723704640] Test data is not provided.\u001b[0m\n",
"\u001b[34m#metrics {\"StartTime\": 1630683983.295877, \"EndTime\": 1630683983.296018, \"Dimensions\": {\"Algorithm\": \"RandomCutForest\", \"Host\": \"algo-1\", \"Operation\": \"training\"}, \"Metrics\": {\"setuptime\": {\"sum\": 27.47488021850586, \"count\": 1, \"min\": 27.47488021850586, \"max\": 27.47488021850586}, \"totaltime\": {\"sum\": 1443.469762802124, \"count\": 1, \"min\": 1443.469762802124, \"max\": 1443.469762802124}}}\n",
"\u001b[0m\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"2021-09-03 15:46:50 Completed - Training job completed\n",
"Training seconds: 67\n",
"Billable seconds: 67\n"
]
}
],
"source": [
"from sagemaker import RandomCutForest\n",
"\n",
"session = sagemaker.Session()\n",
"\n",
"# specify general training job information\n",
"rcf = RandomCutForest(\n",
" role=execution_role,\n",
" instance_count=1,\n",
" instance_type=\"ml.m4.xlarge\",\n",
" data_location=f\"s3://{bucket}/{prefix}/\",\n",
" output_path=f\"s3://{bucket}/{prefix}/output\",\n",
" num_samples_per_tree=512,\n",
" num_trees=50,\n",
")\n",
"\n",
"# automatically upload the training data to S3 and run the training job\n",
"rcf.fit(rcf.record_set(taxi_data.value.to_numpy().reshape(-1, 1)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you see the message\n",
"\n",
"> `===== Job Complete =====`\n",
"\n",
"at the bottom of the output logs then that means training successfully completed and the output RCF model was stored in the specified output path. You can also view information about and the status of a training job using the AWS SageMaker console. Just click on the \"Jobs\" tab and select training job matching the training job name, below:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Training job name: randomcutforest-2021-09-03-15-41-29-072\n"
]
}
],
"source": [
"print(f\"Training job name: {rcf.latest_training_job.job_name}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Inference\n",
"\n",
"***\n",
"\n",
"A trained Random Cut Forest model does nothing on its own. We now want to use the model we computed to perform inference on data. In this case, it means computing anomaly scores from input time series data points.\n",
"\n",
"We create an inference endpoint using the SageMaker Python SDK `deploy()` function from the job we defined above. We specify the instance type where inference is computed as well as an initial number of instances to spin up. We recommend using the `ml.c5` instance type as it provides the fastest inference time at the lowest cost."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: 1.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"---------!"
]
}
],
"source": [
"rcf_inference = rcf.deploy(initial_instance_count=1, instance_type=\"ml.m4.xlarge\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Congratulations! You now have a functioning SageMaker RCF inference endpoint. You can confirm the endpoint configuration and status by navigating to the \"Endpoints\" tab in the AWS SageMaker console and selecting the endpoint matching the endpoint name, below: "
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"The endpoint attribute has been renamed in sagemaker>=2.\n",
"See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Endpoint name: randomcutforest-2021-09-03-15-47-12-234\n"
]
}
],
"source": [
"print(f\"Endpoint name: {rcf_inference.endpoint}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data Serialization/Deserialization\n",
"\n",
"We can pass data in a variety of formats to our inference endpoint. In this example we will demonstrate passing CSV-formatted data. Other available formats are JSON-formatted and RecordIO Protobuf. We make use of the SageMaker Python SDK utilities `csv_serializer` and `json_deserializer` when configuring the inference endpoint."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"from sagemaker.serializers import CSVSerializer\n",
"from sagemaker.deserializers import JSONDeserializer\n",
"\n",
"rcf_inference.serializer = CSVSerializer()\n",
"rcf_inference.deserializer = JSONDeserializer()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's pass the training dataset, in CSV format, to the inference endpoint so we can automatically detect the anomalies we saw with our eyes in the plots, above. Note that the serializer and deserializer will automatically take care of the datatype conversion from Numpy NDArrays.\n",
"\n",
"For starters, let's only pass in the first six datapoints so we can see what the output looks like."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[10844]\n",
" [ 8127]\n",
" [ 6210]\n",
" [ 4656]\n",
" [ 3820]\n",
" [ 2873]]\n"
]
}
],
"source": [
"taxi_data_numpy = taxi_data.value.to_numpy().reshape(-1, 1)\n",
"print(taxi_data_numpy[:6])\n",
"results = rcf_inference.predict(\n",
" taxi_data_numpy[:6], initial_args={\"ContentType\": \"text/csv\", \"Accept\": \"application/json\"}\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Computing Anomaly Scores\n",
"\n",
"Now, let's compute and plot the anomaly scores from the entire taxi dataset."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"fig, ax1 = plt.subplots()\n",
"ax2 = ax1.twinx()\n",
"\n",
"#\n",
"# *Try this out* - change `start` and `end` to zoom in on the\n",
"# anomaly found earlier in this notebook\n",
"#\n",
"start, end = 0, len(taxi_data)\n",
"# start, end = 5500, 6500\n",
"taxi_data_subset = taxi_data[start:end]\n",
"\n",
"ax1.plot(taxi_data_subset[\"value\"], color=\"C0\", alpha=0.8)\n",
"ax2.plot(taxi_data_subset[\"score\"], color=\"C1\")\n",
"\n",
"ax1.grid(which=\"major\", axis=\"both\")\n",
"\n",
"ax1.set_ylabel(\"Taxi Ridership\", color=\"C0\")\n",
"ax2.set_ylabel(\"Anomaly Score\", color=\"C1\")\n",
"\n",
"ax1.tick_params(\"y\", colors=\"C0\")\n",
"ax2.tick_params(\"y\", colors=\"C1\")\n",
"\n",
"ax1.set_ylim(0, 40000)\n",
"ax2.set_ylim(min(scores), 1.4 * max(scores))\n",
"fig.set_figwidth(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that the anomaly score spikes where our eyeball-norm method suggests there is an anomalous data point as well as in some places where our eyeballs are not as accurate.\n",
"\n",
"Below we print and plot any data points with scores greater than 3 standard deviations (approx 99.9th percentile) from the mean score."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
timestamp
\n",
"
value
\n",
"
score
\n",
"
\n",
" \n",
" \n",
"
\n",
"
37
\n",
"
2014-07-01 18:30:00
\n",
"
27598
\n",
"
2.111130
\n",
"
\n",
"
\n",
"
38
\n",
"
2014-07-01 19:00:00
\n",
"
26827
\n",
"
1.740207
\n",
"
\n",
"
\n",
"
87
\n",
"
2014-07-02 19:30:00
\n",
"
26872
\n",
"
1.750501
\n",
"
\n",
"
\n",
"
134
\n",
"
2014-07-03 19:00:00
\n",
"
29985
\n",
"
3.113783
\n",
"
\n",
"
\n",
"
527
\n",
"
2014-07-11 23:30:00
\n",
"
26873
\n",
"
1.755937
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
10309
\n",
"
2015-01-31 18:30:00
\n",
"
27286
\n",
"
1.959775
\n",
"
\n",
"
\n",
"
10310
\n",
"
2015-01-31 19:00:00
\n",
"
28804
\n",
"
2.689794
\n",
"
\n",
"
\n",
"
10311
\n",
"
2015-01-31 19:30:00
\n",
"
27773
\n",
"
2.226761
\n",
"
\n",
"
\n",
"
10317
\n",
"
2015-01-31 22:30:00
\n",
"
27309
\n",
"
1.965813
\n",
"
\n",
"
\n",
"
10318
\n",
"
2015-01-31 23:00:00
\n",
"
26591
\n",
"
1.655542
\n",
"
\n",
" \n",
"
\n",
"
236 rows × 3 columns
\n",
"
"
],
"text/plain": [
" timestamp value score\n",
"37 2014-07-01 18:30:00 27598 2.111130\n",
"38 2014-07-01 19:00:00 26827 1.740207\n",
"87 2014-07-02 19:30:00 26872 1.750501\n",
"134 2014-07-03 19:00:00 29985 3.113783\n",
"527 2014-07-11 23:30:00 26873 1.755937\n",
"... ... ... ...\n",
"10309 2015-01-31 18:30:00 27286 1.959775\n",
"10310 2015-01-31 19:00:00 28804 2.689794\n",
"10311 2015-01-31 19:30:00 27773 2.226761\n",
"10317 2015-01-31 22:30:00 27309 1.965813\n",
"10318 2015-01-31 23:00:00 26591 1.655542\n",
"\n",
"[236 rows x 3 columns]"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"score_mean = taxi_data[\"score\"].mean()\n",
"score_std = taxi_data[\"score\"].std()\n",
"score_cutoff = score_mean + 3 * score_std\n",
"\n",
"anomalies = taxi_data_subset[taxi_data_subset[\"score\"] > score_cutoff]\n",
"anomalies"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following is a list of known anomalous events which occurred in New York City within this timeframe:\n",
"\n",
"* `2014-11-02` - NYC Marathon\n",
"* `2015-01-01` - New Year's Eve\n",
"* `2015-01-27` - Snowstorm\n",
"\n",
"Note that our algorithm managed to capture these events along with quite a few others. Below we add these anomalies to the score plot."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"
"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ax2.plot(anomalies.index, anomalies.score, \"ko\")\n",
"fig"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With the current hyperparameter choices we see that the three-standard-deviation threshold, while able to capture the known anomalies as well as the ones apparent in the ridership plot, is rather sensitive to fine-grained peruturbations and anomalous behavior. Adding trees to the SageMaker RCF model could smooth out the results as well as using a larger data set."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Redshift ML BYOM Remote Inference"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup Run SQL function using Redshift Data API to get SQL query output directly into pandas dataframe\n",
"In this step, we are creating function run_sql, which we will use to get SQL query output directly into pandas dataframe. We will also use this function to run DDL statements"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"\n",
"import boto3\n",
"import time\n",
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"\n",
"session = boto3.session.Session()\n",
"region = session.region_name\n",
"\n",
"\n",
"def run_sql(sql_text):\n",
" client = boto3.client(\"redshift-data\")\n",
" res = client.execute_statement(Database=REDSHIFT_ENDPOINT.split('/')[1], DbUser=REDSHIFT_USER, Sql=sql_text,\n",
" ClusterIdentifier=REDSHIFT_ENDPOINT.split('.')[0])\n",
" query_id = res[\"Id\"]\n",
" done = False\n",
" while not done:\n",
" time.sleep(1)\n",
" status_description = client.describe_statement(Id=query_id)\n",
" status = status_description[\"Status\"]\n",
" if status == \"FAILED\":\n",
" raise Exception('SQL query failed:' + query_id + \": \" + status_description[\"Error\"])\n",
" elif status == \"FINISHED\":\n",
" if status_description['ResultRows']>0:\n",
" results = client.get_statement_result(Id=query_id)\n",
" column_labels = []\n",
" for i in range(len(results[\"ColumnMetadata\"])): column_labels.append(results[\"ColumnMetadata\"][i]['label'])\n",
" records = []\n",
" for record in results.get('Records'):\n",
" records.append([list(rec.values())[0] for rec in record])\n",
" df = pd.DataFrame(np.array(records), columns=column_labels)\n",
" return df\n",
" else:\n",
" return query_id\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data Preparation Script\n",
"Data preparation script to be run on Redshift\n",
"we will create the table that will be used to run inference on"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"setup_script = \"\"\"\n",
"\n",
"DROP TABLE IF EXISTS public.rcf_taxi_data CASCADE;\n",
"\n",
"CREATE TABLE public.rcf_taxi_data\n",
"(\n",
"ride_timestamp timestamp,\n",
"nbr_passengers int\n",
");\n",
"\n",
"COPY public.rcf_taxi_data\n",
"FROM 's3://sagemaker-sample-files/datasets/tabular/anomaly_benchmark_taxi/NAB_nyc_taxi.csv'\n",
"IAM_ROLE '{}' ignoreheader 1 csv delimiter ',';\n",
"\n",
"\"\"\"\n",
" \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Run data preparation script in Redshift"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"sql_stmt = setup_script.split(\";\")\n",
"for sql_text in sql_stmt[:-1]:\n",
" run_sql(sql_text.format(REDSHIFT_IAM_ROLE));"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Run Redshift ML Create Model statement using Sagemaker Endpoint for Remote Inference\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"The endpoint attribute has been renamed in sagemaker>=2.\n",
"See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"randomcutforest-2021-09-03-15-47-12-234\n"
]
}
],
"source": [
"SAGEMAKER_ENDPOINT = rcf_inference.endpoint\n",
"print(SAGEMAKER_ENDPOINT) "
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"cfa8e390-605b-4c44-9c45-86914f70720e\n"
]
}
],
"source": [
"sql_text=(\"drop model if exists public.remote_random_cut_forest;\\\n",
"CREATE MODEL public.remote_random_cut_forest\\\n",
" FUNCTION remote_fn_rcf (int)\\\n",
" RETURNS decimal(10,6)\\\n",
" SAGEMAKER'{}'\\\n",
" IAM_ROLE'{}'\\\n",
"\")\n",
"df=run_sql(sql_text.format(SAGEMAKER_ENDPOINT,REDSHIFT_IAM_ROLE))\n",
"print(df)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Show Model"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Key
\n",
"
Value
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
Model Name
\n",
"
remote_random_cut_forest
\n",
"
\n",
"
\n",
"
1
\n",
"
Schema Name
\n",
"
public
\n",
"
\n",
"
\n",
"
2
\n",
"
Owner
\n",
"
demo
\n",
"
\n",
"
\n",
"
3
\n",
"
Creation Time
\n",
"
Fri, 03.09.2021 15:51:52
\n",
"
\n",
"
\n",
"
4
\n",
"
Model State
\n",
"
READY
\n",
"
\n",
"
\n",
"
5
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
6
\n",
"
PARAMETERS:
\n",
"
\n",
"
\n",
"
\n",
"
7
\n",
"
Endpoint
\n",
"
randomcutforest-2021-09-03-15-47-12-234
\n",
"
\n",
"
\n",
"
8
\n",
"
Function Name
\n",
"
remote_fn_rcf
\n",
"
\n",
"
\n",
"
9
\n",
"
Inference Type
\n",
"
Remote
\n",
"
\n",
"
\n",
"
10
\n",
"
Function Parameter Types
\n",
"
int4
\n",
"
\n",
"
\n",
"
11
\n",
"
IAM Role
\n",
"
arn:aws:iam::845897987212:role/RedshiftDemo-Re...
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Key \\\n",
"0 Model Name \n",
"1 Schema Name \n",
"2 Owner \n",
"3 Creation Time \n",
"4 Model State \n",
"5 \n",
"6 PARAMETERS: \n",
"7 Endpoint \n",
"8 Function Name \n",
"9 Inference Type \n",
"10 Function Parameter Types \n",
"11 IAM Role \n",
"\n",
" Value \n",
"0 remote_random_cut_forest \n",
"1 public \n",
"2 demo \n",
"3 Fri, 03.09.2021 15:51:52 \n",
"4 READY \n",
"5 \n",
"6 \n",
"7 randomcutforest-2021-09-03-15-47-12-234 \n",
"8 remote_fn_rcf \n",
"9 Remote \n",
"10 int4 \n",
"11 arn:aws:iam::845897987212:role/RedshiftDemo-Re... "
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = run_sql(\"SHOW MODEL public.remote_random_cut_forest\")\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Computing Anomaly Scores\n",
"Now, let's compute and plot the anomaly scores from the entire taxi dataset.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
ride_timestamp
\n",
"
nbr_passengers
\n",
"
score
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
2014-07-01 00:00:00
\n",
"
10844
\n",
"
0.943024
\n",
"
\n",
"
\n",
"
1
\n",
"
2014-07-01 00:30:00
\n",
"
8127
\n",
"
0.949037
\n",
"
\n",
"
\n",
"
2
\n",
"
2014-07-01 01:00:00
\n",
"
6210
\n",
"
0.961702
\n",
"
\n",
"
\n",
"
3
\n",
"
2014-07-01 01:30:00
\n",
"
4656
\n",
"
0.851479
\n",
"
\n",
"
\n",
"
4
\n",
"
2014-07-01 02:00:00
\n",
"
3820
\n",
"
0.894885
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
10315
\n",
"
2015-01-31 21:30:00
\n",
"
24670
\n",
"
1.100063
\n",
"
\n",
"
\n",
"
10316
\n",
"
2015-01-31 22:00:00
\n",
"
25721
\n",
"
1.303702
\n",
"
\n",
"
\n",
"
10317
\n",
"
2015-01-31 22:30:00
\n",
"
27309
\n",
"
1.965812
\n",
"
\n",
"
\n",
"
10318
\n",
"
2015-01-31 23:00:00
\n",
"
26591
\n",
"
1.655541
\n",
"
\n",
"
\n",
"
10319
\n",
"
2015-01-31 23:30:00
\n",
"
26288
\n",
"
1.509151
\n",
"
\n",
" \n",
"
\n",
"
10320 rows × 3 columns
\n",
"
"
],
"text/plain": [
" ride_timestamp nbr_passengers score\n",
"0 2014-07-01 00:00:00 10844 0.943024\n",
"1 2014-07-01 00:30:00 8127 0.949037\n",
"2 2014-07-01 01:00:00 6210 0.961702\n",
"3 2014-07-01 01:30:00 4656 0.851479\n",
"4 2014-07-01 02:00:00 3820 0.894885\n",
"... ... ... ...\n",
"10315 2015-01-31 21:30:00 24670 1.100063\n",
"10316 2015-01-31 22:00:00 25721 1.303702\n",
"10317 2015-01-31 22:30:00 27309 1.965812\n",
"10318 2015-01-31 23:00:00 26591 1.655541\n",
"10319 2015-01-31 23:30:00 26288 1.509151\n",
"\n",
"[10320 rows x 3 columns]"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = run_sql(\"\"\"\n",
"select ride_timestamp, nbr_passengers, public.remote_fn_rcf(nbr_passengers) as score\n",
"from public.rcf_taxi_data;\n",
"\n",
"\"\"\");\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Note that the anomaly score spikes where our eyeball-norm method suggests there is an anomalous data point as well as in some places where our eyeballs are not as accurate.\n",
"\n",
"Below we print any data points with scores greater than 3 standard deviations (approx 99.9th percentile) from the mean score."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
ride_timestamp
\n",
"
nbr_passengers
\n",
"
score
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
2014-07-01 18:30:00
\n",
"
27598
\n",
"
2.111130
\n",
"
\n",
"
\n",
"
1
\n",
"
2014-07-01 19:00:00
\n",
"
26827
\n",
"
1.740206
\n",
"
\n",
"
\n",
"
2
\n",
"
2014-07-02 19:30:00
\n",
"
26872
\n",
"
1.750500
\n",
"
\n",
"
\n",
"
3
\n",
"
2014-07-03 19:00:00
\n",
"
29985
\n",
"
3.113782
\n",
"
\n",
"
\n",
"
4
\n",
"
2014-07-11 23:30:00
\n",
"
26873
\n",
"
1.755936
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
231
\n",
"
2015-01-31 18:30:00
\n",
"
27286
\n",
"
1.959775
\n",
"
\n",
"
\n",
"
232
\n",
"
2015-01-31 19:00:00
\n",
"
28804
\n",
"
2.689794
\n",
"
\n",
"
\n",
"
233
\n",
"
2015-01-31 19:30:00
\n",
"
27773
\n",
"
2.226760
\n",
"
\n",
"
\n",
"
234
\n",
"
2015-01-31 22:30:00
\n",
"
27309
\n",
"
1.965812
\n",
"
\n",
"
\n",
"
235
\n",
"
2015-01-31 23:00:00
\n",
"
26591
\n",
"
1.655541
\n",
"
\n",
" \n",
"
\n",
"
236 rows × 3 columns
\n",
"
"
],
"text/plain": [
" ride_timestamp nbr_passengers score\n",
"0 2014-07-01 18:30:00 27598 2.111130\n",
"1 2014-07-01 19:00:00 26827 1.740206\n",
"2 2014-07-02 19:30:00 26872 1.750500\n",
"3 2014-07-03 19:00:00 29985 3.113782\n",
"4 2014-07-11 23:30:00 26873 1.755936\n",
".. ... ... ...\n",
"231 2015-01-31 18:30:00 27286 1.959775\n",
"232 2015-01-31 19:00:00 28804 2.689794\n",
"233 2015-01-31 19:30:00 27773 2.226760\n",
"234 2015-01-31 22:30:00 27309 1.965812\n",
"235 2015-01-31 23:00:00 26591 1.655541\n",
"\n",
"[236 rows x 3 columns]"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = run_sql(\"\"\"\n",
"with score_cutoff as\n",
"(select stddev(public.remote_fn_rcf(nbr_passengers)) as std, avg(public.remote_fn_rcf(nbr_passengers)) as mean, ( mean + 3 * std ) as score_cutoff_value\n",
"from public.rcf_taxi_data)\n",
"\n",
"select ride_timestamp, nbr_passengers, public.remote_fn_rcf(nbr_passengers) as score\n",
"from public.rcf_taxi_data\n",
"where score > (select score_cutoff_value from score_cutoff)\n",
"\n",
"\n",
"\"\"\");\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Conclusion\n",
"\n",
"---\n",
"\n",
"We used Amazon SageMaker Random Cut Forest to detect anomalous datapoints in a taxi ridership dataset. In these data the anomalies occurred when ridership was uncharacteristically high or low. However, the RCF algorithm is also capable of detecting when, for example, data breaks periodicity or uncharacteristically changes global behavior.\n",
"\n",
"We then used Redshift ML to demonstrate how you can do inference on unsupervised algorithms(such as Random Cut Forest). This allows you to democratize Machine learning by doing predictions with Redshift SQL Commands.\n"
]
}
],
"metadata": {
"celltoolbar": "Tags",
"instance_type": "ml.t3.medium",
"kernelspec": {
"display_name": "conda_python3",
"language": "python",
"name": "conda_python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.13"
},
"notice": "Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
},
"nbformat": 4,
"nbformat_minor": 4
}