# Predict Training Time and SageMaker Training Instance RAM and CPU resource consumption

This notebook walks through how you can use the `canary_training` library to generate projections of training time, RAM, and CPU usage (collectivley refered to here as "resource consumption").

To briefly summarize, the canary_training library works by creating many small training jobs on small percentages of the data (generally, 1,2 and 3 percent). Based on the statistics gathered (using the SageMaker Profiler) it then extrapolates the resource consumption for the complete training job.

**Note** If you are using a SageMaker Notebook Instance, please use the `conda_python3` kernel. If you are using SageMaker Studio, please use `Python 3 (Data Science)` kernel.

In [1]:
import sagemaker
import pandas
import logging
logger = logging.getLogger('log')
#set logs if not done already
if not logger.handlers:
    logger.setLevel(logging.INFO)
    

This notebook relies on the `canary_training` package, which will be used for generating extrapolations.

In [2]:
#In SageMaker Studio
#install from canary training library, which is in directory above.
!pip install ../canary_training/ 
#in a SageMaker Notebook Instance
#!pip install /home/ec2-user/SageMaker/canary_training/Canary_Training/canary_training #make sure this points to the canary_training directory
from canary_training import *

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Processing /root/canary_training/Canary_Training/canary_training
  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: canary-training
  Building wheel for canary-training (setup.py) ... [?25ldone
[?25h  Created wheel for canary-training: filename=canary_training-0.1-py3-none-any.whl size=9728 sha256=84624f3f58dc1dd72d9a4f1e1a9a90e7f669a875dab0b6bf469f00615623f14d
  Stored in directory: /tmp/pip-ephem-wheel-cache-ecwlfep4/wheels/56/ba/a4/c4d62105eeb5cba3ce5d0c713d357fb5dd59a87ad3e87fb8ea
Successfully built canary-training
Installing collected packages: canary-training
  Attempting uninstall: canary-training
    Found existing installation: canary-training 0.1
    Uninstalling canary-training-0.1:
      Successfully uninstalled canary-training-0.1
Successfully installed canary-training-0.1


## Setup the Canary Job estimator and parameters
Before using canary_training to generate predictions of resource consumption, we need to define a few things.

1. A standard SageMaker estimator which defines our model.
2. The instance(s) that we want to test.
3. How many data points we want to make predictions based on.

In this example, we will try to predict resource consumption (i.e. CPU, RAM, and training time) when training on a `ml.m5.24xlarge`.

In this example, the goal to predict the training time and other training information for the tabular NYC Taxi dataset. This dataset is 40GB after preprocessing to remove non numeric features. It has 7 columns; the ML problem is to predict the `total_amount` column (i.e. the total fare) based on the 6 other features. The raw data was downloaded from S3 comes originally from the [Registry of Open Data on AWS](https://registry.opendata.aws/nyc-tlc-trip-records-pds/). The dataset is partitioned into 132 csvs.

In this notebook, we use the SageMaker XGBoost built-in algorithm to generate an ML model.

**Note**: The dataset used for the ML model is located here: `s3://aws-hcls-ml/public_assets_support_materials/taxi_yellow_trip_data_processed`.

First we will set canary training configuration and options. We will be training on 1%,2% and 3% of the data in triplicate.

In [3]:
import boto3
import sagemaker
from sagemaker import image_uris
from sagemaker.session import Session
from sagemaker.inputs import TrainingInput
from time import  gmtime,strftime
import random

role = sagemaker.get_execution_role()
region = boto3.Session().region_name
sagemaker_session = sagemaker.Session()
output_bucket = sagemaker_session.default_bucket()

instance_types=["ml.m5.24xlarge"]
#instance_types=["ml.m5.4xlarge","ml.m4.16xlarge","ml.p3.2xlarge"] #you can test multiple instances if you wish for canary training.

#set canary training parameters and inputs
output_s3_location=f"s3://{output_bucket}/taxi_output_data"
#create a random local temporary directory which will be copied to s3
#If this exists already, you can just point to it already
random_number=random.randint(10000000, 99999999)
the_temp_dir=f"canary-training-temp-dir-{str(random_number)}" 



training_percentages=[.01,.01,.01,.02,.02,.02,.03,.03,.03] #train jobs in triplicate in order to increase statistical confidence

In [4]:
print(output_bucket)

sagemaker-us-east-1-111918798052


Now we set standard SageMaker Estimator parameters. Because this is just a test, we use the same data for both the `training` and `validation` channel.

In [5]:
#location of input data for training make sure to exclude the final "/". "taxi_yellow_trip_data_processed" and not "taxi_yellow_trip_data_processed/"
data_location='s3://aws-hcls-ml/public_assets_support_materials/canary_training_data/taxi_yellow_trip_data_processed' 
hyperparameters = {
        "max_depth":"5",
        "eta":"0.2",
        "gamma":"4",
        "min_child_weight":"6",
        "subsample":"0.7",
        "objective":"reg:squarederror",
        "num_round":"50"}

# set an output path where the trained model will be saved
job_name = f"canary-train-experiment-{strftime('%Y-%m-%d-%H-%M-%S', gmtime())}-{str(random.random())}".replace(".","")
xgboost_container = sagemaker.image_uris.retrieve("xgboost", region, "1.2-1")

instance_type="None"
# construct a SageMaker estimator that calls the xgboost-container
estimator = sagemaker.estimator.Estimator(image_uri=xgboost_container, 
                                          hyperparameters=hyperparameters,
                                          role=role,
                                          instance_count=1, 
                                          instance_type=instance_type, 
                                          volume_size=300,  #large dataset needs lots of disk space
                                          output_path=f'{output_s3_location}/{the_temp_dir}')


## Set up canary training jobs

We will set up the canary training by:
1. Creating samples of the underlying data
2. Create manifest files that will be used for these smaller training jobs
3. Copy the underlying manifest files to S3.
4. Build estimators for SageMaker that will be used for these smaller training jobs.

In [6]:
ct=CanaryTraining(data_location=data_location,output_s3_location=output_s3_location,
                           the_temp_dir=the_temp_dir,instance_types=instance_types,estimator=estimator,training_percentages=training_percentages)

ct.prepare_canary_training_data()

aws s3 cp --recursive canary-training-temp-dir-06277901747192343 s3://sagemaker-us-east-1-111918798052/taxi_output_data/canary-training-temp-dir-06277901747192343/


## Kick of canary training jobs
Now that we have the list of estimators, let's kick off the canary training jobs.
**Note**: By default, the canary_training library kicks off all of the jobs in parallel. For this example, this will mean that there will be 9 jobs on a `ml.m5.24xlarge` running. If your account does not support this  many jobs of that instance type (and you cannot request an increase), you can run each job serially.

If you run the jobs in parallel, the total amount of time taken is about 15 minutes. If you run them one-after-another, it takes about 1.5 hours. 

In [8]:
#kick off in parallel
ct.kick_off_canary_training_jobs(training_channels_list=['train','validation'],wait=False) #set wait equal to True if you cannot/do not want to run all jobs in paralell 

Done Submitting Jobs
aws s3 cp canary-training-temp-dir-06277901747192343/data_files/canary_training_job_list.csv s3://sagemaker-us-east-1-111918798052/taxi_output_data/canary-training-temp-dir-06277901747192343/data_files/canary_training_job_list.csv


## Wait until the jobs are finished before continuing in the next section!!!
Before continuing, please make sure that all the jobs kicked off for canary training are finished. You can see these jobs in the `SageMake Training` console. 

## Gather Statistics and Perform Extrapolations

In the next section we will gather statistics around the training jobs, and use them to **extrapolate** resource consumption for the entire training job. We will do three things:

1. Extract relevant information from the training job and the SageMaker Profiler around CPU, RAM, and Training Time.
2. Report the extrapolated CPU usage, RAM, and Training Time and cost.
3. Report the raw CPU usage, RAM, and Training Time for the canary training jobs themselves. This will allow the user to make an informed decision based on this detailed information.

(note that if a statistic is not relevant, it will have a value of -1 or -1.1)

In [9]:
#submitted_jobs_information
predicted_resource_usage_df,raw_actual_resource_usage_df=ct.get_predicted_resource_consumption()

In [10]:
predicted_resource_usage_df.head()

Unnamed: 0,Projected_CPUUtilization,Projected_MemoryUsedPercent,Projected_TrainingTimeInSeconds,Projected_GPUUtilization,Projected_GPUMemoryUtilization,price,Projected_TotalCost
ml.m5.24xlarge,11.949157,56.596349,23482.546109,-1.1,-1.1,0.00153611,36.0718


Now report the raw info from the canary jobs. 

**Note** that the `PercentageDataTrainedOn` column does not exactly match the 1,2 and 3 percentages due to those numbers not evenly dividing into the number of partitions of the data (132 partitions).

**Note** Due to the stochastic nature of the canary jobs, the forecasts that you get may change between run to run. For example, Projected_TrainingTimeInSeconds may be around 23,000, 28,000 or 40,000.

## Inspect Canary Training Job Results
You can inspect the underlying data for the canary training results. This is the data that was used to create the forcasts. While the forecasts may be useful, we strongly encourage data scientists to inspect the raw results as well. Note that CPUUtilization,MemoryUsedPercent,GPUUtilization,and GPUMemoryUtilization are all p99 values.

In [11]:
raw_actual_resource_usage_df.head()

Unnamed: 0,TrainingJobStatus,TrainingTimeInSeconds,InstanceType,ManifestLocation,job_name,PercentageDataTrainedOn,CPUUtilization,I/OWaitPercentage,MemoryUsedPercent,GPUUtilization,GPUMemoryUtilization
0,Completed,318,ml.m5.24xlarge,s3://sagemaker-us-east-1-111918798052/taxi_out...,canary-training--job-2021-12-22-17-35-28-01832...,0.0075757575757575,0.871563,0.0,1.34,-1.0,-1
1,Completed,273,ml.m5.24xlarge,s3://sagemaker-us-east-1-111918798052/taxi_out...,canary-training--job-2021-12-22-17-35-28-02161...,0.0075757575757575,0.837396,0.0,1.34,-1.0,-1
2,Completed,302,ml.m5.24xlarge,s3://sagemaker-us-east-1-111918798052/taxi_out...,canary-training--job-2021-12-22-17-35-34-09762...,0.0075757575757575,0.854167,2.0,1.34,-1.0,-1
3,Completed,514,ml.m5.24xlarge,s3://sagemaker-us-east-1-111918798052/taxi_out...,canary-training--job-2021-12-22-17-35-35-08214...,0.0151515151515151,0.916667,0.0,1.89,-1.0,-1
4,Completed,562,ml.m5.24xlarge,s3://sagemaker-us-east-1-111918798052/taxi_out...,canary-training--job-2021-12-22-17-35-36-07827...,0.0151515151515151,0.96,0.0,1.97,-1.0,-1


# (Optional) Lets now kick off the actual training full job.
If you wish, feel free to kick off the entire training job to check the results.

**NOTE** This training job takes around 8 hours to run.

In [9]:
#estimator.instance_type="ml.m5.24xlarge"
#content_type = "csv"
#train_input = data_location
#validation_input = data_location
#train_input=sagemaker.inputs.TrainingInput(train_input,content_type='csv')
#validation_input=sagemaker.inputs.TrainingInput(validation_input,content_type='csv')
#job_name=job_name=f"full-training--job-{strftime('%Y-%m-%d-%H-%M-%S', gmtime())}-{str(random.random())}".replace(".","")
#estimator.fit(inputs={'train': train_input, 'validation': validation_input},job_name=job_name,wait=False,logs="All")