{
"cells": [
{
"cell_type": "markdown",
"id": "controversial-leonard",
"metadata": {},
"source": [
"# Predict Training Time and SageMaker Training Instance RAM and CPU resource consumption\n",
"\n",
"This notebook walks through how you can use the `canary_training` library to generate projections of training time, RAM, and CPU usage (collectivley refered to here as \"resource consumption\").\n",
"\n",
"To briefly summarize, the canary_training library works by creating many small training jobs on small percentages of the data (generally, 1,2 and 3 percent). Based on the statistics gathered (using the SageMaker Profiler) it then extrapolates the resource consumption for the complete training job.\n",
"\n",
"**Note** If you are using a SageMaker Notebook Instance, please use the `conda_python3` kernel. If you are using SageMaker Studio, please use `Python 3 (Data Science)` kernel."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "excess-laptop",
"metadata": {},
"outputs": [],
"source": [
"import sagemaker\n",
"import pandas\n",
"import logging\n",
"logger = logging.getLogger('log')\n",
"#set logs if not done already\n",
"if not logger.handlers:\n",
" logger.setLevel(logging.INFO)\n",
" "
]
},
{
"cell_type": "markdown",
"id": "limiting-humanitarian",
"metadata": {},
"source": [
"This notebook relies on the `canary_training` package, which will be used for generating extrapolations."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "irish-choir",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"/opt/conda/lib/python3.7/site-packages/secretstorage/dhcrypto.py:16: CryptographyDeprecationWarning: int_from_bytes is deprecated, use int.from_bytes instead\n",
" from cryptography.utils import int_from_bytes\n",
"/opt/conda/lib/python3.7/site-packages/secretstorage/util.py:25: CryptographyDeprecationWarning: int_from_bytes is deprecated, use int.from_bytes instead\n",
" from cryptography.utils import int_from_bytes\n",
"Processing /root/canary_training/Canary_Training/canary_training\n",
" Preparing metadata (setup.py) ... \u001b[?25ldone\n",
"\u001b[?25hBuilding wheels for collected packages: canary-training\n",
" Building wheel for canary-training (setup.py) ... \u001b[?25ldone\n",
"\u001b[?25h Created wheel for canary-training: filename=canary_training-0.1-py3-none-any.whl size=9728 sha256=84624f3f58dc1dd72d9a4f1e1a9a90e7f669a875dab0b6bf469f00615623f14d\n",
" Stored in directory: /tmp/pip-ephem-wheel-cache-ecwlfep4/wheels/56/ba/a4/c4d62105eeb5cba3ce5d0c713d357fb5dd59a87ad3e87fb8ea\n",
"Successfully built canary-training\n",
"Installing collected packages: canary-training\n",
" Attempting uninstall: canary-training\n",
" Found existing installation: canary-training 0.1\n",
" Uninstalling canary-training-0.1:\n",
" Successfully uninstalled canary-training-0.1\n",
"Successfully installed canary-training-0.1\n",
"\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\n"
]
}
],
"source": [
"#In SageMaker Studio\n",
"#install from canary training library, which is in directory above.\n",
"!pip install ../canary_training/ \n",
"#in a SageMaker Notebook Instance\n",
"#!pip install /home/ec2-user/SageMaker/canary_training/Canary_Training/canary_training #make sure this points to the canary_training directory\n",
"from canary_training import *"
]
},
{
"cell_type": "markdown",
"id": "revised-brazilian",
"metadata": {},
"source": [
"## Setup the Canary Job estimator and parameters\n",
"Before using canary_training to generate predictions of resource consumption, we need to define a few things.\n",
"\n",
"1. A standard SageMaker estimator which defines our model.\n",
"2. The instance(s) that we want to test.\n",
"3. How many data points we want to make predictions based on.\n",
"\n",
"In this example, we will try to predict resource consumption (i.e. CPU, RAM, and training time) when training on a `ml.m5.24xlarge`.\n",
"\n",
"In this example, the goal to predict the training time and other training information for the tabular NYC Taxi dataset. This dataset is 40GB after preprocessing to remove non numeric features. It has 7 columns; the ML problem is to predict the `total_amount` column (i.e. the total fare) based on the 6 other features. The raw data was downloaded from S3 comes originally from the [Registry of Open Data on AWS](https://registry.opendata.aws/nyc-tlc-trip-records-pds/). The dataset is partitioned into 132 csvs.\n",
"\n",
"In this notebook, we use the SageMaker XGBoost built-in algorithm to generate an ML model.\n",
"\n",
"**Note**: The dataset used for the ML model is located here: `s3://aws-hcls-ml/public_assets_support_materials/taxi_yellow_trip_data_processed`."
]
},
{
"cell_type": "markdown",
"id": "functioning-pottery",
"metadata": {},
"source": [
"First we will set canary training configuration and options. We will be training on 1%,2% and 3% of the data in triplicate."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "accepted-domestic",
"metadata": {},
"outputs": [],
"source": [
"import boto3\n",
"import sagemaker\n",
"from sagemaker import image_uris\n",
"from sagemaker.session import Session\n",
"from sagemaker.inputs import TrainingInput\n",
"from time import gmtime,strftime\n",
"import random\n",
"\n",
"role = sagemaker.get_execution_role()\n",
"region = boto3.Session().region_name\n",
"sagemaker_session = sagemaker.Session()\n",
"output_bucket = sagemaker_session.default_bucket()\n",
"\n",
"instance_types=[\"ml.m5.24xlarge\"]\n",
"#instance_types=[\"ml.m5.4xlarge\",\"ml.m4.16xlarge\",\"ml.p3.2xlarge\"] #you can test multiple instances if you wish for canary training.\n",
"\n",
"#set canary training parameters and inputs\n",
"output_s3_location=f\"s3://{output_bucket}/taxi_output_data\"\n",
"#create a random local temporary directory which will be copied to s3\n",
"#If this exists already, you can just point to it already\n",
"random_number=random.randint(10000000, 99999999)\n",
"the_temp_dir=f\"canary-training-temp-dir-{str(random_number)}\" \n",
"\n",
"\n",
"\n",
"training_percentages=[.01,.01,.01,.02,.02,.02,.03,.03,.03] #train jobs in triplicate in order to increase statistical confidence"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "extraordinary-century",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"sagemaker-us-east-1-111918798052\n"
]
}
],
"source": [
"print(output_bucket)"
]
},
{
"cell_type": "markdown",
"id": "detected-dollar",
"metadata": {},
"source": [
"Now we set standard SageMaker Estimator parameters. Because this is just a test, we use the same data for both the `training` and `validation` channel."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "athletic-neighbor",
"metadata": {},
"outputs": [],
"source": [
"#location of input data for training make sure to exclude the final \"/\". \"taxi_yellow_trip_data_processed\" and not \"taxi_yellow_trip_data_processed/\"\n",
"data_location='s3://aws-hcls-ml/public_assets_support_materials/canary_training_data/taxi_yellow_trip_data_processed' \n",
"hyperparameters = {\n",
" \"max_depth\":\"5\",\n",
" \"eta\":\"0.2\",\n",
" \"gamma\":\"4\",\n",
" \"min_child_weight\":\"6\",\n",
" \"subsample\":\"0.7\",\n",
" \"objective\":\"reg:squarederror\",\n",
" \"num_round\":\"50\"}\n",
"\n",
"# set an output path where the trained model will be saved\n",
"job_name = f\"canary-train-experiment-{strftime('%Y-%m-%d-%H-%M-%S', gmtime())}-{str(random.random())}\".replace(\".\",\"\")\n",
"xgboost_container = sagemaker.image_uris.retrieve(\"xgboost\", region, \"1.2-1\")\n",
"\n",
"instance_type=\"None\"\n",
"# construct a SageMaker estimator that calls the xgboost-container\n",
"estimator = sagemaker.estimator.Estimator(image_uri=xgboost_container, \n",
" hyperparameters=hyperparameters,\n",
" role=role,\n",
" instance_count=1, \n",
" instance_type=instance_type, \n",
" volume_size=300, #large dataset needs lots of disk space\n",
" output_path=f'{output_s3_location}/{the_temp_dir}')\n"
]
},
{
"cell_type": "markdown",
"id": "central-technique",
"metadata": {},
"source": [
"## Set up canary training jobs\n",
"\n",
"We will set up the canary training by:\n",
"1. Creating samples of the underlying data\n",
"2. Create manifest files that will be used for these smaller training jobs\n",
"3. Copy the underlying manifest files to S3.\n",
"4. Build estimators for SageMaker that will be used for these smaller training jobs."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "preliminary-animation",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"aws s3 cp --recursive canary-training-temp-dir-06277901747192343 s3://sagemaker-us-east-1-111918798052/taxi_output_data/canary-training-temp-dir-06277901747192343/\n"
]
}
],
"source": [
"ct=CanaryTraining(data_location=data_location,output_s3_location=output_s3_location,\n",
" the_temp_dir=the_temp_dir,instance_types=instance_types,estimator=estimator,training_percentages=training_percentages)\n",
"\n",
"ct.prepare_canary_training_data()"
]
},
{
"cell_type": "markdown",
"id": "blind-canadian",
"metadata": {},
"source": [
"## Kick of canary training jobs\n",
"Now that we have the list of estimators, let's kick off the canary training jobs.\n",
"**Note**: By default, the canary_training library kicks off all of the jobs in parallel. For this example, this will mean that there will be 9 jobs on a `ml.m5.24xlarge` running. If your account does not support this many jobs of that instance type (and you cannot request an increase), you can run each job serially.\n",
"\n",
"If you run the jobs in parallel, the total amount of time taken is about 15 minutes. If you run them one-after-another, it takes about 1.5 hours. "
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "maritime-shopper",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Done Submitting Jobs\n",
"aws s3 cp canary-training-temp-dir-06277901747192343/data_files/canary_training_job_list.csv s3://sagemaker-us-east-1-111918798052/taxi_output_data/canary-training-temp-dir-06277901747192343/data_files/canary_training_job_list.csv\n"
]
}
],
"source": [
"#kick off in parallel\n",
"ct.kick_off_canary_training_jobs(training_channels_list=['train','validation'],wait=False) #set wait equal to True if you cannot/do not want to run all jobs in paralell "
]
},
{
"cell_type": "markdown",
"id": "raised-tackle",
"metadata": {},
"source": [
"## Wait until the jobs are finished before continuing in the next section!!!\n",
"Before continuing, please make sure that all the jobs kicked off for canary training are finished. You can see these jobs in the `SageMake Training` console. "
]
},
{
"cell_type": "markdown",
"id": "chicken-adams",
"metadata": {},
"source": [
"## Gather Statistics and Perform Extrapolations\n",
"\n",
"In the next section we will gather statistics around the training jobs, and use them to **extrapolate** resource consumption for the entire training job. We will do three things:\n",
"\n",
"1. Extract relevant information from the training job and the SageMaker Profiler around CPU, RAM, and Training Time.\n",
"2. Report the extrapolated CPU usage, RAM, and Training Time and cost.\n",
"3. Report the raw CPU usage, RAM, and Training Time for the canary training jobs themselves. This will allow the user to make an informed decision based on this detailed information.\n",
"\n",
"(note that if a statistic is not relevant, it will have a value of -1 or -1.1)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "authorized-ancient",
"metadata": {},
"outputs": [],
"source": [
"#submitted_jobs_information\n",
"predicted_resource_usage_df,raw_actual_resource_usage_df=ct.get_predicted_resource_consumption()"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "developmental-strain",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Projected_CPUUtilization | \n",
" Projected_MemoryUsedPercent | \n",
" Projected_TrainingTimeInSeconds | \n",
" Projected_GPUUtilization | \n",
" Projected_GPUMemoryUtilization | \n",
" price | \n",
" Projected_TotalCost | \n",
"
\n",
" \n",
" \n",
" \n",
" ml.m5.24xlarge | \n",
" 11.949157 | \n",
" 56.596349 | \n",
" 23482.546109 | \n",
" -1.1 | \n",
" -1.1 | \n",
" 0.00153611 | \n",
" 36.0718 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Projected_CPUUtilization Projected_MemoryUsedPercent \\\n",
"ml.m5.24xlarge 11.949157 56.596349 \n",
"\n",
" Projected_TrainingTimeInSeconds Projected_GPUUtilization \\\n",
"ml.m5.24xlarge 23482.546109 -1.1 \n",
"\n",
" Projected_GPUMemoryUtilization price Projected_TotalCost \n",
"ml.m5.24xlarge -1.1 0.00153611 36.0718 "
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"predicted_resource_usage_df.head()"
]
},
{
"cell_type": "markdown",
"id": "activated-cleveland",
"metadata": {},
"source": [
"Now report the raw info from the canary jobs. \n",
"\n",
"**Note** that the `PercentageDataTrainedOn` column does not exactly match the 1,2 and 3 percentages due to those numbers not evenly dividing into the number of partitions of the data (132 partitions).\n",
"\n",
"**Note** Due to the stochastic nature of the canary jobs, the forecasts that you get may change between run to run. For example, Projected_TrainingTimeInSeconds may be around 23,000, 28,000 or 40,000."
]
},
{
"cell_type": "markdown",
"id": "therapeutic-plain",
"metadata": {},
"source": [
"## Inspect Canary Training Job Results\n",
"You can inspect the underlying data for the canary training results. This is the data that was used to create the forcasts. While the forecasts may be useful, we strongly encourage data scientists to inspect the raw results as well. Note that CPUUtilization,MemoryUsedPercent,GPUUtilization,and GPUMemoryUtilization are all p99 values."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "exact-drive",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" TrainingJobStatus | \n",
" TrainingTimeInSeconds | \n",
" InstanceType | \n",
" ManifestLocation | \n",
" job_name | \n",
" PercentageDataTrainedOn | \n",
" CPUUtilization | \n",
" I/OWaitPercentage | \n",
" MemoryUsedPercent | \n",
" GPUUtilization | \n",
" GPUMemoryUtilization | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Completed | \n",
" 318 | \n",
" ml.m5.24xlarge | \n",
" s3://sagemaker-us-east-1-111918798052/taxi_out... | \n",
" canary-training--job-2021-12-22-17-35-28-01832... | \n",
" 0.007575757575757576 | \n",
" 0.871563 | \n",
" 0.0 | \n",
" 1.34 | \n",
" -1.0 | \n",
" -1 | \n",
"
\n",
" \n",
" 1 | \n",
" Completed | \n",
" 273 | \n",
" ml.m5.24xlarge | \n",
" s3://sagemaker-us-east-1-111918798052/taxi_out... | \n",
" canary-training--job-2021-12-22-17-35-28-02161... | \n",
" 0.007575757575757576 | \n",
" 0.837396 | \n",
" 0.0 | \n",
" 1.34 | \n",
" -1.0 | \n",
" -1 | \n",
"
\n",
" \n",
" 2 | \n",
" Completed | \n",
" 302 | \n",
" ml.m5.24xlarge | \n",
" s3://sagemaker-us-east-1-111918798052/taxi_out... | \n",
" canary-training--job-2021-12-22-17-35-34-09762... | \n",
" 0.007575757575757576 | \n",
" 0.854167 | \n",
" 2.0 | \n",
" 1.34 | \n",
" -1.0 | \n",
" -1 | \n",
"
\n",
" \n",
" 3 | \n",
" Completed | \n",
" 514 | \n",
" ml.m5.24xlarge | \n",
" s3://sagemaker-us-east-1-111918798052/taxi_out... | \n",
" canary-training--job-2021-12-22-17-35-35-08214... | \n",
" 0.015151515151515152 | \n",
" 0.916667 | \n",
" 0.0 | \n",
" 1.89 | \n",
" -1.0 | \n",
" -1 | \n",
"
\n",
" \n",
" 4 | \n",
" Completed | \n",
" 562 | \n",
" ml.m5.24xlarge | \n",
" s3://sagemaker-us-east-1-111918798052/taxi_out... | \n",
" canary-training--job-2021-12-22-17-35-36-07827... | \n",
" 0.015151515151515152 | \n",
" 0.960000 | \n",
" 0.0 | \n",
" 1.97 | \n",
" -1.0 | \n",
" -1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" TrainingJobStatus TrainingTimeInSeconds InstanceType \\\n",
"0 Completed 318 ml.m5.24xlarge \n",
"1 Completed 273 ml.m5.24xlarge \n",
"2 Completed 302 ml.m5.24xlarge \n",
"3 Completed 514 ml.m5.24xlarge \n",
"4 Completed 562 ml.m5.24xlarge \n",
"\n",
" ManifestLocation \\\n",
"0 s3://sagemaker-us-east-1-111918798052/taxi_out... \n",
"1 s3://sagemaker-us-east-1-111918798052/taxi_out... \n",
"2 s3://sagemaker-us-east-1-111918798052/taxi_out... \n",
"3 s3://sagemaker-us-east-1-111918798052/taxi_out... \n",
"4 s3://sagemaker-us-east-1-111918798052/taxi_out... \n",
"\n",
" job_name PercentageDataTrainedOn \\\n",
"0 canary-training--job-2021-12-22-17-35-28-01832... 0.007575757575757576 \n",
"1 canary-training--job-2021-12-22-17-35-28-02161... 0.007575757575757576 \n",
"2 canary-training--job-2021-12-22-17-35-34-09762... 0.007575757575757576 \n",
"3 canary-training--job-2021-12-22-17-35-35-08214... 0.015151515151515152 \n",
"4 canary-training--job-2021-12-22-17-35-36-07827... 0.015151515151515152 \n",
"\n",
" CPUUtilization I/OWaitPercentage MemoryUsedPercent GPUUtilization \\\n",
"0 0.871563 0.0 1.34 -1.0 \n",
"1 0.837396 0.0 1.34 -1.0 \n",
"2 0.854167 2.0 1.34 -1.0 \n",
"3 0.916667 0.0 1.89 -1.0 \n",
"4 0.960000 0.0 1.97 -1.0 \n",
"\n",
" GPUMemoryUtilization \n",
"0 -1 \n",
"1 -1 \n",
"2 -1 \n",
"3 -1 \n",
"4 -1 "
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"raw_actual_resource_usage_df.head()"
]
},
{
"cell_type": "markdown",
"id": "professional-constant",
"metadata": {},
"source": [
"# (Optional) Lets now kick off the actual training full job.\n",
"If you wish, feel free to kick off the entire training job to check the results.\n",
"\n",
"**NOTE** This training job takes around 8 hours to run."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "approved-effectiveness",
"metadata": {},
"outputs": [],
"source": [
"#estimator.instance_type=\"ml.m5.24xlarge\"\n",
"#content_type = \"csv\"\n",
"#train_input = data_location\n",
"#validation_input = data_location\n",
"#train_input=sagemaker.inputs.TrainingInput(train_input,content_type='csv')\n",
"#validation_input=sagemaker.inputs.TrainingInput(validation_input,content_type='csv')\n",
"#job_name=job_name=f\"full-training--job-{strftime('%Y-%m-%d-%H-%M-%S', gmtime())}-{str(random.random())}\".replace(\".\",\"\")\n",
"#estimator.fit(inputs={'train': train_input, 'validation': validation_input},job_name=job_name,wait=False,logs=\"All\")"
]
}
],
"metadata": {
"instance_type": "ml.m5.large",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}