{ "cells": [ { "cell_type": "markdown", "id": "lesbian-designer", "metadata": {}, "source": [ "# Predict Needed RAM for a Training Instance \n", "\n", "This notebook shows how the `canary_training` library can be used to detect when a training instance is underprovisioned for the training task at hand.\n", "\n", "This notebook is identical to the `2_quick_start_canary_training_example_taxi_data.ipynb` notebook, except that it we will use the smaller `ml.m5.4xlarge` instance. We will use the `canary_training` library to detect that the requested instance is forecasted to **not have sufficient RAM for the task at hand**. This provides the data scientist with information for next steps, such as if they want to do parallel training or pick an instance with more compute capacity.\n", "\n", "\n", "**Note** If you are using a SageMaker Notebook Instance, please use the `conda_python3` kernel. If you are using SageMaker Studio, please use `Python 3 (Data Science)` kernel." ] }, { "cell_type": "code", "execution_count": 2, "id": "civil-magic", "metadata": {}, "outputs": [], "source": [ "import sagemaker\n", "import pandas\n", "import logging\n", "logger = logging.getLogger('log')\n", "#set logs if not done already\n", "if not logger.handlers:\n", " logger.setLevel(logging.INFO)\n", " " ] }, { "cell_type": "markdown", "id": "provincial-projection", "metadata": {}, "source": [ "This notebook relies on the `canary_training` package, which will be used for generating extrapolations." ] }, { "cell_type": "code", "execution_count": null, "id": "found-charles", "metadata": {}, "outputs": [], "source": [ "#In SageMaker Studio\n", "#install from canary training library, which is in directory above.\n", "!pip install ../canary_training/ \n", "#in a SageMaker Notebook Instance\n", "#!pip install /home/ec2-user/SageMaker/canary_training/Canary_Training/canary_training #make sure this points to the canary_training directory\n", "from canary_training import *" ] }, { "cell_type": "markdown", "id": "solid-bennett", "metadata": {}, "source": [ "## Setup the Canary Job estimator and parameters\n", "Before using canary_training to generate predictions of resource consumption, we need to define a few things.\n", "\n", "1. A standard SageMaker estimator which defines our model.\n", "2. The instance(s) that we want to test.\n", "3. How many data points we want to make predictions based on.\n", "\n", "In this example, we will try to predict resource consumption (i.e. CPU, RAM, and training time) when training on a `ml.m5.24xlarge`.\n", "\n", "In this example, the goal to predict the training time and other training information for the tabular NYC Taxi dataset. This dataset is 40GB after preprocessing to remove non numeric features. It has 7 columns; the ML problem is to predict the `total_amount` column (i.e. the total fare) based on the 6 other features. The raw data was downloaded from S3 comes originally from the [Registry of Open Data on AWS](https://registry.opendata.aws/nyc-tlc-trip-records-pds/). The dataset is partitioned into 132 csvs.\n", "\n", "In this notebook, we use the SageMaker XGBoost built-in algorithm to generate an ML model.\n", "\n", "**Note**: The dataset used for the ML model is located here: `s3://aws-hcls-ml/public_assets_support_materials/taxi_yellow_trip_data_processed`." ] }, { "cell_type": "markdown", "id": "vocal-given", "metadata": {}, "source": [ "First we will set canary training configuration and options. We will be training on 1%,2% and 3% of the data in triplicate." ] }, { "cell_type": "code", "execution_count": 4, "id": "informal-margin", "metadata": {}, "outputs": [], "source": [ "import boto3\n", "import sagemaker\n", "from sagemaker import image_uris\n", "from sagemaker.session import Session\n", "from sagemaker.inputs import TrainingInput\n", "from time import gmtime,strftime\n", "import random\n", "\n", "role = sagemaker.get_execution_role()\n", "region = boto3.Session().region_name\n", "sagemaker_session = sagemaker.Session()\n", "output_bucket = sagemaker_session.default_bucket()\n", "\n", "instance_types=[\"ml.m5.4xlarge\"]\n", "#instance_types=[\"ml.m5.4xlarge\",\"ml.m4.16xlarge\",\"ml.p3.2xlarge\"] #you can test multiple instances if you wish for canary training.\n", "\n", "#set canary training parameters and inputs\n", "output_s3_location=f\"s3://{output_bucket}/taxi_output_data\"\n", "#create a random local temporary directory which will be copied to s3\n", "#If this exists already, you can just point to it already\n", "random_number=random.randint(10000000, 99999999)\n", "the_temp_dir=f\"canary-training-temp-dir-{str(random_number)}\" \n", "\n", "\n", "\n", "training_percentages=[.01,.01,.01,.02,.02,.02,.03,.03,.03] #train jobs in triplicate in order to increase statistical confidence" ] }, { "cell_type": "code", "execution_count": 5, "id": "visible-illinois", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "sagemaker-us-east-1-111918798052\n" ] } ], "source": [ "print(output_bucket)" ] }, { "cell_type": "markdown", "id": "improving-porter", "metadata": {}, "source": [ "Now we set standard SageMaker Estimator parameters. Because this is just a test, we use the same data for both the `training` and `validation` channel." ] }, { "cell_type": "code", "execution_count": 6, "id": "virtual-psychology", "metadata": {}, "outputs": [], "source": [ "#location of input data for training make sure to exclude the final \"/\". \"taxi_yellow_trip_data_processed\" and not \"taxi_yellow_trip_data_processed/\"\n", "data_location='s3://aws-hcls-ml/public_assets_support_materials/canary_training_data/taxi_yellow_trip_data_processed' \n", "hyperparameters = {\n", " \"max_depth\":\"5\",\n", " \"eta\":\"0.2\",\n", " \"gamma\":\"4\",\n", " \"min_child_weight\":\"6\",\n", " \"subsample\":\"0.7\",\n", " \"objective\":\"reg:squarederror\",\n", " \"num_round\":\"50\"}\n", "\n", "# set an output path where the trained model will be saved\n", "job_name = f\"canary-train-experiment-{strftime('%Y-%m-%d-%H-%M-%S', gmtime())}-{str(random.random())}\".replace(\".\",\"\")\n", "xgboost_container = sagemaker.image_uris.retrieve(\"xgboost\", region, \"1.2-1\")\n", "\n", "instance_type=\"None\"\n", "# construct a SageMaker estimator that calls the xgboost-container\n", "estimator = sagemaker.estimator.Estimator(image_uri=xgboost_container, \n", " hyperparameters=hyperparameters,\n", " role=role,\n", " instance_count=1, \n", " instance_type=instance_type, \n", " volume_size=300, #large dataset needs lots of disk space\n", " output_path=f'{output_s3_location}/{the_temp_dir}')\n" ] }, { "cell_type": "markdown", "id": "opponent-aircraft", "metadata": {}, "source": [ "## Set up canary training jobs\n", "\n", "We will set up the canary training by:\n", "1. Creating samples of the underlying data\n", "2. Create manifest files that will be used for these smaller training jobs\n", "3. Copy the underlying manifest files to S3.\n", "4. Build estimators for SageMaker that will be used for these smaller training jobs." ] }, { "cell_type": "code", "execution_count": 7, "id": "headed-credit", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "aws s3 cp --recursive canary-training-temp-dir-34087732 s3://sagemaker-us-east-1-111918798052/taxi_output_data/canary-training-temp-dir-34087732/\n" ] } ], "source": [ "ct=CanaryTraining(data_location=data_location,output_s3_location=output_s3_location,\n", " the_temp_dir=the_temp_dir,instance_types=instance_types,estimator=estimator,training_percentages=training_percentages)\n", "\n", "ct.prepare_canary_training_data()" ] }, { "cell_type": "markdown", "id": "acute-mathematics", "metadata": {}, "source": [ "## Kick of canary training jobs\n", "Now that we have the list of estimators, let's kick off the canary training jobs.\n", "**Note**: By default, the canary_training library kicks off all of the jobs in parallel. For this example, this will mean that there will be 9 jobs on a `ml.m5.24xlarge` running. If your account does not support this many jobs of that instance type (and you cannot request an increase), you can run each job serially.\n", "\n", "If you run the jobs in parallel, the total amount of time taken is about 15 minutes. If you run them one-after-another, it takes about 1.5 hours. " ] }, { "cell_type": "code", "execution_count": 8, "id": "yellow-browser", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done Submitting Jobs\n", "aws s3 cp canary-training-temp-dir-34087732/data_files/canary_training_job_list.csv s3://sagemaker-us-east-1-111918798052/taxi_output_data/canary-training-temp-dir-34087732/data_files/canary_training_job_list.csv\n" ] } ], "source": [ "#kick off in parallel\n", "ct.kick_off_canary_training_jobs(training_channels_list=['train','validation'],wait=False) #set wait equal to True if you cannot/do not want to run all jobs in parallel " ] }, { "cell_type": "markdown", "id": "given-reduction", "metadata": {}, "source": [ "## Wait until the jobs are finished before continuing in the next section!!!\n", "Before continuing, please make sure that all the jobs kicked off for canary training are finished. You can see these jobs in the `SageMake Training` console. " ] }, { "cell_type": "markdown", "id": "decreased-communication", "metadata": {}, "source": [ "## Gather Statistics and Perform Extrapolations\n", "\n", "In the next section we will gather statistics around the training jobs, and use them to **extrapolate** resource consumption for the entire training job. We will do three things:\n", "\n", "1. Extract relevant information from the training job and the SageMaker Profiler around CPU, RAM, and Training Time.\n", "2. Report the extrapolated CPU usage, RAM, and Training Time and cost.\n", "3. Report the raw CPU usage, RAM, and Training Time for the canary training jobs themselves. This will allow the user to make an informed decision based on this detailed information.\n", "\n", "(note that if a statistic is not relevant, it will have a value of -1 or -1.1)" ] }, { "cell_type": "code", "execution_count": 9, "id": "baking-climate", "metadata": {}, "outputs": [], "source": [ "#submitted_jobs_information\n", "predicted_resource_usage_df,raw_actual_resource_usage_df=ct.get_predicted_resource_consumption()" ] }, { "cell_type": "code", "execution_count": 10, "id": "polar-captain", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Projected_CPUUtilizationProjected_MemoryUsedPercentProjected_TrainingTimeInSecondsProjected_GPUUtilizationProjected_GPUMemoryUtilizationpriceProjected_TotalCost
ml.m5.4xlarge6.875549.04958425167.877657-1.1-1.10.0002561116.44577
\n", "
" ], "text/plain": [ " Projected_CPUUtilization Projected_MemoryUsedPercent \\\n", "ml.m5.4xlarge 6.875 549.049584 \n", "\n", " Projected_TrainingTimeInSeconds Projected_GPUUtilization \\\n", "ml.m5.4xlarge 25167.877657 -1.1 \n", "\n", " Projected_GPUMemoryUtilization price Projected_TotalCost \n", "ml.m5.4xlarge -1.1 0.000256111 6.44577 " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "predicted_resource_usage_df.head()" ] }, { "cell_type": "markdown", "id": "pending-smoke", "metadata": {}, "source": [ "As can be seen from the `Projected_MemoryUsedPercent` forecast, the forecasted memory utilization is well above the maximum allowed of 100%; this would be expected to result in an `Out of Memory` error. The data scienctist should take note of this, and **not** use this instance type for training the XGboost model on the full dataset." ] } ], "metadata": { "instance_type": "ml.m5.large", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.9" } }, "nbformat": 4, "nbformat_minor": 5 }