{
"cells": [
{
"cell_type": "markdown",
"id": "375c8b68",
"metadata": {},
"source": [
"# Train a TensorFlow 2.x model with custom training loop on the Amazon SageMaker optimized TensorFlow container and debug using Amazon SageMaker Debugger\n",
"\n",
"[Amazon SageMaker](https://aws.amazon.com/sagemaker/) is a fully managed machine learning service. With SageMaker, you have the option of using the built-in algorithms as well as bringing your own algorithms and frameworks. One such framework is TensorFlow 2.x. [Amazon SageMaker Debugger](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html) debugs, monitors and profiles training jobs in real time thereby helping with detecting non-converging conditions, optimizing resource utilization by eliminating bottlenecks, improving training time and reducing costs of your machine learning models.\n",
"\n",
"This notebook demonstrates how to use a SageMaker optimized TensorFlow 2.x container to train a multi-class image classification model using the [Fashion MNIST dataset](https://github.com/zalandoresearch/fashion-mnist) using a custom training loop i.e. customizes what goes on in the `fit()` loop. It also demonstrates how to debug using SageMaker Debugger. Finally the debugger's output is analyzed. This will take your training script and use SageMaker in script mode.\n",
"\n",
"**Note:**\n",
"\n",
"* This notebook should only be run from within a SageMaker notebook instance as it references SageMaker native APIs.\n",
"* At the time of writing this notebook, the most relevant latest version of the Jupyter notebook kernel for this notebook was `conda_python3`.\n",
"* Although the training step in this notebook supports both CPU and GPU based instances, it is highly recommended that you use a GPU based instance as the CPU based instance will take a very long time to complete training. The training scripts used by this notebook have been coded to run on single-GPU training instances. You can use multi-GPU instances but only one GPU will be used resulting in wastage of GPU resources.\n",
"* This notebook will create resources in the same AWS account and in the same region where this notebook is running.\n",
"\n",
"**Table of Contents:**\n",
"\n",
"1. [Complete prerequisites](#Complete%20prerequisites)\n",
"\n",
" 1. [Check and configure access to the Internet](#Check%20and%20configure%20access%20to%20the%20Internet)\n",
"\n",
" 2. [Check and upgrade required software versions](#Check%20and%20upgrade%20required%20software%20versions)\n",
" \n",
" 3. [Check and configure security permissions](#Check%20and%20configure%20security%20permissions)\n",
"\n",
" 4. [Organize imports](#Organize%20imports)\n",
" \n",
" 5. [Create common objects](#Create%20common%20objects)\n",
"\n",
"2. [Prepare the dataset](#Prepare%20the%20dataset)\n",
"\n",
" 1. [Create the local directories](#Create%20the%20local%20directories)\n",
"\n",
" 2. [Load the dataset](#Load%20the%20dataset)\n",
" \n",
" 3. [View the details of the dataset](#View%20the%20details%20of%20the%20dataset)\n",
" \n",
" 4. [Visualize the dataset](#Visualize%20the%20dataset)\n",
" \n",
" 5. [Normalize the dataset](#Normalize%20the%20dataset)\n",
" \n",
" 6. [Save the prepared datasets locally](#Save%20the%20prepared%20datasets%20locally)\n",
" \n",
" 7. [Upload the prepared datasets to S3](#Upload%20the%20prepared%20datasets%20to%20S3)\n",
"\n",
"3. [View the training script](#View%20the%20training%20script)\n",
"\n",
" 1. [Zero script change](#Zero%20script%20change)\n",
" \n",
" 2. [With script change](#With%20script%20change)\n",
"\n",
"4. [Perform training, validation and testing](#Perform%20training%20validation%20and%20testing)\n",
"\n",
" 1. [Set the training parameters](#Set%20the%20training%20parameters)\n",
" \n",
" 2. [Set the debugger parameters](#Set%20the%20debugger%20parameters)\n",
" \n",
" 3. [(Optional) Delete previous checkpoints](#(Optional)%20Delete%20previous%20checkpoints)\n",
"\n",
" 4. [Run the training job](#Run%20the%20training%20job)\n",
"\n",
"5. [View the auto-generated debugger profiling report](#View%20the%20auto-generated%20debugger%20profiling%20report)\n",
"\n",
"6. [Perform interactive analysis of the debugger output](#Perform%20interactive%20analysis%20of%20the%20debugger%20output)\n",
"\n",
" 1. [Get the training job](#Get%20the%20training%20job)\n",
"\n",
" 2. [Read the metrics](#Read%20the%20metrics)\n",
"\n",
" 3. [Plot the metrics](#Plot%20the%20metrics)\n",
" \n",
" 1. [System metrics histogram](#System%20metrics%20histogram)\n",
"\n",
" 2. [Framework metrics stepline chart](#Framework%20metrics%20stepline%20chart)\n",
" \n",
" 3. [Framework metrics step histogram](#Framework%20metrics%20step%20histogram)\n",
"\n",
" 4. [System and framework metrics timeline charts](#System%20and%20framework%20metrics%20timeline%20charts)\n",
"\n",
" 5. [System and framework metrics heatmap](#System%20and%20framework%20metrics%20heatmap)\n",
"\n",
"7. [Cleanup](#Cleanup)"
]
},
{
"cell_type": "markdown",
"id": "874a0533",
"metadata": {},
"source": [
"## 1. Complete prerequisites \n",
"\n",
"Check and complete the prerequisites."
]
},
{
"cell_type": "markdown",
"id": "36efd9a1",
"metadata": {},
"source": [
"### A. Check and configure access to the Internet \n",
"This notebook requires outbound access to the Internet to download the required software updates and to download the dataset. You can either provide direct Internet access (default) or provide Internet access through a VPC. For more information on this, refer [here](https://docs.aws.amazon.com/sagemaker/latest/dg/appendix-notebook-and-internet-access.html)."
]
},
{
"cell_type": "markdown",
"id": "aadc054d",
"metadata": {},
"source": [
"### B. Check and upgrade required software versions \n",
"\n",
"This notebook requires:\n",
"* [SageMaker Python SDK version 2.x](https://sagemaker.readthedocs.io/en/stable/v2.html)\n",
"* [TensorFlow version 2.x with SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/tf.html)\n",
"* [Python 3.6.x](https://www.python.org/downloads/release/python-360/)\n",
"* [SMDebug](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-analyze-data.html)\n",
"* [Boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html)\n",
"\n",
"Note: If you get 'module not found' errors in the following cell, then uncomment the appropriate installation commands and install the modules. Also, uncomment and run the kernel shutdown command. When the kernel comes back, comment out the installation and kernel shutdown commands and run the following cell. Now, you should not see any errors."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7a9e3acb",
"metadata": {},
"outputs": [],
"source": [
"import boto3\n",
"import IPython\n",
"import sagemaker\n",
"import smdebug\n",
"import sys\n",
"import tensorflow as tf\n",
"\n",
"\"\"\"\n",
"Last tested versions:\n",
"SageMaker Python SDK version : 2.32.0\n",
"TensorFlow version : 2.5.0\n",
"Python version : 3.6.13 | packaged by conda-forge | (default, Feb 19 2021, 05:36:01) \n",
"[GCC 9.3.0]\n",
"Boto3 version : 1.17.77\n",
"SMDebug version : 1.0.9\n",
"\"\"\"\n",
"\n",
"# Install/upgrade sagemaker (v2.32.0), boto3, tensorflow and smdebug\n",
"#!{sys.executable} -m pip install -U boto3\n",
"#!{sys.executable} -m pip install -U tensorflow\n",
"#!{sys.executable} -m pip install -U sagemaker==2.32.0\n",
"#!{sys.executable} -m pip install -U smdebug\n",
"#IPython.Application.instance().kernel.do_shutdown(True)\n",
"\n",
"# Get the current installed version of Sagemaker SDK, TensorFlow, Python, Boto3 and SMDebug\n",
"print('SageMaker Python SDK version : {}'.format(sagemaker.__version__))\n",
"print('TensorFlow version : {}'.format(tf.__version__))\n",
"print('Python version : {}'.format(sys.version))\n",
"print('Boto3 version : {}'.format(boto3.__version__))\n",
"print('SMDebug version : {}'.format(smdebug.__version__))"
]
},
{
"cell_type": "markdown",
"id": "f1ee312c",
"metadata": {},
"source": [
"### C. Check and configure security permissions \n",
"This notebook uses the IAM role attached to the underlying notebook instance. To view the name of this role, run the following cell.\n",
"\n",
"Note: This role should have the following permissions,\n",
"\n",
"1. Full access to the S3 bucket that will be used to store training and output data.\n",
"2. Full access to launch training instances.\n",
"3. Access to write to CloudWatch."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8d341217",
"metadata": {},
"outputs": [],
"source": [
"print(sagemaker.get_execution_role())"
]
},
{
"cell_type": "markdown",
"id": "8eb30c17",
"metadata": {},
"source": [
"### D. Organize imports \n",
"\n",
"Organize all the library and module imports for later use."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8a3ff2ac",
"metadata": {},
"outputs": [],
"source": [
"from IPython.core.display import display, HTML\n",
"import logging\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"import os\n",
"from sagemaker.debugger import (ProfilerConfig,\n",
" FrameworkProfile,\n",
" CollectionConfig,\n",
" DebuggerHookConfig,\n",
" DetailedProfilingConfig, \n",
" DataloaderProfilingConfig, \n",
" PythonProfilingConfig,\n",
" Rule,\n",
" PythonProfiler,\n",
" cProfileTimer,\n",
" ProfilerRule,\n",
" rule_configs)\n",
"from sagemaker.tensorflow import TensorFlow\n",
"from smdebug.profiler.analysis.notebook_utils.metrics_histogram import MetricsHistogram\n",
"from smdebug.profiler.analysis.notebook_utils.step_timeline_chart import StepTimelineChart\n",
"from smdebug.profiler.analysis.notebook_utils.step_histogram import StepHistogram\n",
"from smdebug.profiler.analysis.notebook_utils.timeline_charts import TimelineCharts\n",
"from smdebug.profiler.analysis.notebook_utils.heatmap import Heatmap\n",
"from smdebug.profiler.analysis.notebook_utils.training_job import TrainingJob\n",
"import time"
]
},
{
"cell_type": "markdown",
"id": "345fc1d3",
"metadata": {},
"source": [
"### E. Create common objects \n",
"\n",
"Create common objects to be used in future steps in this notebook."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3aff19c5",
"metadata": {},
"outputs": [],
"source": [
"# Specify the S3 bucket name\n",
"s3_bucket = ''\n",
"\n",
"# Create the S3 Boto3 resource\n",
"s3_resource = boto3.resource('s3')\n",
"s3_bucket_resource = s3_resource.Bucket(s3_bucket)\n",
"\n",
"# Get the AWS region name\n",
"region_name = sagemaker.Session().boto_region_name\n",
"\n",
"# Base name to be used to create resources\n",
"nb_name = 'tf2-fashion-mnist-custom-debugger'\n",
"\n",
"# Names of various resources\n",
"train_job_name = 'train-{}'.format(nb_name)\n",
"\n",
"# Names of local sub-directories in the notebook file system\n",
"data_dir = os.path.join(os.getcwd(), 'data/{}'.format(nb_name))\n",
"train_dir = os.path.join(os.getcwd(), 'data/{}/train'.format(nb_name))\n",
"test_dir = os.path.join(os.getcwd(), 'data/{}/test'.format(nb_name))\n",
"\n",
"# Sub-folder names in S3\n",
"train_dir_s3_prefix = '{}/data/train'.format(nb_name)\n",
"test_dir_s3_prefix = '{}/data/test'.format(nb_name)\n",
"\n",
"# Location in S3 where the training scripts will be copied\n",
"code_location = 's3://{}/{}/scripts'.format(s3_bucket, nb_name)\n",
"\n",
"# Location in S3 where the model checkpoint will be stored\n",
"model_checkpoint_s3_path = 's3://{}/{}/checkpoint/'.format(s3_bucket, nb_name)\n",
"\n",
"# Location in S3 where the trained model and debugger output will be stored\n",
"model_and_debugger_output_s3_path = 's3://{}/{}/output/'.format(s3_bucket, nb_name)"
]
},
{
"cell_type": "markdown",
"id": "3bc25eb5",
"metadata": {},
"source": [
"## 2. Prepare the dataset \n",
"\n",
"The [Fashion MNIST dataset](https://github.com/zalandoresearch/fashion-mnist) consists of 60,000 28x28 grayscale images of 10 fashion categories, along with a test set of 10,000 images. These categories are mapped to integers from 0 to 9 and represent the following class labels,\n",
"\n",
"* 0: T-shirt/top\n",
"* 1: Trouser\n",
"* 2: Pullover\n",
"* 3: Dress\n",
"* 4: Coat\n",
"* 5: Sandal\n",
"* 6: Shirt\n",
"* 7: Sneaker\n",
"* 8: Bag\n",
"* 9: Ankle boot\n",
"\n",
"The following steps will help with preparing the dataset for training."
]
},
{
"cell_type": "markdown",
"id": "4a1fe745",
"metadata": {},
"source": [
"### A) Create the local directories \n",
"\n",
"Create the directories in the local system where the dataset will be copied to and processed."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ba8f74f3",
"metadata": {},
"outputs": [],
"source": [
"# Create the local directories\n",
"os.makedirs(data_dir, exist_ok=True)\n",
"os.makedirs(train_dir, exist_ok=True)\n",
"os.makedirs(test_dir, exist_ok=True)"
]
},
{
"cell_type": "markdown",
"id": "ee77a72b",
"metadata": {},
"source": [
"### B) Load the dataset \n",
"\n",
"Load the pre-shuffled train and test data with the keras.datasets API."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2de7e7fa",
"metadata": {},
"outputs": [],
"source": [
"# Load the dataset\n",
"(x_train, y_train), (x_test, y_test) = tf.keras.datasets.fashion_mnist.load_data()"
]
},
{
"cell_type": "markdown",
"id": "a3ccb2b7",
"metadata": {},
"source": [
"### C) View the details of the dataset \n",
"\n",
"Print the shape of the data and you will notice that they are 28x28 pixels. There are 60,000 images in the training data and 10,000 images in the test data."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0db9bc6a",
"metadata": {},
"outputs": [],
"source": [
"# Summarize the dataset\n",
"print(\"x_train shape:\", x_train.shape)\n",
"print(\"y_train shape:\", y_train.shape)\n",
"print(\"x_test shape:\", x_test.shape)\n",
"print(\"y_test shape:\", y_test.shape)"
]
},
{
"cell_type": "markdown",
"id": "7e72a070",
"metadata": {},
"source": [
"### D) Visualize the dataset \n",
"\n",
"Randomly display the images and labels of `sample_size` number of images from the test dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fbbac80f",
"metadata": {},
"outputs": [],
"source": [
"# Randomly display the images and labels of n (sample_size) images from the test dataset\n",
"\n",
"sample_size = 50\n",
"\n",
"random_indexes = np.random.randint(0, len(x_test), sample_size)\n",
"sample_images = x_test[random_indexes]\n",
"sample_labels = y_test[random_indexes]\n",
"sample_predictions = None\n",
"num_rows = 5\n",
"num_cols = 10\n",
"plot_title = None\n",
"fig_size = None\n",
"assert sample_images.shape[0] == num_rows * num_cols\n",
"\n",
"# Labels\n",
"FASHION_LABELS = {\n",
" 0: 'T-shirt/top',\n",
" 1: 'Trouser',\n",
" 2: 'Pullover',\n",
" 3: 'Dress',\n",
" 4: 'Coat',\n",
" 5: 'Sandal',\n",
" 6: 'Shirt',\n",
" 7: 'Sneaker',\n",
" 8: 'Bag',\n",
" 9: 'Ankle boot'\n",
"}\n",
"\n",
"import seaborn as sns\n",
"\n",
"with sns.axes_style(\"whitegrid\"):\n",
" sns.set_context(\"notebook\", font_scale=1.1)\n",
" sns.set_style({\"font.sans-serif\": [\"Verdana\", \"Arial\", \"Calibri\", \"DejaVu Sans\"]})\n",
" f, ax = plt.subplots(num_rows, num_cols, figsize=((14, 9) if fig_size is None else fig_size),\n",
" gridspec_kw={\"wspace\": 0.02, \"hspace\": 0.30}, squeeze=True)\n",
" for r in range(num_rows):\n",
" for c in range(num_cols):\n",
" image_index = r * num_cols + c\n",
" ax[r, c].axis(\"off\")\n",
" ax[r, c].imshow(sample_images[image_index], cmap=\"Greys\")\n",
" if sample_predictions is None:\n",
" title = ax[r, c].set_title(\"%s\" % FASHION_LABELS[sample_labels[image_index]])\n",
" else:\n",
" true_label = sample_labels[image_index]\n",
" pred_label = sample_predictions[image_index]\n",
" prediction_matches_true = (sample_labels[image_index] == sample_predictions[image_index])\n",
" if prediction_matches_true:\n",
" title = FASHION_LABELS[true_label]\n",
" title_color = 'g'\n",
" else:\n",
" title = '%s/%s' % (FASHION_LABELS[true_label], FASHION_LABELS[pred_label])\n",
" title_color = 'r'\n",
" title = ax[r, c].set_title(title)\n",
" plt.setp(title, color=title_color)\n",
" if plot_title is not None:\n",
" f.suptitle(plot_title)\n",
" plt.show()\n",
" plt.close()"
]
},
{
"cell_type": "markdown",
"id": "13c30c86",
"metadata": {},
"source": [
"The pixel values of the images fall in the range of 0 to 255. You can verify this below."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "85f7b2ce",
"metadata": {},
"outputs": [],
"source": [
"plt.figure()\n",
"plt.imshow(x_train[0])\n",
"plt.colorbar()\n",
"plt.grid(False)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "4f81729d",
"metadata": {},
"source": [
"### E) Normalize the dataset \n",
"\n",
"As the pixel values range from 0 to 255, it is important to normalize them to a range from 0 to 1. This can be done by dividing these values by 255. This has to be done for both training and test images."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "735bd2f4",
"metadata": {},
"outputs": [],
"source": [
"# Normalize the dataset\n",
"x_train = x_train.astype('float32') / 255.0\n",
"x_test = x_test.astype('float32') / 255.0"
]
},
{
"cell_type": "markdown",
"id": "839e6798",
"metadata": {},
"source": [
"### F) Save the prepared datasets locally \n",
"\n",
"Save the prepared train, validate and test datasets to local directories. Prior to saving, concatenate x and y columns as needed. Create the directories if they don't exist."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "20094a25",
"metadata": {},
"outputs": [],
"source": [
"# Save the prepared dataset (in numpy format) to the local directories\n",
"np.save(os.path.join(train_dir, 'x_train.npy'), x_train)\n",
"np.save(os.path.join(train_dir, 'y_train.npy'), y_train)\n",
"np.save(os.path.join(test_dir, 'x_test.npy'), x_test)\n",
"np.save(os.path.join(test_dir, 'y_test.npy'), y_test)"
]
},
{
"cell_type": "markdown",
"id": "68a5d135",
"metadata": {},
"source": [
"### G) Upload the prepared datasets to S3 \n",
"\n",
"Upload the datasets from the local directories to appropriate sub-directories in the specified S3 bucket."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2694ba83",
"metadata": {},
"outputs": [],
"source": [
"# Upload the data to S3\n",
"train_data_s3_full_path = sagemaker.Session().upload_data(path='./data/{}/train/'.format(nb_name),\n",
" bucket=s3_bucket,\n",
" key_prefix=train_dir_s3_prefix)\n",
"test_data_s3_full_path = sagemaker.Session().upload_data(path='./data/{}/test/'.format(nb_name),\n",
" bucket=s3_bucket,\n",
" key_prefix=test_dir_s3_prefix)"
]
},
{
"cell_type": "markdown",
"id": "902d2c7f",
"metadata": {},
"source": [
"## 3. View the training script \n",
"\n",
"View the script that will be used for training the model. This should exist in a local directory."
]
},
{
"cell_type": "markdown",
"id": "cfd1b911",
"metadata": {},
"source": [
"### A) Zero script change \n",
"\n",
"In case you don't have to modify the training script to use the SageMaker Debugger hook."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3f4305ad",
"metadata": {},
"outputs": [],
"source": [
"!cat scripts/train_tf2_fashion_mnist_custom.py"
]
},
{
"cell_type": "markdown",
"id": "3993ae27",
"metadata": {},
"source": [
"### B) With script change \n",
"\n",
"In case you have to modify the training script to use the SageMaker Debugger hook. For example, to save save scalars and tensors at specific points in the training script as you require."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0e7baf2c",
"metadata": {},
"outputs": [],
"source": [
"!cat scripts/train_tf2_fashion_mnist_custom_debugger.py"
]
},
{
"cell_type": "markdown",
"id": "261fb7b0",
"metadata": {},
"source": [
"## 4. Perform training, validation and testing \n",
"\n",
"In this step, we will use a SageMaker optimized TensorFlow 2.x container to train a multi-class image classification model using the [Fashion MNIST dataset](https://github.com/zalandoresearch/fashion-mnist) using a custom training loop i.e. customizes what goes on in the `fit()` loop. This will take your training script and use SageMaker in script mode.\n",
"\n",
"Debugger will be enabled as part of the training process. Based on the debugger configuration, the required number of [Processing Jobs](https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job.html) will be created.\n",
"\n",
"Note:\n",
"\n",
"* The logic for the custom training loop will be in the training script.\n",
"* During the training process, checkpointing the model is a good practice in general and is strongly recommended when you are using Spot instances for training as there is a chance of your training job getting interrupted."
]
},
{
"cell_type": "markdown",
"id": "ca7d2e93",
"metadata": {},
"source": [
"### A) Set the training parameters \n",
"\n",
"1. Inputs - S3 locations for training and test data.\n",
"2. Hyperparameters and checkpoint parameters.\n",
"3. Training instance details:\n",
"\n",
" 1. Instance count (Recommended: 1. Anything more than 1 will result in wastage of resources.)\n",
" \n",
" 2. Instance type (Recommended: Single-GPU based instance. Refer [here](https://aws.amazon.com/ec2/instance-types/) for info on instance types.)\n",
" \n",
" 3. The max run time of the training job\n",
" \n",
" 4. (Optional) Use Spot instances. For more info, refer [here](https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html).\n",
" \n",
" 5. (Optional) The max wait for Spot instances, if using Spot. This should be larger than the max run time.\n",
" \n",
" 6. Training image parameters. For more info, refer [here](https://github.com/aws/deep-learning-containers/blob/master/available_images.md).\n",
" \n",
"4. Tensorflow framework version and Python version.\n",
"5. Names of the training script and the local directory where it is located. Choose the script from either zero-script-change or with-script-change scenarios.\n",
"6. Logging level of the SageMaker optimized Tensorflow 2.x container.\n",
"7. Appropriate local and S3 directories that will be used by the training job."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "afc357ef",
"metadata": {},
"outputs": [],
"source": [
"# Set the input data paths\n",
"inputs = {'train':train_data_s3_full_path, 'test':test_data_s3_full_path}\n",
"\n",
"# Location where the model checkpoints will be stored locally in the container before being uploaded to S3\n",
"## Note: It is recommended that you use the default location of /opt/ml/checkpoints/ for saving/loading checkpoints.\n",
"model_checkpoint_local_dir = '/opt/ml/checkpoints/'\n",
"\n",
"# Set the hyperparameters\n",
"##\n",
"## Note: Parameters 'checkpoint_enabled', 'checkpoint_load_previous' and 'checkpoint_local_dir' are not\n",
"## hyperparameters. They have been specified here as a means to pass them to the training script.\n",
"## The better way of passing these would be in the Environment variables which were not supported\n",
"## at the time of writing this notebook.\n",
"##\n",
"## 'checkpoint_enabled' - when this is set to 'True', the training script will save the model as a checkpoint\n",
"## after every epoch. If set to 'False', checkpoints will not be saved.\n",
"##\n",
"## 'checkpoint_load_previous' - when this is set to 'True', prior checkpoints saved to S3 will be downloaded\n",
"## to the container and the weights from the latest checkpoint from that list will be loaded to the model.\n",
"## Training will resume from that point. If this is set to 'False', prior checkpoints saved to S3 will still\n",
"## be downloaded to the container but not loaded for training. In this case, training will start from scratch.\n",
"##\n",
"## 'checkpoint_local_dir' - the local directory in the container where the checkpoints will be saved to\n",
"## and loaded from.\n",
"hyperparameters = {'epochs': 25,\n",
" 'batch_size': 50,\n",
" 'learning_rate': 0.001,\n",
" 'decay': 1e-6,\n",
" 'checkpoint_enabled': 'True',\n",
" 'checkpoint_load_previous': 'True',\n",
" 'checkpoint_local_dir': model_checkpoint_local_dir}\n",
"\n",
"# Set the instance count, instance type, instance volume size, options to use Spot instances and other parameters\n",
"## Recommended: 1\n",
"train_instance_count = 1\n",
"## Based on whether you choose a CPU or GPU based instance,\n",
"## set the variable under training image parameters\n",
"## Recommended: Single-GPU based instance\n",
"#train_instance_type = 'ml.m5.xlarge'\n",
"train_instance_type = 'ml.p3.2xlarge'\n",
"train_instance_volume_size_in_gb = 100\n",
"#use_spot_instances = True\n",
"#spot_max_wait_time_in_seconds = 5400\n",
"use_spot_instances = False\n",
"spot_max_wait_time_in_seconds = None\n",
"## Specify a large timeout if you use CPU based instances\n",
"max_run_time_in_seconds = 3600\n",
"\n",
"# Training image parameters\n",
"## should be either 'cpu' or 'gpu'\n",
"## Recommended: 'gpu'\n",
"#image_type = 'cpu'\n",
"image_type = 'gpu'\n",
"framework_version = '2.4.1'\n",
"py_version = 'py37'\n",
"## Set the for image_type = 'gpu' \n",
"cuda_version = 'cu110'\n",
"image_uri_prefix = '763104351884.dkr.ecr.{}.amazonaws.com/tensorflow-training:{}-{}-{}'.format(region_name,\n",
" framework_version,\n",
" image_type,\n",
" py_version)\n",
"image_os_version = 'ubuntu18.04'\n",
"if image_type == 'gpu':\n",
" image_uri = '{}-{}-{}'.format(image_uri_prefix, cuda_version, image_os_version)\n",
"else:\n",
" image_uri = '{}-{}'.format(image_uri_prefix, image_os_version)\n",
"\n",
"# Set the training script related parameters\n",
"train_script_dir = 'scripts'\n",
"## Zero-script-change scenario\n",
"train_script = 'train_tf2_fashion_mnist_custom.py'\n",
"## With-script-change scenario\n",
"#train_script = 'train_tf2_fashion_mnist_custom_debugger.py'\n",
"\n",
"# Set the training container related parameters\n",
"container_log_level = logging.INFO\n",
"\n",
"# Location where the trained model will be stored locally in the container before being uploaded to S3\n",
"model_local_dir = '/opt/ml/model'"
]
},
{
"cell_type": "markdown",
"id": "9ec1003d",
"metadata": {},
"source": [
"### B) Set the debugger parameters \n",
"\n",
"1. **Profile config** - configure how to collect system metrics and framework metrics from your training job and save into your secured S3 bucket URI or local machine.\n",
"\n",
" 1. [Monitoring hardware system resource utilization](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-configure-system-monitoring.html)\n",
" \n",
" 2. [Framework profiling](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-configure-framework-profiling.html)\n",
" \n",
"2. **Debugger hook config** - configure how to collect output tensors from your training job and save into your secured S3 bucket URI or local machine. For more info, refer [here](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-configure-hook.html).\n",
"\n",
"3. **Rules** - configure this parameter to enable Debugger built-in rules that you want to run in parallel. The rules automatically analyze your training job and find training issues. The ProfilerReport rule saves the Debugger profiling reports in your secured S3 bucket URI. For more info, refer [here](https://docs.aws.amazon.com/sagemaker/latest/dg/use-debugger-built-in-rules.html)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "78ca03b9",
"metadata": {},
"outputs": [],
"source": [
"# Location in S3 where the debugger output will be stored is mentioned in the previous step\n",
"\n",
"# Set the profile config for both system and framework metrics\n",
"profiler_config = ProfilerConfig(\n",
" system_monitor_interval_millis = 500,\n",
" framework_profile_params = FrameworkProfile(\n",
" detailed_profiling_config = DetailedProfilingConfig(\n",
" start_step = 5, \n",
" num_steps = 10\n",
" ),\n",
" dataloader_profiling_config = DataloaderProfilingConfig(\n",
" start_step = 7, \n",
" num_steps = 10\n",
" ),\n",
" python_profiling_config = PythonProfilingConfig(\n",
" start_step = 9, \n",
" num_steps = 10,\n",
" python_profiler = PythonProfiler.CPROFILE, \n",
" cprofile_timer = cProfileTimer.TOTAL_TIME\n",
" )\n",
" )\n",
")\n",
"\n",
"# Set the debugger hook config to save tensors\n",
"debugger_hook_config = DebuggerHookConfig(\n",
" collection_configs = [\n",
" CollectionConfig(name = 'metrics'),\n",
" CollectionConfig(name = 'sm_metrics'),\n",
" CollectionConfig(name = 'weights'),\n",
" CollectionConfig(name = 'gradients')\n",
" ]\n",
")\n",
"\n",
"# Set the rules to analyze tensors emitted during training\n",
"## These specific set of rules will inspect the overall training performance and progress of the model\n",
"rules=[\n",
" ProfilerRule.sagemaker(rule_configs.ProfilerReport()),\n",
" Rule.sagemaker(rule_configs.loss_not_decreasing()),\n",
" Rule.sagemaker(rule_configs.overfit()),\n",
" Rule.sagemaker(rule_configs.overtraining()),\n",
" Rule.sagemaker(rule_configs.stalled_training_rule())\n",
"]"
]
},
{
"cell_type": "markdown",
"id": "a4cabbb4",
"metadata": {},
"source": [
"### C) (Optional) Delete previous checkpoints \n",
"\n",
"If model checkpoints from previous trainings are found in the S3 checkpoint location specified in the previous step, then, they will be automatically downloaded to the container running the training process. If you set the checkpoint parameter `checkpoint_load_previous` to `True`, then the weights from the latest checkpoint file will be loaded to the model and training will start from there. If you have set `checkpoint_load_previous` to `False`, you can avoid the unnecessary downloading of the checkpoint files from S3 to the container by deleting those checkpoint files from S3. For this, run the following code cell."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cd031784",
"metadata": {},
"outputs": [],
"source": [
"# Delete the checkpoints if you want to train from the beginning; else ignore this code cell\n",
"for checkpoint_file in s3_bucket_resource.objects.filter(Prefix='{}/checkpoint/'.format(nb_name)):\n",
" checkpoint_file_key = checkpoint_file.key\n",
" print('Deleting {} ...'.format(checkpoint_file_key))\n",
" s3_resource.Object(s3_bucket_resource.name, checkpoint_file_key).delete()"
]
},
{
"cell_type": "markdown",
"id": "c37fb12b",
"metadata": {},
"source": [
"### D) Run the training job \n",
"\n",
"Prepare the `estimator` and call the `fit()` method. This will pull the container containing the specified version of the framework in the AWS region, copy the custom training script to it and run the training job in the specified type of EC2 instance(s). The training data will be pulled from the specified location in S3 and the generated model along with the checkpoints will be written to the specified locations in S3. The debugger will use its configured settings to capture the data and write them to the specified locations in S3."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6cb400c0",
"metadata": {},
"outputs": [],
"source": [
"# Create the estimator\n",
"estimator = TensorFlow(\n",
" source_dir=train_script_dir,\n",
" entry_point=train_script,\n",
" code_location=code_location,\n",
" checkpoint_local_path=model_checkpoint_local_dir,\n",
" checkpoint_s3_uri=model_checkpoint_s3_path,\n",
" model_dir=model_local_dir,\n",
" output_path=model_and_debugger_output_s3_path,\n",
" instance_type=train_instance_type,\n",
" volume_size=train_instance_volume_size_in_gb,\n",
" instance_count=train_instance_count,\n",
" use_spot_instances=use_spot_instances,\n",
" max_wait=spot_max_wait_time_in_seconds,\n",
" max_run=max_run_time_in_seconds,\n",
" hyperparameters=hyperparameters,\n",
" role=sagemaker.get_execution_role(),\n",
" base_job_name=train_job_name,\n",
" image_uri=image_uri,\n",
" container_log_level=container_log_level,\n",
" script_mode=True,\n",
" disable_profiler=False,\n",
" profiler_config=profiler_config,\n",
" debugger_hook_config=debugger_hook_config,\n",
" rules=rules)\n",
"\n",
"# Perform the training\n",
"estimator.fit(inputs, wait=True)"
]
},
{
"cell_type": "markdown",
"id": "4b97a475",
"metadata": {},
"source": [
"## 5. View the auto-generated debugger profiling report \n",
"\n",
"The debugger's auto-generated profiling report will be stored in the S3 directory specified in earlier steps. You can view it here.\n",
"\n",
"For information on how to read the report, refer [here](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-profiling-report.html)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "166c93fc",
"metadata": {},
"outputs": [],
"source": [
"# Get the S3 path to the debugger's auto-generated profiling report\n",
"profiling_report_s3_prefix = '{}/output/{}/rule-output/ProfilerReport/profiler-output/profiler-report.html'.format(nb_name,\n",
" estimator.latest_training_job.job_name)\n",
"profiling_report = sagemaker.Session().read_s3_file(s3_bucket, profiling_report_s3_prefix)\n",
"\n",
"\n",
"# Print debugger's auto-generated profiling report location\n",
"display(HTML(profiling_report))"
]
},
{
"cell_type": "markdown",
"id": "11e29684",
"metadata": {},
"source": [
"## 6. Perform interactive analysis of the debugger output \n",
"\n",
"The debugger's output will be stored in the S3 directories specified in earlier steps. In this step, we will read that data and display them through various visualizations."
]
},
{
"cell_type": "markdown",
"id": "066f672b",
"metadata": {},
"source": [
"### A. Get the training job \n",
"\n",
"Get the training job object from the estimator object used for training in the previous step. This is required to read the metrics from the debugger's output."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dbce10ab",
"metadata": {},
"outputs": [],
"source": [
"# This assumes that the job was trained in the same AWS region as the S3 bucket where the debugger output is stored\n",
"# If not, then make appropriate changes to the following code\n",
"tj = TrainingJob(estimator.latest_training_job.job_name, sagemaker.Session().boto_region_name)"
]
},
{
"cell_type": "markdown",
"id": "6b0b6dda",
"metadata": {},
"source": [
"### B. Read the metrics \n",
"\n",
"1. Wait for the the system and framework metrics to be available.\n",
"2. Get the reader objects for both of these metrics.\n",
"3. Refresh the event file lists that contains these metrics."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "259a8eaa",
"metadata": {},
"outputs": [],
"source": [
"# Wait for the data to be available\n",
"tj.wait_for_sys_profiling_data_to_be_available()\n",
"tj.wait_for_framework_profiling_data_to_be_available()\n",
"# Get the metrics reader\n",
"system_metrics_reader = tj.get_systems_metrics_reader()\n",
"framework_metrics_reader = tj.get_framework_metrics_reader()\n",
"# Refresh the event file list\n",
"system_metrics_reader.refresh_event_file_list()\n",
"framework_metrics_reader.refresh_event_file_list()"
]
},
{
"cell_type": "markdown",
"id": "4fd391a4",
"metadata": {},
"source": [
"### C. Plot the metrics \n",
"\n",
"Plot visualizations for the metrics read in the previous step."
]
},
{
"cell_type": "markdown",
"id": "1d8572c9",
"metadata": {},
"source": [
"#### a. System metrics histogram "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0f5518ce",
"metadata": {},
"outputs": [],
"source": [
"metrics_histogram = MetricsHistogram(system_metrics_reader)\n",
"metrics_histogram.plot(\n",
" starttime=0, \n",
" endtime=system_metrics_reader.get_timestamp_of_latest_available_file(), \n",
" select_dimensions=[\"CPU\", \"GPU\", \"I/O\"],\n",
" select_events=[\"total\"]\n",
")"
]
},
{
"cell_type": "markdown",
"id": "21c57832",
"metadata": {},
"source": [
"#### b. Framework metrics stepline chart "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3663d484",
"metadata": {},
"outputs": [],
"source": [
"view_step_timeline_chart = StepTimelineChart(framework_metrics_reader)"
]
},
{
"cell_type": "markdown",
"id": "04f00c8e",
"metadata": {},
"source": [
"#### c. Framework metrics step histogram "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "383a1ce6",
"metadata": {},
"outputs": [],
"source": [
"step_histogram = StepHistogram(framework_metrics_reader)\n",
"step_histogram.plot(\n",
" starttime=step_histogram.last_timestamp - 5 * 1000 * 1000, \n",
" endtime=step_histogram.last_timestamp, \n",
" show_workers=True\n",
")"
]
},
{
"cell_type": "markdown",
"id": "77c7dcfb",
"metadata": {},
"source": [
"#### d. System and framework metrics timeline charts "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3f6939df",
"metadata": {},
"outputs": [],
"source": [
"view_timeline_charts = TimelineCharts(\n",
" system_metrics_reader, \n",
" framework_metrics_reader,\n",
" select_dimensions=[\"CPU\", \"GPU\", \"I/O\"],\n",
" select_events=[\"total\"] \n",
")\n",
"\n",
"view_timeline_charts.plot_detailed_profiler_data([500,510]) "
]
},
{
"cell_type": "markdown",
"id": "56e61d97",
"metadata": {},
"source": [
"#### e. System and framework metrics heatmap "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a34374c0",
"metadata": {},
"outputs": [],
"source": [
"view_heatmap = Heatmap(\n",
" system_metrics_reader,\n",
" framework_metrics_reader,\n",
" select_dimensions=[\"CPU\", \"GPU\", \"I/O\"],\n",
" select_events=[\"total\"],\n",
" plot_height=450\n",
")"
]
},
{
"cell_type": "markdown",
"id": "3ebd9164",
"metadata": {},
"source": [
"## 7. Cleanup \n",
"\n",
"As a best practice, you should delete resources and S3 objects when no longer required. This will help you avoid incurring unncessary costs.\n",
"\n",
"This step will cleanup the resources and S3 objects created by this notebook."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1578fee3",
"metadata": {},
"outputs": [],
"source": [
"# Delete data from S3 bucket\n",
"for file in s3_bucket_resource.objects.filter(Prefix='{}/'.format(nb_name)):\n",
" file_key = file.key\n",
" print('Deleting {} ...'.format(file_key))\n",
" s3_resource.Object(s3_bucket_resource.name, file_key).delete()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a9d03f6d",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "conda_python3",
"language": "python",
"name": "conda_python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}