# Train a TensorFlow 2.x model with custom training loop on the Amazon SageMaker optimized TensorFlow container and debug using Amazon SageMaker Debugger

[Amazon SageMaker](https://aws.amazon.com/sagemaker/) is a fully managed machine learning service. With SageMaker, you have the option of using the built-in algorithms as well as bringing your own algorithms and frameworks. One such framework is TensorFlow 2.x. [Amazon SageMaker Debugger](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html) debugs, monitors and profiles training jobs in real time thereby helping with detecting non-converging conditions, optimizing resource utilization by eliminating bottlenecks, improving training time and reducing costs of your machine learning models.

This notebook demonstrates how to use a SageMaker optimized TensorFlow 2.x container to train a multi-class image classification model using the [Fashion MNIST dataset](https://github.com/zalandoresearch/fashion-mnist) using a custom training loop i.e. customizes what goes on in the `fit()` loop. It also demonstrates how to debug using SageMaker Debugger. Finally the debugger's output is analyzed. This will take your training script and use SageMaker in script mode.

**Note:**

* This notebook should only be run from within a SageMaker notebook instance as it references SageMaker native APIs.
* At the time of writing this notebook, the most relevant latest version of the Jupyter notebook kernel for this notebook was `conda_python3`.
* Although the training step in this notebook supports both CPU and GPU based instances, it is highly recommended that you use a GPU based instance as the CPU based instance will take a very long time to complete training. The training scripts used by this notebook have been coded to run on single-GPU training instances. You can use multi-GPU instances but only one GPU will be used resulting in wastage of GPU resources.
* This notebook will create resources in the same AWS account and in the same region where this notebook is running.

**Table of Contents:**

1. [Complete prerequisites](#Complete%20prerequisites)

 1. [Check and configure access to the Internet](#Check%20and%20configure%20access%20to%20the%20Internet)

 2. [Check and upgrade required software versions](#Check%20and%20upgrade%20required%20software%20versions)
 
 3. [Check and configure security permissions](#Check%20and%20configure%20security%20permissions)

 4. [Organize imports](#Organize%20imports)
 
 5. [Create common objects](#Create%20common%20objects)

2. [Prepare the dataset](#Prepare%20the%20dataset)

 1. [Create the local directories](#Create%20the%20local%20directories)

 2. [Load the dataset](#Load%20the%20dataset)
 
 3. [View the details of the dataset](#View%20the%20details%20of%20the%20dataset)
 
 4. [Visualize the dataset](#Visualize%20the%20dataset)
 
 5. [Normalize the dataset](#Normalize%20the%20dataset)
 
 6. [Save the prepared datasets locally](#Save%20the%20prepared%20datasets%20locally)
 
 7. [Upload the prepared datasets to S3](#Upload%20the%20prepared%20datasets%20to%20S3)

3. [View the training script](#View%20the%20training%20script)

 1. [Zero script change](#Zero%20script%20change)
 
 2. [With script change](#With%20script%20change)

4. [Perform training, validation and testing](#Perform%20training%20validation%20and%20testing)

 1. [Set the training parameters](#Set%20the%20training%20parameters)
 
 2. [Set the debugger parameters](#Set%20the%20debugger%20parameters)
 
 3. [(Optional) Delete previous checkpoints](#(Optional)%20Delete%20previous%20checkpoints)

 4. [Run the training job](#Run%20the%20training%20job)

5. [View the auto-generated debugger profiling report](#View%20the%20auto-generated%20debugger%20profiling%20report)

6. [Perform interactive analysis of the debugger output](#Perform%20interactive%20analysis%20of%20the%20debugger%20output)

 1. [Get the training job](#Get%20the%20training%20job)

 2. [Read the metrics](#Read%20the%20metrics)

 3. [Plot the metrics](#Plot%20the%20metrics)
 
 1. [System metrics histogram](#System%20metrics%20histogram)

 2. [Framework metrics stepline chart](#Framework%20metrics%20stepline%20chart)
 
 3. [Framework metrics step histogram](#Framework%20metrics%20step%20histogram)

 4. [System and framework metrics timeline charts](#System%20and%20framework%20metrics%20timeline%20charts)

 5. [System and framework metrics heatmap](#System%20and%20framework%20metrics%20heatmap)

7. [Cleanup](#Cleanup)

## 1. Complete prerequisites 

Check and complete the prerequisites.

### A. Check and configure access to the Internet 
This notebook requires outbound access to the Internet to download the required software updates and to download the dataset. You can either provide direct Internet access (default) or provide Internet access through a VPC. For more information on this, refer [here](https://docs.aws.amazon.com/sagemaker/latest/dg/appendix-notebook-and-internet-access.html).

### B. Check and upgrade required software versions 

This notebook requires:
* [SageMaker Python SDK version 2.x](https://sagemaker.readthedocs.io/en/stable/v2.html)
* [TensorFlow version 2.x with SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/tf.html)
* [Python 3.6.x](https://www.python.org/downloads/release/python-360/)
* [SMDebug](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-analyze-data.html)
* [Boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html)

Note: If you get 'module not found' errors in the following cell, then uncomment the appropriate installation commands and install the modules. Also, uncomment and run the kernel shutdown command. When the kernel comes back, comment out the installation and kernel shutdown commands and run the following cell. Now, you should not see any errors.

In [None]:
import boto3
import IPython
import sagemaker
import smdebug
import sys
import tensorflow as tf

"""
Last tested versions:
SageMaker Python SDK version : 2.32.0
TensorFlow version : 2.5.0
Python version : 3.6.13 | packaged by conda-forge | (default, Feb 19 2021, 05:36:01) 
[GCC 9.3.0]
Boto3 version : 1.17.77
SMDebug version : 1.0.9
"""

# Install/upgrade sagemaker (v2.32.0), boto3, tensorflow and smdebug
#!{sys.executable} -m pip install -U boto3
#!{sys.executable} -m pip install -U tensorflow
#!{sys.executable} -m pip install -U sagemaker==2.32.0
#!{sys.executable} -m pip install -U smdebug
#IPython.Application.instance().kernel.do_shutdown(True)

# Get the current installed version of Sagemaker SDK, TensorFlow, Python, Boto3 and SMDebug
print('SageMaker Python SDK version : {}'.format(sagemaker.__version__))
print('TensorFlow version : {}'.format(tf.__version__))
print('Python version : {}'.format(sys.version))
print('Boto3 version : {}'.format(boto3.__version__))
print('SMDebug version : {}'.format(smdebug.__version__))

### C. Check and configure security permissions 
This notebook uses the IAM role attached to the underlying notebook instance. To view the name of this role, run the following cell.

Note: This role should have the following permissions,

1. Full access to the S3 bucket that will be used to store training and output data.
2. Full access to launch training instances.
3. Access to write to CloudWatch.

In [None]:
print(sagemaker.get_execution_role())

### D. Organize imports 

Organize all the library and module imports for later use.

In [None]:
from IPython.core.display import display, HTML
import logging
import matplotlib.pyplot as plt
import numpy as np
import os
from sagemaker.debugger import (ProfilerConfig,
 FrameworkProfile,
 CollectionConfig,
 DebuggerHookConfig,
 DetailedProfilingConfig, 
 DataloaderProfilingConfig, 
 PythonProfilingConfig,
 Rule,
 PythonProfiler,
 cProfileTimer,
 ProfilerRule,
 rule_configs)
from sagemaker.tensorflow import TensorFlow
from smdebug.profiler.analysis.notebook_utils.metrics_histogram import MetricsHistogram
from smdebug.profiler.analysis.notebook_utils.step_timeline_chart import StepTimelineChart
from smdebug.profiler.analysis.notebook_utils.step_histogram import StepHistogram
from smdebug.profiler.analysis.notebook_utils.timeline_charts import TimelineCharts
from smdebug.profiler.analysis.notebook_utils.heatmap import Heatmap
from smdebug.profiler.analysis.notebook_utils.training_job import TrainingJob
import time

### E. Create common objects 

Create common objects to be used in future steps in this notebook.

In [None]:
# Specify the S3 bucket name
s3_bucket = ''

# Create the S3 Boto3 resource
s3_resource = boto3.resource('s3')
s3_bucket_resource = s3_resource.Bucket(s3_bucket)

# Get the AWS region name
region_name = sagemaker.Session().boto_region_name

# Base name to be used to create resources
nb_name = 'tf2-fashion-mnist-custom-debugger'

# Names of various resources
train_job_name = 'train-{}'.format(nb_name)

# Names of local sub-directories in the notebook file system
data_dir = os.path.join(os.getcwd(), 'data/{}'.format(nb_name))
train_dir = os.path.join(os.getcwd(), 'data/{}/train'.format(nb_name))
test_dir = os.path.join(os.getcwd(), 'data/{}/test'.format(nb_name))

# Sub-folder names in S3
train_dir_s3_prefix = '{}/data/train'.format(nb_name)
test_dir_s3_prefix = '{}/data/test'.format(nb_name)

# Location in S3 where the training scripts will be copied
code_location = 's3://{}/{}/scripts'.format(s3_bucket, nb_name)

# Location in S3 where the model checkpoint will be stored
model_checkpoint_s3_path = 's3://{}/{}/checkpoint/'.format(s3_bucket, nb_name)

# Location in S3 where the trained model and debugger output will be stored
model_and_debugger_output_s3_path = 's3://{}/{}/output/'.format(s3_bucket, nb_name)

## 2. Prepare the dataset 

The [Fashion MNIST dataset](https://github.com/zalandoresearch/fashion-mnist) consists of 60,000 28x28 grayscale images of 10 fashion categories, along with a test set of 10,000 images. These categories are mapped to integers from 0 to 9 and represent the following class labels,

* 0: T-shirt/top
* 1: Trouser
* 2: Pullover
* 3: Dress
* 4: Coat
* 5: Sandal
* 6: Shirt
* 7: Sneaker
* 8: Bag
* 9: Ankle boot

The following steps will help with preparing the dataset for training.

### A) Create the local directories 

Create the directories in the local system where the dataset will be copied to and processed.

In [None]:
# Create the local directories
os.makedirs(data_dir, exist_ok=True)
os.makedirs(train_dir, exist_ok=True)
os.makedirs(test_dir, exist_ok=True)

### B) Load the dataset 

Load the pre-shuffled train and test data with the keras.datasets API.

In [None]:
# Load the dataset
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.fashion_mnist.load_data()

### C) View the details of the dataset 

Print the shape of the data and you will notice that they are 28x28 pixels. There are 60,000 images in the training data and 10,000 images in the test data.

In [None]:
# Summarize the dataset
print("x_train shape:", x_train.shape)
print("y_train shape:", y_train.shape)
print("x_test shape:", x_test.shape)
print("y_test shape:", y_test.shape)

### D) Visualize the dataset 

Randomly display the images and labels of `sample_size` number of images from the test dataset.

In [None]:
# Randomly display the images and labels of n (sample_size) images from the test dataset

sample_size = 50

random_indexes = np.random.randint(0, len(x_test), sample_size)
sample_images = x_test[random_indexes]
sample_labels = y_test[random_indexes]
sample_predictions = None
num_rows = 5
num_cols = 10
plot_title = None
fig_size = None
assert sample_images.shape[0] == num_rows * num_cols

# Labels
FASHION_LABELS = {
 0: 'T-shirt/top',
 1: 'Trouser',
 2: 'Pullover',
 3: 'Dress',
 4: 'Coat',
 5: 'Sandal',
 6: 'Shirt',
 7: 'Sneaker',
 8: 'Bag',
 9: 'Ankle boot'
}

import seaborn as sns

with sns.axes_style("whitegrid"):
 sns.set_context("notebook", font_scale=1.1)
 sns.set_style({"font.sans-serif": ["Verdana", "Arial", "Calibri", "DejaVu Sans"]})
 f, ax = plt.subplots(num_rows, num_cols, figsize=((14, 9) if fig_size is None else fig_size),
 gridspec_kw={"wspace": 0.02, "hspace": 0.30}, squeeze=True)
 for r in range(num_rows):
 for c in range(num_cols):
 image_index = r * num_cols + c
 ax[r, c].axis("off")
 ax[r, c].imshow(sample_images[image_index], cmap="Greys")
 if sample_predictions is None:
 title = ax[r, c].set_title("%s" % FASHION_LABELS[sample_labels[image_index]])
 else:
 true_label = sample_labels[image_index]
 pred_label = sample_predictions[image_index]
 prediction_matches_true = (sample_labels[image_index] == sample_predictions[image_index])
 if prediction_matches_true:
 title = FASHION_LABELS[true_label]
 title_color = 'g'
 else:
 title = '%s/%s' % (FASHION_LABELS[true_label], FASHION_LABELS[pred_label])
 title_color = 'r'
 title = ax[r, c].set_title(title)
 plt.setp(title, color=title_color)
 if plot_title is not None:
 f.suptitle(plot_title)
 plt.show()
 plt.close()

The pixel values of the images fall in the range of 0 to 255. You can verify this below.

In [None]:
plt.figure()
plt.imshow(x_train[0])
plt.colorbar()
plt.grid(False)
plt.show()

### E) Normalize the dataset 

As the pixel values range from 0 to 255, it is important to normalize them to a range from 0 to 1. This can be done by dividing these values by 255. This has to be done for both training and test images.

In [None]:
# Normalize the dataset
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0

### F) Save the prepared datasets locally 

Save the prepared train, validate and test datasets to local directories. Prior to saving, concatenate x and y columns as needed. Create the directories if they don't exist.

In [None]:
# Save the prepared dataset (in numpy format) to the local directories
np.save(os.path.join(train_dir, 'x_train.npy'), x_train)
np.save(os.path.join(train_dir, 'y_train.npy'), y_train)
np.save(os.path.join(test_dir, 'x_test.npy'), x_test)
np.save(os.path.join(test_dir, 'y_test.npy'), y_test)

### G) Upload the prepared datasets to S3 

Upload the datasets from the local directories to appropriate sub-directories in the specified S3 bucket.

In [None]:
# Upload the data to S3
train_data_s3_full_path = sagemaker.Session().upload_data(path='./data/{}/train/'.format(nb_name),
 bucket=s3_bucket,
 key_prefix=train_dir_s3_prefix)
test_data_s3_full_path = sagemaker.Session().upload_data(path='./data/{}/test/'.format(nb_name),
 bucket=s3_bucket,
 key_prefix=test_dir_s3_prefix)

## 3. View the training script 

View the script that will be used for training the model. This should exist in a local directory.

### A) Zero script change 

In case you don't have to modify the training script to use the SageMaker Debugger hook.

In [None]:
!cat scripts/train_tf2_fashion_mnist_custom.py

### B) With script change 

In case you have to modify the training script to use the SageMaker Debugger hook. For example, to save save scalars and tensors at specific points in the training script as you require.

In [None]:
!cat scripts/train_tf2_fashion_mnist_custom_debugger.py

## 4. Perform training, validation and testing 

In this step, we will use a SageMaker optimized TensorFlow 2.x container to train a multi-class image classification model using the [Fashion MNIST dataset](https://github.com/zalandoresearch/fashion-mnist) using a custom training loop i.e. customizes what goes on in the `fit()` loop. This will take your training script and use SageMaker in script mode.

Debugger will be enabled as part of the training process. Based on the debugger configuration, the required number of [Processing Jobs](https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job.html) will be created.

Note:

* The logic for the custom training loop will be in the training script.
* During the training process, checkpointing the model is a good practice in general and is strongly recommended when you are using Spot instances for training as there is a chance of your training job getting interrupted.

### A) Set the training parameters 

1. Inputs - S3 locations for training and test data.
2. Hyperparameters and checkpoint parameters.
3. Training instance details:

 1. Instance count (Recommended: 1. Anything more than 1 will result in wastage of resources.)
 
 2. Instance type (Recommended: Single-GPU based instance. Refer [here](https://aws.amazon.com/ec2/instance-types/) for info on instance types.)
 
 3. The max run time of the training job
 
 4. (Optional) Use Spot instances. For more info, refer [here](https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html).
 
 5. (Optional) The max wait for Spot instances, if using Spot. This should be larger than the max run time.
 
 6. Training image parameters. For more info, refer [here](https://github.com/aws/deep-learning-containers/blob/master/available_images.md).
 
4. Tensorflow framework version and Python version.
5. Names of the training script and the local directory where it is located. Choose the script from either zero-script-change or with-script-change scenarios.
6. Logging level of the SageMaker optimized Tensorflow 2.x container.
7. Appropriate local and S3 directories that will be used by the training job.

In [None]:
# Set the input data paths
inputs = {'train':train_data_s3_full_path, 'test':test_data_s3_full_path}

# Location where the model checkpoints will be stored locally in the container before being uploaded to S3
## Note: It is recommended that you use the default location of /opt/ml/checkpoints/ for saving/loading checkpoints.
model_checkpoint_local_dir = '/opt/ml/checkpoints/'

# Set the hyperparameters
##
## Note: Parameters 'checkpoint_enabled', 'checkpoint_load_previous' and 'checkpoint_local_dir' are not
## hyperparameters. They have been specified here as a means to pass them to the training script.
## The better way of passing these would be in the Environment variables which were not supported
## at the time of writing this notebook.
##
## 'checkpoint_enabled' - when this is set to 'True', the training script will save the model as a checkpoint
## after every epoch. If set to 'False', checkpoints will not be saved.
##
## 'checkpoint_load_previous' - when this is set to 'True', prior checkpoints saved to S3 will be downloaded
## to the container and the weights from the latest checkpoint from that list will be loaded to the model.
## Training will resume from that point. If this is set to 'False', prior checkpoints saved to S3 will still
## be downloaded to the container but not loaded for training. In this case, training will start from scratch.
##
## 'checkpoint_local_dir' - the local directory in the container where the checkpoints will be saved to
## and loaded from.
hyperparameters = {'epochs': 25,
 'batch_size': 50,
 'learning_rate': 0.001,
 'decay': 1e-6,
 'checkpoint_enabled': 'True',
 'checkpoint_load_previous': 'True',
 'checkpoint_local_dir': model_checkpoint_local_dir}

# Set the instance count, instance type, instance volume size, options to use Spot instances and other parameters
## Recommended: 1
train_instance_count = 1
## Based on whether you choose a CPU or GPU based instance,
## set the variable under training image parameters
## Recommended: Single-GPU based instance
#train_instance_type = 'ml.m5.xlarge'
train_instance_type = 'ml.p3.2xlarge'
train_instance_volume_size_in_gb = 100
#use_spot_instances = True
#spot_max_wait_time_in_seconds = 5400
use_spot_instances = False
spot_max_wait_time_in_seconds = None
## Specify a large timeout if you use CPU based instances
max_run_time_in_seconds = 3600

# Training image parameters
## should be either 'cpu' or 'gpu'
## Recommended: 'gpu'
#image_type = 'cpu'
image_type = 'gpu'
framework_version = '2.4.1'
py_version = 'py37'
## Set the for image_type = 'gpu' 
cuda_version = 'cu110'
image_uri_prefix = '763104351884.dkr.ecr.{}.amazonaws.com/tensorflow-training:{}-{}-{}'.format(region_name,
 framework_version,
 image_type,
 py_version)
image_os_version = 'ubuntu18.04'
if image_type == 'gpu':
 image_uri = '{}-{}-{}'.format(image_uri_prefix, cuda_version, image_os_version)
else:
 image_uri = '{}-{}'.format(image_uri_prefix, image_os_version)

# Set the training script related parameters
train_script_dir = 'scripts'
## Zero-script-change scenario
train_script = 'train_tf2_fashion_mnist_custom.py'
## With-script-change scenario
#train_script = 'train_tf2_fashion_mnist_custom_debugger.py'

# Set the training container related parameters
container_log_level = logging.INFO

# Location where the trained model will be stored locally in the container before being uploaded to S3
model_local_dir = '/opt/ml/model'

### B) Set the debugger parameters 

1. **Profile config** - configure how to collect system metrics and framework metrics from your training job and save into your secured S3 bucket URI or local machine.

 1. [Monitoring hardware system resource utilization](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-configure-system-monitoring.html)
 
 2. [Framework profiling](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-configure-framework-profiling.html)
 
2. **Debugger hook config** - configure how to collect output tensors from your training job and save into your secured S3 bucket URI or local machine. For more info, refer [here](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-configure-hook.html).

3. **Rules** - configure this parameter to enable Debugger built-in rules that you want to run in parallel. The rules automatically analyze your training job and find training issues. The ProfilerReport rule saves the Debugger profiling reports in your secured S3 bucket URI. For more info, refer [here](https://docs.aws.amazon.com/sagemaker/latest/dg/use-debugger-built-in-rules.html).

In [None]:
# Location in S3 where the debugger output will be stored is mentioned in the previous step

# Set the profile config for both system and framework metrics
profiler_config = ProfilerConfig(
 system_monitor_interval_millis = 500,
 framework_profile_params = FrameworkProfile(
 detailed_profiling_config = DetailedProfilingConfig(
 start_step = 5, 
 num_steps = 10
 ),
 dataloader_profiling_config = DataloaderProfilingConfig(
 start_step = 7, 
 num_steps = 10
 ),
 python_profiling_config = PythonProfilingConfig(
 start_step = 9, 
 num_steps = 10,
 python_profiler = PythonProfiler.CPROFILE, 
 cprofile_timer = cProfileTimer.TOTAL_TIME
 )
 )
)

# Set the debugger hook config to save tensors
debugger_hook_config = DebuggerHookConfig(
 collection_configs = [
 CollectionConfig(name = 'metrics'),
 CollectionConfig(name = 'sm_metrics'),
 CollectionConfig(name = 'weights'),
 CollectionConfig(name = 'gradients')
 ]
)

# Set the rules to analyze tensors emitted during training
## These specific set of rules will inspect the overall training performance and progress of the model
rules=[
 ProfilerRule.sagemaker(rule_configs.ProfilerReport()),
 Rule.sagemaker(rule_configs.loss_not_decreasing()),
 Rule.sagemaker(rule_configs.overfit()),
 Rule.sagemaker(rule_configs.overtraining()),
 Rule.sagemaker(rule_configs.stalled_training_rule())
]

### C) (Optional) Delete previous checkpoints 

If model checkpoints from previous trainings are found in the S3 checkpoint location specified in the previous step, then, they will be automatically downloaded to the container running the training process. If you set the checkpoint parameter `checkpoint_load_previous` to `True`, then the weights from the latest checkpoint file will be loaded to the model and training will start from there. If you have set `checkpoint_load_previous` to `False`, you can avoid the unnecessary downloading of the checkpoint files from S3 to the container by deleting those checkpoint files from S3. For this, run the following code cell.

In [None]:
# Delete the checkpoints if you want to train from the beginning; else ignore this code cell
for checkpoint_file in s3_bucket_resource.objects.filter(Prefix='{}/checkpoint/'.format(nb_name)):
 checkpoint_file_key = checkpoint_file.key
 print('Deleting {} ...'.format(checkpoint_file_key))
 s3_resource.Object(s3_bucket_resource.name, checkpoint_file_key).delete()

### D) Run the training job 

Prepare the `estimator` and call the `fit()` method. This will pull the container containing the specified version of the framework in the AWS region, copy the custom training script to it and run the training job in the specified type of EC2 instance(s). The training data will be pulled from the specified location in S3 and the generated model along with the checkpoints will be written to the specified locations in S3. The debugger will use its configured settings to capture the data and write them to the specified locations in S3.

In [None]:
# Create the estimator
estimator = TensorFlow(
 source_dir=train_script_dir,
 entry_point=train_script,
 code_location=code_location,
 checkpoint_local_path=model_checkpoint_local_dir,
 checkpoint_s3_uri=model_checkpoint_s3_path,
 model_dir=model_local_dir,
 output_path=model_and_debugger_output_s3_path,
 instance_type=train_instance_type,
 volume_size=train_instance_volume_size_in_gb,
 instance_count=train_instance_count,
 use_spot_instances=use_spot_instances,
 max_wait=spot_max_wait_time_in_seconds,
 max_run=max_run_time_in_seconds,
 hyperparameters=hyperparameters,
 role=sagemaker.get_execution_role(),
 base_job_name=train_job_name,
 image_uri=image_uri,
 container_log_level=container_log_level,
 script_mode=True,
 disable_profiler=False,
 profiler_config=profiler_config,
 debugger_hook_config=debugger_hook_config,
 rules=rules)

# Perform the training
estimator.fit(inputs, wait=True)

## 5. View the auto-generated debugger profiling report 

The debugger's auto-generated profiling report will be stored in the S3 directory specified in earlier steps. You can view it here.

For information on how to read the report, refer [here](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-profiling-report.html).

In [None]:
# Get the S3 path to the debugger's auto-generated profiling report
profiling_report_s3_prefix = '{}/output/{}/rule-output/ProfilerReport/profiler-output/profiler-report.html'.format(nb_name,
 estimator.latest_training_job.job_name)
profiling_report = sagemaker.Session().read_s3_file(s3_bucket, profiling_report_s3_prefix)


# Print debugger's auto-generated profiling report location
display(HTML(profiling_report))

## 6. Perform interactive analysis of the debugger output 

The debugger's output will be stored in the S3 directories specified in earlier steps. In this step, we will read that data and display them through various visualizations.

### A. Get the training job 

Get the training job object from the estimator object used for training in the previous step. This is required to read the metrics from the debugger's output.

In [None]:
# This assumes that the job was trained in the same AWS region as the S3 bucket where the debugger output is stored
# If not, then make appropriate changes to the following code
tj = TrainingJob(estimator.latest_training_job.job_name, sagemaker.Session().boto_region_name)

### B. Read the metrics 

1. Wait for the the system and framework metrics to be available.
2. Get the reader objects for both of these metrics.
3. Refresh the event file lists that contains these metrics.

In [None]:
# Wait for the data to be available
tj.wait_for_sys_profiling_data_to_be_available()
tj.wait_for_framework_profiling_data_to_be_available()
# Get the metrics reader
system_metrics_reader = tj.get_systems_metrics_reader()
framework_metrics_reader = tj.get_framework_metrics_reader()
# Refresh the event file list
system_metrics_reader.refresh_event_file_list()
framework_metrics_reader.refresh_event_file_list()

### C. Plot the metrics 

Plot visualizations for the metrics read in the previous step.

#### a. System metrics histogram 

In [None]:
metrics_histogram = MetricsHistogram(system_metrics_reader)
metrics_histogram.plot(
 starttime=0, 
 endtime=system_metrics_reader.get_timestamp_of_latest_available_file(), 
 select_dimensions=["CPU", "GPU", "I/O"],
 select_events=["total"]
)

#### b. Framework metrics stepline chart 

In [None]:
view_step_timeline_chart = StepTimelineChart(framework_metrics_reader)

#### c. Framework metrics step histogram 

In [None]:
step_histogram = StepHistogram(framework_metrics_reader)
step_histogram.plot(
 starttime=step_histogram.last_timestamp - 5 * 1000 * 1000, 
 endtime=step_histogram.last_timestamp, 
 show_workers=True
)

#### d. System and framework metrics timeline charts 

In [None]:
view_timeline_charts = TimelineCharts(
 system_metrics_reader, 
 framework_metrics_reader,
 select_dimensions=["CPU", "GPU", "I/O"],
 select_events=["total"] 
)

view_timeline_charts.plot_detailed_profiler_data([500,510]) 

#### e. System and framework metrics heatmap 

In [None]:
view_heatmap = Heatmap(
 system_metrics_reader,
 framework_metrics_reader,
 select_dimensions=["CPU", "GPU", "I/O"],
 select_events=["total"],
 plot_height=450
)

## 7. Cleanup 

As a best practice, you should delete resources and S3 objects when no longer required. This will help you avoid incurring unncessary costs.

This step will cleanup the resources and S3 objects created by this notebook.

In [None]:
# Delete data from S3 bucket
for file in s3_bucket_resource.objects.filter(Prefix='{}/'.format(nb_name)):
 file_key = file.key
 print('Deleting {} ...'.format(file_key))
 s3_resource.Object(s3_bucket_resource.name, file_key).delete()