{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-05-26T15:44:26.461461Z",
     "iopub.status.busy": "2021-05-26T15:44:26.460634Z",
     "iopub.status.idle": "2021-05-26T15:44:57.084009Z",
     "shell.execute_reply": "2021-05-26T15:44:57.084439Z"
    },
    "papermill": {
     "duration": 30.670318,
     "end_time": "2021-05-26T15:44:57.084607",
     "exception": false,
     "start_time": "2021-05-26T15:44:26.414289",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Install dependencies\n",
    "!pip install -q smdebug==1.0.3\n",
    "!pip install -q seaborn\n",
    "!pip install -q plotly\n",
    "!pip install -q opencv-python\n",
    "!pip install -q shap\n",
    "!pip install -q bokeh\n",
    "!pip install -q imageio\n",
    "!pip install -Uq sagemaker"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.039952,
     "end_time": "2021-05-26T15:44:57.165151",
     "exception": false,
     "start_time": "2021-05-26T15:44:57.125199",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "#  Profile machine learning training with Amazon SageMaker Debugger \n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n",
    "\n",
    "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/sagemaker-debugger|tensorflow_nlp_sentiment_analysis|sentiment-analysis-tf-distributed-training-bringyourownscript.ipynb)\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.039952,
     "end_time": "2021-05-26T15:44:57.165151",
     "exception": false,
     "start_time": "2021-05-26T15:44:57.125199",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "\n",
    "### Gain high precision insights of horovod based distributed machine learning training jobs\n",
    "\n",
    "This notebook demonstrates how to: \n",
    "* Execute distributed training on Amazon SageMaker using Horovod framework.  \n",
    "* Execute distributed training using script mode which allows you to use a training script similar to one you would use outside SageMaker.\n",
    "* Execute SageMaker Debugger profiling rules against training jobs in process.\n",
    "* Visualize the system and framework metrics using the SMDebug client library.\n",
    "* Analyze autogenerated profiling report and implement recommendations suggested by SageMaker Debugger.  \n",
    "\n",
    "**Table of Contents** \n",
    "\n",
    "1. [Introduction](#intro)\n",
    "2. [Section 1 - Setup](#setup)\n",
    "3. [Section 2 - Train sentiment analysis CNN model with custom Debugger profiling configuration](#train)\n",
    "3. [Section 3 - Interactive analysis using the SMDebug visualization tools](#analysis)\n",
    "5. [Section 4 - Analyze report generated by Debugger](#profiler-report)\n",
    "6. [Section 5 - Analyze recommendations from the report](#analyze-profiler-recommendations)\n",
    "7. [Section 6 - Implement recommendations from the report](#implement-profiler-recommendations)\n",
    "8. [Conclusion](#conclusion)\n",
    "\n",
    "\n",
    "## Introduction <a id='intro'></a>   \n",
    "\n",
    "Training machine learning models is a time and compute intensive process requiring multiple training runs with different hyperparameters before a model yields acceptable accuracy. CPU and GPU based distributed training with\n",
    "frameworks such as Horovord and Parameter Servers address this issue by allowing training to be easily\n",
    "scalable to a cluster of resources. However, distributed training makes it harder to identify and debug\n",
    "resource bottleneck problems. Gaining insights into the training in progress, both at the machine learning\n",
    "framework level and the underlying compute resources level, is critical step towards understanding the\n",
    "resource usage patterns and reducing resource wastage. Analyzing bottleneck issues is necessary to\n",
    "maximize the utilization of compute resources and optimize model training performance to deliver state-of-the-art machine learning models with target accuracy.\n",
    "\n",
    "Amazon SageMaker is a fully managed service that enables developers and data scientists to quickly and easily build, train, and deploy ML models at scale. Amazon SageMaker Debugger is a feature of SageMaker training that makes it easy to train machine learning (ML) models faster by capturing real-time metrics such as learning gradients and weights, providing transparency into the training process, so you can correct anomalies such as losses, overfitting, and overtraining. With the newly introduced profiling capability, SageMaker Debugger now automatically monitors system resources such as CPU, GPU, network, IO, and memory providing a complete resource utilization view of training jobs.\n",
    "\n",
    "In this notebook, we demonstrate the Amazon SageMaker Debugger profiling capabilities using the sentiment analysis use case. \n",
    "\n",
    "####  Use case - Sentiment Analysis with TensorFlow and Keras\n",
    "\n",
    "Sentiment analysis is a very common text analytics task that involves determining whether a text sample is positive or negative about its subject.  There are several different algorithms for performing this task, including statistical algorithms and deep learning algorithms.  With respect to deep learning, a Convolutional Neural Net (CNN) is sometimes used for this purpose.  In this notebook we'll use a CNN built with TensorFlow to perform sentiment analysis in Amazon SageMaker on the IMDB dataset, which consists of movie reviews labeled as having positive or negative sentiment."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.040061,
     "end_time": "2021-05-26T15:44:57.245149",
     "exception": false,
     "start_time": "2021-05-26T15:44:57.205088",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "## Step 0 - Install and check the SageMaker Python SDK version\n",
    "To use the new Debugger profiling features, ensure that you have the right versions of SageMaker and SMDebug SDKs installed."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.040333,
     "end_time": "2021-05-26T15:44:59.115251",
     "exception": false,
     "start_time": "2021-05-26T15:44:59.074918",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "Check the library versions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-05-26T15:44:59.199346Z",
     "iopub.status.busy": "2021-05-26T15:44:59.198834Z",
     "iopub.status.idle": "2021-05-26T15:44:59.280196Z",
     "shell.execute_reply": "2021-05-26T15:44:59.280592Z"
    },
    "papermill": {
     "duration": 0.125097,
     "end_time": "2021-05-26T15:44:59.280723",
     "exception": false,
     "start_time": "2021-05-26T15:44:59.155626",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "import sagemaker\n",
    "\n",
    "sagemaker.__version__"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.041279,
     "end_time": "2021-05-26T15:44:59.451974",
     "exception": false,
     "start_time": "2021-05-26T15:44:59.410695",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "**<font color=red>Important</font>**: If the SageMaker version is less than 2.19.0 and if you are using an existing SageMaker Studio or Notebook instance, you must update the environment to use the latest SageMaker Python SDK. Follow instructions at [Update Amazon SageMaker Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-tasks-update.html) and [Notebook Instance Software Updates](https://docs.aws.amazon.com/sagemaker/latest/dg/nbi-software-updates.html) in the [Amazon SageMaker developer guide](https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.04154,
     "end_time": "2021-05-26T15:44:59.534518",
     "exception": false,
     "start_time": "2021-05-26T15:44:59.492978",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "## Section 1 - Setup <a id='setup'></a>\n",
    "\n",
    "In this section, you will import the necessary libraries, set up variables, and examine data to train the sentiment analysis model.\n",
    "\n",
    "Let's start by specifying:\n",
    "\n",
    "* The AWS region used to host your model.\n",
    "* The IAM role associated with this SageMaker notebook instance.\n",
    "* The S3 bucket used to store the data used to train your model, any additional model data, and the data captured from model invocations."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.041147,
     "end_time": "2021-05-26T15:44:59.616652",
     "exception": false,
     "start_time": "2021-05-26T15:44:59.575505",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "#### 1.1 Import necessary libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-05-26T15:44:59.703396Z",
     "iopub.status.busy": "2021-05-26T15:44:59.702904Z",
     "iopub.status.idle": "2021-05-26T15:44:59.707649Z",
     "shell.execute_reply": "2021-05-26T15:44:59.708008Z"
    },
    "papermill": {
     "duration": 0.050362,
     "end_time": "2021-05-26T15:44:59.708136",
     "exception": false,
     "start_time": "2021-05-26T15:44:59.657774",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import os\n",
    "import boto3\n",
    "import time\n",
    "\n",
    "# import debugger libraries\n",
    "import sagemaker\n",
    "from sagemaker.tensorflow import TensorFlow\n",
    "from sagemaker.debugger import ProfilerConfig, FrameworkProfile\n",
    "\n",
    "from tensorflow.keras.preprocessing import sequence\n",
    "from tensorflow.python.keras.datasets import imdb"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.041304,
     "end_time": "2021-05-26T15:44:59.790429",
     "exception": false,
     "start_time": "2021-05-26T15:44:59.749125",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "#### 1.2 AWS region and  IAM Role"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-05-26T15:44:59.886041Z",
     "iopub.status.busy": "2021-05-26T15:44:59.885530Z",
     "iopub.status.idle": "2021-05-26T15:45:00.387018Z",
     "shell.execute_reply": "2021-05-26T15:45:00.387445Z"
    },
    "papermill": {
     "duration": 0.556059,
     "end_time": "2021-05-26T15:45:00.387596",
     "exception": false,
     "start_time": "2021-05-26T15:44:59.831537",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "import sagemaker\n",
    "\n",
    "region = sagemaker.Session().boto_region_name\n",
    "print(\"AWS Region: {}\".format(region))\n",
    "\n",
    "role = sagemaker.get_execution_role()\n",
    "print(\"RoleArn: {}\".format(role))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.041908,
     "end_time": "2021-05-26T15:45:00.474172",
     "exception": false,
     "start_time": "2021-05-26T15:45:00.432264",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "#### 1.3 S3 bucket and prefixes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-05-26T15:45:00.570189Z",
     "iopub.status.busy": "2021-05-26T15:45:00.568925Z",
     "iopub.status.idle": "2021-05-26T15:45:00.705874Z",
     "shell.execute_reply": "2021-05-26T15:45:00.706345Z"
    },
    "papermill": {
     "duration": 0.190931,
     "end_time": "2021-05-26T15:45:00.706495",
     "exception": false,
     "start_time": "2021-05-26T15:45:00.515564",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "s3_prefix = \"tf-hvd-sentiment-silent\"\n",
    "\n",
    "traindata_s3_prefix = \"{}/data/train\".format(s3_prefix)\n",
    "testdata_s3_prefix = \"{}/data/test\".format(s3_prefix)\n",
    "\n",
    "sagemaker_session = sagemaker.Session()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.041749,
     "end_time": "2021-05-26T15:45:00.790394",
     "exception": false,
     "start_time": "2021-05-26T15:45:00.748645",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "#### 1.4 Process training data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.041589,
     "end_time": "2021-05-26T15:45:00.873650",
     "exception": false,
     "start_time": "2021-05-26T15:45:00.832061",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "We'll begin by loading the reviews dataset, and padding the reviews, so all reviews have the same length. Each review is represented as an array of numbers, where each number represents an indexed word. Training data for both Local Mode and Hosted Training must be saved as files, so we'll also save the transformed data to files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-05-26T15:45:00.961616Z",
     "iopub.status.busy": "2021-05-26T15:45:00.961055Z",
     "iopub.status.idle": "2021-05-26T15:45:06.953496Z",
     "shell.execute_reply": "2021-05-26T15:45:06.952975Z"
    },
    "papermill": {
     "duration": 6.038467,
     "end_time": "2021-05-26T15:45:06.953612",
     "exception": false,
     "start_time": "2021-05-26T15:45:00.915145",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "max_features = 20000\n",
    "maxlen = 400\n",
    "\n",
    "(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)\n",
    "print(len(x_train), \"train sequences\")\n",
    "print(len(x_test), \"test sequences\")\n",
    "\n",
    "x_train = sequence.pad_sequences(x_train, maxlen=maxlen)\n",
    "x_test = sequence.pad_sequences(x_test, maxlen=maxlen)\n",
    "print(\"x_train shape:\", x_train.shape)\n",
    "print(\"x_test shape:\", x_test.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-05-26T15:45:07.046630Z",
     "iopub.status.busy": "2021-05-26T15:45:07.046040Z",
     "iopub.status.idle": "2021-05-26T15:45:07.048435Z",
     "shell.execute_reply": "2021-05-26T15:45:07.048853Z"
    },
    "papermill": {
     "duration": 0.050496,
     "end_time": "2021-05-26T15:45:07.048982",
     "exception": false,
     "start_time": "2021-05-26T15:45:06.998486",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Each review is an array of numbers where each number is an indexed word\n",
    "print(x_train[:10])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-05-26T15:45:07.142833Z",
     "iopub.status.busy": "2021-05-26T15:45:07.142191Z",
     "iopub.status.idle": "2021-05-26T15:45:07.144674Z",
     "shell.execute_reply": "2021-05-26T15:45:07.145038Z"
    },
    "papermill": {
     "duration": 0.051705,
     "end_time": "2021-05-26T15:45:07.145163",
     "exception": false,
     "start_time": "2021-05-26T15:45:07.093458",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "data_dir = os.path.join(os.getcwd(), \"data\")\n",
    "os.makedirs(data_dir, exist_ok=True)\n",
    "\n",
    "train_dir = os.path.join(os.getcwd(), \"data/train\")\n",
    "os.makedirs(train_dir, exist_ok=True)\n",
    "\n",
    "test_dir = os.path.join(os.getcwd(), \"data/test\")\n",
    "os.makedirs(test_dir, exist_ok=True)\n",
    "\n",
    "csv_test_dir = os.path.join(os.getcwd(), \"data/csv-test\")\n",
    "os.makedirs(csv_test_dir, exist_ok=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-05-26T15:45:07.238626Z",
     "iopub.status.busy": "2021-05-26T15:45:07.238141Z",
     "iopub.status.idle": "2021-05-26T15:45:07.294586Z",
     "shell.execute_reply": "2021-05-26T15:45:07.294030Z"
    },
    "papermill": {
     "duration": 0.104975,
     "end_time": "2021-05-26T15:45:07.294699",
     "exception": false,
     "start_time": "2021-05-26T15:45:07.189724",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "np.save(os.path.join(train_dir, \"x_train.npy\"), x_train)\n",
    "np.save(os.path.join(train_dir, \"y_train.npy\"), y_train)\n",
    "np.save(os.path.join(test_dir, \"x_test.npy\"), x_test)\n",
    "np.save(os.path.join(test_dir, \"y_test.npy\"), y_test)\n",
    "np.savetxt(\n",
    "    os.path.join(csv_test_dir, \"csv-test.csv\"),\n",
    "    np.array(x_test[:100], dtype=np.int32),\n",
    "    fmt=\"%d\",\n",
    "    delimiter=\",\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-05-26T15:45:07.388775Z",
     "iopub.status.busy": "2021-05-26T15:45:07.387960Z",
     "iopub.status.idle": "2021-05-26T15:45:08.926795Z",
     "shell.execute_reply": "2021-05-26T15:45:08.927200Z"
    },
    "papermill": {
     "duration": 1.587547,
     "end_time": "2021-05-26T15:45:08.927367",
     "exception": false,
     "start_time": "2021-05-26T15:45:07.339820",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "train_s3 = sagemaker_session.upload_data(path=\"./data/train/\", key_prefix=traindata_s3_prefix)\n",
    "test_s3 = sagemaker_session.upload_data(path=\"./data/test/\", key_prefix=testdata_s3_prefix)\n",
    "\n",
    "inputs = {\"train\": train_s3, \"test\": test_s3}\n",
    "print(inputs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.045241,
     "end_time": "2021-05-26T15:45:09.018360",
     "exception": false,
     "start_time": "2021-05-26T15:45:08.973119",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "## Section 2 - Train sentiment analysis CNN model with custom profiler configuration <a id='train'></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.044776,
     "end_time": "2021-05-26T15:45:09.107993",
     "exception": false,
     "start_time": "2021-05-26T15:45:09.063217",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "In this section we use SageMaker's hosted training using Uber's Horovod framework, which uses compute resources separate from this notebook instance.  Hosted training spins up one or more instances (cluster) for training, and then tears the cluster down when training is complete. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.044855,
     "end_time": "2021-05-26T15:45:09.197738",
     "exception": false,
     "start_time": "2021-05-26T15:45:09.152883",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. The objective is to take a single-GPU training script and successfully scale it to train across many GPUs in parallel. Once a training script has been written for scale with Horovod, it can run on a single-GPU, multiple-GPUs, or even multiple hosts without any further code changes.\n",
    "\n",
    "With the SageMaker Python SDK, you can train and host TensorFlow models on Amazon SageMaker. For more information, see [Use TensorFlow with the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/using_tf.html) in the SageMaker Python SDK documentation.\n",
    "\n",
    "For our training, we will use three p3.8xlarge instances to begin with and change our training configuration based on profiling recommendations from Amazon SageMaker Debugger. Amazon EC2 P3 instances deliver high performance compute in the cloud with up to 8 NVIDIA\u00ae V100 Tensor Core GPUs and up to 100 Gbps of networking throughput for machine learning and HPC applications. The p3.8xlarge instance comes with 4 GPUs and 32 vCPU cores with 10 Gbps networking performance. Please refer to the EC2 Instance Types page for more details."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.044821,
     "end_time": "2021-05-26T15:45:09.287605",
     "exception": false,
     "start_time": "2021-05-26T15:45:09.242784",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "#### 2.1 Setup training job"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.044835,
     "end_time": "2021-05-26T15:45:09.377466",
     "exception": false,
     "start_time": "2021-05-26T15:45:09.332631",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "We will use the standard SageMaker Estimator API for TensorFlow to create training jobs. Profiling configuration will be enabled by default to emit framework and system metrics for our analysis. Define hyperparameters such as number of epochs, batch size, and data augmentation. \n",
    "\n",
    "* You can increase batch size to increase system utilization, but it may result in CPU bottleneck problems. Data preprocessing of a large batch size with augmentation requires a heavy computation. \n",
    "\n",
    "* You can disable `data_augmentation` to see the impact on the system utilization.\n",
    "\n",
    "* We've set the number of epochs to enable training to run quicker, please adjust this accordingly for your use case."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-05-26T15:45:09.472010Z",
     "iopub.status.busy": "2021-05-26T15:45:09.471507Z",
     "iopub.status.idle": "2021-05-26T15:45:09.473402Z",
     "shell.execute_reply": "2021-05-26T15:45:09.473794Z"
    },
    "papermill": {
     "duration": 0.051169,
     "end_time": "2021-05-26T15:45:09.473920",
     "exception": false,
     "start_time": "2021-05-26T15:45:09.422751",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "hyperparameters = {\"epoch\": 1, \"batch_size\": 256, \"data_augmentation\": True}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.045723,
     "end_time": "2021-05-26T15:45:09.564879",
     "exception": false,
     "start_time": "2021-05-26T15:45:09.519156",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "Take your AWS account limits into consideration while setting up the `instance_type` and `instance_count` of the cluster."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-05-26T15:45:09.662385Z",
     "iopub.status.busy": "2021-05-26T15:45:09.661837Z",
     "iopub.status.idle": "2021-05-26T15:45:09.663674Z",
     "shell.execute_reply": "2021-05-26T15:45:09.664061Z"
    },
    "papermill": {
     "duration": 0.052063,
     "end_time": "2021-05-26T15:45:09.664189",
     "exception": false,
     "start_time": "2021-05-26T15:45:09.612126",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "distributions = {\n",
    "    \"mpi\": {\n",
    "        \"enabled\": True,\n",
    "        \"processes_per_host\": 2,\n",
    "        \"custom_mpi_options\": \"-verbose -x HOROVOD_TIMELINE=./hvd_timeline.json -x NCCL_DEBUG=INFO -x OMPI_MCA_btl_vader_single_copy_mechanism=none\",\n",
    "    }\n",
    "}\n",
    "\n",
    "model_dir = \"/opt/ml/model\"\n",
    "train_instance_type = \"ml.p3.8xlarge\"\n",
    "instance_count = 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.045119,
     "end_time": "2021-05-26T15:45:09.755299",
     "exception": false,
     "start_time": "2021-05-26T15:45:09.710180",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "#### 2.2  Define profiler configuration"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.045118,
     "end_time": "2021-05-26T15:45:09.845506",
     "exception": false,
     "start_time": "2021-05-26T15:45:09.800388",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "With the following **`profiler_config`** parameter configuration, Debugger calls the default settings of monitoring, collecting system metrics every 500 milliseconds. For collecting framework metrics, you can set target steps and target time intervals in detail. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-05-26T15:45:09.939556Z",
     "iopub.status.busy": "2021-05-26T15:45:09.939066Z",
     "iopub.status.idle": "2021-05-26T15:45:09.941038Z",
     "shell.execute_reply": "2021-05-26T15:45:09.941439Z"
    },
    "papermill": {
     "duration": 0.050838,
     "end_time": "2021-05-26T15:45:09.941562",
     "exception": false,
     "start_time": "2021-05-26T15:45:09.890724",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "profiler_config = ProfilerConfig(\n",
    "    framework_profile_params=FrameworkProfile(start_step=2, num_steps=7)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.045303,
     "end_time": "2021-05-26T15:45:10.032035",
     "exception": false,
     "start_time": "2021-05-26T15:45:09.986732",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "With this `profiler_config` settings, Debugger will collect system metrics every 500 milliseconds and framework metrics on the specified steps (from step 2 to 9). For a complete list of parameters and profiling configurations, see [Configure Debugger Using Amazon SageMaker Python SDK](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-configuration.html) in the [Amazon SageMaker Debugger developer guide](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.045213,
     "end_time": "2021-05-26T15:45:10.122553",
     "exception": false,
     "start_time": "2021-05-26T15:45:10.077340",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "#### 2.3  Configure training job using TensorFlow estimator and pass in the profiler configuration.\n",
    "\n",
    "While constructing a SageMaker estimator, specify the TensorFlow framework version and supported python version. For a complete list of the supported framework versions and the corresponding python version to use, see [Supported Frameworks and Algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html#debugger-supported-frameworks) in the [Amazon SageMaker Debugger developer guide](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html).\n",
    "\n",
    "**Note**: In the following estimator, the exact `image_uri` was pointed to use the latest AWS TensorFlow deep learning container image. For a complete list of AWS deep learning containers, see [General Framework Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#general-framework-containers) in the [AWS Deep Learning Containers](https://github.com/aws/deep-learning-containers/) repository. The Debugger's new profiling features are available for TensorFlow 2.3.1 and PyTorch 1.6.0."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-05-26T15:45:10.226800Z",
     "iopub.status.busy": "2021-05-26T15:45:10.226290Z",
     "iopub.status.idle": "2021-05-26T15:45:10.691194Z",
     "shell.execute_reply": "2021-05-26T15:45:10.691636Z"
    },
    "papermill": {
     "duration": 0.52407,
     "end_time": "2021-05-26T15:45:10.691788",
     "exception": false,
     "start_time": "2021-05-26T15:45:10.167718",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "estimator = TensorFlow(\n",
    "    role=sagemaker.get_execution_role(),\n",
    "    base_job_name=\"tf-keras-silent\",\n",
    "    model_dir=model_dir,\n",
    "    instance_count=instance_count,\n",
    "    instance_type=train_instance_type,\n",
    "    entry_point=\"sentiment-distributed.py\",\n",
    "    source_dir=\"./tf-sentiment-script-mode\",\n",
    "    framework_version=\"2.3.1\",\n",
    "    py_version=\"py37\",\n",
    "    profiler_config=profiler_config,\n",
    "    script_mode=True,\n",
    "    hyperparameters=hyperparameters,\n",
    "    distribution=distributions,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.062137,
     "end_time": "2021-05-26T15:45:10.801538",
     "exception": false,
     "start_time": "2021-05-26T15:45:10.739401",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "We then simply call `fit` to start the actual hosted training"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-05-26T15:45:10.905877Z",
     "iopub.status.busy": "2021-05-26T15:45:10.905354Z",
     "iopub.status.idle": "2021-05-26T15:45:11.469469Z",
     "shell.execute_reply": "2021-05-26T15:45:11.469884Z"
    },
    "papermill": {
     "duration": 0.613381,
     "end_time": "2021-05-26T15:45:11.470060",
     "exception": false,
     "start_time": "2021-05-26T15:45:10.856679",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "estimator.fit(inputs, wait=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.045836,
     "end_time": "2021-05-26T15:45:11.561681",
     "exception": false,
     "start_time": "2021-05-26T15:45:11.515845",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "## Section 3 - Interactive analysis using the SMDebug visualization tools <a id='analysis'></a>\n",
    "\n",
    "In this section, we introduce interactive analysis of the data captured by SageMaker Debugger. It is organized in order of training phases: initialization, training, and finalization. The profiling data results are categorized as System Metrics and Algorithm (Framework) Metrics.\n",
    "\n",
    "Once the training job initiates, SageMaker Debugger starts collecting system and framework metrics. The smdebug library provides profiler analysis tools that enable you to access and analyze the profiling data. The following code cells are to set up a TrainingJob object to retrieve the system and framework metrics when they become available in the default S3 bucket. Once the metrics are available, you can query, plot, and analyze the profiling metrics data throughout this notebook. \n",
    "\n",
    "Let's check the profiler artifact path where the system metrics and framework metrics are stored."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-05-26T15:45:11.657012Z",
     "iopub.status.busy": "2021-05-26T15:45:11.656400Z",
     "iopub.status.idle": "2021-05-26T15:45:11.659578Z",
     "shell.execute_reply": "2021-05-26T15:45:11.659132Z"
    },
    "papermill": {
     "duration": 0.052479,
     "end_time": "2021-05-26T15:45:11.659687",
     "exception": false,
     "start_time": "2021-05-26T15:45:11.607208",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "estimator.latest_job_profiler_artifacts_path()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.045731,
     "end_time": "2021-05-26T15:45:11.751238",
     "exception": false,
     "start_time": "2021-05-26T15:45:11.705507",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "#### 3.1 Read profiling data: system metrics\n",
    "Once the training job is running, SageMaker collects system and framework metrics. The following code cell is waiting for the system metrics to become available in S3. Once they are available you will be able to query and plot those metrics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-05-26T15:45:11.847860Z",
     "iopub.status.busy": "2021-05-26T15:45:11.847030Z",
     "iopub.status.idle": "2021-05-26T15:45:12.081024Z",
     "shell.execute_reply": "2021-05-26T15:45:12.081426Z"
    },
    "papermill": {
     "duration": 0.284438,
     "end_time": "2021-05-26T15:45:12.081572",
     "exception": false,
     "start_time": "2021-05-26T15:45:11.797134",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "from smdebug.profiler.system_metrics_reader import S3SystemMetricsReader\n",
    "\n",
    "path = estimator.latest_job_profiler_artifacts_path()\n",
    "system_metrics_reader = S3SystemMetricsReader(path)\n",
    "\n",
    "sagemaker_client = boto3.client(\"sagemaker\")\n",
    "training_job_name = estimator.latest_training_job.name\n",
    "print(f\"Training job name: {training_job_name}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-05-26T15:45:12.181584Z",
     "iopub.status.busy": "2021-05-26T15:45:12.181032Z",
     "iopub.status.idle": "2021-05-26T15:50:40.587727Z",
     "shell.execute_reply": "2021-05-26T15:50:40.588152Z"
    },
    "papermill": {
     "duration": 328.459898,
     "end_time": "2021-05-26T15:50:40.588290",
     "exception": false,
     "start_time": "2021-05-26T15:45:12.128392",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "training_job_status = \"\"\n",
    "training_job_secondary_status = \"\"\n",
    "while system_metrics_reader.get_timestamp_of_latest_available_file() == 0:\n",
    "    system_metrics_reader.refresh_event_file_list()\n",
    "    client = sagemaker_client.describe_training_job(TrainingJobName=training_job_name)\n",
    "    if \"TrainingJobStatus\" in client:\n",
    "        training_job_status = f\"TrainingJobStatus: {client['TrainingJobStatus']}\"\n",
    "    if \"SecondaryStatus\" in client:\n",
    "        training_job_secondary_status = f\"TrainingJobSecondaryStatus: {client['SecondaryStatus']}\"\n",
    "\n",
    "    print(\n",
    "        f\"Profiler data from system not available yet. {training_job_status}. {training_job_secondary_status}.\"\n",
    "    )\n",
    "    time.sleep(20)\n",
    "\n",
    "print(\"\\n\\nProfiler data from system is available\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.052743,
     "end_time": "2021-05-26T15:50:40.691899",
     "exception": false,
     "start_time": "2021-05-26T15:50:40.639156",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "Helper function to convert timestamps into UTC:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-05-26T15:50:40.799086Z",
     "iopub.status.busy": "2021-05-26T15:50:40.798462Z",
     "iopub.status.idle": "2021-05-26T15:50:40.800422Z",
     "shell.execute_reply": "2021-05-26T15:50:40.800809Z"
    },
    "papermill": {
     "duration": 0.056187,
     "end_time": "2021-05-26T15:50:40.800939",
     "exception": false,
     "start_time": "2021-05-26T15:50:40.744752",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "from datetime import datetime\n",
    "\n",
    "\n",
    "def timestamp_to_utc(timestamp):\n",
    "    utc_dt = datetime.utcfromtimestamp(timestamp)\n",
    "    return utc_dt.strftime(\"%Y-%m-%d %H:%M:%S\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.050325,
     "end_time": "2021-05-26T15:50:40.901655",
     "exception": false,
     "start_time": "2021-05-26T15:50:40.851330",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "Now that the data is available we can query and inspect it. We get the latest available timestamp and query all the events within the given time range:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-05-26T15:50:41.007677Z",
     "iopub.status.busy": "2021-05-26T15:50:41.007194Z",
     "iopub.status.idle": "2021-05-26T15:50:41.284563Z",
     "shell.execute_reply": "2021-05-26T15:50:41.284968Z"
    },
    "papermill": {
     "duration": 0.333209,
     "end_time": "2021-05-26T15:50:41.285115",
     "exception": false,
     "start_time": "2021-05-26T15:50:40.951906",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "system_metrics_reader.refresh_event_file_list()\n",
    "last_timestamp = system_metrics_reader.get_timestamp_of_latest_available_file()\n",
    "events = system_metrics_reader.get_events(0, last_timestamp * 1000000)  # UTC time in micro seconds\n",
    "\n",
    "print(\n",
    "    \"Found\",\n",
    "    len(events),\n",
    "    \"recorded system metric events. Latest recorded event:\",\n",
    "    timestamp_to_utc(last_timestamp / 1000000),\n",
    ")  # UTC time in seconds to datetime"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.051127,
     "end_time": "2021-05-26T15:50:41.387697",
     "exception": false,
     "start_time": "2021-05-26T15:50:41.336570",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "We can iterate over the list of recorded events. Let's have a look on the first event."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-05-26T15:50:41.502407Z",
     "iopub.status.busy": "2021-05-26T15:50:41.501722Z",
     "iopub.status.idle": "2021-05-26T15:50:41.504540Z",
     "shell.execute_reply": "2021-05-26T15:50:41.504956Z"
    },
    "papermill": {
     "duration": 0.066388,
     "end_time": "2021-05-26T15:50:41.505090",
     "exception": false,
     "start_time": "2021-05-26T15:50:41.438702",
     "status": "completed"
    },
    "scrolled": true,
    "tags": []
   },
   "outputs": [],
   "source": [
    "print(\n",
    "    \"Event name:\",\n",
    "    events[0].name,\n",
    "    \"\\nTimestamp:\",\n",
    "    timestamp_to_utc(events[0].timestamp),\n",
    "    \"\\nValue:\",\n",
    "    events[0].value,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.051836,
     "end_time": "2021-05-26T15:50:41.608795",
     "exception": false,
     "start_time": "2021-05-26T15:50:41.556959",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "#### 3.2 GPU and CPU usage \n",
    "\n",
    "MetricHistogram computes a histogram on GPU and CPU utilization values. Bins are between 0 and 100. Good system utilization means that the center of the distribution should be between 80 to 90. In case of multi-GPU training: if distributions of GPU utilization values are not similar it indicates an issue with workload distribution.\n",
    "\n",
    "The following cell will plot the histograms per metric. In order to only plot specific metrics, define the list  `select_dimensions` and `select_events`. A dimension can be CPUUtilization, GPUUtilization, GPUMemoryUtilization IOPS. With CPUUtilization dimension, CPU uiltization histogram for each single core and total CPU usage will be plotted. In case of GPU, it will visualize utilization and memory for each GPU. In case of IOPS, it will plot IO wait time per CPU. If `select_events` is specified then only metrics that match the name in `select_metrics` will be shown. If neither `select_dimensions` nor `select_events` are specified, all available metrics will be visualized. One can also specify a start and endtime."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-05-26T15:50:41.715943Z",
     "iopub.status.busy": "2021-05-26T15:50:41.715438Z",
     "iopub.status.idle": "2021-05-26T15:50:42.270451Z",
     "shell.execute_reply": "2021-05-26T15:50:42.270844Z"
    },
    "papermill": {
     "duration": 0.610354,
     "end_time": "2021-05-26T15:50:42.270987",
     "exception": false,
     "start_time": "2021-05-26T15:50:41.660633",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "from smdebug.profiler.analysis.notebook_utils.metrics_histogram import MetricsHistogram\n",
    "\n",
    "system_metrics_reader.refresh_event_file_list()\n",
    "metrics_histogram = MetricsHistogram(system_metrics_reader)\n",
    "metrics_histogram.plot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.054427,
     "end_time": "2021-05-26T15:50:42.380310",
     "exception": false,
     "start_time": "2021-05-26T15:50:42.325883",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "#### 3.3 Read profiling data: framework annotations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-05-26T15:50:42.493653Z",
     "iopub.status.busy": "2021-05-26T15:50:42.493150Z",
     "iopub.status.idle": "2021-05-26T15:51:03.086870Z",
     "shell.execute_reply": "2021-05-26T15:51:03.087295Z"
    },
    "papermill": {
     "duration": 20.652828,
     "end_time": "2021-05-26T15:51:03.087447",
     "exception": false,
     "start_time": "2021-05-26T15:50:42.434619",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "from smdebug.profiler.algorithm_metrics_reader import S3AlgorithmMetricsReader\n",
    "\n",
    "framework_metrics_reader = S3AlgorithmMetricsReader(path)\n",
    "\n",
    "events = []\n",
    "while framework_metrics_reader.get_timestamp_of_latest_available_file() == 0 or len(events) == 0:\n",
    "    framework_metrics_reader.refresh_event_file_list()\n",
    "    last_timestamp = framework_metrics_reader.get_timestamp_of_latest_available_file()\n",
    "    events = framework_metrics_reader.get_events(0, last_timestamp)\n",
    "\n",
    "    print(\"Profiler data from framework not available yet\")\n",
    "    time.sleep(20)\n",
    "\n",
    "print(\"\\n\\n Profiler data from framework is available\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.055564,
     "end_time": "2021-05-26T15:51:03.199132",
     "exception": false,
     "start_time": "2021-05-26T15:51:03.143568",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "The following code cell retrieves all recorded events from Amazon S3."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-05-26T15:51:03.316348Z",
     "iopub.status.busy": "2021-05-26T15:51:03.315848Z",
     "iopub.status.idle": "2021-05-26T15:51:03.794326Z",
     "shell.execute_reply": "2021-05-26T15:51:03.794753Z"
    },
    "papermill": {
     "duration": 0.540097,
     "end_time": "2021-05-26T15:51:03.794903",
     "exception": false,
     "start_time": "2021-05-26T15:51:03.254806",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "framework_metrics_reader.refresh_event_file_list()\n",
    "last_timestamp = framework_metrics_reader.get_timestamp_of_latest_available_file()\n",
    "events = framework_metrics_reader.get_events(0, last_timestamp)\n",
    "\n",
    "print(\n",
    "    \"Found\",\n",
    "    len(events),\n",
    "    \"recorded framework annotations. Latest event recorded \",\n",
    "    timestamp_to_utc(last_timestamp / 1000000),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.056855,
     "end_time": "2021-05-26T15:51:03.909254",
     "exception": false,
     "start_time": "2021-05-26T15:51:03.852399",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "Like before we can inspect the recorded events. Since we are reading framework metrics there is now a start and end time for each event."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-05-26T15:51:04.028143Z",
     "iopub.status.busy": "2021-05-26T15:51:04.027485Z",
     "iopub.status.idle": "2021-05-26T15:51:04.030361Z",
     "shell.execute_reply": "2021-05-26T15:51:04.030758Z"
    },
    "papermill": {
     "duration": 0.064814,
     "end_time": "2021-05-26T15:51:04.030890",
     "exception": false,
     "start_time": "2021-05-26T15:51:03.966076",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "print(\n",
    "    \"Event name:\",\n",
    "    events[0].event_name,\n",
    "    \"\\nStart time:\",\n",
    "    timestamp_to_utc(events[0].start_time / 1000000000),\n",
    "    \"\\nEnd time:\",\n",
    "    timestamp_to_utc(events[0].end_time / 1000000000),\n",
    "    \"\\nDuration:\",\n",
    "    events[0].duration,\n",
    "    \"nanosecond\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.057243,
     "end_time": "2021-05-26T15:51:04.145753",
     "exception": false,
     "start_time": "2021-05-26T15:51:04.088510",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "#### 3.4 Outliers in step duration\n",
    "\n",
    "StepHistogram creates histograms of step duration values. Significant outliers are indication of system bottlenecks. In contrast to SetpTimelineChart it helps identify clusters of step duration values. As a simple example: time spent during training phase (forward and backward pass) will likely be different to time spent during validation phase (forward pass), so we would expect at least two clusters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-05-26T15:51:04.265866Z",
     "iopub.status.busy": "2021-05-26T15:51:04.265310Z",
     "iopub.status.idle": "2021-05-26T15:51:04.676719Z",
     "shell.execute_reply": "2021-05-26T15:51:04.677140Z"
    },
    "papermill": {
     "duration": 0.47435,
     "end_time": "2021-05-26T15:51:04.677284",
     "exception": false,
     "start_time": "2021-05-26T15:51:04.202934",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "from smdebug.profiler.analysis.notebook_utils.step_histogram import StepHistogram\n",
    "\n",
    "framework_metrics_reader.refresh_event_file_list()\n",
    "step_histogram = StepHistogram(framework_metrics_reader)\n",
    "step_histogram.plot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.059212,
     "end_time": "2021-05-26T15:51:04.796532",
     "exception": false,
     "start_time": "2021-05-26T15:51:04.737320",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "#### 3.5 Heatmap\n",
    "The following code cell creates a heatmap where each row corresponds to one metric (CPU core and GPU utilizations) and x-axis is the duration of the training job. It allows you to more easily spot CPU bottlenecks (utilization on GPU is low but a utilization of one or more cores is high)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-05-26T15:51:04.919380Z",
     "iopub.status.busy": "2021-05-26T15:51:04.918604Z",
     "iopub.status.idle": "2021-05-26T15:51:06.507893Z",
     "shell.execute_reply": "2021-05-26T15:51:06.507458Z"
    },
    "papermill": {
     "duration": 1.652325,
     "end_time": "2021-05-26T15:51:06.508006",
     "exception": false,
     "start_time": "2021-05-26T15:51:04.855681",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "from smdebug.profiler.analysis.notebook_utils.heatmap import Heatmap\n",
    "\n",
    "view_heatmap = Heatmap(\n",
    "    system_metrics_reader,\n",
    "    framework_metrics_reader,\n",
    "    select_dimensions=[\"CPU\", \"GPU\"],  # optional - comment this line out to see all dimensions.\n",
    "    # select_events=[\"total\"],                 # optional - comment this line out to see all events.\n",
    "    plot_height=900,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.066605,
     "end_time": "2021-05-26T15:51:06.642010",
     "exception": false,
     "start_time": "2021-05-26T15:51:06.575405",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "#### 3.6 Run loop to fetch latest profiler data and update charts\n",
    "The following code cell runs while your training job is in progress and refreshes the plots in the previous sections."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "papermill-error-cell-tag"
    ]
   },
   "source": [
    "<span id=\"papermill-error-cell\" style=\"color:red; font-family:Helvetica Neue, Helvetica, Arial, sans-serif; font-size:2em;\">Execution using papermill encountered an exception here and stopped:</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2021-05-26T15:51:06.782421Z",
     "iopub.status.busy": "2021-05-26T15:51:06.781846Z",
     "iopub.status.idle": "2021-05-26T15:52:09.857763Z",
     "shell.execute_reply": "2021-05-26T15:52:09.856879Z"
    },
    "papermill": {
     "duration": 63.149009,
     "end_time": "2021-05-26T15:52:09.858020",
     "exception": true,
     "start_time": "2021-05-26T15:51:06.709011",
     "status": "failed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "from bokeh.io import push_notebook\n",
    "import time\n",
    "\n",
    "last_timestamp = system_metrics_reader.get_timestamp_of_latest_available_file()\n",
    "description = sagemaker_client.describe_training_job(TrainingJobName=training_job_name)\n",
    "\n",
    "while description[\"TrainingJobStatus\"] == \"InProgress\":\n",
    "    system_metrics_reader.refresh_event_file_list()\n",
    "    framework_metrics_reader.refresh_event_file_list()\n",
    "    current_timestamp = system_metrics_reader.get_timestamp_of_latest_available_file()\n",
    "    description = sagemaker_client.describe_training_job(TrainingJobName=training_job_name)\n",
    "\n",
    "    if current_timestamp > last_timestamp:\n",
    "\n",
    "        print(\n",
    "            \"New data available, updating dashboards. Current timestamp is\",\n",
    "            timestamp_to_utc(current_timestamp / 1000000),\n",
    "        )\n",
    "\n",
    "        view_heatmap.update_data(current_timestamp)\n",
    "        push_notebook(handle=view_heatmap.target)\n",
    "\n",
    "        metrics_histogram.update_data(current_timestamp)\n",
    "        push_notebook(handle=metrics_histogram.target)\n",
    "\n",
    "        step_histogram.update_data(current_timestamp)\n",
    "        push_notebook(handle=step_histogram.target)\n",
    "\n",
    "        last_timestamp = current_timestamp\n",
    "    time.sleep(30)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": null,
     "end_time": null,
     "exception": null,
     "start_time": null,
     "status": "pending"
    },
    "tags": []
   },
   "source": [
    "## Section 4 - Analyze report generated by Debugger <a id='profiler-report'></a>\n",
    "\n",
    "In this section we will analyze the report generated by the profiler rule processing job.  We will showcase a few sections of the report.  For complete details, please download the report from the S3 bucket and review.\n",
    "\n",
    "Also note that the exact details in the report generated for your training job may be different from what you see in this section.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": null,
     "end_time": null,
     "exception": null,
     "start_time": null,
     "status": "pending"
    },
    "tags": []
   },
   "source": [
    "#### 4.1 View the location of the report generated."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "papermill": {
     "duration": null,
     "end_time": null,
     "exception": null,
     "start_time": null,
     "status": "pending"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "rule_output_path = estimator.output_path + estimator.latest_training_job.job_name + \"/rule-output\"\n",
    "print(\n",
    "    f\"You will find the profiler report under `{rule_output_path}/` after the training has finished\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": null,
     "end_time": null,
     "exception": null,
     "start_time": null,
     "status": "pending"
    },
    "tags": []
   },
   "source": [
    "To check if the report is generated, list directories and files recursively"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "papermill": {
     "duration": null,
     "end_time": null,
     "exception": null,
     "start_time": null,
     "status": "pending"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "! aws s3 ls {rule_output_path} --recursive"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": null,
     "end_time": null,
     "exception": null,
     "start_time": null,
     "status": "pending"
    },
    "tags": []
   },
   "source": [
    "#### Download the report and rule output files recursively using `aws s3 cp` \n",
    "The following command saves all of the rule output files to the **ProfilerReport-1234567890** folder under your current working directory. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "papermill": {
     "duration": null,
     "end_time": null,
     "exception": null,
     "start_time": null,
     "status": "pending"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "! aws s3 cp {rule_output_path} ./ --recursive"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": null,
     "end_time": null,
     "exception": null,
     "start_time": null,
     "status": "pending"
    },
    "tags": []
   },
   "source": [
    "The following script automatically finds the **ProfilerReport** folder name and returns a link to the downloaded report."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "papermill": {
     "duration": null,
     "end_time": null,
     "exception": null,
     "start_time": null,
     "status": "pending"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "from IPython.display import FileLink\n",
    "\n",
    "profiler_report_name = [\n",
    "    rule[\"RuleConfigurationName\"]\n",
    "    for rule in estimator.latest_training_job.rule_job_summary()\n",
    "    if \"Profiler\" in rule[\"RuleConfigurationName\"]\n",
    "][0]\n",
    "profiler_report_name\n",
    "display(\n",
    "    \"Click link below to view the profiler report\",\n",
    "    FileLink(profiler_report_name + \"/profiler-output/profiler-report.html\"),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": null,
     "end_time": null,
     "exception": null,
     "start_time": null,
     "status": "pending"
    },
    "tags": []
   },
   "source": [
    "For more information about how to find, download, and browse Debugger profiling reports, see [SageMaker Debugger Profiling Report](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-profiling-report.html) in the [Amazon SageMaker Debugger developer guide](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": null,
     "end_time": null,
     "exception": null,
     "start_time": null,
     "status": "pending"
    },
    "tags": []
   },
   "source": [
    "#### 4.2 Profile Report - Framework metrics summary"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": null,
     "end_time": null,
     "exception": null,
     "start_time": null,
     "status": "pending"
    },
    "tags": []
   },
   "source": [
    "In this section of the report, you will see a pie chart similar to the below which shows how much time the training job spent in \"training\", \"validation\" phase or \"others\". "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": null,
     "end_time": null,
     "exception": null,
     "start_time": null,
     "status": "pending"
    },
    "tags": []
   },
   "source": [
    "<IMG src=images/Framework_Metrics.png/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": null,
     "end_time": null,
     "exception": null,
     "start_time": null,
     "status": "pending"
    },
    "tags": []
   },
   "source": [
    "#### 4.3 Profile Report - Identify most expensive CPU operator"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": null,
     "end_time": null,
     "exception": null,
     "start_time": null,
     "status": "pending"
    },
    "tags": []
   },
   "source": [
    "Table in this section of the report shows a list of operators that your training job run on CPU. The most expensive operator on CPU was \"ExecutorState::Process\" with 16 %"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": null,
     "end_time": null,
     "exception": null,
     "start_time": null,
     "status": "pending"
    },
    "tags": []
   },
   "source": [
    "<IMG src=images/debugger-profiling-report-framework-cpu-operators.gif/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": null,
     "end_time": null,
     "exception": null,
     "start_time": null,
     "status": "pending"
    },
    "tags": []
   },
   "source": [
    "#### 4.4 Profile Report - Identify most expensive GPU operator"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": null,
     "end_time": null,
     "exception": null,
     "start_time": null,
     "status": "pending"
    },
    "tags": []
   },
   "source": [
    "The table shows the percentage of the time and the absolute cumulative time spent on the most frequently called GPU operators."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": null,
     "end_time": null,
     "exception": null,
     "start_time": null,
     "status": "pending"
    },
    "tags": []
   },
   "source": [
    "<IMG src=images/debugger-profiling-report-framework-gpu-operators.gif/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": null,
     "end_time": null,
     "exception": null,
     "start_time": null,
     "status": "pending"
    },
    "tags": []
   },
   "source": [
    "#### 4.5  Access Debugger Insights in Amazon SageMaker Studio\n",
    "\n",
    "In addition to interactive analysis of the Debugger output data and analyzing the autogenerated profiling report, you can also access Debugger insights dashboard from Amazon SageMaker Studio. To get started with Amazon SageMaker \n",
    "Studio using Debugger, see [Debugger on Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-on-studio.html) in the [Amazon SageMaker Debugger developer guide](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html).\n",
    "\n",
    "<IMG src=images/debugger-studio-insights-sample.png/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": null,
     "end_time": null,
     "exception": null,
     "start_time": null,
     "status": "pending"
    },
    "tags": []
   },
   "source": [
    "## Section 5 - Analyze recommendations from the report<a id='analyze-profiler-recommendations'></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": null,
     "end_time": null,
     "exception": null,
     "start_time": null,
     "status": "pending"
    },
    "tags": []
   },
   "source": [
    "The **Rules Summary** section of the report aggregates all of the rule evaluation results, analysis, rule descriptions, and suggestions. The following table shows a summary of the executed profiler rules. The table is sorted by the rules that triggered most frequently. In training job this was the case for rule LowGPUUtilization. \n",
    "It has processed 1001 datapoints and triggered 8 times.\n",
    "\n",
    "You may see a different rule summary based on the data and the training configuration you use.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": null,
     "end_time": null,
     "exception": null,
     "start_time": null,
     "status": "pending"
    },
    "tags": []
   },
   "source": [
    "<IMG src=images/RulesSummary.png/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": null,
     "end_time": null,
     "exception": null,
     "start_time": null,
     "status": "pending"
    },
    "tags": []
   },
   "source": [
    "From the analysis so far and the top recommendations from the table above, there seems to be scope for improving resource utilization and make our training efficient.  Based on this change the training configuration settings and re run the training."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": null,
     "end_time": null,
     "exception": null,
     "start_time": null,
     "status": "pending"
    },
    "tags": []
   },
   "source": [
    "## Section 6 - Implement recommendations from the report<a id='implement-profiler-recommendations'></a>\n",
    "\n",
    "In the section, we will rerun the training job with the changed configuration. The training instances are changed from p3.8xlarge to p3.2xlarge instances, the number of instances is reduced to 2 and only one process per host for MPI is configured to increase the number of data loaders. The Batch Size is also changed to 512. We will use the same profiling configuration as the previous job.\n",
    "\n",
    "After second training job with the new settings is complete, there are new system metrics, framework metrics and a new report generated. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "papermill": {
     "duration": null,
     "end_time": null,
     "exception": null,
     "start_time": null,
     "status": "pending"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "hyperparameters = {\"epoch\": 5, \"batch_size\": 512, \"data_augmentation\": True}\n",
    "\n",
    "distributions = {\n",
    "    \"mpi\": {\n",
    "        \"enabled\": True,\n",
    "        \"processes_per_host\": 1,\n",
    "        \"custom_mpi_options\": \"-verbose -x HOROVOD_TIMELINE=./hvd_timeline.json -x NCCL_DEBUG=INFO -x OMPI_MCA_btl_vader_single_copy_mechanism=none\",\n",
    "    }\n",
    "}\n",
    "\n",
    "model_dir = \"/opt/ml/model\"\n",
    "train_instance_type = \"ml.p3.2xlarge\"\n",
    "instance_count = 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "papermill": {
     "duration": null,
     "end_time": null,
     "exception": null,
     "start_time": null,
     "status": "pending"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "estimator_new = TensorFlow(\n",
    "    role=sagemaker.get_execution_role(),\n",
    "    base_job_name=\"tf-keras-silent\",\n",
    "    model_dir=model_dir,\n",
    "    instance_count=instance_count,\n",
    "    instance_type=train_instance_type,\n",
    "    entry_point=\"sentiment-distributed.py\",\n",
    "    source_dir=\"./tf-sentiment-script-mode\",\n",
    "    framework_version=\"2.3.1\",\n",
    "    py_version=\"py37\",\n",
    "    profiler_config=profiler_config,\n",
    "    script_mode=True,\n",
    "    hyperparameters=hyperparameters,\n",
    "    distribution=distributions,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "papermill": {
     "duration": null,
     "end_time": null,
     "exception": null,
     "start_time": null,
     "status": "pending"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "estimator_new.fit(inputs, wait=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": null,
     "end_time": null,
     "exception": null,
     "start_time": null,
     "status": "pending"
    },
    "tags": []
   },
   "source": [
    "#### Call to action\n",
    "\n",
    "To understand the impact of the training configuration changes, compare the report analysis from the two training jobs.  Repeat the process of analyzing the profiler report, implementing the recommendations and comparing with the previous run, till you are satisfied.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "papermill": {
     "duration": null,
     "end_time": null,
     "exception": null,
     "start_time": null,
     "status": "pending"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "rule_output_path = (\n",
    "    estimator_new.output_path + estimator_new.latest_training_job.job_name + \"/rule-output\"\n",
    ")\n",
    "print(\n",
    "    f\"You will find the profiler report under {rule_output_path}/ after the training has finished\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": null,
     "end_time": null,
     "exception": null,
     "start_time": null,
     "status": "pending"
    },
    "tags": []
   },
   "source": [
    "#### Download the new report and files recursively using `aws s3 cp`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "papermill": {
     "duration": null,
     "end_time": null,
     "exception": null,
     "start_time": null,
     "status": "pending"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "! aws s3 cp {rule_output_path} ./ --recursive"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": null,
     "end_time": null,
     "exception": null,
     "start_time": null,
     "status": "pending"
    },
    "tags": []
   },
   "source": [
    "Retrieve a file link to the new profiling report."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "papermill": {
     "duration": null,
     "end_time": null,
     "exception": null,
     "start_time": null,
     "status": "pending"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "from IPython.display import FileLink\n",
    "\n",
    "profiler_report_name = [\n",
    "    rule[\"RuleConfigurationName\"]\n",
    "    for rule in estimator_new.latest_training_job.rule_job_summary()\n",
    "    if \"Profiler\" in rule[\"RuleConfigurationName\"]\n",
    "][0]\n",
    "profiler_report_name\n",
    "display(\n",
    "    \"Click link below to view the profiler report\",\n",
    "    FileLink(profiler_report_name + \"/profiler-output/profiler-report.html\"),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": null,
     "end_time": null,
     "exception": null,
     "start_time": null,
     "status": "pending"
    },
    "tags": []
   },
   "source": [
    "## Conclusion\n",
    "\n",
    "Profiling feature of Amazon SageMaker Debugger is a powerful tool to gain visibility into machine learning training jobs. This notebook provided insight into training resource utilization to identify bottlenecks, analysis of various phases of training and identifying expensive framework functions. The notebook also demonstrated how to analyze and implement profiler recommendations. Applying profiler recommendations to a Tensorflow Horovord distributed training for a sentiment analysis model, we achieved resource utilization improvement upto 36% and up to 83% cost savings."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Notebook CI Test Results\n",
    "\n",
    "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n",
    "\n",
    "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/sagemaker-debugger|tensorflow_nlp_sentiment_analysis|sentiment-analysis-tf-distributed-training-bringyourownscript.ipynb)\n",
    "\n",
    "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/sagemaker-debugger|tensorflow_nlp_sentiment_analysis|sentiment-analysis-tf-distributed-training-bringyourownscript.ipynb)\n",
    "\n",
    "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/sagemaker-debugger|tensorflow_nlp_sentiment_analysis|sentiment-analysis-tf-distributed-training-bringyourownscript.ipynb)\n",
    "\n",
    "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/sagemaker-debugger|tensorflow_nlp_sentiment_analysis|sentiment-analysis-tf-distributed-training-bringyourownscript.ipynb)\n",
    "\n",
    "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/sagemaker-debugger|tensorflow_nlp_sentiment_analysis|sentiment-analysis-tf-distributed-training-bringyourownscript.ipynb)\n",
    "\n",
    "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/sagemaker-debugger|tensorflow_nlp_sentiment_analysis|sentiment-analysis-tf-distributed-training-bringyourownscript.ipynb)\n",
    "\n",
    "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/sagemaker-debugger|tensorflow_nlp_sentiment_analysis|sentiment-analysis-tf-distributed-training-bringyourownscript.ipynb)\n",
    "\n",
    "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/sagemaker-debugger|tensorflow_nlp_sentiment_analysis|sentiment-analysis-tf-distributed-training-bringyourownscript.ipynb)\n",
    "\n",
    "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/sagemaker-debugger|tensorflow_nlp_sentiment_analysis|sentiment-analysis-tf-distributed-training-bringyourownscript.ipynb)\n",
    "\n",
    "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/sagemaker-debugger|tensorflow_nlp_sentiment_analysis|sentiment-analysis-tf-distributed-training-bringyourownscript.ipynb)\n",
    "\n",
    "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/sagemaker-debugger|tensorflow_nlp_sentiment_analysis|sentiment-analysis-tf-distributed-training-bringyourownscript.ipynb)\n",
    "\n",
    "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/sagemaker-debugger|tensorflow_nlp_sentiment_analysis|sentiment-analysis-tf-distributed-training-bringyourownscript.ipynb)\n",
    "\n",
    "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/sagemaker-debugger|tensorflow_nlp_sentiment_analysis|sentiment-analysis-tf-distributed-training-bringyourownscript.ipynb)\n",
    "\n",
    "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/sagemaker-debugger|tensorflow_nlp_sentiment_analysis|sentiment-analysis-tf-distributed-training-bringyourownscript.ipynb)\n",
    "\n",
    "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/sagemaker-debugger|tensorflow_nlp_sentiment_analysis|sentiment-analysis-tf-distributed-training-bringyourownscript.ipynb)\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Environment (conda_tensorflow2_p36)",
   "language": "python",
   "name": "conda_tensorflow2_p36"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.10"
  },
  "papermill": {
   "default_parameters": {},
   "duration": 466.224595,
   "end_time": "2021-05-26T15:52:11.667590",
   "environment_variables": {},
   "exception": true,
   "input_path": "sentiment-analysis-tf-distributed-training-bringyourownscript.ipynb",
   "output_path": "/opt/ml/processing/output/sentiment-analysis-tf-distributed-training-bringyourownscript-2021-05-26-15-40-35.ipynb",
   "parameters": {
    "kms_key": "arn:aws:kms:us-west-2:521695447989:key/6e9984db-50cf-4c7e-926c-877ec47a8b25"
   },
   "start_time": "2021-05-26T15:44:25.442995",
   "version": "2.3.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}