{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Amazon SageMaker Experiment Trials for Distributed Training of Mask-RCNN\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n",
    "\n",
    "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "This notebook is a step-by-step tutorial on Amazon SageMaker Experiment Trials for distributed training of [Mask R-CNN](https://arxiv.org/abs/1703.06870) implemented in [TensorFlow](https://www.tensorflow.org/) framework. \n",
    "\n",
    "Concretely, we will describe the steps for SagerMaker Experiment Trials for training [TensorPack Faster-RCNN/Mask-RCNN](https://github.com/tensorpack/tensorpack/tree/master/examples/FasterRCNN) and [AWS Samples Mask R-CNN](https://github.com/aws-samples/mask-rcnn-tensorflow) in [Amazon SageMaker](https://aws.amazon.com/sagemaker/) using [Amazon S3](https://aws.amazon.com/s3/) as data source.\n",
    "\n",
    "The outline of steps is as follows:\n",
    "\n",
    "1. Stage COCO 2017 dataset in [Amazon S3](https://aws.amazon.com/s3/)\n",
    "2. Build SageMaker training image and push it to [Amazon ECR](https://aws.amazon.com/ecr/)\n",
    "3. Configure data input channels\n",
    "4. Configure hyper-prarameters\n",
    "5. Define training metrics\n",
    "6. Define training job \n",
    "7. Define SageMaker Experiment Trials to start the training jobs\n",
    "\n",
    "Before we get started, let us initialize two python variables ```aws_region``` and ```s3_bucket``` that we will use throughout the notebook. The ```s3_bucket``` must be located in the region of this notebook instance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import boto3\n",
    "\n",
    "session = boto3.session.Session()\n",
    "aws_region = session.region_name\n",
    "s3_bucket  = # your-s3-bucket-name\n",
    "\n",
    "\n",
    "try:\n",
    "    s3_client = boto3.client('s3')\n",
    "    response = s3_client.get_bucket_location(Bucket=s3_bucket)\n",
    "    print(f\"Bucket region: {response['LocationConstraint']}\")\n",
    "except:\n",
    "    print(f\"Access Error: Check if '{s3_bucket}' S3 bucket is in '{aws_region}' region\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Stage COCO 2017 dataset in Amazon S3\n",
    "\n",
    "We use [COCO 2017 dataset](http://cocodataset.org/#home) for training. We download COCO 2017 training and validation dataset to this notebook instance, extract the files from the dataset archives, and upload the extracted files to your Amazon [S3 bucket](https://docs.aws.amazon.com/en_pv/AmazonS3/latest/gsg/CreatingABucket.html) with the prefix ```mask-rcnn/sagemaker/input/train```. The ```prepare-s3-bucket.sh``` script executes this step.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!cat ./prepare-s3-bucket.sh"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " Using your *Amazon S3 bucket* as argument, run the cell below. If you have already uploaded COCO 2017 dataset to your Amazon S3 bucket *in this AWS region*, you may skip this step. The expected time to execute this step is 20 minutes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "!./prepare-s3-bucket.sh {s3_bucket}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Build and push SageMaker training images\n",
    "\n",
    "For this step, the [IAM Role](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles.html) attached to this notebook instance needs full access to Amazon ECR service. If you created this notebook instance using the ```./stack-sm.sh``` script in this repository, the IAM Role attached to this notebook instance is already setup with full access to ECR service. \n",
    "\n",
    "Below, we have a choice of two different implementations:\n",
    "\n",
    "1. [TensorPack Faster-RCNN/Mask-RCNN](https://github.com/tensorpack/tensorpack/tree/master/examples/FasterRCNN) implementation supports a maximum per-GPU batch size of 1, and does not support mixed precision. It can be used with mainstream TensorFlow releases.\n",
    "\n",
    "2. [AWS Samples Mask R-CNN](https://github.com/aws-samples/mask-rcnn-tensorflow) is an optimized implementation that supports a maximum batch size of 4 and supports mixed precision. This implementation uses custom TensorFlow ops. The required custom TensorFlow ops are available in [AWS Deep Learning Container](https://github.com/aws/deep-learning-containers/blob/master/available_images.md) images in ```tensorflow-training``` repository with image tag ```1.15.2-gpu-py36-cu100-ubuntu18.04```, or later. \n",
    "\n",
    "It is recommended that you build and push both SageMaker training images and use either image for training later.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### TensorPack Faster-RCNN/Mask-RCNN\n",
    "\n",
    "Use ```./container-script-mode/build_tools/build_and_push.sh``` script to build and push the TensorPack Faster-RCNN/Mask-RCNN  training image to Amazon ECR. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!cat ./container-script-mode/build_tools/build_and_push.sh"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using your *AWS region* as argument, run the cell below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "! ./container-script-mode/build_tools/build_and_push.sh {aws_region}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Set ```tensorpack_image``` below to Amazon ECR URI of the image you pushed above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tensorpack_image = # mask-rcnn-tensorpack-sagemaker ECR URI"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### AWS Samples Mask R-CNN\n",
    "Use ```./container-optimized-script-mode/build_tools/build_and_push.sh``` script to build and push the AWS Samples Mask R-CNN training image to Amazon ECR."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!cat ./container-optimized-script-mode/build_tools/build_and_push.sh"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using your *AWS region* as argument, run the cell below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "! ./container-optimized-script-mode/build_tools/build_and_push.sh {aws_region}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " Set ```aws_samples_image``` below to Amazon ECR URI of the image you pushed above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "aws_samples_image = # mask-rcnn-tensorflow-sagemaker ECR URI"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## SageMaker Initialization \n",
    "First we upgrade SageMaker to 2.3.0 API. If your notebook is already using latest Sagemaker 2.x API, you may skip the next cell.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "! pip install --upgrade pip\n",
    "! pip install sagemaker"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have staged the data and we have built and pushed the training docker image to Amazon ECR. Now we are ready to start using Amazon SageMaker."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "import os\n",
    "import time\n",
    "import boto3\n",
    "import sagemaker\n",
    "from sagemaker import get_execution_role\n",
    "from sagemaker.tensorflow.estimator import TensorFlow\n",
    "\n",
    "role = (\n",
    "    get_execution_role()\n",
    ")  # provide a pre-existing role ARN as an alternative to creating a new role\n",
    "print(f\"SageMaker Execution Role:{role}\")\n",
    "\n",
    "client = boto3.client(\"sts\")\n",
    "account = client.get_caller_identity()[\"Account\"]\n",
    "print(f\"AWS account:{account}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we set ```training_image``` to the Amazon ECR image URI you saved in a previous step. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "training_image = # set to tensorpack_image or aws_samples_image \n",
    "print(f'Training image: {training_image}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define SageMaker Data Channels\n",
    "\n",
    "Next, we define the *train* data channel using EFS file-system. To do so, we need to specify the EFS file-system id, which is shown in the output of the command below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "notebook_attached_efs=!df -kh | grep 'fs-' | sed 's/\\(fs-[0-9a-z]*\\).*/\\1/'\n",
    "print(f\"SageMaker notebook attached EFS: {notebook_attached_efs}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the cell below, we define the `train` data input channel."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.inputs import FileSystemInput\n",
    "\n",
    "# Specify EFS file system id.\n",
    "file_system_id = notebook_attached_efs[0]\n",
    "print(f\"EFS file-system-id: {file_system_id}\")\n",
    "\n",
    "# Specify directory path for input data on the file system.\n",
    "# You need to provide normalized and absolute path below.\n",
    "file_system_directory_path = \"/mask-rcnn/sagemaker/input/train\"\n",
    "print(f\"EFS file-system data input path: {file_system_directory_path}\")\n",
    "\n",
    "# Specify the access mode of the mount of the directory associated with the file system.\n",
    "# Directory must be mounted  'ro'(read-only).\n",
    "file_system_access_mode = \"ro\"\n",
    "\n",
    "# Specify your file system type\n",
    "file_system_type = \"EFS\"\n",
    "\n",
    "train = FileSystemInput(\n",
    "    file_system_id=file_system_id,\n",
    "    file_system_type=file_system_type,\n",
    "    directory_path=file_system_directory_path,\n",
    "    file_system_access_mode=file_system_access_mode,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we define the model output location in S3 bucket."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "prefix = \"mask-rcnn/sagemaker\"  # prefix in your bucket\n",
    "s3_output_location = f\"s3://{s3_bucket}/{prefix}/output\"\n",
    "print(f\"S3 model output location: {s3_output_location}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Configure Hyper-parameters\n",
    "Next, we define the hyper-parameters. \n",
    "\n",
    "Note, some hyper-parameters are different between the two implementations. The batch size per GPU in TensorPack Faster-RCNN/Mask-RCNN is fixed at 1, but is configurable in AWS Samples Mask-RCNN. The learning rate schedule is specified in units of steps in TensorPack Faster-RCNN/Mask-RCNN, but in epochs in AWS Samples Mask-RCNN.\n",
    "\n",
    "The detault learning rate schedule values shown below correspond to training for a total of 24 epochs, at 120,000 images per epoch.\n",
    "\n",
    "<table align='left'>\n",
    "    <caption>TensorPack Faster-RCNN/Mask-RCNN  Hyper-parameters</caption>\n",
    "    <tr>\n",
    "    <th style=\"text-align:center\">Hyper-parameter</th>\n",
    "    <th style=\"text-align:center\">Description</th>\n",
    "    <th style=\"text-align:center\">Default</th>\n",
    "    </tr>\n",
    "    <tr>\n",
    "        <td style=\"text-align:center\">mode_fpn</td>\n",
    "        <td style=\"text-align:left\">Flag to indicate use of Feature Pyramid Network (FPN) in the Mask R-CNN model backbone</td>\n",
    "        <td style=\"text-align:center\">\"True\"</td>\n",
    "    </tr>\n",
    "     <tr>\n",
    "        <td style=\"text-align:center\">mode_mask</td>\n",
    "        <td style=\"text-align:left\">A value of \"False\" means Faster-RCNN model, \"True\" means Mask R-CNN moodel</td>\n",
    "        <td style=\"text-align:center\">\"True\"</td>\n",
    "    </tr>\n",
    "     <tr>\n",
    "        <td style=\"text-align:center\">eval_period</td>\n",
    "        <td style=\"text-align:left\">Number of epochs period for evaluation during training</td>\n",
    "        <td style=\"text-align:center\">1</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "        <td style=\"text-align:center\">lr_schedule</td>\n",
    "        <td style=\"text-align:left\">Learning rate schedule in training steps</td>\n",
    "        <td style=\"text-align:center\">'[240000, 320000, 360000]'</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "        <td style=\"text-align:center\">batch_norm</td>\n",
    "        <td style=\"text-align:left\">Batch normalization option ('FreezeBN', 'SyncBN', 'GN', 'None') </td>\n",
    "        <td style=\"text-align:center\">'FreezeBN'</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "        <td style=\"text-align:center\">images_per_epoch</td>\n",
    "        <td style=\"text-align:left\">Images per epoch </td>\n",
    "        <td style=\"text-align:center\">120000</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "        <td style=\"text-align:center\">data_train</td>\n",
    "        <td style=\"text-align:left\">Training data under data directory</td>\n",
    "        <td style=\"text-align:center\">'coco_train2017'</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "        <td style=\"text-align:center\">data_val</td>\n",
    "        <td style=\"text-align:left\">Validation data under data directory</td>\n",
    "        <td style=\"text-align:center\">'coco_val2017'</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "        <td style=\"text-align:center\">resnet_arch</td>\n",
    "        <td style=\"text-align:left\">Must be 'resnet50' or 'resnet101'</td>\n",
    "        <td style=\"text-align:center\">'resnet50'</td>\n",
    "    </tr>\n",
    "   <tr>\n",
    "        <td style=\"text-align:center\">backbone_weights</td>\n",
    "        <td style=\"text-align:left\">ResNet backbone weights</td>\n",
    "        <td style=\"text-align:center\">'ImageNet-R50-AlignPadding.npz'</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "        <td style=\"text-align:center\">load_model</td>\n",
    "        <td style=\"text-align:left\">Pre-trained model to load</td>\n",
    "        <td style=\"text-align:center\"></td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "        <td style=\"text-align:center\">config:</td>\n",
    "        <td style=\"text-align:left\">Any hyperparamter prefixed with <b>config:</b> is set as a model config parameter</td>\n",
    "        <td style=\"text-align:center\"></td>\n",
    "    </tr>\n",
    "</table>\n",
    "\n",
    "    \n",
    "<table align='left'>\n",
    "    <caption>AWS Samples Mask-RCNN  Hyper-parameters</caption>\n",
    "    <tr>\n",
    "    <th style=\"text-align:center\">Hyper-parameter</th>\n",
    "    <th style=\"text-align:center\">Description</th>\n",
    "    <th style=\"text-align:center\">Default</th>\n",
    "    </tr>\n",
    "    <tr>\n",
    "        <td style=\"text-align:center\">mode_fpn</td>\n",
    "        <td style=\"text-align:left\">Flag to indicate use of Feature Pyramid Network (FPN) in the Mask R-CNN model backbone</td>\n",
    "        <td style=\"text-align:center\">\"True\"</td>\n",
    "    </tr>\n",
    "     <tr>\n",
    "        <td style=\"text-align:center\">mode_mask</td>\n",
    "        <td style=\"text-align:left\">A value of \"False\" means Faster-RCNN model, \"True\" means Mask R-CNN moodel</td>\n",
    "        <td style=\"text-align:center\">\"True\"</td>\n",
    "    </tr>\n",
    "     <tr>\n",
    "        <td style=\"text-align:center\">eval_period</td>\n",
    "        <td style=\"text-align:left\">Number of epochs period for evaluation during training</td>\n",
    "        <td style=\"text-align:center\">1</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "        <td style=\"text-align:center\">lr_epoch_schedule</td>\n",
    "        <td style=\"text-align:left\">Learning rate schedule in epochs</td>\n",
    "        <td style=\"text-align:center\">'[(16, 0.1), (20, 0.01), (24, None)]'</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "        <td style=\"text-align:center\">batch_size_per_gpu</td>\n",
    "        <td style=\"text-align:left\">Batch size per gpu ( Minimum 1, Maximum 4)</td>\n",
    "        <td style=\"text-align:center\">4</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "        <td style=\"text-align:center\">batch_norm</td>\n",
    "        <td style=\"text-align:left\">Batch normalization option ('FreezeBN', 'SyncBN', 'GN', 'None') </td>\n",
    "        <td style=\"text-align:center\">'FreezeBN'</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "        <td style=\"text-align:center\">images_per_epoch</td>\n",
    "        <td style=\"text-align:left\">Images per epoch </td>\n",
    "        <td style=\"text-align:center\">120000</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "        <td style=\"text-align:center\">data_train</td>\n",
    "        <td style=\"text-align:left\">Training data under data directory</td>\n",
    "        <td style=\"text-align:center\">'train2017'</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "        <td style=\"text-align:center\">data_val</td>\n",
    "        <td style=\"text-align:left\">Validation data under data directory</td>\n",
    "        <td style=\"text-align:center\">'val2017'</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "        <td style=\"text-align:center\">resnet_arch</td>\n",
    "        <td style=\"text-align:left\">Must be 'resnet50' or 'resnet101'</td>\n",
    "        <td style=\"text-align:center\">'resnet50'</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "        <td style=\"text-align:center\">backbone_weights</td>\n",
    "        <td style=\"text-align:left\">ResNet backbone weights</td>\n",
    "        <td style=\"text-align:center\">'ImageNet-R50-AlignPadding.npz'</td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "        <td style=\"text-align:center\">load_model</td>\n",
    "        <td style=\"text-align:left\">Pre-trained model to load</td>\n",
    "        <td style=\"text-align:center\"></td>\n",
    "    </tr>\n",
    "    <tr>\n",
    "        <td style=\"text-align:center\">config:</td>\n",
    "        <td style=\"text-align:left\">Any hyperparamter prefixed with <b>config:</b> is set as a model config parameter</td>\n",
    "        <td style=\"text-align:center\"></td>\n",
    "    </tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hyperparameters = {\n",
    "    \"mode_fpn\": \"True\",\n",
    "    \"mode_mask\": \"True\",\n",
    "    \"eval_period\": 1,\n",
    "    \"batch_norm\": \"FreezeBN\",\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define Training Metrics\n",
    "Next, we define the regular expressions that SageMaker uses to extract algorithm metrics from training logs and send them to [AWS CloudWatch metrics](https://docs.aws.amazon.com/en_pv/AmazonCloudWatch/latest/monitoring/working_with_metrics.html). These algorithm metrics are visualized in SageMaker console."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "metric_definitions = [\n",
    "    {\"Name\": \"fastrcnn_losses/box_loss\", \"Regex\": \".*fastrcnn_losses/box_loss:\\\\s*(\\\\S+).*\"},\n",
    "    {\"Name\": \"fastrcnn_losses/label_loss\", \"Regex\": \".*fastrcnn_losses/label_loss:\\\\s*(\\\\S+).*\"},\n",
    "    {\n",
    "        \"Name\": \"fastrcnn_losses/label_metrics/accuracy\",\n",
    "        \"Regex\": \".*fastrcnn_losses/label_metrics/accuracy:\\\\s*(\\\\S+).*\",\n",
    "    },\n",
    "    {\n",
    "        \"Name\": \"fastrcnn_losses/label_metrics/false_negative\",\n",
    "        \"Regex\": \".*fastrcnn_losses/label_metrics/false_negative:\\\\s*(\\\\S+).*\",\n",
    "    },\n",
    "    {\n",
    "        \"Name\": \"fastrcnn_losses/label_metrics/fg_accuracy\",\n",
    "        \"Regex\": \".*fastrcnn_losses/label_metrics/fg_accuracy:\\\\s*(\\\\S+).*\",\n",
    "    },\n",
    "    {\n",
    "        \"Name\": \"fastrcnn_losses/num_fg_label\",\n",
    "        \"Regex\": \".*fastrcnn_losses/num_fg_label:\\\\s*(\\\\S+).*\",\n",
    "    },\n",
    "    {\"Name\": \"maskrcnn_loss/accuracy\", \"Regex\": \".*maskrcnn_loss/accuracy:\\\\s*(\\\\S+).*\"},\n",
    "    {\n",
    "        \"Name\": \"maskrcnn_loss/fg_pixel_ratio\",\n",
    "        \"Regex\": \".*maskrcnn_loss/fg_pixel_ratio:\\\\s*(\\\\S+).*\",\n",
    "    },\n",
    "    {\"Name\": \"maskrcnn_loss/maskrcnn_loss\", \"Regex\": \".*maskrcnn_loss/maskrcnn_loss:\\\\s*(\\\\S+).*\"},\n",
    "    {\"Name\": \"maskrcnn_loss/pos_accuracy\", \"Regex\": \".*maskrcnn_loss/pos_accuracy:\\\\s*(\\\\S+).*\"},\n",
    "    {\"Name\": \"mAP(bbox)/IoU=0.5\", \"Regex\": \".*mAP\\\\(bbox\\\\)/IoU=0\\\\.5:\\\\s*(\\\\S+).*\"},\n",
    "    {\"Name\": \"mAP(bbox)/IoU=0.5:0.95\", \"Regex\": \".*mAP\\\\(bbox\\\\)/IoU=0\\\\.5:0\\\\.95:\\\\s*(\\\\S+).*\"},\n",
    "    {\"Name\": \"mAP(bbox)/IoU=0.75\", \"Regex\": \".*mAP\\\\(bbox\\\\)/IoU=0\\\\.75:\\\\s*(\\\\S+).*\"},\n",
    "    {\"Name\": \"mAP(bbox)/large\", \"Regex\": \".*mAP\\\\(bbox\\\\)/large:\\\\s*(\\\\S+).*\"},\n",
    "    {\"Name\": \"mAP(bbox)/medium\", \"Regex\": \".*mAP\\\\(bbox\\\\)/medium:\\\\s*(\\\\S+).*\"},\n",
    "    {\"Name\": \"mAP(bbox)/small\", \"Regex\": \".*mAP\\\\(bbox\\\\)/small:\\\\s*(\\\\S+).*\"},\n",
    "    {\"Name\": \"mAP(segm)/IoU=0.5\", \"Regex\": \".*mAP\\\\(segm\\\\)/IoU=0\\\\.5:\\\\s*(\\\\S+).*\"},\n",
    "    {\"Name\": \"mAP(segm)/IoU=0.5:0.95\", \"Regex\": \".*mAP\\\\(segm\\\\)/IoU=0\\\\.5:0\\\\.95:\\\\s*(\\\\S+).*\"},\n",
    "    {\"Name\": \"mAP(segm)/IoU=0.75\", \"Regex\": \".*mAP\\\\(segm\\\\)/IoU=0\\\\.75:\\\\s*(\\\\S+).*\"},\n",
    "    {\"Name\": \"mAP(segm)/large\", \"Regex\": \".*mAP\\\\(segm\\\\)/large:\\\\s*(\\\\S+).*\"},\n",
    "    {\"Name\": \"mAP(segm)/medium\", \"Regex\": \".*mAP\\\\(segm\\\\)/medium:\\\\s*(\\\\S+).*\"},\n",
    "    {\"Name\": \"mAP(segm)/small\", \"Regex\": \".*mAP\\\\(segm\\\\)/small:\\\\s*(\\\\S+).*\"},\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define SageMaker Experiment\n",
    "\n",
    "To define SageMaker Experiment, we first install `sagemaker-experiments` package."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "! pip install --upgrade pip\n",
    "! pip install sagemaker-experiments"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we import the SageMaker Experiment modules."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from smexperiments.experiment import Experiment\n",
    "from smexperiments.trial import Trial\n",
    "from smexperiments.trial_component import TrialComponent\n",
    "from smexperiments.tracker import Tracker\n",
    "import time"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we define a `Tracker` for tracking input data used in the SageMaker Trials in this Experiment. Specify the S3 URL of your dataset in the `value` below and change the name of the dataset if you are using a different dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sm = session.client(\"sagemaker\")\n",
    "with Tracker.create(display_name=\"Preprocessing\", sagemaker_boto_client=sm) as tracker:\n",
    "    # we can log the s3 uri to the dataset used for training\n",
    "    tracker.log_input(\n",
    "        name=\"coco-2017-dataset\",\n",
    "        media_type=\"s3/uri\",\n",
    "        value=f\"s3://{s3_bucket}/{prefix}/input/train\",  # specify S3 URL to your dataset\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we create a SageMaker Experiment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mrcnn_experiment = Experiment.create(\n",
    "    experiment_name=f\"mask-rcnn-experiment-{int(time.time())}\",\n",
    "    description=\"Mask R-CNN experiment\",\n",
    "    sagemaker_boto_client=sm,\n",
    ")\n",
    "print(mrcnn_experiment)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define SageMaker Experiment Trials\n",
    "\n",
    "Next, we define SageMaker experiment trials for the experiment we just defined. For each experiment trial, we use SageMaker [Tensorflow](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/sagemaker.tensorflow.html) API to define a SageMaker Training Job that uses SageMaker script mode. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Select script\n",
    "\n",
    "In script-mode, first we have to select an entry point script that acts as interface with SageMaker and launches the training job. For training [TensorPack Faster-RCNN/Mask-RCNN](https://github.com/tensorpack/tensorpack/tree/master/examples/FasterRCNN) model, set ```script``` to ```\"tensorpack-mask-rcnn.py\"```. For training [AWS Samples Mask R-CNN](https://github.com/aws-samples/mask-rcnn-tensorflow) model, set ```script``` to ```\"aws-mask-rcnn.py\"```."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "script= # \"tensorpack-mask-rcnn.py\" or \"aws-mask-rcnn.py\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Select distribution mode\n",
    "\n",
    "We use Message Passing Interface (MPI) to distribute the training job across multiple hosts. The ```custom_mpi_options``` below is only used by [AWS Samples Mask R-CNN](https://github.com/aws-samples/mask-rcnn-tensorflow) model, and can be safely commented out for [TensorPack Faster-RCNN/Mask-RCNN](https://github.com/tensorpack/tensorpack/tree/master/examples/FasterRCNN) model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mpi_distribution = {\"mpi\": {\"enabled\": True, \"custom_mpi_options\": \"-x TENSORPACK_FP16=1 \"}}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Define Security Group and Subnets\n",
    "We run the training job in your private VPC, so we need to set the ```subnets``` and ```security_group_ids``` prior to running the cell below. You may specify multiple subnet ids in the ```subnets``` list. The subnets included in the ```sunbets``` list must be part of the output of  ```./stack-sm.sh``` CloudFormation stack script used to create this notebook instance. Specify only one security group id in ```security_group_ids``` list. The security group id must be part of the output of  ```./stack-sm.sh``` script."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "security_group_ids = # ['sg-xxxxxxxx']\n",
    "subnets =  # ['subnet-xxxxxxx', 'subnet-xxxxxxx', 'subnet-xxxxxxx']\n",
    "sagemaker_session = sagemaker.session.Session(boto_session=session)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Define SageMaker Tensorflow Estimator\n",
    "\n",
    "Next, we use SageMaker TensorFlow Estimator API to define a SageMaker Training Job for each SageMaker Trial we need to run within the SageMaker Experiment.\n",
    "\n",
    "We recommned using 32 GPUs for each training job, so we set ```instance_count=4``` and ```instance_type='ml.p3.16xlarge'```, because there are 8 Tesla V100 GPUs per ```ml.p3.16xlarge``` instance. We recommend using 100 GB [Amazon EBS](https://aws.amazon.com/ebs/) storage volume with each training instance, so we set ```volume_size = 100```. We want to replicate training data to each training instance, so we set ```input_mode= 'File'```.\n",
    "\n",
    "First we download [ImageNet-R101-AlignPadding.npz](http://models.tensorpack.com/FasterRCNN/ImageNet-R101-AlignPadding.npz) to ```pretrained-models``` folder under the input train data directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "! sudo wget -O ~/efs/mask-rcnn/sagemaker/input/train/pretrained-models/ImageNet-R101-AlignPadding.npz \\\n",
    "    http://models.tensorpack.com/FasterRCNN/ImageNet-R101-AlignPadding.npz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we will iterate through the Trial parameters and start two trials, one for ResNet architecture `resnet50`, and a second Trial for `resnet101`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "trial_params = [ ('resnet50', 'ImageNet-R50-AlignPadding.npz'), \n",
    "                ('resnet101', 'ImageNet-R101-AlignPadding.npz')]\n",
    "\n",
    "for resnet_arch, backbone_weights in trial_params:\n",
    "    \n",
    "    hyperparameters['resnet_arch'] = resnet_arch\n",
    "    hyperparameters['backbone_weights'] = backbone_weights\n",
    "    \n",
    "    trial_name = f\"mask-rcnn-script-mode-{resnet_arch}-{int(time.time())}\"\n",
    "    mrcnn_trial = Trial.create(\n",
    "                        trial_name=trial_name, \n",
    "                        experiment_name=mrcnn_experiment.experiment_name,\n",
    "                        sagemaker_boto_client=sm,\n",
    "    )\n",
    "    \n",
    "    # associate the proprocessing trial component with the current trial\n",
    "    mrcnn_trial.add_trial_component(tracker.trial_component)\n",
    "    print(mrcnn_trial)\n",
    "\n",
    "    mask_rcnn_estimator = TensorFlow(image_uri=training_image,\n",
    "                                role=role, \n",
    "                                py_version='py3',\n",
    "                                instance_count=4, \n",
    "                                instance_type='ml.p3.16xlarge',\n",
    "                                distribution=mpi_distribution,\n",
    "                                entry_point=script,\n",
    "                                volume_size = 100,\n",
    "                                max_run = 400000,\n",
    "                                output_path=s3_output_location,\n",
    "                                sagemaker_session=sagemaker_session, \n",
    "                                hyperparameters = hyperparameters,\n",
    "                                metric_definitions = metric_definitions,\n",
    "                                subnets=subnets,\n",
    "                                security_group_ids=security_group_ids)\n",
    "    \n",
    "    # Specify directory path for log output on the EFS file system.\n",
    "    # You need to provide normalized and absolute path below.\n",
    "    # For example, '/mask-rcnn/sagemaker/output/log'\n",
    "    # Log output directory must not exist\n",
    "    file_system_directory_path = f'/mask-rcnn/sagemaker/output/{mrcnn_trial.trial_name}'\n",
    "    print(f\"EFS log directory:{file_system_directory_path}\")\n",
    "\n",
    "    # Create the log output directory. \n",
    "    # EFS file-system is mounted on '$HOME/efs' mount point for this notebook.\n",
    "    home_dir=os.environ['HOME']\n",
    "    local_efs_path = os.path.join(home_dir,'efs', file_system_directory_path[1:])\n",
    "    print(f\"Creating log directory on EFS: {local_efs_path}\")\n",
    "\n",
    "    assert not os.path.isdir(local_efs_path)\n",
    "    ! sudo mkdir -p -m a=rw {local_efs_path}\n",
    "    assert os.path.isdir(local_efs_path)\n",
    "\n",
    "    # Specify the access mode of the mount of the directory associated with the file system. \n",
    "    # Directory must be mounted 'rw'(read-write).\n",
    "    file_system_access_mode = 'rw'\n",
    "\n",
    "\n",
    "    log = FileSystemInput(file_system_id=file_system_id,\n",
    "                                    file_system_type=file_system_type,\n",
    "                                    directory_path=file_system_directory_path,\n",
    "                                    file_system_access_mode=file_system_access_mode)\n",
    "\n",
    "    data_channels = {'train': train, 'log': log}\n",
    "\n",
    "    mask_rcnn_estimator.fit(inputs=data_channels, \n",
    "                            job_name=mrcnn_trial.trial_name,\n",
    "                            logs=True,  \n",
    "                            experiment_config={\"TrialName\": mrcnn_trial.trial_name, \n",
    "                                           \"TrialComponentDisplayName\": \"Training\"},\n",
    "                            wait=False)\n",
    "\n",
    "    # sleep in between starting two trials\n",
    "    time.sleep(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "search_expression = {\n",
    "    \"Filters\": [\n",
    "        {\n",
    "            \"Name\": \"DisplayName\",\n",
    "            \"Operator\": \"Equals\",\n",
    "            \"Value\": \"Training\",\n",
    "        },\n",
    "        {\n",
    "            \"Name\": \"metrics.maskrcnn_loss/accuracy.max\",\n",
    "            \"Operator\": \"LessThan\",\n",
    "            \"Value\": \"1\",\n",
    "        },\n",
    "    ],\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.analytics import ExperimentAnalytics\n",
    "\n",
    "trial_component_analytics = ExperimentAnalytics(\n",
    "    sagemaker_session=sagemaker_session,\n",
    "    experiment_name=mrcnn_experiment.experiment_name,\n",
    "    search_expression=search_expression,\n",
    "    sort_by=\"metrics.maskrcnn_loss/accuracy.max\",\n",
    "    sort_order=\"Descending\",\n",
    "    parameter_names=[\"resnet_arch\"],\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "analytic_table = trial_component_analytics.dataframe()\n",
    "for col in analytic_table.columns:\n",
    "    print(col)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bbox_map = analytic_table[\n",
    "    [\"resnet_arch\", \"mAP(bbox)/small - Max\", \"mAP(bbox)/medium - Max\", \"mAP(bbox)/large - Max\"]\n",
    "]\n",
    "bbox_map"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "segm_map = analytic_table[\n",
    "    [\"resnet_arch\", \"mAP(segm)/small - Max\", \"mAP(segm)/medium - Max\", \"mAP(segm)/large - Max\"]\n",
    "]\n",
    "segm_map"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Notebook CI Test Results\n",
    "\n",
    "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n",
    "\n",
    "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)\n",
    "\n",
    "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)\n",
    "\n",
    "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)\n",
    "\n",
    "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)\n",
    "\n",
    "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)\n",
    "\n",
    "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)\n",
    "\n",
    "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)\n",
    "\n",
    "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)\n",
    "\n",
    "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)\n",
    "\n",
    "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)\n",
    "\n",
    "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)\n",
    "\n",
    "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)\n",
    "\n",
    "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)\n",
    "\n",
    "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)\n",
    "\n",
    "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_tensorflow_p36",
   "language": "python",
   "name": "conda_tensorflow_p36"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}