{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Amazon SageMaker Experiment Trials for Distributed Training of Mask-RCNN\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", "\n", "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "This notebook is a step-by-step tutorial on Amazon SageMaker Experiment Trials for distributed training of [Mask R-CNN](https://arxiv.org/abs/1703.06870) implemented in [TensorFlow](https://www.tensorflow.org/) framework. \n", "\n", "Concretely, we will describe the steps for SagerMaker Experiment Trials for training [TensorPack Faster-RCNN/Mask-RCNN](https://github.com/tensorpack/tensorpack/tree/master/examples/FasterRCNN) and [AWS Samples Mask R-CNN](https://github.com/aws-samples/mask-rcnn-tensorflow) in [Amazon SageMaker](https://aws.amazon.com/sagemaker/) using [Amazon S3](https://aws.amazon.com/s3/) as data source.\n", "\n", "The outline of steps is as follows:\n", "\n", "1. Stage COCO 2017 dataset in [Amazon S3](https://aws.amazon.com/s3/)\n", "2. Build SageMaker training image and push it to [Amazon ECR](https://aws.amazon.com/ecr/)\n", "3. Configure data input channels\n", "4. Configure hyper-prarameters\n", "5. Define training metrics\n", "6. Define training job \n", "7. Define SageMaker Experiment Trials to start the training jobs\n", "\n", "Before we get started, let us initialize two python variables ```aws_region``` and ```s3_bucket``` that we will use throughout the notebook. The ```s3_bucket``` must be located in the region of this notebook instance." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import boto3\n", "\n", "session = boto3.session.Session()\n", "aws_region = session.region_name\n", "s3_bucket = # your-s3-bucket-name\n", "\n", "\n", "try:\n", " s3_client = boto3.client('s3')\n", " response = s3_client.get_bucket_location(Bucket=s3_bucket)\n", " print(f\"Bucket region: {response['LocationConstraint']}\")\n", "except:\n", " print(f\"Access Error: Check if '{s3_bucket}' S3 bucket is in '{aws_region}' region\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Stage COCO 2017 dataset in Amazon S3\n", "\n", "We use [COCO 2017 dataset](http://cocodataset.org/#home) for training. We download COCO 2017 training and validation dataset to this notebook instance, extract the files from the dataset archives, and upload the extracted files to your Amazon [S3 bucket](https://docs.aws.amazon.com/en_pv/AmazonS3/latest/gsg/CreatingABucket.html) with the prefix ```mask-rcnn/sagemaker/input/train```. The ```prepare-s3-bucket.sh``` script executes this step.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!cat ./prepare-s3-bucket.sh" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Using your *Amazon S3 bucket* as argument, run the cell below. If you have already uploaded COCO 2017 dataset to your Amazon S3 bucket *in this AWS region*, you may skip this step. The expected time to execute this step is 20 minutes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "!./prepare-s3-bucket.sh {s3_bucket}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Build and push SageMaker training images\n", "\n", "For this step, the [IAM Role](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles.html) attached to this notebook instance needs full access to Amazon ECR service. If you created this notebook instance using the ```./stack-sm.sh``` script in this repository, the IAM Role attached to this notebook instance is already setup with full access to ECR service. \n", "\n", "Below, we have a choice of two different implementations:\n", "\n", "1. [TensorPack Faster-RCNN/Mask-RCNN](https://github.com/tensorpack/tensorpack/tree/master/examples/FasterRCNN) implementation supports a maximum per-GPU batch size of 1, and does not support mixed precision. It can be used with mainstream TensorFlow releases.\n", "\n", "2. [AWS Samples Mask R-CNN](https://github.com/aws-samples/mask-rcnn-tensorflow) is an optimized implementation that supports a maximum batch size of 4 and supports mixed precision. This implementation uses custom TensorFlow ops. The required custom TensorFlow ops are available in [AWS Deep Learning Container](https://github.com/aws/deep-learning-containers/blob/master/available_images.md) images in ```tensorflow-training``` repository with image tag ```1.15.2-gpu-py36-cu100-ubuntu18.04```, or later. \n", "\n", "It is recommended that you build and push both SageMaker training images and use either image for training later.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### TensorPack Faster-RCNN/Mask-RCNN\n", "\n", "Use ```./container-script-mode/build_tools/build_and_push.sh``` script to build and push the TensorPack Faster-RCNN/Mask-RCNN training image to Amazon ECR. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!cat ./container-script-mode/build_tools/build_and_push.sh" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using your *AWS region* as argument, run the cell below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "! ./container-script-mode/build_tools/build_and_push.sh {aws_region}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Set ```tensorpack_image``` below to Amazon ECR URI of the image you pushed above." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tensorpack_image = # mask-rcnn-tensorpack-sagemaker ECR URI" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### AWS Samples Mask R-CNN\n", "Use ```./container-optimized-script-mode/build_tools/build_and_push.sh``` script to build and push the AWS Samples Mask R-CNN training image to Amazon ECR." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!cat ./container-optimized-script-mode/build_tools/build_and_push.sh" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using your *AWS region* as argument, run the cell below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "! ./container-optimized-script-mode/build_tools/build_and_push.sh {aws_region}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Set ```aws_samples_image``` below to Amazon ECR URI of the image you pushed above." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "aws_samples_image = # mask-rcnn-tensorflow-sagemaker ECR URI" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## SageMaker Initialization \n", "First we upgrade SageMaker to 2.3.0 API. If your notebook is already using latest Sagemaker 2.x API, you may skip the next cell.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! pip install --upgrade pip\n", "! pip install sagemaker" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have staged the data and we have built and pushed the training docker image to Amazon ECR. Now we are ready to start using Amazon SageMaker." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "import os\n", "import time\n", "import boto3\n", "import sagemaker\n", "from sagemaker import get_execution_role\n", "from sagemaker.tensorflow.estimator import TensorFlow\n", "\n", "role = (\n", " get_execution_role()\n", ") # provide a pre-existing role ARN as an alternative to creating a new role\n", "print(f\"SageMaker Execution Role:{role}\")\n", "\n", "client = boto3.client(\"sts\")\n", "account = client.get_caller_identity()[\"Account\"]\n", "print(f\"AWS account:{account}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we set ```training_image``` to the Amazon ECR image URI you saved in a previous step. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "training_image = # set to tensorpack_image or aws_samples_image \n", "print(f'Training image: {training_image}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define SageMaker Data Channels\n", "\n", "Next, we define the *train* data channel using EFS file-system. To do so, we need to specify the EFS file-system id, which is shown in the output of the command below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "notebook_attached_efs=!df -kh | grep 'fs-' | sed 's/\\(fs-[0-9a-z]*\\).*/\\1/'\n", "print(f\"SageMaker notebook attached EFS: {notebook_attached_efs}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the cell below, we define the `train` data input channel." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.inputs import FileSystemInput\n", "\n", "# Specify EFS file system id.\n", "file_system_id = notebook_attached_efs[0]\n", "print(f\"EFS file-system-id: {file_system_id}\")\n", "\n", "# Specify directory path for input data on the file system.\n", "# You need to provide normalized and absolute path below.\n", "file_system_directory_path = \"/mask-rcnn/sagemaker/input/train\"\n", "print(f\"EFS file-system data input path: {file_system_directory_path}\")\n", "\n", "# Specify the access mode of the mount of the directory associated with the file system.\n", "# Directory must be mounted 'ro'(read-only).\n", "file_system_access_mode = \"ro\"\n", "\n", "# Specify your file system type\n", "file_system_type = \"EFS\"\n", "\n", "train = FileSystemInput(\n", " file_system_id=file_system_id,\n", " file_system_type=file_system_type,\n", " directory_path=file_system_directory_path,\n", " file_system_access_mode=file_system_access_mode,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we define the model output location in S3 bucket." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "prefix = \"mask-rcnn/sagemaker\" # prefix in your bucket\n", "s3_output_location = f\"s3://{s3_bucket}/{prefix}/output\"\n", "print(f\"S3 model output location: {s3_output_location}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Configure Hyper-parameters\n", "Next, we define the hyper-parameters. \n", "\n", "Note, some hyper-parameters are different between the two implementations. The batch size per GPU in TensorPack Faster-RCNN/Mask-RCNN is fixed at 1, but is configurable in AWS Samples Mask-RCNN. The learning rate schedule is specified in units of steps in TensorPack Faster-RCNN/Mask-RCNN, but in epochs in AWS Samples Mask-RCNN.\n", "\n", "The detault learning rate schedule values shown below correspond to training for a total of 24 epochs, at 120,000 images per epoch.\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TensorPack Faster-RCNN/Mask-RCNN Hyper-parameters
Hyper-parameterDescriptionDefault
mode_fpnFlag to indicate use of Feature Pyramid Network (FPN) in the Mask R-CNN model backbone\"True\"
mode_maskA value of \"False\" means Faster-RCNN model, \"True\" means Mask R-CNN moodel\"True\"
eval_periodNumber of epochs period for evaluation during training1
lr_scheduleLearning rate schedule in training steps'[240000, 320000, 360000]'
batch_normBatch normalization option ('FreezeBN', 'SyncBN', 'GN', 'None') 'FreezeBN'
images_per_epochImages per epoch 120000
data_trainTraining data under data directory'coco_train2017'
data_valValidation data under data directory'coco_val2017'
resnet_archMust be 'resnet50' or 'resnet101''resnet50'
backbone_weightsResNet backbone weights'ImageNet-R50-AlignPadding.npz'
load_modelPre-trained model to load
config:Any hyperparamter prefixed with config: is set as a model config parameter
\n", "\n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AWS Samples Mask-RCNN Hyper-parameters
Hyper-parameterDescriptionDefault
mode_fpnFlag to indicate use of Feature Pyramid Network (FPN) in the Mask R-CNN model backbone\"True\"
mode_maskA value of \"False\" means Faster-RCNN model, \"True\" means Mask R-CNN moodel\"True\"
eval_periodNumber of epochs period for evaluation during training1
lr_epoch_scheduleLearning rate schedule in epochs'[(16, 0.1), (20, 0.01), (24, None)]'
batch_size_per_gpuBatch size per gpu ( Minimum 1, Maximum 4)4
batch_normBatch normalization option ('FreezeBN', 'SyncBN', 'GN', 'None') 'FreezeBN'
images_per_epochImages per epoch 120000
data_trainTraining data under data directory'train2017'
data_valValidation data under data directory'val2017'
resnet_archMust be 'resnet50' or 'resnet101''resnet50'
backbone_weightsResNet backbone weights'ImageNet-R50-AlignPadding.npz'
load_modelPre-trained model to load
config:Any hyperparamter prefixed with config: is set as a model config parameter
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hyperparameters = {\n", " \"mode_fpn\": \"True\",\n", " \"mode_mask\": \"True\",\n", " \"eval_period\": 1,\n", " \"batch_norm\": \"FreezeBN\",\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define Training Metrics\n", "Next, we define the regular expressions that SageMaker uses to extract algorithm metrics from training logs and send them to [AWS CloudWatch metrics](https://docs.aws.amazon.com/en_pv/AmazonCloudWatch/latest/monitoring/working_with_metrics.html). These algorithm metrics are visualized in SageMaker console." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "metric_definitions = [\n", " {\"Name\": \"fastrcnn_losses/box_loss\", \"Regex\": \".*fastrcnn_losses/box_loss:\\\\s*(\\\\S+).*\"},\n", " {\"Name\": \"fastrcnn_losses/label_loss\", \"Regex\": \".*fastrcnn_losses/label_loss:\\\\s*(\\\\S+).*\"},\n", " {\n", " \"Name\": \"fastrcnn_losses/label_metrics/accuracy\",\n", " \"Regex\": \".*fastrcnn_losses/label_metrics/accuracy:\\\\s*(\\\\S+).*\",\n", " },\n", " {\n", " \"Name\": \"fastrcnn_losses/label_metrics/false_negative\",\n", " \"Regex\": \".*fastrcnn_losses/label_metrics/false_negative:\\\\s*(\\\\S+).*\",\n", " },\n", " {\n", " \"Name\": \"fastrcnn_losses/label_metrics/fg_accuracy\",\n", " \"Regex\": \".*fastrcnn_losses/label_metrics/fg_accuracy:\\\\s*(\\\\S+).*\",\n", " },\n", " {\n", " \"Name\": \"fastrcnn_losses/num_fg_label\",\n", " \"Regex\": \".*fastrcnn_losses/num_fg_label:\\\\s*(\\\\S+).*\",\n", " },\n", " {\"Name\": \"maskrcnn_loss/accuracy\", \"Regex\": \".*maskrcnn_loss/accuracy:\\\\s*(\\\\S+).*\"},\n", " {\n", " \"Name\": \"maskrcnn_loss/fg_pixel_ratio\",\n", " \"Regex\": \".*maskrcnn_loss/fg_pixel_ratio:\\\\s*(\\\\S+).*\",\n", " },\n", " {\"Name\": \"maskrcnn_loss/maskrcnn_loss\", \"Regex\": \".*maskrcnn_loss/maskrcnn_loss:\\\\s*(\\\\S+).*\"},\n", " {\"Name\": \"maskrcnn_loss/pos_accuracy\", \"Regex\": \".*maskrcnn_loss/pos_accuracy:\\\\s*(\\\\S+).*\"},\n", " {\"Name\": \"mAP(bbox)/IoU=0.5\", \"Regex\": \".*mAP\\\\(bbox\\\\)/IoU=0\\\\.5:\\\\s*(\\\\S+).*\"},\n", " {\"Name\": \"mAP(bbox)/IoU=0.5:0.95\", \"Regex\": \".*mAP\\\\(bbox\\\\)/IoU=0\\\\.5:0\\\\.95:\\\\s*(\\\\S+).*\"},\n", " {\"Name\": \"mAP(bbox)/IoU=0.75\", \"Regex\": \".*mAP\\\\(bbox\\\\)/IoU=0\\\\.75:\\\\s*(\\\\S+).*\"},\n", " {\"Name\": \"mAP(bbox)/large\", \"Regex\": \".*mAP\\\\(bbox\\\\)/large:\\\\s*(\\\\S+).*\"},\n", " {\"Name\": \"mAP(bbox)/medium\", \"Regex\": \".*mAP\\\\(bbox\\\\)/medium:\\\\s*(\\\\S+).*\"},\n", " {\"Name\": \"mAP(bbox)/small\", \"Regex\": \".*mAP\\\\(bbox\\\\)/small:\\\\s*(\\\\S+).*\"},\n", " {\"Name\": \"mAP(segm)/IoU=0.5\", \"Regex\": \".*mAP\\\\(segm\\\\)/IoU=0\\\\.5:\\\\s*(\\\\S+).*\"},\n", " {\"Name\": \"mAP(segm)/IoU=0.5:0.95\", \"Regex\": \".*mAP\\\\(segm\\\\)/IoU=0\\\\.5:0\\\\.95:\\\\s*(\\\\S+).*\"},\n", " {\"Name\": \"mAP(segm)/IoU=0.75\", \"Regex\": \".*mAP\\\\(segm\\\\)/IoU=0\\\\.75:\\\\s*(\\\\S+).*\"},\n", " {\"Name\": \"mAP(segm)/large\", \"Regex\": \".*mAP\\\\(segm\\\\)/large:\\\\s*(\\\\S+).*\"},\n", " {\"Name\": \"mAP(segm)/medium\", \"Regex\": \".*mAP\\\\(segm\\\\)/medium:\\\\s*(\\\\S+).*\"},\n", " {\"Name\": \"mAP(segm)/small\", \"Regex\": \".*mAP\\\\(segm\\\\)/small:\\\\s*(\\\\S+).*\"},\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define SageMaker Experiment\n", "\n", "To define SageMaker Experiment, we first install `sagemaker-experiments` package." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! pip install --upgrade pip\n", "! pip install sagemaker-experiments" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we import the SageMaker Experiment modules." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from smexperiments.experiment import Experiment\n", "from smexperiments.trial import Trial\n", "from smexperiments.trial_component import TrialComponent\n", "from smexperiments.tracker import Tracker\n", "import time" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we define a `Tracker` for tracking input data used in the SageMaker Trials in this Experiment. Specify the S3 URL of your dataset in the `value` below and change the name of the dataset if you are using a different dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sm = session.client(\"sagemaker\")\n", "with Tracker.create(display_name=\"Preprocessing\", sagemaker_boto_client=sm) as tracker:\n", " # we can log the s3 uri to the dataset used for training\n", " tracker.log_input(\n", " name=\"coco-2017-dataset\",\n", " media_type=\"s3/uri\",\n", " value=f\"s3://{s3_bucket}/{prefix}/input/train\", # specify S3 URL to your dataset\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we create a SageMaker Experiment." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mrcnn_experiment = Experiment.create(\n", " experiment_name=f\"mask-rcnn-experiment-{int(time.time())}\",\n", " description=\"Mask R-CNN experiment\",\n", " sagemaker_boto_client=sm,\n", ")\n", "print(mrcnn_experiment)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define SageMaker Experiment Trials\n", "\n", "Next, we define SageMaker experiment trials for the experiment we just defined. For each experiment trial, we use SageMaker [Tensorflow](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/sagemaker.tensorflow.html) API to define a SageMaker Training Job that uses SageMaker script mode. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Select script\n", "\n", "In script-mode, first we have to select an entry point script that acts as interface with SageMaker and launches the training job. For training [TensorPack Faster-RCNN/Mask-RCNN](https://github.com/tensorpack/tensorpack/tree/master/examples/FasterRCNN) model, set ```script``` to ```\"tensorpack-mask-rcnn.py\"```. For training [AWS Samples Mask R-CNN](https://github.com/aws-samples/mask-rcnn-tensorflow) model, set ```script``` to ```\"aws-mask-rcnn.py\"```." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "script= # \"tensorpack-mask-rcnn.py\" or \"aws-mask-rcnn.py\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Select distribution mode\n", "\n", "We use Message Passing Interface (MPI) to distribute the training job across multiple hosts. The ```custom_mpi_options``` below is only used by [AWS Samples Mask R-CNN](https://github.com/aws-samples/mask-rcnn-tensorflow) model, and can be safely commented out for [TensorPack Faster-RCNN/Mask-RCNN](https://github.com/tensorpack/tensorpack/tree/master/examples/FasterRCNN) model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mpi_distribution = {\"mpi\": {\"enabled\": True, \"custom_mpi_options\": \"-x TENSORPACK_FP16=1 \"}}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Define Security Group and Subnets\n", "We run the training job in your private VPC, so we need to set the ```subnets``` and ```security_group_ids``` prior to running the cell below. You may specify multiple subnet ids in the ```subnets``` list. The subnets included in the ```sunbets``` list must be part of the output of ```./stack-sm.sh``` CloudFormation stack script used to create this notebook instance. Specify only one security group id in ```security_group_ids``` list. The security group id must be part of the output of ```./stack-sm.sh``` script." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "security_group_ids = # ['sg-xxxxxxxx']\n", "subnets = # ['subnet-xxxxxxx', 'subnet-xxxxxxx', 'subnet-xxxxxxx']\n", "sagemaker_session = sagemaker.session.Session(boto_session=session)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Define SageMaker Tensorflow Estimator\n", "\n", "Next, we use SageMaker TensorFlow Estimator API to define a SageMaker Training Job for each SageMaker Trial we need to run within the SageMaker Experiment.\n", "\n", "We recommned using 32 GPUs for each training job, so we set ```instance_count=4``` and ```instance_type='ml.p3.16xlarge'```, because there are 8 Tesla V100 GPUs per ```ml.p3.16xlarge``` instance. We recommend using 100 GB [Amazon EBS](https://aws.amazon.com/ebs/) storage volume with each training instance, so we set ```volume_size = 100```. We want to replicate training data to each training instance, so we set ```input_mode= 'File'```.\n", "\n", "First we download [ImageNet-R101-AlignPadding.npz](http://models.tensorpack.com/FasterRCNN/ImageNet-R101-AlignPadding.npz) to ```pretrained-models``` folder under the input train data directory." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! sudo wget -O ~/efs/mask-rcnn/sagemaker/input/train/pretrained-models/ImageNet-R101-AlignPadding.npz \\\n", " http://models.tensorpack.com/FasterRCNN/ImageNet-R101-AlignPadding.npz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we will iterate through the Trial parameters and start two trials, one for ResNet architecture `resnet50`, and a second Trial for `resnet101`. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "trial_params = [ ('resnet50', 'ImageNet-R50-AlignPadding.npz'), \n", " ('resnet101', 'ImageNet-R101-AlignPadding.npz')]\n", "\n", "for resnet_arch, backbone_weights in trial_params:\n", " \n", " hyperparameters['resnet_arch'] = resnet_arch\n", " hyperparameters['backbone_weights'] = backbone_weights\n", " \n", " trial_name = f\"mask-rcnn-script-mode-{resnet_arch}-{int(time.time())}\"\n", " mrcnn_trial = Trial.create(\n", " trial_name=trial_name, \n", " experiment_name=mrcnn_experiment.experiment_name,\n", " sagemaker_boto_client=sm,\n", " )\n", " \n", " # associate the proprocessing trial component with the current trial\n", " mrcnn_trial.add_trial_component(tracker.trial_component)\n", " print(mrcnn_trial)\n", "\n", " mask_rcnn_estimator = TensorFlow(image_uri=training_image,\n", " role=role, \n", " py_version='py3',\n", " instance_count=4, \n", " instance_type='ml.p3.16xlarge',\n", " distribution=mpi_distribution,\n", " entry_point=script,\n", " volume_size = 100,\n", " max_run = 400000,\n", " output_path=s3_output_location,\n", " sagemaker_session=sagemaker_session, \n", " hyperparameters = hyperparameters,\n", " metric_definitions = metric_definitions,\n", " subnets=subnets,\n", " security_group_ids=security_group_ids)\n", " \n", " # Specify directory path for log output on the EFS file system.\n", " # You need to provide normalized and absolute path below.\n", " # For example, '/mask-rcnn/sagemaker/output/log'\n", " # Log output directory must not exist\n", " file_system_directory_path = f'/mask-rcnn/sagemaker/output/{mrcnn_trial.trial_name}'\n", " print(f\"EFS log directory:{file_system_directory_path}\")\n", "\n", " # Create the log output directory. \n", " # EFS file-system is mounted on '$HOME/efs' mount point for this notebook.\n", " home_dir=os.environ['HOME']\n", " local_efs_path = os.path.join(home_dir,'efs', file_system_directory_path[1:])\n", " print(f\"Creating log directory on EFS: {local_efs_path}\")\n", "\n", " assert not os.path.isdir(local_efs_path)\n", " ! sudo mkdir -p -m a=rw {local_efs_path}\n", " assert os.path.isdir(local_efs_path)\n", "\n", " # Specify the access mode of the mount of the directory associated with the file system. \n", " # Directory must be mounted 'rw'(read-write).\n", " file_system_access_mode = 'rw'\n", "\n", "\n", " log = FileSystemInput(file_system_id=file_system_id,\n", " file_system_type=file_system_type,\n", " directory_path=file_system_directory_path,\n", " file_system_access_mode=file_system_access_mode)\n", "\n", " data_channels = {'train': train, 'log': log}\n", "\n", " mask_rcnn_estimator.fit(inputs=data_channels, \n", " job_name=mrcnn_trial.trial_name,\n", " logs=True, \n", " experiment_config={\"TrialName\": mrcnn_trial.trial_name, \n", " \"TrialComponentDisplayName\": \"Training\"},\n", " wait=False)\n", "\n", " # sleep in between starting two trials\n", " time.sleep(2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "search_expression = {\n", " \"Filters\": [\n", " {\n", " \"Name\": \"DisplayName\",\n", " \"Operator\": \"Equals\",\n", " \"Value\": \"Training\",\n", " },\n", " {\n", " \"Name\": \"metrics.maskrcnn_loss/accuracy.max\",\n", " \"Operator\": \"LessThan\",\n", " \"Value\": \"1\",\n", " },\n", " ],\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.analytics import ExperimentAnalytics\n", "\n", "trial_component_analytics = ExperimentAnalytics(\n", " sagemaker_session=sagemaker_session,\n", " experiment_name=mrcnn_experiment.experiment_name,\n", " search_expression=search_expression,\n", " sort_by=\"metrics.maskrcnn_loss/accuracy.max\",\n", " sort_order=\"Descending\",\n", " parameter_names=[\"resnet_arch\"],\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "analytic_table = trial_component_analytics.dataframe()\n", "for col in analytic_table.columns:\n", " print(col)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bbox_map = analytic_table[\n", " [\"resnet_arch\", \"mAP(bbox)/small - Max\", \"mAP(bbox)/medium - Max\", \"mAP(bbox)/large - Max\"]\n", "]\n", "bbox_map" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "segm_map = analytic_table[\n", " [\"resnet_arch\", \"mAP(segm)/small - Max\", \"mAP(segm)/medium - Max\", \"mAP(segm)/large - Max\"]\n", "]\n", "segm_map" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Notebook CI Test Results\n", "\n", "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", "\n", "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)\n", "\n", "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)\n", "\n", "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)\n", "\n", "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)\n", "\n", "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)\n", "\n", "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)\n", "\n", "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)\n", "\n", "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)\n", "\n", "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)\n", "\n", "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)\n", "\n", "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)\n", "\n", "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)\n", "\n", "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)\n", "\n", "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)\n", "\n", "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-experiment-trials.ipynb)\n" ] } ], "metadata": { "kernelspec": { "display_name": "conda_tensorflow_p36", "language": "python", "name": "conda_tensorflow_p36" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.10" } }, "nbformat": 4, "nbformat_minor": 4 }