{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Distributed Training of Mask-RCNN in Amazon SageMaker using FSx\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", "\n", "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-fsx.ipynb)\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "This notebook is a step-by-step tutorial on distributed training of [Mask R-CNN]( https://arxiv.org/abs/1703.06870) implemented in [TensorFlow](https://www.tensorflow.org/) framework. Mask R-CNN is also referred to as heavy weight object detection model and it is part of [MLPerf](https://www.mlperf.org/training-results-0-6/).\n", "\n", "Concretely, we will describe the steps for training [TensorPack Faster-RCNN/Mask-RCNN](https://github.com/tensorpack/tensorpack/tree/master/examples/FasterRCNN) and [AWS Samples Mask R-CNN](https://github.com/aws-samples/mask-rcnn-tensorflow) in [Amazon SageMaker](https://aws.amazon.com/sagemaker/) using [Amazon FSx for Lustre](https://aws.amazon.com/fsx/lustre/) file-system as data source.\n", "\n", "The outline of steps is as follows:\n", "\n", "1. Stage COCO 2017 dataset in [Amazon S3](https://aws.amazon.com/s3/)\n", "2. Create Amazon FSx Lustre file-system and import data into the file-system from S3\n", "3. Build Docker training image and push it to [Amazon ECR](https://aws.amazon.com/ecr/)\n", "4. Configure data input channels\n", "5. Configure hyper-prarameters\n", "6. Define training metrics\n", "7. Define training job and start training\n", "\n", "Before we get started, let us initialize two python variables ```aws_region``` and ```s3_bucket``` that we will use throughout the notebook. The ```s3_bucket``` must be located in the region of this notebook instance." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import boto3\n", "\n", "session = boto3.session.Session()\n", "aws_region = session.region_name\n", "s3_bucket = # your-s3-bucket-name\n", "\n", "\n", "try:\n", " s3_client = boto3.client('s3')\n", " response = s3_client.get_bucket_location(Bucket=s3_bucket)\n", " print(f\"Bucket region: {response['LocationConstraint']}\")\n", "except:\n", " print(f\"Access Error: Check if '{s3_bucket}' S3 bucket is in '{aws_region}' region\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Stage COCO 2017 dataset in Amazon S3\n", "\n", "We use [COCO 2017 dataset](http://cocodataset.org/#home) for training. We download COCO 2017 training and validation dataset to this notebook instance, extract the files from the dataset archives, and upload the extracted files to your Amazon [S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html). The ```prepare-s3-bucket.sh``` script executes this step. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!cat ./prepare-s3-bucket.sh" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using your *Amazon S3 bucket* as argument, run the cell below. If you have already uploaded COCO 2017 dataset to your Amazon S3 bucket, you may skip this step." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "!./prepare-s3-bucket.sh {s3_bucket}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create FSx Lustre file-system and import data from S3\n", "\n", "Below, we use [AWS CloudFomration stack](https://docs.aws.amazon.com/en_pv/AWSCloudFormation/latest/UserGuide/stacks.html) to create a FSx Lustre file-system and import COCO 2017 dataset into the FSx file-system from your S3 bucket. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!cat stack-fsx.sh" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this step, the [IAM Role](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles.html) attached to this notebook instance needs full access to Amazon CloudFormation and FSx services. If you created this notebook instance using the ```./stack-sm.sh``` script in this repository, the IAM Role attached to this notebook instance is already setup with requried access. \n", "\n", "```usage: ./stack-fsx.sh ```\n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
FSx configuraiton
ArgumentDescriptionValue
aws-regionAWS region namee.g. us-east-1
s3-import-pathS3 import path for importing data to FSx file-systems3://<s3-bucket-name>/mask-rcnn/sagemaker/input
fsx-capacityFSx Lustre file-system capacity in GiB3600 or 7200
subnet-idThis is available in the output of ./stack-sm.sh script you used to create this notebook instance. Specify only one subnet.subnet-xxxx
security-group-idSecurity group id for FSx lustre file system. This is available in the output of ./stack-sm.sh script you used to create this notebook instance. sg-xxxx
\n", "\n", "\n", "If you have already created a FSx Lustre file-system and populated it with COCO 2017 dataset, you may skip this step." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "security_group_id = # 'sg-xxxxxxxx' \n", "subnet_id = # 'subnet-xxxxxxx'\n", "!./stack-fsx.sh {aws_region} s3://{s3_bucket}/mask-rcnn/sagemaker/input 3600 {subnet_id} {security_group_id}\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Build and push SageMaker training images\n", "\n", "For this step, the [IAM Role](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles.html) attached to this notebook instance needs full access to Amazon ECR service. If you created this notebook instance using the ```./stack-sm.sh``` script in this repository, the IAM Role attached to this notebook instance is already setup with full access to ECR service. \n", "\n", "Below, we have a choice of two different implementations:\n", "\n", "1. [TensorPack Faster-RCNN/Mask-RCNN](https://github.com/tensorpack/tensorpack/tree/master/examples/FasterRCNN) implementation supports a maximum per-GPU batch size of 1, and does not support mixed precision. It can be used with mainstream TensorFlow releases.\n", "\n", "2. [AWS Samples Mask R-CNN](https://github.com/aws-samples/mask-rcnn-tensorflow) is an optimized implementation that supports a maximum batch size of 4 and supports mixed precision. This implementation uses custom TensorFlow ops. The required custom TensorFlow ops are available in [AWS Deep Learning Container](https://github.com/aws/deep-learning-containers/blob/master/available_images.md) images in ```tensorflow-training``` repository with image tag ```1.15.2-gpu-py36-cu100-ubuntu18.04```, or later.\n", "\n", "It is recommended that you build and push both SageMaker training images and use either image for training later.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### TensorPack Faster-RCNN/Mask-RCNN\n", "\n", "Use ```./container-script-mode/build_tools/build_and_push.sh``` script to build and push the TensorPack Faster-RCNN/Mask-RCNN training image to Amazon ECR. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!cat ./container-script-mode/build_tools/build_and_push.sh" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using your *AWS region* as argument, run the cell below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "! ./container-script-mode/build_tools/build_and_push.sh {aws_region}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Set ```tensorpack_image``` below to Amazon ECR URI of the image you pushed above." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tensorpack_image = # mask-rcnn-tensorpack-sagemaker ECR URI" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### AWS Samples Mask R-CNN\n", "Use ```./container-optimized-script-mode/build_tools/build_and_push.sh``` script to build and push the AWS Samples Mask R-CNN training image to Amazon ECR." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!cat ./container-optimized-script-mode/build_tools/build_and_push.sh" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using your *AWS region* as argument, run the cell below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "! ./container-optimized-script-mode/build_tools/build_and_push.sh {aws_region}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Set ```aws_samples_image``` below to the Amazon ECR URI of the image you pushed above." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "aws_samples_image = # mask-rcnn-tensorflow-sagemaker ECR URI" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## SageMaker Initialization \n", "We have staged the data and we have built and pushed the training docker image to Amazon ECR. Now we are ready to start using Amazon SageMaker.\n", "\n", "First we upgrade SageMaker to 2.3.0 API. If your notebook is already using latest Sagemaker 2.x API, you may skip the next cell." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! pip install --upgrade pip\n", "! pip install sagemaker" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "import sagemaker\n", "from sagemaker import get_execution_role\n", "from sagemaker.tensorflow.estimator import TensorFlow\n", "\n", "role = (\n", " get_execution_role()\n", ") # provide a pre-existing role ARN as an alternative to creating a new role\n", "print(f\"SageMaker Execution Role:{role}\")\n", "\n", "client = boto3.client(\"sts\")\n", "account = client.get_caller_identity()[\"Account\"]\n", "print(f\"AWS account:{account}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we set ```training_image``` to the Amazon ECR image URI you saved in a previous step. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "training_image = # set to tensorpack_image or aws_samples_image \n", "print(f'Training image: {training_image}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define SageMaker Data Channels\n", "\n", "Next, we define the *train* data channel using EFS file-system as data source. Set the FSx Lustre ```file_system_id``` prior to running the cell below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.inputs import FileSystemInput\n", "\n", "# Specify FSx Lustre file system id.\n", "file_system_id = # \"fs-xxxxxxxxxxxxxx\"\n", "\n", "# Specify directory path for input data on the file system. \n", "# You need to provide normalized and absolute path below.\n", "file_system_directory_path = '/fsx/mask-rcnn/sagemaker/input/train'\n", "print(f'FSx file-system data input path: {file_system_directory_path}')\n", "\n", "# Specify the access mode of the mount of the directory associated with the file system. \n", "# Directory must be mounted 'ro'(read-only).\n", "file_system_access_mode = 'ro'\n", "\n", "# Specify your file system type.\n", "file_system_type = 'FSxLustre'\n", "\n", "train = FileSystemInput(file_system_id=file_system_id,\n", " file_system_type=file_system_type,\n", " directory_path=file_system_directory_path,\n", " file_system_access_mode=file_system_access_mode)\n", "\n", "data_channels = {'train': train}\n", "\n", "# Optionally you can create a log channel on FSx file-system\n", "# To create a log channel, follow the steps below and uncomment relevant code\n", "# Specify directory path for log output on the file system.\n", "# This directory must exist, be empty and be writable\n", "# You need to provide normalized and absolute path below.\n", "# For example, '/fsx/mask-rcnn/sagemaker/output/log', \n", "# assuming this directory exists, is empty and is writeable\n", "# file_system_directory_path = \n", "\n", "\n", "# Specify the access mode of the mount of the directory associated with the file system. \n", "# Directory must be mounted 'rw'(read-write).\n", "# file_system_access_mode = 'rw'\n", "\n", "# log = FileSystemInput(file_system_id=file_system_id,\n", "# file_system_type=file_system_type,\n", "# directory_path=file_system_directory_path,\n", "# file_system_access_mode=file_system_access_mode)\n", "\n", "#data_channels = {'train': train, 'log': log}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we define the model output location in S3. Set ```bucket``` to your S3 bucket name prior to running the cell below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "prefix = \"mask-rcnn/sagemaker\" # prefix in your bucket\n", "s3_output_location = f\"s3://{s3_bucket}/{prefix}/output\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Configure Hyper-parameters\n", "Next we define the hyper-parameters. \n", "\n", "Note, some hyper-parameters are different between the two implementations. The batch size per GPU in TensorPack Faster-RCNN/Mask-RCNN is fixed at 1, but is configurable in AWS Samples Mask-RCNN. The learning rate schedule is specified in units of steps in TensorPack Faster-RCNN/Mask-RCNN, but in epochs in AWS Samples Mask-RCNN.\n", "\n", "The detault learning rate schedule values shown below correspond to training for a total of 24 epochs, at 120,000 images per epoch.\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TensorPack Faster-RCNN/Mask-RCNN Hyper-parameters
Hyper-parameterDescriptionDefault
mode_fpnFlag to indicate use of Feature Pyramid Network (FPN) in the Mask R-CNN model backbone\"True\"
mode_maskA value of \"False\" means Faster-RCNN model, \"True\" means Mask R-CNN moodel\"True\"
eval_periodNumber of epochs period for evaluation during training1
lr_scheduleLearning rate schedule in training steps'[240000, 320000, 360000]'
batch_normBatch normalization option ('FreezeBN', 'SyncBN', 'GN', 'None') 'FreezeBN'
images_per_epochImages per epoch 120000
data_trainTraining data under data directory'coco_train2017'
data_valValidation data under data directory'coco_val2017'
resnet_archMust be 'resnet50' or 'resnet101''resnet50'
backbone_weightsResNet backbone weights'ImageNet-R50-AlignPadding.npz'
load_modelPre-trained model to load
config:Any hyperparamter prefixed with config: is set as a model config parameter
\n", "\n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AWS Samples Mask-RCNN Hyper-parameters
Hyper-parameterDescriptionDefault
mode_fpnFlag to indicate use of Feature Pyramid Network (FPN) in the Mask R-CNN model backbone\"True\"
mode_maskA value of \"False\" means Faster-RCNN model, \"True\" means Mask R-CNN moodel\"True\"
eval_periodNumber of epochs period for evaluation during training1
lr_epoch_scheduleLearning rate schedule in epochs'[(16, 0.1), (20, 0.01), (24, None)]'
batch_size_per_gpuBatch size per gpu ( Minimum 1, Maximum 4)4
batch_normBatch normalization option ('FreezeBN', 'SyncBN', 'GN', 'None') 'FreezeBN'
images_per_epochImages per epoch 120000
data_trainTraining data under data directory'train2017'
data_valValidation data under data directory'val2017'
resnet_archMust be 'resnet50' or 'resnet101''resnet50'
backbone_weightsResNet backbone weights'ImageNet-R50-AlignPadding.npz'
load_modelPre-trained model to load
config:Any hyperparamter prefixed with config: is set as a model config parameter
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hyperparameters = {\n", " \"mode_fpn\": \"True\",\n", " \"mode_mask\": \"True\",\n", " \"eval_period\": 1,\n", " \"batch_norm\": \"FreezeBN\",\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define Training Metrics\n", "Next, we define the regular expressions that SageMaker uses to extract algorithm metrics from training logs and send them to [AWS CloudWatch metrics](https://docs.aws.amazon.com/en_pv/AmazonCloudWatch/latest/monitoring/working_with_metrics.html). These algorithm metrics are visualized in SageMaker console." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "metric_definitions = [\n", " {\"Name\": \"fastrcnn_losses/box_loss\", \"Regex\": \".*fastrcnn_losses/box_loss:\\\\s*(\\\\S+).*\"},\n", " {\"Name\": \"fastrcnn_losses/label_loss\", \"Regex\": \".*fastrcnn_losses/label_loss:\\\\s*(\\\\S+).*\"},\n", " {\n", " \"Name\": \"fastrcnn_losses/label_metrics/accuracy\",\n", " \"Regex\": \".*fastrcnn_losses/label_metrics/accuracy:\\\\s*(\\\\S+).*\",\n", " },\n", " {\n", " \"Name\": \"fastrcnn_losses/label_metrics/false_negative\",\n", " \"Regex\": \".*fastrcnn_losses/label_metrics/false_negative:\\\\s*(\\\\S+).*\",\n", " },\n", " {\n", " \"Name\": \"fastrcnn_losses/label_metrics/fg_accuracy\",\n", " \"Regex\": \".*fastrcnn_losses/label_metrics/fg_accuracy:\\\\s*(\\\\S+).*\",\n", " },\n", " {\n", " \"Name\": \"fastrcnn_losses/num_fg_label\",\n", " \"Regex\": \".*fastrcnn_losses/num_fg_label:\\\\s*(\\\\S+).*\",\n", " },\n", " {\"Name\": \"maskrcnn_loss/accuracy\", \"Regex\": \".*maskrcnn_loss/accuracy:\\\\s*(\\\\S+).*\"},\n", " {\n", " \"Name\": \"maskrcnn_loss/fg_pixel_ratio\",\n", " \"Regex\": \".*maskrcnn_loss/fg_pixel_ratio:\\\\s*(\\\\S+).*\",\n", " },\n", " {\"Name\": \"maskrcnn_loss/maskrcnn_loss\", \"Regex\": \".*maskrcnn_loss/maskrcnn_loss:\\\\s*(\\\\S+).*\"},\n", " {\"Name\": \"maskrcnn_loss/pos_accuracy\", \"Regex\": \".*maskrcnn_loss/pos_accuracy:\\\\s*(\\\\S+).*\"},\n", " {\"Name\": \"mAP(bbox)/IoU=0.5\", \"Regex\": \".*mAP\\\\(bbox\\\\)/IoU=0\\\\.5:\\\\s*(\\\\S+).*\"},\n", " {\"Name\": \"mAP(bbox)/IoU=0.5:0.95\", \"Regex\": \".*mAP\\\\(bbox\\\\)/IoU=0\\\\.5:0\\\\.95:\\\\s*(\\\\S+).*\"},\n", " {\"Name\": \"mAP(bbox)/IoU=0.75\", \"Regex\": \".*mAP\\\\(bbox\\\\)/IoU=0\\\\.75:\\\\s*(\\\\S+).*\"},\n", " {\"Name\": \"mAP(bbox)/large\", \"Regex\": \".*mAP\\\\(bbox\\\\)/large:\\\\s*(\\\\S+).*\"},\n", " {\"Name\": \"mAP(bbox)/medium\", \"Regex\": \".*mAP\\\\(bbox\\\\)/medium:\\\\s*(\\\\S+).*\"},\n", " {\"Name\": \"mAP(bbox)/small\", \"Regex\": \".*mAP\\\\(bbox\\\\)/small:\\\\s*(\\\\S+).*\"},\n", " {\"Name\": \"mAP(segm)/IoU=0.5\", \"Regex\": \".*mAP\\\\(segm\\\\)/IoU=0\\\\.5:\\\\s*(\\\\S+).*\"},\n", " {\"Name\": \"mAP(segm)/IoU=0.5:0.95\", \"Regex\": \".*mAP\\\\(segm\\\\)/IoU=0\\\\.5:0\\\\.95:\\\\s*(\\\\S+).*\"},\n", " {\"Name\": \"mAP(segm)/IoU=0.75\", \"Regex\": \".*mAP\\\\(segm\\\\)/IoU=0\\\\.75:\\\\s*(\\\\S+).*\"},\n", " {\"Name\": \"mAP(segm)/large\", \"Regex\": \".*mAP\\\\(segm\\\\)/large:\\\\s*(\\\\S+).*\"},\n", " {\"Name\": \"mAP(segm)/medium\", \"Regex\": \".*mAP\\\\(segm\\\\)/medium:\\\\s*(\\\\S+).*\"},\n", " {\"Name\": \"mAP(segm)/small\", \"Regex\": \".*mAP\\\\(segm\\\\)/small:\\\\s*(\\\\S+).*\"},\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define SageMaker Training Job\n", "\n", "Next, we use SageMaker [Estimator](https://sagemaker.readthedocs.io/en/stable/estimators.html) API to define a SageMaker Training Job. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Select script\n", "\n", "In script-mode, first we have to select an entry point script that acts as interface with SageMaker and launches the training job. For training [TensorPack Faster-RCNN/Mask-RCNN](https://github.com/tensorpack/tensorpack/tree/master/examples/FasterRCNN) model, set ```script``` to ```\"tensorpack-mask-rcnn.py\"```. For training [AWS Samples Mask R-CNN](https://github.com/aws-samples/mask-rcnn-tensorflow) model, set ```script``` to ```\"aws-mask-rcnn.py\"```." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "script= # \"tensorpack-mask-rcnn.py\" or \"aws-mask-rcnn.py\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Select distribution mode\n", "\n", "We use Message Passing Interface (MPI) to distribute the training job across multiple hosts. The ```custom_mpi_options``` below is only used by [AWS Samples Mask R-CNN](https://github.com/aws-samples/mask-rcnn-tensorflow) model, and can be safely commented out for [TensorPack Faster-RCNN/Mask-RCNN](https://github.com/tensorpack/tensorpack/tree/master/examples/FasterRCNN) model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mpi_distribution = {\"mpi\": {\"enabled\": True, \"custom_mpi_options\": \"-x TENSORPACK_FP16=1 \"}}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We recommned using 32 GPUs, so we set ```instance_count=4``` and ```instance_type='ml.p3.16xlarge'```, because there are 8 Tesla V100 GPUs per ```ml.p3.16xlarge``` instance. We recommend using 100 GB [Amazon EBS](https://aws.amazon.com/ebs/) storage volume with each training instance, so we set ```volume_size = 100```. \n", "\n", "We run the training job in your private VPC, so we need to set the ```subnets``` and ```security_group_ids``` prior to running the cell below. You may specify multiple subnet ids in the ```subnets``` list. The subnets included in the ```sunbets``` list must be part of the output of ```./stack-sm.sh``` CloudFormation stack script used to create this notebook instance. Specify only one security group id in ```security_group_ids``` list. The security group id must be part of the output of ```./stack-sm.sh``` script.\n", "\n", "For ```instance_type``` below, you have the option to use ```ml.p3.16xlarge``` with 16 GB per-GPU memory and 25 Gbs network interconnectivity, or ```ml.p3dn.24xlarge``` with 32 GB per-GPU memory and 100 Gbs network interconnectivity. The ```ml.p3dn.24xlarge``` instance type offers significantly better performance than ```ml.p3.16xlarge``` for Mask R-CNN distributed TensorFlow training.\n", "\n", "We use MPI to distribute the training job across multiple hosts." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Give Amazon SageMaker Training Jobs Access to FileSystem Resources in Your Amazon VPC.\n", "security_group_ids = # ['sg-xxxxxxxx'] \n", "subnets = # [ 'subnet-xxxxxxx']\n", "sagemaker_session = sagemaker.session.Session(boto_session=session)\n", "\n", "mask_rcnn_estimator = TensorFlow(image_uri=training_image,\n", " role=role, \n", " py_version='py3',\n", " instance_count=4, \n", " instance_type='ml.p3.16xlarge',\n", " distribution=mpi_distribution,\n", " entry_point=script,\n", " volume_size = 100,\n", " max_run = 400000,\n", " output_path=s3_output_location,\n", " sagemaker_session=sagemaker_session, \n", " hyperparameters = hyperparameters,\n", " metric_definitions = metric_definitions,\n", " subnets=subnets,\n", " security_group_ids=security_group_ids)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we launch the SageMaker training job. See ```Training Jobs``` in SageMaker console to monitor the training job." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import time\n", "\n", "job_name = f\"mask-rcnn-fsx-scriptmode-{int(time.time())}\"\n", "print(f\"Launching Training Job: {job_name}\")\n", "\n", "# set wait=True below if you want to print logs in cell output\n", "mask_rcnn_estimator.fit(inputs=data_channels, job_name=job_name, logs=\"All\", wait=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Notebook CI Test Results\n", "\n", "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", "\n", "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-fsx.ipynb)\n", "\n", "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-fsx.ipynb)\n", "\n", "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-fsx.ipynb)\n", "\n", "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-fsx.ipynb)\n", "\n", "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-fsx.ipynb)\n", "\n", "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-fsx.ipynb)\n", "\n", "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-fsx.ipynb)\n", "\n", "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-fsx.ipynb)\n", "\n", "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-fsx.ipynb)\n", "\n", "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-fsx.ipynb)\n", "\n", "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-fsx.ipynb)\n", "\n", "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-fsx.ipynb)\n", "\n", "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-fsx.ipynb)\n", "\n", "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-fsx.ipynb)\n", "\n", "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/advanced_functionality|distributed_tensorflow_mask_rcnn|mask-rcnn-scriptmode-fsx.ipynb)\n" ] } ], "metadata": { "availableInstances": [ { "_defaultOrder": 0, "_isFastLaunch": true, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 4, "name": "ml.t3.medium", "vcpuNum": 2 }, { "_defaultOrder": 1, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.t3.large", "vcpuNum": 2 }, { "_defaultOrder": 2, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.t3.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 3, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.t3.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 4, "_isFastLaunch": true, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.m5.large", "vcpuNum": 2 }, { "_defaultOrder": 5, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.m5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 6, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.m5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 7, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.m5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 8, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.m5.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 9, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.m5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 10, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.m5.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 11, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.m5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 12, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.m5d.large", "vcpuNum": 2 }, { "_defaultOrder": 13, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.m5d.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 14, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.m5d.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 15, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.m5d.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 16, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.m5d.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 17, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.m5d.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 18, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.m5d.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 19, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.m5d.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 20, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": true, "memoryGiB": 0, "name": "ml.geospatial.interactive", "supportedImageNames": [ "sagemaker-geospatial-v1-0" ], "vcpuNum": 0 }, { "_defaultOrder": 21, "_isFastLaunch": true, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 4, "name": "ml.c5.large", "vcpuNum": 2 }, { "_defaultOrder": 22, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.c5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 23, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.c5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 24, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.c5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 25, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 72, "name": "ml.c5.9xlarge", "vcpuNum": 36 }, { "_defaultOrder": 26, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 96, "name": "ml.c5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 27, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 144, "name": "ml.c5.18xlarge", "vcpuNum": 72 }, { "_defaultOrder": 28, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.c5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 29, "_isFastLaunch": true, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.g4dn.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 30, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.g4dn.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 31, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.g4dn.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 32, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.g4dn.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 33, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.g4dn.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 34, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.g4dn.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 35, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 61, "name": "ml.p3.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 36, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 244, "name": "ml.p3.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 37, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 488, "name": "ml.p3.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 38, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 768, "name": "ml.p3dn.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 39, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.r5.large", "vcpuNum": 2 }, { "_defaultOrder": 40, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.r5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 41, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.r5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 42, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.r5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 43, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.r5.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 44, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.r5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 45, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 512, "name": "ml.r5.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 46, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 768, "name": "ml.r5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 47, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.g5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 48, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.g5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 49, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.g5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 50, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.g5.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 51, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.g5.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 52, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.g5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 53, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.g5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 54, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 768, "name": "ml.g5.48xlarge", "vcpuNum": 192 }, { "_defaultOrder": 55, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 1152, "name": "ml.p4d.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 56, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 1152, "name": "ml.p4de.24xlarge", "vcpuNum": 96 } ], "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "Python 3 (Data Science 3.0)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/sagemaker-data-science-310-v1" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" } }, "nbformat": 4, "nbformat_minor": 4 }