{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Faster RCNN Sagemaker Tutorial\n",
"\n",
"This tutorial walks through the entire process of training a faster RCNN model on Sagemaker. There are six components that can be applied to most models trained on Sagemaker. This tutorial does not go into the logic of the faster RCNN model itself. Rather, we focus only on setting up and training the model. For a detailed explanation of faster RCNN, see [Faster R-CNN: Down the rabbit hole of modern object detection](https://tryolabs.com/blog/2018/01/18/faster-r-cnn-down-the-rabbit-hole-of-modern-object-detection/) from Tryo Labs.\n",
"\n",
"- Notebook instance creation\n",
"- Data download and prep\n",
"- Docker container\n",
"- Hyperparameter settings\n",
"- Launching a model\n",
"- Monitoring\n",
"\n",
"### Notebook instance creation\n",
"\n",
"We recommend running this tutorial on a Sagemaker notebook instance. This will ensure you have proper access S3 and the AWS Elastic Container Registry. From the AWS console page, first go to S3 and create a new bucket if you don't already have one for sagemaker. With standard settings Sagemaker will only connect to buckets with `sagemaker` in the name, so we recommend using something like `mybucket-sagemaker`.\n",
"\n",
"
\n",
"\n",
"Next, go to the AWS Sagemaker and select `Notebook instance`. Select `Create notebook instance`, give your instance any name you like, and select any instance type you want. You don't need a powerful notebook instance, since we'll be doing the training on different instances. But sometimes you might want to experiment with some training on the notebook instance itself, so more powerful instances are available. For now, an ml.m5.2xlarge[1](#fn1) should be fine. Select additional configuration and increase your volume size to 100GB. We need the extra size in order to get the model data, and build our docker containers. Under `Permissions and Encryption` select `Create a new role`. A new window will pop up. Leave this with the default settings, and click `Create role`. Everything else can be left as default. Click `Create notebook instance`. \n",
"\n",
"It will take a few minutes to create the notebook instance. While that's happening, we need to add one more permission to our IAM role we just created. On the `Notebook instances` page you will see your new instance. Click the instance name link, scroll down to `Permissions and encryption` and click the link under `IAM role ARN`. On the summary page, select `Attach policies` in the filter policies box search for `container registry`, check the box next to `AmazonEC2ContainerRegistryFullAccess`, and click `Attach policy`. This allows us to store our new container image we want to use with Sagemaker.\n",
"\n",
"If you don't already have a container repository on Elastic Container Registry, got to `Elastic Container Registry` on the AWS dashboard, click `create repository` and pick any name you like.\n",
"\n",
" 1M5 instances are general purpose EC2 instances that provide a balance of memory, storage, and CPU performance for most common use cases. For our purpose, we want an instance with enough CPU performance and memory to handle building a Docker image, and to launch training. Neither of these tasks are too resource intensive, so a small 2x instance will be fine. If you wish to do more complex work on your notebook instance, you might consider an 8 or 16x instance. If you want to train deep learning models directly in your notebook instance, you should select a GPU instance, such as a p2 or p3, but be aware that these are more expensive to run."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Clone Deep Learning Repo\n",
"\n",
"Start your notebook instance by clicking Open Jupyterlab. Scroll down to the bottom of the launch page and click to start a new terminal and run the following command.\n",
"\n",
"```\n",
"cd SageMaker\n",
"git clone https://github.com/aws-samples/deep-learning-models\n",
"cd deep-learning-models/models/vision/detection/tutorial\n",
"```\n",
"\n",
"### Data download and prep\n",
"\n",
"The script below contains everything to download the COCO 2017 dataset, as well as resnet weights pretrained on imagenet data. These are the same weights included in the Keras package. We download them manually for stability, because they are sometimes retrained between versions of Keras. The script will download everything to the SageMaker notebook instance, assemble everything in the file structure the faster R-CNN model expects to see, archives it, and saves it to your S3 bucket.\n",
"Once your instance is ready, click Open JupyterLab. When the launcher appears, select Terminal at the bottom of the page and run the following commands:\n",
"\n",
"```\n",
"S3_BUCKET=[name of your s3 bucket not including s3://]\n",
"\n",
"############################################################\n",
"# Setup directories\n",
"############################################################\n",
"\n",
"BASE_DIR=$HOME/SageMaker\n",
"mkdir -p $BASE_DIR/data/coco\n",
"mkdir -p $BASE_DIR/data/weights\n",
"\n",
"############################################################\n",
"# Download all data files\n",
"############################################################\n",
"\n",
"wget -O $BASE_DIR/data/coco/train2017.zip http://images.cocodataset.org/zips/train2017.zip\n",
"wget -O $BASE_DIR/data/coco/val2017.zip http://images.cocodataset.org/zips/val2017.zip\n",
"wget -O $BASE_DIR/data/coco/annotations_trainval2017.zip \\\n",
" http://images.cocodataset.org/annotations/annotations_trainval2017.zip\n",
"wget -O $BASE_DIR/data/weights/resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5 \\\n",
" https://github.com/keras-team/keras-applications/releases/download/resnet/resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5\n",
"\n",
"############################################################\n",
"# Decompress arrange and archive\n",
"############################################################\n",
"\n",
"unzip -q $BASE_DIR/data/coco/train2017.zip -d $BASE_DIR/data/coco\n",
"unzip -q $BASE_DIR/data/coco/val2017.zip -d $BASE_DIR/data/coco\n",
"unzip $BASE_DIR/data/coco/annotations_trainval2017.zip -d $BASE_DIR/data/coco\n",
"rm $BASE_DIR/data/coco/*.zip\n",
"cd $BASE_DIR/data/\n",
"tar -cf coco.tar coco\n",
"\n",
"############################################################\n",
"# Upload to S3\n",
"############################################################\n",
"\n",
"aws s3 cp $BASE_DIR/data/coco.tar s3://${S3_BUCKET}/faster-rcnn/data/coco/coco.tar\n",
"aws s3 cp $BASE_DIR/data/weights/resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5 \\\n",
" s3://${S3_BUCKET}/faster-rcnn/data/weights/resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5\n",
"\n",
"```\n",
"\n",
"This will take about 5 minutes.\n",
"\n",
"### Docker container\n",
"\n",
"While the data prep script is running, you can open another terminal by clicking the `+` in the upper left corner of your screen and starting a new terminal. First, a little clarification on how Sagemaker works. When you launch a Sagemaker training job, each instance on that job will download a Docker image from ECR, setup a series of environment variables so it knows your hyperparameters and where to find the data, and launches your training script. Most of the time, you can use Sagemaker's built in containers, but for more complex models you might want to create your own. \n",
"\n",
"Under `deep-learning-models/models/vision/detection/docker` you'll find the Dockerfile that create the image we use for training. A detailed description of that file, and how to create your own, is included in the readme. Below are the basic commands to build this image and upload it to ECR.\n",
"\n",
"```\n",
"cd deep-learning-models/models/vision/detection/docker\n",
"ECR_REPO=[the ECR repo created earlier]\n",
"ALGO=frcnn-tutorial\n",
"docker build -t ${ECR_REPO}/${ALGO} -f Dockerfile.frcnn .\n",
"\n",
"# login to ECR\n",
"REGION=$(aws configure get region)\n",
"ACCOUNT=$(aws sts get-caller-identity --query Account --output text)\n",
"FULLNAME=\"${ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/${ECR_REPO}:${ALGO}\"\n",
"$(aws ecr get-login --region ${REGION} --no-include-email)\n",
"\n",
"# push image to ECR\n",
"docker tag ${ECR_REPO}/${ALGO} ${FULLNAME}\n",
"docker push ${FULLNAME}\n",
"echo ${FULLNAME}\n",
"```\n",
"\n",
"### Hyperparameter settings\n",
"\n",
"Now we can start setting up our model. Set the variables in the paragraph below to access your S3 bucket and ECR repo."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Below, enter your s3 bucket, image name, and a user name.\n",
"\n",
"After this paragraph, you don't need to modify anything to run training with the standard hyperparameters."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"import subprocess\n",
"\n",
"account_call = \"aws sts get-caller-identity --query Account --output text\"\n",
"ecr_account = subprocess.check_output(account_call, shell=True).decode().strip()\n",
"ecr_repo = [the ECR repo created earlier]\n",
"algo_name = \"frcnn-tutorial\"\n",
"\n",
"s3_bucket = [mybucket-sagemaker] # name of your s3 bucket without s3://\n",
"docker_image = \"{0}.dkr.ecr.us-east-1.amazonaws.com/{1}:{2}\".format(ecr_account,\n",
" ecr_repo,\n",
" algo_name) # the output of `echo ${FULLNAME}` from the previous section something like 12345.dkr.ecr.us-east-1.amazonaws.com/name:algo\n",
"user_id = [user_name] # this can be anything you like, used for keeping track of your training jobs"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"from sagemaker import get_execution_role\n",
"from sagemaker.tensorflow import TensorFlow\n",
"from datetime import datetime\n",
"import os\n",
"import pprint"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Below are the common hyperparameters to tune. All of these hyperparameters are passed to the `deep-learning-models/models/vision/detection/tools/train_sagemaker.py`. More customization can be done in the `train_sagemaker` script. All lines referenced below refer to this file.\n",
"\n",
"`schedule` determines the learning rate schedule. `1x` performs a 10x drop in the learning rate at the 8th and 10th epochs. The other option currently included is `cosine` which performs a cosine decay with restarts. These schedulers are lines 112-121.\n",
"\n",
"`fp16` refers to whether a model should train using mixed precision. Mixed precision training is supported on p3 and g4 instances.\n",
"\n",
"`base_learning_rate` this is the learning rate after the warmup period.\n",
"\n",
"`warmup_steps` number of steps for warmup.\n",
"\n",
"`warmup_init_lr_scale` the scale of the warmup start level. 3.0 means the learning rate starts as 1/3 of the base_learning_rate value.\n",
"\n",
"`batch_size_per_device` number of images per GPU.\n",
"\n",
"`instance_type` the type of EC2 instance for training. For highest performance use `ml.p3dn.24xlarge`. The `ml.p3.16xlarge` and `ml.p3.8xlarge` are slightly lower performance while being more cost effective. The `g4dn.12xlarge` instance type is even lower cost, but much slower than the p3 instances.\n",
"\n",
"`instance_count` number of instances for training.\n",
"\n",
"`num_workers_per_host` number of training jobs per instance. This should be the number of GPUs on the instance, so should be set to 8 for `ml.p3dn.24xlarge` or `ml.p3.16xlarge` instances, and 4 for `ml.p3.8xlarge` or `g4dn.12xlarge`.\n",
"\n",
"`use_conv` the final bounding box head of the faster RCNN model can use convolutions instead of fully connected layers. We observe slightly better performance with convolution layers, but provide an option to run either.\n",
"\n",
"`use_rcnn_bn` batch normalization of the bounding box head layers. This is only useful when using fully connected layers, and will be ignored if `use_conv` is set to True.\n",
"\n",
"`ls` label smoothing. In some cases, faster rcnn performance is hurt by a large imbalance in the number of positive versus negative regions. Label smoothing can help mitigate this."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"hyperparameters = {\n",
" 'schedule': '1x',\n",
" 'fp16': True,\n",
" 'base_learning_rate': 15e-3,\n",
" 'warmup_steps': 500,\n",
" 'warmup_init_lr_scale': 3.0,\n",
" 'batch_size_per_device': 4,\n",
" 'instance_type': 'ml.p3.16xlarge',\n",
" 'instance_count': 1,\n",
" 'num_workers_per_host': 8,\n",
" 'use_conv': True,\n",
" 'use_rcnn_bn': False,\n",
" 'ls': 0.0,\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The model configuration file also contains additional less commonly adjusted hyperparamters, as well as data pipeline settings."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"model_cfg = \"../../configs/sagemaker_default_config.py\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can setup our training job. We first get the Sagemaker execution role, so the training instances now which account they're connected to. We add a timestamp to keep track of multiple training jobs. And finally give Sagemaker the path to our model training files."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Couldn't call 'get_role' to get Role ARN from role name jbsnyder to get Role path.\n"
]
},
{
"ename": "ValueError",
"evalue": "The current AWS identity is not a role: arn:aws:iam::578276202366:user/jbsnyder, therefore it cannot be used as a SageMaker execution role",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mrole\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mget_execution_role\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2\u001b[0m \u001b[0mnow\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdatetime\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mnow\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0mtime_str\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnow\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstrftime\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"%d-%m-%Y-%H-%M\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0mdate_str\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnow\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstrftime\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"%d-%m-%Y\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0msource_dir\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m\"../..\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/usr/local/lib/python3.7/site-packages/sagemaker/session.py\u001b[0m in \u001b[0;36mget_execution_role\u001b[0;34m(sagemaker_session)\u001b[0m\n\u001b[1;32m 3347\u001b[0m \u001b[0;34m\"SageMaker execution role\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3348\u001b[0m )\n\u001b[0;32m-> 3349\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mValueError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmessage\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marn\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3350\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3351\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mValueError\u001b[0m: The current AWS identity is not a role: arn:aws:iam::578276202366:user/jbsnyder, therefore it cannot be used as a SageMaker execution role"
]
}
],
"source": [
"role = get_execution_role()\n",
"now = datetime.now()\n",
"time_str = now.strftime(\"%d-%m-%Y-%H-%M\")\n",
"date_str = now.strftime(\"%d-%m-%Y\")\n",
"source_dir = \"../..\"\n",
"main_script = \"tools/train_sagemaker.py\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Fill in details about our training instances. Get the instance types, GPUs, and number of instances. We also setup any options we want for MPI. "
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"hvd_processes_per_host = hyperparameters['num_workers_per_host']\n",
"hvd_instance_type = hyperparameters['instance_type']\n",
"hvd_instance_count = hyperparameters['instance_count']\n",
"\n",
"distributions = {\n",
" \"mpi\": {\n",
" \"enabled\": True,\n",
" \"processes_per_host\": hvd_processes_per_host,\n",
" \"custom_mpi_options\": \"-x OMPI_MCA_btl_vader_single_copy_mechanism=none -x TF_CUDNN_USE_AUTOTUNE=0\"\n",
" }\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Setup the output paths on S3. This is where the final model and training logs will be stored.\n",
"\n",
"Name the training job using user name and time stamp.\n",
"\n",
"Create the training configuration."
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"s3_path = os.path.join('s3://{}/faster-rcnn/outputs/{}'.format(s3_bucket, date_str))\n",
"\n",
"job_name = '{}-frcnn-{}'.format(user_id, time_str)\n",
"\n",
"output_path = os.path.join(s3_path, \"output\", job_name)\n",
"\n",
"configuration = {\n",
" 'configuration': 'configs/sagemaker_default_model_config.py', \n",
" 's3_path': s3_path,\n",
" 'instance_name': job_name\n",
"}\n",
"configuration.update(hyperparameters)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lastly, we need to set our training channels. Channels are environmental variables in the Sagemaker training instances that tell Sagemaker where to find data. Each channel has an S3 location that will copy to the instance. For example, if we pass a dictionary with\n",
"\n",
"```\n",
"{\"coco\": \"s3://my-bucket/data/coco_data/\",\n",
" \"weights\": \"s3://my-bucket/data/resnet_weights/\"}\n",
"```\n",
"\n",
"Sagemaker will create two directories on each training instance\n",
"\n",
"```\n",
"/opt/ml/input/data/coco/\n",
"/opt/ml/input/data/weights/\n",
"```\n",
"\n",
"and copy all contents of the respective S3 locations to these directories. Addtionally, each training instance will get two environmental variables\n",
"\n",
"```\n",
"SM_CHANNEL_COCO\n",
"SM_CHANNEL_WEIGHTS\n",
"```\n",
"\n",
"that are set to these directories.\n",
"\n",
"So we want to create a dictionary that points to where we uploaded the data at the beginning of this tutorial."
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"channels = {\n",
" 'coco': 's3://{}/faster-rcnn/data/coco/'.format(s3_bucket),\n",
" 'weights': 's3://{}/faster-rcnn/data/weights/'.format(s3_bucket)\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"New we can create our Sagemaker estimator, and launch training."
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
"estimator = TensorFlow(\n",
" entry_point=main_script, \n",
" source_dir=source_dir, \n",
" image_name=docker_image, \n",
" role=role,\n",
" framework_version=\"2.1.0\",\n",
" py_version=\"py3\",\n",
" train_instance_count=hvd_instance_count,\n",
" train_instance_type=hvd_instance_type,\n",
" distributions=distributions,\n",
" output_path=output_path, train_volume_size=200,\n",
" hyperparameters=configuration\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And finally, launch training."
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [],
"source": [
"estimator.fit(channels, wait=False, job_name=job_name)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since we set `wait=False` it will seem like nothing is happening. Since training take about 6 hours on a p3.16xlarge, we don't want our notebook to have to wait on it.\n",
"\n",
"If you go back to the Sagemaker homepage and click `Training Jobs`, you'll now see a new training job with a name like `user-frcnn-12-27-2020-4-29`. You can click the link to see what your training job is doing. The training will take about 5-10 minutes to set up, depending on your instance type (instances with NVME storage, p3dn and g4dn, will start faster). It first downloads your data from S3, then your Docker image from ECR. At this point, you have a few options for monitoring training. If you scroll down on this page, there is a section called `Monitor`. Here you can see your CPU and GPU usage, or view the logs that each instance is outputing. \n",
"\n",
"A more fun way to monitor is using Tensorboard. Once training starts, you'll find a new set of directories in your S3 bucket in the form\n",
"\n",
"```\n",
"s3://[your bucket]/faster-rcnn/\n",
"```\n",
"\n",
"In a terminal on your notebook instance, run\n",
"\n",
"```\n",
"S3_BUCKET=[your bucket]\n",
"source activate tensorflow_p36\n",
"tensorboard --logdir s3://${S3_BUCKET}/faster-rcnn/\n",
"```\n",
"\n",
"In your browser, go to \n",
"\n",
"```\n",
"https://[you notebook instance name].notebook.us-east-1.sagemaker.aws/proxy/6006/\n",
"```\n",
"\n",
"You can also use the above commands to launch multiple training jobs, and monitor all of them at once in Tensorboard. Our model runner provides a useful set of logging hooks that display model progress, both in terms of loss values and images, in Tensorboard. This gives users one place to watch how different models perform without waiting for training to finish.\n",
"\n",
"
\n",
"\n",
"
\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.7"
}
},
"nbformat": 4,
"nbformat_minor": 4
}