{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# AWSDet Sagemaker Tutorial\n", "\n", "This tutorial walks through the entire process of training an object detection model on Sagemaker. The tutorial does not go into the details of the model itself. Rather, we focus only on setting up and training the model. For a detailed explanation of individual models, please look at resources mentioned in the README.md for that model.\n", "\n", "We assume that you are running this notebook on a SageMaker Notebook instance. For instructions on how to setup a SageMaker Notebook instance for this tutorial, please see [README.md](./README.md)\n", "\n", "The following are the major components that are required,\n", "- Data download and preparation\n", "- Docker container\n", "- Hyperparameter settings\n", "- Launching a model\n", "- Monitoring" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data download and preparation\n", "The `prep_data_s3.sh` shell script contains everything to download the COCO 2017 dataset, as well as resnet weights pretrained on imagenet data. These are the same weights included in the Keras package. We download them manually for stability, because they are sometimes retrained between versions of Keras. The script will download everything to the Sagemaker notebook instance, assemble everything in the file structure the training script expects to see, archives it, and saves it to your S3 bucket." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "cd ~/SageMaker\n", "echo \"Current dir: $PWD\"\n", "# IMPORTANT if deep-learning-models directory already exists then backup (optional) and remove it so that git clone can succeed\n", "git clone https://github.com/aws-samples/deep-learning-models\n", "cd deep-learning-models/models/vision/detection/\n", "# IMPORTANT replace argument to the script with name of your s3 bucket not including s3://\n", "scripts/prep_data_s3.sh mzanur-sagemaker" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Docker container\n", "\n", "While the data preparation script mentioned above is running, you can prepare a Docker image for SageMaker. When you launch a Sagemaker training job, each instance in that job will download a Docker image from Elastic Container Registry (ECR), setup a series of environment variables so it knows your hyperparameters and where to find the data, and launches your training script. Most of the time, you can use Sagemaker's built in containers, but for customized models you could have to create your own.\n", "\n", "Under `deep-learning-models/models/vision/detection/docker` you'll find the docker file `Dockerfile.sagemaker` that creates the image that we use for training. A detailed description of that file, and how to create your own, is included in [README.md](../docker/README.md). Below are the basic commands to build this image on your SageMaker notebook instance and upload it to ECR." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "cd ~/SageMaker/deep-learning-models/models/vision/detection/\n", "# IMPORTANT replace argument to the script with name of the ECR repository that you created\n", "scripts/build_docker_sagemaker.sh mzanur-awsdet-ecr" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At the end of running the script above it will emit your docker image ID. Please put in that image ID in the configuration for your desired model under `deep-learning-models/models/vision/detection/configs//SM/` as described in the following section." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### SageMaker and Model configuration\n", "\n", "Now we can start setting up the model training process on SageMaker.\n", "\n", "Most of the hyperparameters related to the model structure do not require tweaking. Some training related hyperparameters, e.g. learning rates, momentum, weight decay may need changing depending on how many nodes you are using to train, or your dataset, or the batch size per GPU that you decide to choose.\n", "\n", "#### Brief explanation of parameters in configuration file\n", "\n", "`user_id` can be anything you like, used for keeping track of your training jobs\n", "\n", "`s3_bucket` name of your s3 bucket without s3://, e.g. jbsnyder-sagemaker\n", "\n", "`docker_image` - ID of the docker image id that was built, e.g. 12345.dkr.ecr.us-east-1.amazonaws.com/name:awsdet\n", "\n", "`instance_count` number of instances (nodes) for training.\n", "\n", "`instance_type` the type of EC2 instances that will be used for training. For highest performance use `ml.p3dn.24xlarge`. The `ml.p3.16xlarge` and `ml.p3.8xlarge` are slightly lower performance while being more cost effective. The `g4dn.12xlarge` instance type is even lower cost, but much slower than the p3 instances.\n", "\n", "`num_workers_per_host` number of training jobs per instance. This should be the number of GPUs on an instance, so should be set to 8 for `ml.p3dn.24xlarge` or `ml.p3.16xlarge` instances, and 4 for `ml.p3.8xlarge` or `g4dn.12xlarge`.\n", "\n", "`amp_enabled` refers to whether a model should train using mixed precision. Mixed precision training is supported on p3 and g4 instances.\n", "\n", "`warmup_steps` number of steps for which the learning rate will be increased from `learning_rate * warmup_ratio` to `learning_rate`.\n", "\n", "`optimizer` this is the configuration for an `tf.keras.optimizers` optimizer. The `learning_rate` specified here is the learning rate adjusted for 8 GPUs after warmup steps. If you have more than 8 GPUs then this rate will be scaled linearly automatically.\n", "\n", "`imgs_per_gpu` number of images per GPU.\n", "\n", "Other model specific details can be found and tweaked in the relevant code under `deep-learning-models/models/vision/detection/awsdet` directory. This is suggested for advanced users." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### SageMaker Job Launch" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%cd ~/SageMaker/deep-learning-models/models/vision/detection/\n", "%run tools/launch_sagemaker_job.py --job_config configs/cascade_rcnn/SM/CI/4/sagemaker_4x8.py --model_config configs/cascade_rcnn/SM/CI/4/cascade_rcnn_r50v1_d_fpn_1x_coco.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Go to the Sagemaker homepage and click `Training Jobs`, you should see a new training job with a name that was setup in the configuration file. Click the link to see what your training job is doing.\n", "\n", "The training takes about 5-10 minutes to set up, depending on the instance type that was chosen.\n", "\n", "The sagemaker job launch script downloads data required to train from S3, then the docker image (that was built) is downloaded from ECR. \n", "\n", "You can monitor CPU/GPU usage and see CloudWatch logs for each instance under the section called `Monitor`.\n", "\n", "For a more detailed view into training losses, accuracy, and to visualize predictions, e.g. bounding boxes, Tensorboard can be used. Once training starts, a new set of directories in your S3 bucket are created \n", "\n", "```\n", "[s3_path]/tensorboard/\n", "```\n", "\n", "In a terminal on your notebook instance, run\n", "\n", "```\n", "conda activate tensorflow_p36\n", "tensorboard --logdir [s3_path]/tensorboard # e.g. tensorboard --logdir s3://mzanur-sagemaker/retinanet/outputs/25-06-2020-18-40/tensorboard/\n", "```\n", "\n", "You can view the Tensorboard output at a URL (please cutomize to your notebook instance,) such as https://mzanur-sm.notebook.us-east-1.sagemaker.aws/proxy/6006/" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "conda_tensorflow2_p36", "language": "python", "name": "conda_tensorflow2_p36" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.10" } }, "nbformat": 4, "nbformat_minor": 4 }