{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# SageMakerCV TensorFlow Tutorial\n", "\n", "SageMakerCV is a collection of computer vision tools developed to take full advantage of Amazon SageMaker by providing state of the art model accuracy, training speed, and training cost reductions. SageMakerCV is based on the lessons we learned from developing the record breaking computer vision models we announced at Re:Invent in 2019 and 2020, along with talking to our customers and understanding the challenges they faced in training their own computer vision models.\n", "\n", "The tutorial in this notebook walks through using SageMakerCV to train Mask RCNN on the COCO dataset. The only prerequisite is to setup SageMaker studio, the instructions for which can be found in [Onboard to Amazon SageMaker Studio Using Quick Start](https://docs.aws.amazon.com/sagemaker/latest/dg/onboard-quick-start.html). Everything else, from getting the COCO data to launching a distributed training cluster, is included here.\n", "\n", "## Setup and Roadmap\n", "\n", "Before diving into the tutorial itself, let's take a minute to discuss the various tools we'll be using.\n", "\n", "#### SageMaker Studio\n", "[SageMaker Studio](https://aws.amazon.com/sagemaker/studio/) is a machine learning focused IDE where you can interactively develop models and launch SageMaker training jobs all in one place. SageMaker Studio provides a Jupyter Lab like environment, but with a number of enhancements. We'll just scratch the surface here. See the [SageMaker Studio Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/studio.html) for more details.\n", "\n", "For our purposes, the biggest difference from regular Jupyter Lab is that SageMaker Studio allows you to change your compute resources as needed, by connecting notebooks to Docker containers on different ML instances. This is a little confusing to just describe, so let's walk through an example.\n", "\n", "Once you've completed the setup on [Onboard to Amazon SageMaker Studio Using Quick Start](https://docs.aws.amazon.com/sagemaker/latest/dg/onboard-quick-start.html), go to the [SageMaker Console](https://us-west-2.console.aws.amazon.com/sagemaker) and click `Open SageMaker Studio` near the top right of the page.\n", "\n", "\n", "\n", "If you haven't yet created a user, do so via the link at the top left of the page. Give it any name you like. For execution role, you can either use an existing SageMaker role, or create a new one. If you're unsure, create a new role. On the `Create IAM Role` window, make sure to select `Any S3 Bucket`. \n", "\n", "\n", "\n", "Back on the SageMaker Studio page, select `Open Studio` next to the user you just created.\n", "\n", "\n", "\n", "This will take a couple minutes to start up the first time. Once it starts, you'll have a Jupyter Lab like interface running on a small instance with an attached EBS volume. Let's start by taking a look at the `Launcher` tab.\n", "\n", "\n", "\n", "If you don't see the `Launcher`, you can bring one up by clicking the `+` on the menu bar in the upper left corner.\n", "\n", "\n", "\n", "The `Launcher` gives you access to all kinds of tools. This is where you can create new notebooks, text files, or get a terminal for your instance. Try the `System Terminal`. This gives you a new terminal tab for your Studio instance. It's useful for things like downloading data or cloning github repos into studio. For example, you can run `aws s3 ls` to browse your current S3 buckets. Go ahead and clone this repo onto Studio with \n", "\n", "`git clone https://github.com/aws-samples/amazon-sagemaker-cv`\n", "\n", "Let's look at the launcher one more time. Bring another one up with the `+`. Notice you have an option for `Select a SageMaker image` above the button to launch a notebook. This allows you to select a Docker image that will launch on a new instance. The notebook you create will be attached to that new instance, along with the EBS volume on your Studio instance. Let's try it out. On the `Launcher` page, click the drop down menu next to `Select a SageMaker Image` and select `TensorFlow 2.3 Python 3.7 (Optimzed for GPU)`, then click the `Notebook` button below the dropdown.\n", "\n", "\n", "\n", "Take a look at the upper righthand corner of the notebook. \n", "\n", "\n", "\n", "The `Ptyhon 3 (TensorFlow 2.3 Python 3.7 GPU Optimized)` refers to the kernel associated with this notebook. The `Unknown` refers to the current instance type. Click `Unknown` and select `ml.g4dn.xlarge`.\n", "\n", "\n", "\n", "This will launch a `ml.g4dn.xlarge` instance and attach this notebook to it. This will take a couple of minutes, because Studio needs to download the PyTorch Docker image to the new instance. Once an instance has started, launching new notebooks with the same instance type and kernel is immediate. You'll also see the `Unknown` replaced with and instance description `4 vCPU + 16 GiB + 1 GPU`. You can also change instance as needed. Say you want to run your notebook on a `ml.p3dn.24xlarge` to get 8 GPUs. To change instances, just click the instance description. To get more instances in the menu, deselect `Fast launch only`.\n", "\n", "Once your notebook is up and running, you can also get a terminal into your new instance.\n", "\n", "\n", "\n", "This can be useful for customizing your image with setup scripts, pip installing new packages, or using mpi to launch multi GPU training jobs. Click to get a terminal and run `ls`. Note that you have the same directories as your main Studio instance. Studio will attach the same EBS volume to all the instances you start, so all your files and data are shared across any notebooks you start. This means that you can prototype a model on a single GPU instance, then switch to a multi GPU instance while still having access to all of your data and scripts.\n", "\n", "Finally, when you want to shut down instances, click the circle with a square in it on the left hand side.\n", "\n", "\n", "\n", "This shows your current running instances, and the Docker containers attached to those instances. To shut them down, just click the power button to their right.\n", "\n", "Now that we've explored studio a bit, let's get started with SageMakerCV. If you followed the instructions above to clone the repo, you should have `amazon-sagemaker-cv` in the file browser on the left. Navigate to `amazon-sagemaker-cv/pytorch/tutorial.ipynb` to open this notebook on your instance. If you still have a `g4dn` running, it should automatically attach to it.\n", "\n", "The rest of this notebook is broken into 4 sections.\n", "\n", "- Installing SageMakerCV and Downloading the COCO Data\n", "\n", "Since we're using the base AWS Deep Learning Container image, we need to add the SageMakerCV tools. Then we'll download the COCO dataset and upload it to S3.\n", "\n", "- Prototyping in Studio\n", "\n", "We'll walk through how to train a model on Studio, how SageMakerCV is structured, and how you can add your own models and features.\n", "\n", "- Launching a SageMaker Training Job\n", "\n", "There's lots of bells and whistles available to train your models fast, an on large datasets. We'll put a lot of those together to launch a high performance training job. Specifically, we'll create a training job with 4 P4d.24xlarge instances connected with 400 GB EFA, and streaming our training data from S3, so we don't have to load the dataset onto the instances before training. You could even use this same configuration to train on a dataset that wouldn't fit on the instances. If you'd rather only launch a smaller (or larger) training cluster, we'll discuss how to modify configuration.\n", "\n", "- Testing Our Model\n", "\n", "Finally, we'll take the output trained Mask RCNN model and visualize its performance in Studio.\n", "\n", "#### Installing SageMakerCV\n", "\n", "To install SageMakerCV on the PyTorch Studio Docker, just run `pip install -e .` in the `amazon-sagemaker-cv/tensorflow` directory. You can do this with either an image terminal, or by running the paragraph below. Note that we use the `-e` option. This will keep the SageMakerCV modules editable, so any changes you make will be launched on your training job." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install -e ." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "### Setup on S3 and Download COCO data\n", "\n", "Next we need to setup an S3 bucket for all our data and results. Enter a name for your S3 bucket below. You can either create a new bucket, or use an existing bucket. If you use an existing bucket, make sure it's in the same region where you plan to run training. For new buckets, we'll specify that it needs to be in the current SageMaker region. By default we'll put everything in an S3 location on your bucket named `smcv-tutorial`, and locally in `/root/smcv-tutorial`, but you can change these locations. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "S3_BUCKET = 'sagemaker-smcv-tutorial' # Don't include s3:// in your bucket name\n", "S3_DIR = 'smcv-tensorflow-tutorial'\n", "LOCAL_DATA_DIR = '/root/smcv-tensorflow-tutorial' # For reasons detailed in Distributed Training, do not put this dir in the SageMakerCV dir" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "import zipfile\n", "from pathlib import Path\n", "from s3fs import S3FileSystem\n", "from concurrent.futures import ThreadPoolExecutor\n", "import boto3\n", "from botocore.client import ClientError\n", "from tqdm import tqdm" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3 = boto3.resource('s3')\n", "boto_session = boto3.session.Session()\n", "region = boto_session.region_name\n", "\n", "# Check if bucket exists. If it doesn't, create it.\n", "\n", "try:\n", " bucket = s3.meta.client.head_bucket(Bucket=S3_BUCKET)\n", " print(f\"S3 Bucket {S3_BUCKET} Exists\")\n", "except ClientError:\n", " print(f\"Creating Bucket {S3_BUCKET}\")\n", " bucket = s3.create_bucket(Bucket=S3_BUCKET, CreateBucketConfiguration={'LocationConstraint': region})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "\n", "Next we'll download the COCO data to Studio, unzip the files, create TFRecords, and upload to S3. The reason we want the data in two places is that it's convenient to have the data locally on Studio for prototyping. We also want to unarchive the data before moving it to S3 so that we can stream it to our training instances instead of downloading it all at once.\n", "\n", "Once this is finished, you'll have copies of the COCO data on your Studio instance, and in S3. Be careful not to open the `data/coco/train2017` dir in the Studio file browser. It contains 118287 images, and can cause your web browser to crash. If you need to browse these files, use the terminal.\n", "\n", "This only needs to be done once, and only if you don't already have the data. The COCO 2017 dataset is about 20GB, so this step takes around 30 minutes to complete. The next paragraph sets up all the file directories we'll use for downloading, and later in training. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "COCO_URL=\"http://images.cocodataset.org\"\n", "ANNOTATIONS_ZIP=\"annotations_trainval2017.zip\"\n", "TRAIN_ZIP=\"train2017.zip\"\n", "VAL_ZIP=\"val2017.zip\"\n", "COCO_DIR=os.path.join(LOCAL_DATA_DIR, 'data', 'coco')\n", "TF_RECORD_DIR=os.path.join(LOCAL_DATA_DIR, 'data', 'coco', 'tfrecord')\n", "os.makedirs(COCO_DIR, exist_ok=True)\n", "os.makedirs(TF_RECORD_DIR, exist_ok=True)\n", "S3_DATA_LOCATION=os.path.join(\"s3://\", S3_BUCKET, S3_DIR, \"data\", \"coco\")\n", "S3_WEIGHTS_LOCATION=os.path.join(\"s3://\", S3_BUCKET, S3_DIR, \"data\", \"weights\", \"resnet\")\n", "WEIGHTS_DIR=os.path.join(LOCAL_DATA_DIR, 'data', 'weights')\n", "os.makedirs(WEIGHTS_DIR, exist_ok=True)\n", "R50_WEIGHTS_SRC=\"https://sagemakercv.s3.us-west-2.amazonaws.com/weights/tensorflow\"\n", "R50_WEIGHTS_TAR=\"tensorflow_resnet50.tar\"\n", "R50_WEIGHTS=\"tensorflow_resnet50\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "\n", "This paragraph will download everything. It takes around 30 minutes to complete." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"Downloading annotations\")\n", "!wget -O $COCO_DIR/$ANNOTATIONS_ZIP $COCO_URL/annotations/$ANNOTATIONS_ZIP\n", "!unzip $COCO_DIR/$ANNOTATIONS_ZIP -d $COCO_DIR\n", "!aws s3 cp --recursive $COCO_DIR/annotations $S3_DATA_LOCATION/annotations\n", "\n", "print(\"Downloading COCO training data\")\n", "!wget -O $COCO_DIR/$TRAIN_ZIP $COCO_URL/zips/$TRAIN_ZIP\n", "\n", "# train data has ~128000 images. Unzip is too slow, about 1.5 hours beceause of disk read and write speed on the EBS volume. \n", "# This technique is much faster because it grabs all the zip metadata at once, then uses threading to unzip multiple files at once.\n", "print(\"Unzipping COCO training data\")\n", "train_zip = zipfile.ZipFile(os.path.join(COCO_DIR, TRAIN_ZIP))\n", "jpeg_files = [image.filename for image in train_zip.filelist if image.filename.endswith('.jpg')]\n", "os.makedirs(os.path.join(COCO_DIR, 'train2017'))\n", "with ThreadPoolExecutor() as executor:\n", " threads = list(tqdm(executor.map(lambda x: train_zip.extract(x, COCO_DIR), jpeg_files), total=len(jpeg_files)))\n", "\n", "print(\"Downloading COCO validation data\")\n", "!wget -O $COCO_DIR/$VAL_ZIP $COCO_URL/zips/$VAL_ZIP\n", "# switch to also threading\n", "!unzip -q $COCO_DIR/$VAL_ZIP -d $COCO_DIR\n", "val_images = [i for i in Path(os.path.join(COCO_DIR, 'val2017')).glob(\"*.jpg\")]\n", " \n", "!apt-get -y update && apt install -y protobuf-compiler\n", "!cd sagemakercv/data/coco && ./process_coco_tfrecord.sh $COCO_DIR $TF_RECORD_DIR\n", "\n", "\n", "tfrecord_train = list(Path(TF_RECORD_DIR).glob('train-*.tfrecord'))\n", "tfrecord_val = list(Path(TF_RECORD_DIR).glob('val-*.tfrecord'))\n", "s3fs = S3FileSystem()\n", "\n", "print(\"Uploading training tfrecords to S3\")\n", "with ThreadPoolExecutor() as executor:\n", " threads = list(tqdm(executor.map(lambda record: s3fs.put(record.as_posix(), \n", " os.path.join(S3_DATA_LOCATION, 'tfrecord', 'train2017', record.name)), \n", " tfrecord_train), total=len(tfrecord_train)))\n", "print(\"Uploading validation tfrecords to S3\")\n", "with ThreadPoolExecutor() as executor:\n", " threads = list(tqdm(executor.map(lambda record: s3fs.put(record.as_posix(), \n", " os.path.join(S3_DATA_LOCATION, 'tfrecord', 'val2017', record.name)), \n", " tfrecord_val), total=len(tfrecord_val)))\n", "\n", "print(\"Downloading Resnet Weights\")\n", "!wget -O $WEIGHTS_DIR/$R50_WEIGHTS_TAR $R50_WEIGHTS_SRC/$R50_WEIGHTS_TAR\n", "!tar -xf $WEIGHTS_DIR/$R50_WEIGHTS_TAR -C $WEIGHTS_DIR\n", "s3fs.put(os.path.join(WEIGHTS_DIR, R50_WEIGHTS), S3_WEIGHTS_LOCATION, recursive=True)\n", "\n", "print(\"Finished!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "### Training on Studio\n", "\n", "Now that we have the data, we can get to training a Mask RCNN model to detect objects in the COCO dataset images. \n", "\n", "Since training on a single GPU can take days, we'll just train for a couple thousands steps, and run a single evaluation to make sure our model is at least starting to learn something. We'll train a full model on a larger cluster of GPUs in a SageMaker training job.\n", "\n", "The reason we first want to train in Studio is that we want to dig a bit into the SageMakerCV framework, and talk about the model architecture, since we expect many users will want to modify models for their own use cases.\n", "\n", "#### Mask RCNN\n", "\n", "First, just a very brief overview of Mask RCNN. If you would like a more in depth examination, we recommend taking a look at the [original paper](https://arxiv.org/abs/1703.06870), the [feature pyramid paper](https://arxiv.org/abs/1612.03144) which describes a popular architectural change we'll use in our model, and blog posts from [viso.ai](https://viso.ai/deep-learning/mask-r-cnn/), [tryo labs](https://tryolabs.com/blog/2018/01/18/faster-r-cnn-down-the-rabbit-hole-of-modern-object-detection/), [Jonathan Hui](https://jonathan-hui.medium.com/image-segmentation-with-mask-r-cnn-ebe6d793272), and [Lilian Weng](https://lilianweng.github.io/lil-log/2017/12/31/object-recognition-for-dummies-part-3.html).\n", "\n", "Mask RCNN is a two stage object detection model that locates objects in images by places bounding boxes around, and segmentation masks over, any object for which the model is trained to find. It also provides classifcations for each object.\n", "\n", "\n", "\n", "Mask RCNN is called a two stage model because it performs detection in two steps. The first identifies any objects in the image, versus background. The second stage determines the specific class of each object, and applies the segmentation mask. Below is an architectural diagram of the model. Let's walk through each step.\n", "\n", "\n", "Credit: Jonathan Hui\n", "\n", "The `Convolution Network` is often referred to as the model backbone. This is a pretrained image classification model, commonly ResNet, which has been trained on a large image classification dataset, like ImageNet. The classification layer is removed, and instead the backbone outputs a set of convolution feature maps. The idea is, the classification model learned to identify objects in the process of classifying images, and now we can use that information to build a more complex model that can find those objects in the image. We want to pretrain because training the backbone at the same time as training the object detector tends to be very unstable.\n", "\n", "One additional component that is sometimes added to the backbone is a `Fearure Pyramid Network`. This take the outputs of the backbone, and combines them to together into a new set of feature maps by perform both up and down convolutions. The idea is that the different sized feature maps will help the model detect images of different sizes. The feature pyramid also helps with this, by allowing the different feature maps to share information with each other.\n", "\n", "The outputs of the feature pyramid are then passed to the `Region Proposal Network` which is responsible for finding regions of the image that might contain an object (this is the first of the two stages). The RPN will output several hundred thousand regions, each with a probability of containing an object. We'll typically take the top few thousand most likely regions. Because these several thousand regions will usually have a lot of overlap, we perform [non-max supression](https://towardsdatascience.com/non-maximum-suppression-nms-93ce178e177c), which removed regions with large areas of overlap. This gives us a set of `regions of interest` regions of the image that we think might contain an image.\n", "\n", "Next, we use those regions to crop out the corresponding sections of the feature maps that came from the feature pyramid network using a technique called [ROI align](https://firiuza.medium.com/roi-pooling-vs-roi-align-65293ab741db).\n", "\n", "We pass our cropped feature maps to the `box head` which classifies each region into either a specific object category, or as background. It also refines the position of the bounding box. In Mask RCNN, we also pass the feature maps to a `mask head` which produces a segmentation mask over the object.\n", "\n", "#### SageMakerCV Internals\n", "\n", "An important feature of Mask RCNN is its multiple heads. One head constructs a bounding box, while another creates a mask. These are referred to as the `ROI heads`. It's common for users to extend this and other two stage models by adding their own ROI heads. For example, a keypoint head it common. Doing so means modifying SageMakerCV's internals, so let's talk about those for a second. \n", "\n", "The high level Mask RCNN model can be found in `amazon-sageamaker-cv/pytorch/sagemakercv/detection/detectors/two_stage_detector.py`. If you trace through the `call` function, you'll see that the model first passes an image through the backbone, neck, then the RPN. The RPN layer also contains the non-max supression step. The regions of interest are then passed to the roi heads, where the regions of interest are used to crop sections of the feature maps, which are then classified into object categories.\n", "\n", "Probably the most important feature to be aware of are the `build` imports at the top. Each section of the model has an associated build function `(build_backbone, build_neck, build_dense_head, build_roi_head)` which are implemented in the `build_two_stage_detector` at the bottom of the file. These functions simplify building the model by letting us pass in a single configuration file for building all the different pieces. \n", "\n", "For example, if you open `amazon-sageamaker-cv/tensorflow/sagemakercv/detection/roi_heads/standard_roi_head.py`, you'll find the `build_standard_roi_head` function at the bottom. To add a new head, you would write a Tensorflow module with its own build function. The decorator at the top of the build function allows it to be called from the config file. The dectorator `@HEADS.register(\"StandardRoIHead\")` adds a dictionary entry so that when `StandardRoIHead` is in the config file, build_standard_roi_head gets called at the `build_roi_head`. If, for example, you specify `CascadeRoIHead` the associated builder for the cascade ROI head is used instead.\n", "\n", "Finally, a note about data loading. SageMakerCV uses and optimized TFRecord data format. The COCO dataloader can be found in `amazon-sageamaker-cv/tensorflow/sagemakercv/data/coco/dataloader.py`. It takes a file pattern in the form `data/coco/train2017/train*` which will include all files that start with `train` in the dataset. You can use either a local directory or an S3 location `s3://my-bucket/my-data/coco/train2017/train*`. The dataloader will automatically switch between the two. The S3 functionality is especially useful for distributed training with large datasets, since it means you can train without waiting for your data to download.\n", "\n", "#### Setting Up Training\n", "\n", "Let's actually use some of these functions to train a model.\n", "\n", "Start by importing the default configuration file." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from configs import cfg" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "We use the [yacs](https://github.com/rbgirshick/yacs) format for configuration files. If you want to see the entire config, run `print(cfg.dump())` but this prints out a lot, and to not overwhelm you with too much information, we'll just focus on the bits we want to change for this model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "First, let's put in all the file directories for the data and weights we downloaded in the previous section, as well as an output directory for the model results." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cfg.PATHS.TRAIN_FILE_PATTERN = os.path.join(TF_RECORD_DIR, \"train*\")\n", "cfg.PATHS.VAL_FILE_PATTERN = os.path.join(TF_RECORD_DIR, \"val*\")\n", "cfg.PATHS.WEIGHTS = os.path.join(WEIGHTS_DIR, R50_WEIGHTS, \"resnet50.ckpt\")\n", "cfg.PATHS.VAL_ANNOTATIONS = os.path.join(COCO_DIR, \"annotations\", \"instances_val2017.json\")\n", "cfg.PATHS.OUT_DIR = os.path.join(LOCAL_DATA_DIR, \"output\")\n", "\n", "# create output dir if it doesn't exist\n", "os.makedirs(cfg.PATHS.OUT_DIR, exist_ok=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "This section specifies model details, including the type of model, and internal hyperparameters. We wont cover the details of all of these, but more information can be found in this blog posts listed above, as well as the original paper." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cfg.LOG_INTERVAL = 50 # Number of training steps between logging interval\n", "cfg.MODEL.DENSE.PRE_NMS_TOP_N_TRAIN = 2000 # Top regions of interest to select before NMS\n", "cfg.MODEL.DENSE.POST_NMS_TOP_N_TRAIN = 1000 # Top regions of interest to select after NMS\n", "cfg.MODEL.RCNN.ROI_HEAD = \"StandardRoIHead\" # ROI head with box and mask, if mask is set to true\n", "cfg.MODEL.FRCNN.LOSS_TYPE = \"giou\"\n", "cfg.MODEL.INCLUDE_MASK = True # include mask. switching this off runs Faster RCNN" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "Next we set up the configuration for training, including the optimizer, hyperparameters, batch size, and training length. Batch size is global, so if you set a batch size of 64 across 8 GPUs, it will be a batch size of 8 per GPU. SageMakerCV currently supports the following optimizers: momentum SGD (stochastic gradient descent) and NovoGrad, and the following learning rate schedulers: stepwise and cosine decay. New, custom optimizers and schedulers can be added by modifying the `sagemakercv/training/builder.py` file.\n", "\n", "For training on Studio, we'll just run for a thousand steps. We'll be using SageMaker training instances for the full training on multiple GPUs." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cfg.INPUT.TRAIN_BATCH_SIZE = 4 # Training batch size\n", "cfg.INPUT.EVAL_BATCH_SIZE = 8 # Training batch size\n", "cfg.SOLVER.SCHEDULE = \"CosineDecay\" # Learning rate schedule, either CosineDecay or PiecewiseConstantDecay\n", "cfg.SOLVER.OPTIMIZER = \"NovoGrad\" # Optimizer type NovoGrad or Momentum\n", "cfg.SOLVER.LR = .002 # Base learning rate after warmup\n", "cfg.SOLVER.BETA_1 = 0.9 # NovoGrad beta 1 value\n", "cfg.SOLVER.BETA_2 = 0.5 # NovoGRad beta 2 value\n", "cfg.SOLVER.MAX_ITERS = 2500 # Total training steps\n", "cfg.SOLVER.WARMUP_STEPS = 250 # warmup steps\n", "cfg.SOLVER.XLA = True # Train with XLA\n", "cfg.SOLVER.FP16 = True # Train with mixed precision enables\n", "cfg.SOLVER.TF32 = False # Train with TF32 data type enabled, only available on Ampere GPUs and TF 2.4 and up" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, SageMakerCV includes a number of training hooks. These work similar to Keras callbacks by adding some functionality to training. We use our own training hooks and runner class which improves performance beyond the standard keras model.fit() training strategy.\n", "\n", "Here we include three hooks. The `CheckpointHook` loads the backbone weights, and saves a model checkpoint after each epoch. The `IterTimerHook` and `TextLoggerHook` print helpful training progress information out to CloudWatch during training." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cfg.HOOKS=[\"CheckpointHook\",\n", " \"IterTimerHook\",\n", " \"TextLoggerHook\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's save our new configuration file in case we want to use it in future training." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import yaml\n", "from contextlib import redirect_stdout" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "local_config_file = f\"configs/local-config-studio.yaml\"\n", "with open(local_config_file, 'w') as outfile:\n", " with redirect_stdout(outfile): print(cfg.dump())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A saved model configuration can be loaded by first running `from configs import cfg` and mapping our saved file with `merge_from_file`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cfg.merge_from_file(local_config_file)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And now we can build and train our model. Import build functions so we can build pieces directory with our configuration file." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemakercv.detection import build_detector\n", "from sagemakercv.training import build_optimizer, build_scheduler, build_trainer\n", "from sagemakercv.data import build_dataset\n", "from sagemakercv.utils.dist_utils import get_dist_info, MPI_size, is_sm_dist\n", "from sagemakercv.utils.runner import Runner, build_hooks\n", "import tensorflow as tf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And include some standard TensorFlow configuration setup so our model runs in mixed precision with XLA enabled." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rank, local_rank, size, local_size = get_dist_info()\n", "devices = tf.config.list_physical_devices('GPU')\n", "for device in devices:\n", " tf.config.experimental.set_memory_growth(device, True)\n", "tf.config.set_visible_devices([devices[local_rank]], 'GPU')\n", "logical_devices = tf.config.list_logical_devices('GPU')\n", "tf.config.optimizer.set_experimental_options({\"auto_mixed_precision\": cfg.SOLVER.FP16})\n", "tf.config.optimizer.set_jit(cfg.SOLVER.XLA)\n", "if int(tf.__version__.split('.')[1])>=4:\n", " tf.config.experimental.enable_tensor_float_32_execution(cfg.SOLVER.TF32)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Build the dataset and create an iterable object from it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataset = iter(build_dataset(cfg))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Build the detector model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "detector = build_detector(cfg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pass a single observation through the model so the shapes are set. This is necessary to load the backbone weights." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "features, labels = next(dataset)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result = detector(features, training=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Build the model optimizer. This will also build our learning rate schedule." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "optimizer = build_optimizer(cfg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The trainer contains our training and evaluation step, and sets up our distributed training based on if we're using Horovod or SMDDP (more on this later)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "trainer = build_trainer(cfg, detector, optimizer, dist='smd' if is_sm_dist() else 'hvd')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, the runner will manage our training and run our training hooks. This serves a similar role to training with Keras, but provides increased flexibility and training performance." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "runner = Runner(trainer, cfg)\n", "hooks = build_hooks(cfg)\n", "for hook in hooks:\n", " runner.register_hook(hook)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run training for 2500 steps. This will take about 30 minutes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "runner.run(dataset)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So now we have a partially trained model. Let's go ahead and try visualizing the results. You'll notice it picks up common categories (such as people) better at this point. The images are randomly picked from the training data, so it might take a few tries to get an image where the model picks up objects at this point in training." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemakercv.utils.visualization import build_image, restore_image\n", "from sagemakercv.data.coco.coco_labels import coco_categories\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "features, labels = next(dataset)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result = detector(features, training=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "image_num = 0 # image number within the batch" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We first restore the original image, then extract the boxes and labels from the results." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "image = restore_image(result['images'][image_num], features['image_info'][image_num]) # converts the image back to its original shape and color\n", "boxes = result['detection_boxes'][image_num]\n", "classes = result['detection_classes'][image_num]\n", "scores = result['detection_scores'][image_num]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Generate an image with the boxes and labels mapped onto it. The threshold limits the number of boxes to those were the model is at least this confident in the class." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "detection_image = build_image(image, boxes, scores, classes, coco_categories, threshold=0.8)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize = (15, 15))\n", "plt.imshow(detection_image)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Great! So far you've built a partially trained model locally on Studio. For many applications, this might be enough. If all you need is to train a model on a small dataset, you can likely do everything you need with what we've covered so far. \n", "\n", "On the other hand, if you need to train a model on many GBs or even TBs of data, and don't want to wait weeks for it to finish, you'll need to run a distributed training job across multiple GPUs, or even multiple nodes. With SageMaker training jobs you can train on as many as 512 [A100 GPUs](https://www.nvidia.com/en-us/data-center/a100/). We won't go quite that far, but we'll show you how.\n", "\n", "The section below is also replicated in the `SageMaker.ipynb` notebook for future training once all the above setup is complete.\n", "\n", "Before we get started, a few notes about how SageMaker training instances work. SageMaker takes care of a lot of setup for you, but it's important to understand a little of what's happening under the hood so you can customize training to your own needs. \n", "\n", "First we're going to look at a toy estimator to explain what's happening:\n", "\n", "```\n", "from sagemaker import get_execution_role\n", "from sagemaker.tensorflow import TesnorFlow\n", "\n", "estimator = TesnorFlow(\n", " entry_point='train.py', \n", " source_dir='.', \n", " py_version='py37',\n", " framework_version='2.4.1',\n", " role=get_execution_role(),\n", " instance_count=4,\n", " instance_type='ml.p4d.24xlarge',\n", " distribution=distribution,\n", " output_path='s3://my-bucket/my-output/',\n", " checkpoint_s3_uri='s3://my-bucket/my-checkpoints/',\n", " model_dir='s3://my-bucket/my-model/',\n", " hyperparameters={'config': 'my-config.yaml'},\n", " volume_size=500,\n", " code_location='s3://my-bucket/my-code/',\n", ")\n", "```\n", "\n", "The estimator forms the basic configuration of your training job.\n", "\n", "SageMaker will first launch `instance_count=4` `instance_type=ml.p4d.24xlarge` instances. The `role` is an IAM role that SageMaker will use to launch instances on your behalf. SageMaker includes a `get_execution_role` function which grabs the execution role of your current instance. Each instance will have a `volume_size=500` EBS volume attached for your model and data. On `ml.p4d.24xlarge` and `ml.p3dn.24xlarge` instance types, SageMaker will automatically set up the [Elastic Fabric Adapter](https://aws.amazon.com/hpc/efa/). EFA provides up to 400 GB/s communication between your training nodes, as well as [GPU Direct RDMA](https://aws.amazon.com/about-aws/whats-new/2020/11/efa-supports-nvidia-gpudirect-rdma/) on `ml.p4d.24xlarge`, which allows your GPUs to bypass the host and communicate directly with each other across nodes.\n", "\n", "Next, SageMaker we copy all the contents of `source_dir='.'` first to the `code_location='s3://my-bucket/my-code/'` S3 location, then to each of your instances. One common mistake is to leave large files or data in this directory or its subdirectories. This will slow down your launch times, or can even cause the launch to hang. Make sure to keep your working data and model artifacts elsewhere on your Studio instance so you don't accidently copy them to your training instance. You should instead use `Channels` to copy data and model artifacts, which we'll cover shortly.\n", "\n", "SageMaker will then download the training Docker image to all your instances. Which container you download is determined by `py_version='py37'` and `framework_version='2.4.1'`. You can also use your own [custom Docker image](https://aws.amazon.com/blogs/machine-learning/bringing-your-own-custom-container-image-to-amazon-sagemaker-studio-notebooks/) by specifying an ECR address with the `image_uri` option. SageMakerCV currently works with TensorFlow versions 2.3-2.5.\n", "\n", "Before starting training, SageMaker will check your source directory for a `setup.py` file, and install if one is present. Then SageMaker will launch training, via `entry_point='train.py'`. Anything in `hyperparameters={'config': 'my-config.yaml'}` will be passed to the training script as a command line argument (ie `python train.py --config my-config.yaml`). The distribution will determine what form of distributed training to launch. This will be covered in more detail later.\n", "\n", "During training, anything written to `/opt/ml/checkpoints` on your training instances will be synced to `checkpoint_s3_uri='s3://my-bucket/my-checkpoints/'` at the same time. This can be useful for checkpointing a model you might want to restart later, or for writting Tensorboard logs to monitor your training.\n", "\n", "When training complets, you can write your model artifacats to `/opt/ml/model` and it will save to `model_dir='s3://my-bucket/my-model/'`. Another option is to also write model artifacts to your checkpoints file.\n", "\n", "Training logs, and any failure messages will to written to `/opt/ml/output` and saved to `output_path='s3://my-bucket/my-output/'`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker import get_execution_role\n", "from sagemaker.tensorflow import TensorFlow\n", "from datetime import datetime" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First we need to set some names. You want `AWS_DEFAULT_REGION` to be the same region as the S3 bucket your created earlier, to ensure your training jobs are reading from nearby S3 buckets.\n", "\n", "Next, set a `user_id`. This is just for naming your training job so it's easier to find later. This can be anything you like. We also get the current date and time to make organizing training jobs a little easier." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# explain region. Don't launch a training job in VA with S3 bucket in OR\n", "os.environ['AWS_DEFAULT_REGION'] = region # This is the region we set at the beginning, when creating the S3 bucket for our data\n", "\n", "# this is all for naming\n", "user_id=\"jbsnyder-smcv-tutorial\" # This is used for naming your training job, and organizing your results on S3. It can be anything you like.\n", "date_str=datetime.now().strftime(\"%d-%m-%Y\") # use the data and time to keep track of training jobs and organize results in S3\n", "time_str=datetime.now().strftime(\"%d-%m-%Y-%H-%M-%S\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For instance type, we'll use an `ml.p4d.24xlarge`. We recommend this instance type for large training. It includes the latest A100 Nvidia GPUs, which can train several times faster than the previous generation. If you would rather train part way on smaller instanes, `ml.p3.2xlarge, ml.p3.8xlarge, ml.p3.16xlarge, ml.p3dn.24xlarge, ml.g4dn.12xlarge` are all good options. In particular, if you're looking for a low cost way to try a short distributed training, but aren't worried about the model fully converging, we recommend the `ml.g4dn.12xlarge` which uses 4 Nvidia T4 GPUs per node.\n", "\n", "`s3_location` will be the base S3 storage location we used earlier for the COCO data. For `role` we get the execution role from our studio instance. For `source_dir` we use the current directory. Again, make sure you haven't accidently written any large files to this directory." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# specify training type, s3 src and nodes\n", "instance_type=\"ml.p4d.24xlarge\" # This can be any of 'ml.p3dn.24xlarge', 'ml.p4d.24xlarge', 'ml.p3.16xlarge', 'ml.p3.8xlarge', 'ml.p3.2xlarge', 'ml.g4dn.12xlarge'\n", "nodes=4 # number of training nodes\n", "s3_location=os.path.join(\"s3://\", S3_BUCKET, S3_DIR)\n", "role=get_execution_role() #give Sagemaker permission to launch nodes on our behalf\n", "source_dir='.'\n", "entry_point='train.py'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "\n", "Let's modify our previous training configuration for multinode. We don't need to change much. We'll increase the batch size since we have more and large GPUs. For A100 GPUs a batch size of 12 per GPU works well. For V100 and T4 GPUs, a batch size of 6 per GPU is recommended. Make sure to lower the learning rate and increase your number of training steps if you decrease the batch size. For example, if you want to train on 2 `ml.g4dn.12xlarge` instances, you'll have 8 T4 GPUs. A batch size of `cfg.INPUT.TRAIN_BATCH_SIZE = 32`, with inference batch size of `cfg.INPUT.EVAL_BATCH_SIZE = 16`, learning rate of `cfg.SOLVER.LR = .008`, and training steps of `cfg.SOLVER.MAX_ITERS = 25000`` is probably about right. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from configs import cfg" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cfg.LOG_INTERVAL = 50 # Number of training steps between logging interval\n", "cfg.MODEL.DENSE.PRE_NMS_TOP_N_TRAIN = 2000 # Top regions of interest to select before NMS\n", "cfg.MODEL.DENSE.POST_NMS_TOP_N_TRAIN = 1000 # Top regions of interest to select after NMS\n", "cfg.MODEL.RCNN.ROI_HEAD = \"StandardRoIHead\"\n", "cfg.MODEL.FRCNN.LOSS_TYPE = \"giou\"\n", "cfg.MODEL.FRCNN.LABEL_SMOOTHING = 0.1 # label smoothing for box head" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cfg.INPUT.TRAIN_BATCH_SIZE = 256 # Training batch size\n", "cfg.INPUT.EVAL_BATCH_SIZE = 128 # Training batch size\n", "cfg.SOLVER.SCHEDULE = \"CosineDecay\" # Learning rate schedule, either CosineDecay or PiecewiseConstantDecay\n", "cfg.SOLVER.OPTIMIZER = \"NovoGrad\" # Optimizer type NovoGrad or Momentum\n", "cfg.SOLVER.LR = .042 # Base learning rate after warmup\n", "cfg.SOLVER.BETA_1 = 0.9 # NovoGrad beta 1 value\n", "cfg.SOLVER.BETA_2 = 0.3 # NovoGRad beta 2 value\n", "cfg.SOLVER.ALPHA = 0.001 # scehduler final alpha\n", "cfg.SOLVER.WEIGHT_DECAY = 0.001 # weight decay\n", "cfg.SOLVER.MAX_ITERS = 5500 # Total training steps\n", "cfg.SOLVER.WARMUP_STEPS = 500 # warmup steps\n", "cfg.SOLVER.XLA = True # Train with XLA\n", "cfg.SOLVER.FP16 = True # Train with mixed precision enables\n", "cfg.SOLVER.TF32 = True # Train with TF32 data type enabled, only available on Ampere GPUs and TF 2.4 and up\n", "cfg.SOLVER.EVAL_EPOCH_EVAL = False # Only run eval at end" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cfg.HOOKS=[\"CheckpointHook\",\n", " \"IterTimerHook\",\n", " \"TextLoggerHook\",\n", " \"CocoEvaluator\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "Earlier we mentioned the `distrbution` strategy in SageMaker. Distributed training can be either multi GPU single node (ie training on 8 GPU in a single ml.p4d.24xlarge) or mutli GPU multi node (ie training on 32 GPUs across 4 ml.p4d.24xlarges). For TensorFlow SageMakerCV uses either Horovod or [SageMaker Distributed Data Parallel](https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel.html) (SMDDP). For single node multi GPU, or multi node on small instances, we recommend Horovod. For multinode on large instance types, SMDDP is built to fully utilize AWS network topology, and EFA, providing improved scaling efficiency.\n", "\n", "To enable SMDDP, set `distribution = { \"smdistributed\": { \"dataparallel\": { \"enabled\": True } } }`. SageMakerCV already has SMDDP integrated. To implement SMDDP for your own models, follow [these instructions](https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-intro.html). SMDDP will launch training from the first node in your cluster using [MPI](https://www.open-mpi.org/).\n", "\n", "For Horovod based training, we can call MPI directory by setting `distribution = {\"mpi\": {\"enabled\": True,}}`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "if nodes>1 and instance_type in ['ml.p3dn.24xlarge', 'ml.p4d.24xlarge', 'ml.p3.16xlarge']:\n", " distribution = { \"smdistributed\": { \"dataparallel\": { \"enabled\": True } } } \n", "else:\n", " distribution = {\"mpi\": {\"enabled\": True,}}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "We'll set a job name based on the user name and time. We'll then set output directories on S3 using the date and job name.\n", "\n", "For this training, we'll use the same S3 location for all 3 SageMaker model outputs `/opt/ml/checkpoint`, `/opt/ml/model`, and `/opt/ml/output`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "job_name = f'{user_id}-{time_str}' # Set the job name to user id and the current time\n", "output_path = os.path.join(s3_location, \"sagemaker-output\", date_str, job_name) # Organizes results on S3 by date and job name\n", "code_location = os.path.join(s3_location, \"sagemaker-code\", date_str, job_name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "Next we need to add our data sources to our configuration file, but first let's talk a little more about how SageMaker gets data to your instance.\n", "\n", "The most straightforward way to get your data is using \"Channels.\" These are S3 locations you specify in a dictionary when you launch a training job. For example, let's say you launch a training job with:\n", "\n", "```\n", "channels = {'train': 's3://my-bucket/data/train/',\n", " 'test': 's3://my-bucket/data/test/',\n", " 'weights': 's3://my-bucket/data/weights/',\n", " 'dave': 's3://my-bucket/data/daves_weird_data/'}\n", "\n", "pytorch_estimator.fit(channels)\n", "```\n", "\n", "At the start of training, SageMaker will create a set of corresponding directories on each training node:\n", "\n", "```\n", "/opt/ml/input/data/train/\n", "/opt/ml/input/data/test/\n", "/opt/ml/input/data/weights/\n", "/opt/ml/input/data/dave/\n", "```\n", "\n", "SageMaker will then copy all the contents of the corresponding S3 locations to these directories, which you can then access in training.\n", "\n", "One downside of setting up channels like this is that it requires all the data to be downloaded to your instance at the start of of training, which can delay the training launch if you're dealing with a large dataset.\n", "\n", "We have two ways to speed up launch. The first is [Fast File Mode](https://aws.amazon.com/about-aws/whats-new/2021/10/amazon-sagemaker-fast-file-mode/) which downloads data from S3 as it's requested by the training model, speeding up your launch time. You can use fast file mode by sepcifying `TrainingInputMode='FastFile'` in your SageMaker estimator configuration. \n", "\n", "If you're dealing with really large datasets, you might prefer to instead continuously stream data from S3. Luckily, this feature is already supported in TensorFlow and SageMakerCV. If you provide the dataset builder with an S3 file pattern, it will stream TFRecords from S3 instead of reading them locally.\n", "\n", "In our case, we'll use a mix of channels and streaming from S3. We'll download the smaller pieces at the start of training (the validation data, pretrained weights, and image annotations), and we'll stream our training data directly from S3 during training.\n", "\n", "First, we setup our training channels. These are the locations where we earlier uploaded our COCO data, annotations, and weights." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "channels = {'val2017': os.path.join(s3_location, 'data', 'coco', 'tfrecord', 'val2017'),\n", " 'annotations': os.path.join(s3_location, 'data', 'coco', 'annotations'),\n", " 'weights': os.path.join(s3_location, 'data', 'weights', 'resnet')}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we setup the data sources in our configuration. The train file pattern will take and S3 location. The others are all set to the corresponding directory for each channel. We also set the output directory to be the SageMaker checkpoint directory, which will sync to our S3 output location." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "CHANNELS_DIR='/opt/ml/input/data/' # on node\n", "cfg.PATHS.TRAIN_FILE_PATTERN = os.path.join(s3_location, 'data', 'coco', 'tfrecord', 'train2017', 'train*')\n", "cfg.PATHS.VAL_FILE_PATTERN = os.path.join(CHANNELS_DIR, \"val2017\", \"val*\")\n", "cfg.PATHS.WEIGHTS = os.path.join(CHANNELS_DIR, \"weights\", \"resnet50.ckpt\")\n", "cfg.PATHS.VAL_ANNOTATIONS = os.path.join(CHANNELS_DIR, \"annotations\", \"instances_val2017.json\")\n", "cfg.PATHS.OUT_DIR = '/opt/ml/checkpoints'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Save the configuration file." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dist_config_file = f\"configs/dist-training-config.yaml\"\n", "with open(dist_config_file, 'w') as outfile:\n", " with redirect_stdout(outfile): print(cfg.dump())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Set the config file as a hyperparameter so it will be passed a command line arg when training launches." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hyperparameters = {\"config\": dist_config_file}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And now we can launch training. With 4 P4d instances, this takes about an hour. This section will also print a lot of output logs. By setting `wait=False` you can avoid printing logs in the notebook. This setting will just launch the job then return, and is useful for when you want to launch several jobs at the same time. You can then montior each job from the [SageMaker Training Console](https://us-west-2.console.aws.amazon.com/sagemaker)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "estimator = TensorFlow(\n", " entry_point=entry_point, \n", " source_dir=source_dir, \n", " py_version='py37',\n", " framework_version='2.4.1',\n", " role=role,\n", " instance_count=nodes,\n", " instance_type=instance_type,\n", " distribution=distribution,\n", " output_path=output_path,\n", " checkpoint_s3_uri=output_path,\n", " model_dir=output_path,\n", " hyperparameters=hyperparameters,\n", " volume_size=500,\n", " disable_profiler=True,\n", " debugger_hook_config=False,\n", " code_location=code_location,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "estimator.fit(channels, wait=True, job_name=job_name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "### Visualizing Results\n", "\n", "And there you have it, a fully trained Mask RCNN model in about an hour. Now let's see how our model does on prediction by actually visualizing the output.\n", "\n", "Our model is stored at the S3 location we gave to the training job in `output_path`. The checkpointer hook creates a `trained_model` directory and stores the final checkpoint there. We'll need to grab the results and store them on our studio instance so we can check performance, and visualize the output." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3fs = S3FileSystem()\n", "model_loc = os.path.join(estimator.output_path, 'trained_model', 'model.h5')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Copy the model from S3 to our Studio instance." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3fs.get(model_loc, model_loc.split('/')[-1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can load the trained model weights into the detector model we created earlier for the local training." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "detector.load_weights(model_loc.split('/')[-1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Like we did for the local model, let's grab a random image from the dataset and visualize the model's predictions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "features, labels = next(dataset)\n", "result = detector(features, training=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "image_num = 3 # image number within the batch\n", "image = restore_image(result['images'][image_num], features['image_info'][image_num]) # converts the image back to its original shape and color\n", "boxes = result['detection_boxes'][image_num]\n", "classes = result['detection_classes'][image_num]\n", "scores = result['detection_scores'][image_num]\n", "detection_image = build_image(image, boxes, scores, classes, coco_categories, threshold=0.8)\n", "plt.figure(figsize = (15, 15))\n", "plt.imshow(detection_image)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Conclusion\n", "\n", "In this notebook, we've walked through the entire process of training Mask RCNN on SageMaker. We've implemented several of SageMaker's more advanced features, such as distributed training, EFA, and streaming data directly from S3. From here you can use the provided template datasets to train on your own data, or modify the framework with your own object detection model.\n", "\n", "When you're done, make sure to check that all of your SageMaker training jobs have stopped by checking the [SageMaker Training Console](https://us-west-2.console.aws.amazon.com/sagemaker). Also check that you've stopped any Studio instance you have running by selecting the session monitor on the left (the circle with a square in it), and clicking the power button next to any running instances. Your files will still be saved on the Studio EBS volume.\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "instance_type": "ml.g4dn.xlarge", "kernelspec": { "display_name": "Python 3 (TensorFlow 2.3 Python 3.7 GPU Optimized)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-west-2:236514542706:image/tensorflow-2.3-gpu-py37-cu110-ubuntu18.04-v3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 4 }