{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# FairMOT Training in Amazon SageMaker\n", "\n", "This notebook demonstrates how to train a [FairMOT](https://arxiv.org/abs/2004.01888) model with SageMaker and tune hyper-parameters with [SageMaker Hyperparameter tuning job](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. SageMaker Initialization \n", "First we upgrade SageMaker to the latest version. If your notebook is already using the latest SageMaker 2.x API, you may skip the next cell." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "! pip install --upgrade pip\n", "! python3 -m pip install --upgrade sagemaker" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "import boto3\n", "import sagemaker\n", "from sagemaker import get_execution_role\n", "from sagemaker.estimator import Estimator\n", "\n", "role = get_execution_role() # provide a pre-existing role ARN as an alternative to creating a new role\n", "print(f'SageMaker Execution Role:{role}')\n", "\n", "client = boto3.client('sts')\n", "account = client.get_caller_identity()['Account']\n", "print(f'AWS account:{account}')\n", "\n", "session = boto3.session.Session()\n", "region = session.region_name\n", "print(f'AWS region:{region}')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3_bucket = sagemaker.Session().default_bucket() \n", "\n", "# we use data parallel to train a model on a single instance as https://github.com/ifzhang/FairMOT\n", "version_name = \"dp\"\n", "\n", "# Currently we support MOT17 and MOT20\n", "dataset_name= \"MOT17\" # Options: MOT17, MOT20\n", "\n", "# 0: set all data to train data, 1: set second half part to validation data\n", "# set 1 when executing hyperparameter tuning job\n", "half_val = 1\n", "\n", "training_image = f\"{account}.dkr.ecr.{region}.amazonaws.com/fairmot-sagemaker:pytorch1.8-{version_name}\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Stage dataset in Amazon S3\n", "\n", "We use the dataset from [MOT Challenge](https://motchallenge.net) for training. First, we download the dataset to this notebook instance. By referencing [DATA ZOO](https://github.com/Zhongdao/Towards-Realtime-MOT/blob/master/DATASET_ZOO.md), we prepare the dataset which can be trained by `FairMOT`, and upload the processed dataset to the Amazon [S3 bucket](https://docs.aws.amazon.com/en_pv/AmazonS3/latest/gsg/CreatingABucket.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!cat ./prepare-s3-bucket.sh" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using your *Amazon S3 bucket*, *dataset name* and *validation flag* as arguments, run the script [`prepare-s3-bucket.sh`](prepare-s3-bucket.sh). You can skip this step if you have already uploaded the dataset to S3 bucket." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "%%time\n", "!./prepare-s3-bucket.sh {s3_bucket} {dataset_name} {half_val}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Build and push SageMaker training image\n", "We use the implementation of [FairMOT](https://github.com/ifzhang/FairMOT) to create our own container, and push the image to [Amazon ECR](https://aws.amazon.com/ecr/).\n", "\n", "### Docker Environment Preparation\n", "Because the volume size of container may be larger than the available size in root directory of the notebook instance, we need to put the directory of docker data into the ```/home/ec2-user/SageMaker/docker``` directory.\n", "\n", "By default, the root directory of docker is set as ```/var/lib/docker/```. We need to change the directory of docker to ```/home/ec2-user/SageMaker/docker```." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!cat /etc/docker/daemon.json" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!bash ./prepare-docker.sh" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Build training image for FairMOT\n", "Use script [`./container/build_tools/build_and_push.sh`](./container-dp/build_tools/build_and_push.sh) to build and push the FairMOT training image to [Amazon ECR](https://aws.amazon.com/ecr/)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "!cat ./container-{version_name}/build_tools/build_and_push.sh" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using your *AWS region* as argument, run the cell below." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "%%time\n", "!bash ./container-{version_name}/build_tools/build_and_push.sh {region}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Define SageMaker Data Channels\n", "In this step, we define SageMaker `train` data channel. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.inputs import TrainingInput\n", "prefix = \"fairmot/sagemaker\" #prefix in your S3 bucket\n", "s3train = f's3://{s3_bucket}/{prefix}/input/train'\n", "\n", "train_input = TrainingInput(s3_data=s3train, \n", " distribution=\"FullyReplicated\", \n", " s3_data_type='S3Prefix', \n", " input_mode='File')\n", "\n", "data_channels = {'train': train_input}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we define the model output location in S3 bucket." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3_output_location = f's3://{s3_bucket}/{prefix}/output'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Configure Hyper-parameters\n", "In this step, we define the hyper-parameters used in FairMOT. Jump to [8.Hyperparameter Tuning](#hyperparametertuning) if you want to run hyperparameter tuning job.\n", "\n", "
Hyper-parameter | \n", "Description | \n", "Default | \n", "
---|---|---|
arch | \n", "model architecture. Currently tested resdcn_34 | resdcn_50 | resfpndcn_34 | dla_34 | hrnet_18 | \n", "'dla_34' | \n", "
load_model | \n", "pretrained model | \n", "fairmot_dla34.pth | \n", "
head_conv | \n", "conv layer channels for output head 0 for no conv layer -1 for default setting: 256 for resnets and 256 for dla. | \n", "-1 | \n", "
down_ratio | \n", "output stride. Currently only supports 4. | \n", "4 | \n", "
input_res | \n", "input height and width. -1 for default from dataset. Will be overriden by input_h | input_w | \n", "-1 | \n", "
input_h | \n", "input height | \n", "608 | \n", "
input_w | \n", "input width | \n", "1088 | \n", "
lr | \n", "learning rate for batch size 12. | \n", "1e-4 | \n", "
lr_step | \n", "drop learning rate by 10. | \n", "'20' | \n", "
num_epochs | \n", "total training epochs. | \n", "30 | \n", "
batch_size | \n", "batch size, 8 is recommended when using ml.p3 instance | \n", "8 | \n", "
num_iters | \n", "default: #samples / batch_size. | \n", "-1 | \n", "
val_intervals | \n", "number of epochs to run validation. | \n", "5 | \n", "
reg_loss | \n", "regression loss: sl1 | l1 | l2 | \n", "'l1' | \n", "
hm_weight | \n", "loss weight for keypoint heatmaps. | \n", "1 | \n", "
off_weight | \n", "loss weight for keypoint local offsets. | \n", "1 | \n", "
wh_weight | \n", "loss weight for bounding box size. | \n", "0.1 | \n", "
id_loss | \n", "reid loss: ce | focal | \n", "'ce' | \n", "
id_weight | \n", "loss weight for id | \n", "1 | \n", "
reid_dim | \n", "feature dim for reid | \n", "128 | \n", "