{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# FairMOT Training in Amazon SageMaker\n", "\n", "This notebook demonstrates how to train a [FairMOT](https://arxiv.org/abs/2004.01888) model with SageMaker and tune hyper-parameters with [SageMaker Hyperparameter tuning job](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. SageMaker Initialization \n", "First we upgrade SageMaker to the latest version. If your notebook is already using the latest SageMaker 2.x API, you may skip the next cell." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "! pip install --upgrade pip\n", "! python3 -m pip install --upgrade sagemaker" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "import boto3\n", "import sagemaker\n", "from sagemaker import get_execution_role\n", "from sagemaker.estimator import Estimator\n", "\n", "role = get_execution_role() # provide a pre-existing role ARN as an alternative to creating a new role\n", "print(f'SageMaker Execution Role:{role}')\n", "\n", "client = boto3.client('sts')\n", "account = client.get_caller_identity()['Account']\n", "print(f'AWS account:{account}')\n", "\n", "session = boto3.session.Session()\n", "region = session.region_name\n", "print(f'AWS region:{region}')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3_bucket = sagemaker.Session().default_bucket() \n", "\n", "# we use data parallel to train a model on a single instance as https://github.com/ifzhang/FairMOT\n", "version_name = \"dp\"\n", "\n", "# Currently we support MOT17 and MOT20\n", "dataset_name= \"MOT17\" # Options: MOT17, MOT20\n", "\n", "# 0: set all data to train data, 1: set second half part to validation data\n", "# set 1 when executing hyperparameter tuning job\n", "half_val = 1\n", "\n", "training_image = f\"{account}.dkr.ecr.{region}.amazonaws.com/fairmot-sagemaker:pytorch1.8-{version_name}\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Stage dataset in Amazon S3\n", "\n", "We use the dataset from [MOT Challenge](https://motchallenge.net) for training. First, we download the dataset to this notebook instance. By referencing [DATA ZOO](https://github.com/Zhongdao/Towards-Realtime-MOT/blob/master/DATASET_ZOO.md), we prepare the dataset which can be trained by `FairMOT`, and upload the processed dataset to the Amazon [S3 bucket](https://docs.aws.amazon.com/en_pv/AmazonS3/latest/gsg/CreatingABucket.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!cat ./prepare-s3-bucket.sh" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using your *Amazon S3 bucket*, *dataset name* and *validation flag* as arguments, run the script [`prepare-s3-bucket.sh`](prepare-s3-bucket.sh). You can skip this step if you have already uploaded the dataset to S3 bucket." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "%%time\n", "!./prepare-s3-bucket.sh {s3_bucket} {dataset_name} {half_val}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Build and push SageMaker training image\n", "We use the implementation of [FairMOT](https://github.com/ifzhang/FairMOT) to create our own container, and push the image to [Amazon ECR](https://aws.amazon.com/ecr/).\n", "\n", "### Docker Environment Preparation\n", "Because the volume size of container may be larger than the available size in root directory of the notebook instance, we need to put the directory of docker data into the ```/home/ec2-user/SageMaker/docker``` directory.\n", "\n", "By default, the root directory of docker is set as ```/var/lib/docker/```. We need to change the directory of docker to ```/home/ec2-user/SageMaker/docker```." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!cat /etc/docker/daemon.json" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!bash ./prepare-docker.sh" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Build training image for FairMOT\n", "Use script [`./container/build_tools/build_and_push.sh`](./container-dp/build_tools/build_and_push.sh) to build and push the FairMOT training image to [Amazon ECR](https://aws.amazon.com/ecr/)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "!cat ./container-{version_name}/build_tools/build_and_push.sh" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using your *AWS region* as argument, run the cell below." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "%%time\n", "!bash ./container-{version_name}/build_tools/build_and_push.sh {region}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Define SageMaker Data Channels\n", "In this step, we define SageMaker `train` data channel. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.inputs import TrainingInput\n", "prefix = \"fairmot/sagemaker\" #prefix in your S3 bucket\n", "s3train = f's3://{s3_bucket}/{prefix}/input/train'\n", "\n", "train_input = TrainingInput(s3_data=s3train, \n", " distribution=\"FullyReplicated\", \n", " s3_data_type='S3Prefix', \n", " input_mode='File')\n", "\n", "data_channels = {'train': train_input}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we define the model output location in S3 bucket." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3_output_location = f's3://{s3_bucket}/{prefix}/output'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Configure Hyper-parameters\n", "In this step, we define the hyper-parameters used in FairMOT. Jump to [8.Hyperparameter Tuning](#hyperparametertuning) if you want to run hyperparameter tuning job.\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
FairMOT Hyper-parameters
Hyper-parameterDescriptionDefault
archmodel architecture. Currently tested resdcn_34 | resdcn_50 | resfpndcn_34 | dla_34 | hrnet_18'dla_34'
load_modelpretrained modelfairmot_dla34.pth
head_convconv layer channels for output head 0 for no conv layer -1 for default setting: 256 for resnets and 256 for dla.-1
down_ratiooutput stride. Currently only supports 4.4
input_resinput height and width. -1 for default from dataset. Will be overriden by input_h | input_w-1
input_hinput height608
input_winput width1088
lrlearning rate for batch size 12.1e-4
lr_stepdrop learning rate by 10.'20'
num_epochstotal training epochs.30
batch_sizebatch size, 8 is recommended when using ml.p3 instance8
num_itersdefault: #samples / batch_size.-1
val_intervalsnumber of epochs to run validation.5
reg_lossregression loss: sl1 | l1 | l2'l1'
hm_weightloss weight for keypoint heatmaps.1
off_weightloss weight for keypoint local offsets.1
wh_weightloss weight for bounding box size.0.1
id_lossreid loss: ce | focal'ce'
id_weightloss weight for id1
reid_dimfeature dim for reid128
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hyperparameters = {\n", " \"batch_size\": 8,\n", " \"num_epochs\": 20,\n", " \"val_intervals\": 1,\n", " \"load_model\": 'fairmot_dla34.pth',\n", " \"data_name\": \"MOT17\"\n", " }" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. Define Training Metrics\n", "Next, we define the regular expressions that SageMaker uses to extract algorithm metrics from training logs and send them to [AWS CloudWatch metrics](https://docs.aws.amazon.com/en_pv/AmazonCloudWatch/latest/monitoring/working_with_metrics.html). These algorithm metrics are visualized in SageMaker console." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "metric_definitions=[\n", " {\n", " \"Name\": \"train_loss\",\n", " \"Regex\": \"\\|train_loss\\\\s*(\\\\S+).*\"\n", " },\n", " {\n", " \"Name\": \"train_hm_loss\",\n", " \"Regex\": \"\\|train_hm_loss\\\\s*(\\\\S+).*\"\n", " },\n", " {\n", " \"Name\": \"train_wh_loss\",\n", " \"Regex\": \"\\|train_wh_loss\\\\s*(\\\\S+).*\"\n", " },\n", " {\n", " \"Name\": \"train_id_loss\",\n", " \"Regex\": \"\\|train_id_loss\\\\s*(\\\\S+).*\"\n", " },\n", " {\n", " \"Name\": \"train_off_loss\",\n", " \"Regex\": \"\\|train_off_loss\\\\s*(\\\\S+).*\"\n", " },\n", " {\n", " \"Name\": \"val_loss\",\n", " \"Regex\": \"\\|val_loss\\\\s*(\\\\S+).*\"\n", " },\n", " {\n", " \"Name\": \"val_hm_loss\",\n", " \"Regex\": \"\\|val_hm_loss\\\\s*(\\\\S+).*\"\n", " },\n", " {\n", " \"Name\": \"val_wh_loss\",\n", " \"Regex\": \"\\|val_wh_loss\\\\s*(\\\\S+).*\"\n", " },\n", " {\n", " \"Name\": \"val_id_loss\",\n", " \"Regex\": \"\\|val_id_loss\\\\s*(\\\\S+).*\"\n", " }, \n", " {\n", " \"Name\": \"val_off_loss\",\n", " \"Regex\": \"\\|val_off_loss\\\\s*(\\\\S+).*\"\n", " }\n", " ]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7. Define SageMaker Training Job\n", "\n", "Next, we use SageMaker [Estimator](https://sagemaker.readthedocs.io/en/stable/estimators.html) API to define a SageMaker Training Job." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sagemaker_session = sagemaker.session.Session(boto_session=session)\n", "\n", "fairmot_estimator = Estimator(image_uri=training_image,\n", " role=role, \n", " instance_count=1,\n", " instance_type='ml.p3.16xlarge',\n", " volume_size = 100,\n", " max_run = 40000,\n", " output_path=s3_output_location,\n", " sagemaker_session=sagemaker_session, \n", " hyperparameters = hyperparameters,\n", " metric_definitions = metric_definitions,\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we launch the SageMaker training job." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import time\n", "\n", "job_name=f'fairmot-{version_name}-{int(time.time())}'\n", "print(f\"Launching Training Job: {job_name}\")\n", "\n", "# set wait=True below if you want to print logs in cell output\n", "fairmot_estimator.fit(inputs=data_channels, job_name=job_name, logs=\"All\", wait=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check the metrics of the training job in the `Training Job` console." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from IPython.core.display import display, HTML\n", "\n", "display(\n", " HTML(\n", " f'Check the status of training job'\n", " )\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Once above training job completed**, we store the S3 URI of the model artifact in IPython’s database as a variable. This variable will be used to serve model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3_model_uri = fairmot_estimator.model_data\n", "%store s3_model_uri" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 8.Hyperparameter Tuning\n", "In this step, we define and launch Hyperparameter tuning job. `MaxParallelTrainingJobs` should be **equal or less than the limit of training job instance**. We choose `id_loss` and `lr` for tuning and set `val_loss` to the objective metric. \n", "\n", "As [Best Practices for Hyperparameter Tuning](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-considerations.html) suggests, a tuning job improves only through successive rounds of experiments. Therefore, smaller `MaxParallelTrainingJobs` and larger `MaxNumberOfTrainingJobs` may lead to a better result. When `MaxParallelTrainingJobs` is equal to `MaxNumberOfTrainingJobs`, searching strategy will become `Random Search` even setting it as `Bayesian Search`. In this demonstration, we set `MaxParallelTrainingJobs` to 1.\n", "\n", "For `MaxNumberOfTrainingJobs`, setting a larger `MaxNumberOfTrainingJobs` cat get the better result, but it takes a longer time. We set `MaxNumberOfTrainingJobs` to the small value 3 to show how SageMaker Hyperparameter works. When you train a model on your own dataset, we recommend to set `MaxNumberOfTrainingJobs` to a larger value.\n", "\n", "For more details on Hyperparameter tuning with SageMaker, you can reference [How Hyperparameter Tuning Works](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-how-it-works.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "from time import gmtime, strftime\n", "\n", "tuning_job_name = f'fairmot-tuningjob-{version_name}-' + strftime(\"%d-%H-%M-%S\", gmtime())\n", "\n", "print(tuning_job_name)\n", "\n", "tuning_job_config = {\n", " \"ParameterRanges\": {\n", " \"CategoricalParameterRanges\": [\n", " {\n", " \"Name\": \"id_loss\",\n", " \"Values\": ['ce', 'focal']\n", " }\n", " ],\n", " \"ContinuousParameterRanges\": [\n", " {\n", " \"Name\": \"lr\",\n", " \"MaxValue\": \"1e-3\",\n", " \"MinValue\": \"1e-5\",\n", " \"ScalingType\": \"Auto\"\n", " }\n", " ]\n", " },\n", " \"ResourceLimits\": {\n", " \"MaxNumberOfTrainingJobs\": 3,\n", " \"MaxParallelTrainingJobs\": 1\n", " },\n", " \"Strategy\": \"Bayesian\",\n", " \"HyperParameterTuningJobObjective\": {\n", " \"MetricName\": \"val_loss\",\n", " \"Type\": \"Minimize\"\n", " }\n", " }" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "training_job_definition = {\n", " \"AlgorithmSpecification\": {\n", " \"MetricDefinitions\": [\n", " {\n", " \"Name\": \"train_loss\",\n", " \"Regex\": \"\\|train_loss\\\\s*(\\\\S+).*\"\n", " },\n", " {\n", " \"Name\": \"train_hm_loss\",\n", " \"Regex\": \"\\|train_hm_loss\\\\s*(\\\\S+).*\"\n", " },\n", " {\n", " \"Name\": \"train_wh_loss\",\n", " \"Regex\": \"\\|train_wh_loss\\\\s*(\\\\S+).*\"\n", " },\n", " {\n", " \"Name\": \"train_id_loss\",\n", " \"Regex\": \"\\|train_id_loss\\\\s*(\\\\S+).*\"\n", " },\n", " {\n", " \"Name\": \"train_off_loss\",\n", " \"Regex\": \"\\|train_off_loss\\\\s*(\\\\S+).*\"\n", " },\n", " {\n", " \"Name\": \"val_loss\",\n", " \"Regex\": \"\\|val_loss\\\\s*(\\\\S+).*\"\n", " },\n", " {\n", " \"Name\": \"val_hm_loss\",\n", " \"Regex\": \"\\|val_hm_loss\\\\s*(\\\\S+).*\"\n", " },\n", " {\n", " \"Name\": \"val_wh_loss\",\n", " \"Regex\": \"\\|val_wh_loss\\\\s*(\\\\S+).*\"\n", " },\n", " {\n", " \"Name\": \"val_id_loss\",\n", " \"Regex\": \"\\|val_id_loss\\\\s*(\\\\S+).*\"\n", " }, \n", " {\n", " \"Name\": \"val_off_loss\",\n", " \"Regex\": \"\\|val_off_loss\\\\s*(\\\\S+).*\"\n", " }\n", " ],\n", " \"TrainingImage\": training_image,\n", " \"TrainingInputMode\": \"File\"\n", " },\n", " \"InputDataConfig\": [\n", " {\n", " \"ChannelName\": \"train\",\n", " \"DataSource\": {\n", " \"S3DataSource\": {\n", " \"S3DataType\": \"S3Prefix\",\n", " \"S3Uri\": s3train,\n", " \"S3DataDistributionType\": \"FullyReplicated\"\n", " }\n", " },\n", " \"CompressionType\": \"None\",\n", " \"RecordWrapperType\": \"None\"\n", " }\n", " ],\n", " \"OutputDataConfig\": {\n", " \"S3OutputPath\": s3_output_location\n", " },\n", " \"ResourceConfig\": {\n", " \"InstanceCount\": 1,\n", " \"InstanceType\": \"ml.p3.16xlarge\",\n", " \"VolumeSizeInGB\": 100\n", " },\n", " \"RoleArn\": role,\n", " \"StaticHyperParameters\": {\n", " \"num_epochs\":\"20\",\n", " \"val_intervals\":\"1\",\n", " \"batch_size\":\"8\",\n", " \"load_model\": 'fairmot_dla34.pth',\n", " \"data_name\": \"MOT17\"\n", " \n", " },\n", " \"StoppingCondition\": {\n", " \"MaxRuntimeInSeconds\": 72000\n", " }\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then we launch the defined hyperparameter tuning job." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "smclient = boto3.client('sagemaker')\n", "smclient.create_hyper_parameter_tuning_job(HyperParameterTuningJobName = tuning_job_name,\n", " HyperParameterTuningJobConfig = tuning_job_config,\n", " TrainingJobDefinition = training_job_definition)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "smclient.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName = tuning_job_name)['HyperParameterTuningJobStatus']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check the status of the hyperparamter tuning job in the `Hyperparameter tuning jobs`console." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from IPython.core.display import display, HTML\n", "\n", "display(\n", " HTML(\n", " f'Check hyperparameter tuning job'\n", " )\n", ")" ] } ], "metadata": { "kernelspec": { "display_name": "conda_pytorch_p39", "language": "python", "name": "conda_pytorch_p39" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" } }, "nbformat": 4, "nbformat_minor": 4 }