{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Detectron2 on SKU-110K dataset\n", "\n", "**Index**\n", "\n", "1. [Background](#Background)\n", "1. [Setup](#Setup)\n", "1. [Data](#Data)\n", "1. [Training](#Training)\n", "1. [Hyperparameter Tuning Jobs](#HPO)\n", "1. [Deploy: Batch Transform](#Deploy)\n", "1. [Visualization](#Visualization)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Background\n", "\n", "Detectron2 is a Computer Vision framework which implements Object Detection algorithms. It is developed by Facebook AI Research team. While its ancestor, Detectron, was completely written in Caffe, Detecton2 was refactored in PyTorch to enable fast experiments and iterations from. Detectron2 has a rich model zoo that contains State-of-the-Art models for object detection, semantic segmentation and pose estimation, to cite a few. A modular design makes Detectron2 easily extensible, and, hence, cutting-edge research projects can be implemented on top of it. \n", "\n", "We use Detectron2 to train and evaluate models on the [SKU110k-dataset](https://github.com/eg4000/SKU110K_CVPR19). This open source dataset contains images of retail shelves. Each image contains about 150 objects, which makes it suitable to test dense scene object detection algortihms. Bounding boxes are associated with SKUs without distinguishing between categories of product.\n", "\n", "In this noteboook we use Object Detection models from Detectron2's model zoo. We then leverage Amazon SageMaker ML platform to finetune pre-trained models on SKU110k dataset and deploy trained model for inference." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup\n", "\n", "#### Precondition\n", "If you are executing this notebook using Sagemaker Notebook instance or Sagemaker Studio instance, please make sure that it has IAM role used with `AmazonSageMakerFullAccess` policy.\n", "\n", "We start by importing required Python libraries and configuring some common parameters" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import boto3\n", "import sagemaker\n", "\n", "assert (\n", " sagemaker.__version__.split(\".\")[0] == \"2\"\n", "), \"Please upgrade SageMaker Python SDK to version 2\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bucket = \"FILL WITH UNIQUE BUCKET NAME\" # TODO: update this value\n", "prefix_data = \"detectron2/data\"\n", "prefix_model = \"detectron2/training_artefacts\"\n", "prefix_code = \"detectron2/model\"\n", "prefix_predictions = \"detectron2/predictions\"\n", "local_folder = \"cache\" # cache folder used to store downloaded data - not versioned\n", "\n", "\n", "sm_session = sagemaker.Session(default_bucket=bucket)\n", "role = sagemaker.get_execution_role()\n", "region = sm_session.boto_region_name\n", "account = sm_session.account_id()\n", "\n", "# if bucket doesn't exist, create one\n", "s3_resource = boto3.resource(\"s3\")\n", "if not s3_resource.Bucket(bucket) in s3_resource.buckets.all():\n", " s3_resource.create_bucket(\n", " Bucket=bucket, CreateBucketConfiguration={\"LocationConstraint\": region}\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dataset Preparation\n", "\n", "To prepare SKU110K for training, we need to do following:\n", "* download and unzip SKU-110K dataset;\n", "* split images into three channels (training, validation and test) according to the filename prefix;\n", "* remove images (and the corresponding annotations) that are corrupted, i.e. cannot be loaded by PIL.Image.load();\n", "* upload image channels to the S3 bucket;\n", "* reorganize annotations into augmented manifest files and upload these files to S3." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "import os\n", "import tarfile\n", "import tempfile\n", "from datetime import datetime\n", "from pathlib import Path\n", "from typing import Mapping, Optional, Sequence\n", "from urllib import request\n", "\n", "import boto3\n", "import numpy as np\n", "import pandas as pd\n", "from tqdm import tqdm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Download SKU-110K dataset\n", "\n", "The total size of the unzipped dataset is 12.2 GB. Please make sure to set the volume size of your notebook instance accordingly. We suggest a volume size equal to 30 GB.\n", "\n", "⚠️ dataset download and extraction will take ~15-20 minutes" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! wget -P cache http://trax-geometry.s3.amazonaws.com/cvpr_challenge/SKU110K_fixed.tar.gz" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sku_dataset_dirname = \"SKU110K_fixed\"\n", "assert Path(\n", " local_folder\n", ").exists(), f\"Set wget directory-prefix to {local_folder} in the previous cell\"\n", "\n", "\n", "def track_progress(members):\n", " i = 0\n", " for member in members:\n", " if i % 100 == 0:\n", " print(\".\", end=\"\")\n", " i += 1\n", " yield member\n", "\n", "\n", "if not (Path(local_folder) / sku_dataset_dirname).exists():\n", " compressed_file = tarfile.open(\n", " name=os.path.join(local_folder, sku_dataset_dirname + \".tar.gz\")\n", " )\n", " compressed_file.extractall(\n", " path=local_folder, members=track_progress(compressed_file)\n", " )\n", "else:\n", " print(f\"Using the data in `{local_folder}` folder\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reorganize images" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "path_images = Path(local_folder) / sku_dataset_dirname / \"images\"\n", "assert path_images.exists(), f\"{path_images} not found\"\n", "\n", "prefix_to_channel = {\n", " \"train\": \"training\",\n", " \"val\": \"validation\",\n", " \"test\": \"test\",\n", "}\n", "for channel_name in prefix_to_channel.values():\n", " if not (path_images.parent / channel_name).exists():\n", " (path_images.parent / channel_name).mkdir()\n", "\n", "for path_img in path_images.iterdir():\n", " for prefix in prefix_to_channel:\n", " if path_img.name.startswith(prefix):\n", " path_img.replace(\n", " path_images.parent / prefix_to_channel[prefix] / path_img.name\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Detectron2 uses Pillow library to read images. We found out that some images in the SKU dataset are corrupted, which causes the dataloader to raise an IOError exception. Therefore, we remove them from the dataset. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "CORRUPTED_IMAGES = {\n", " \"training\": (\"train_4222.jpg\", \"train_5822.jpg\", \"train_882.jpg\", \"train_924.jpg\"),\n", " \"validation\": tuple(),\n", " \"test\": (\"test_274.jpg\", \"test_2924.jpg\"),\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for channel_name in prefix_to_channel.values():\n", " for img_name in CORRUPTED_IMAGES[channel_name]:\n", " try:\n", " (path_images.parent / channel_name / img_name).unlink()\n", " print(f\"{img_name} removed from channel {channel_name} \")\n", " except FileNotFoundError:\n", " print(f\"{img_name} not in channel {channel_name}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for channel_name in prefix_to_channel.values():\n", " print(\n", " f\"Number of {channel_name} images = {sum(1 for x in (path_images.parent / channel_name).glob('*.jpg'))}\"\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Upload dataset to S3. ⚠️ this operation will take some time (~10-15 minutes)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "channel_to_s3_imgs = {}\n", "\n", "for channel_name in prefix_to_channel.values():\n", " inputs = sm_session.upload_data(\n", " path=str(path_images.parent / channel_name),\n", " bucket=bucket,\n", " key_prefix=f\"{prefix_data}/{channel_name}\",\n", " )\n", " print(f\"{channel_name} images uploaded to {inputs}\")\n", " channel_to_s3_imgs[channel_name] = inputs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reorganise annotations\n", "\n", "The annotations in SKU-110K dataset are stored in csv files. They are here reorganised into [augmented manifest files](https://docs.aws.amazon.com/sagemaker/latest/dg/augmented-manifest.html). See SageMaker documentation for specification on [bounding box annotations](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-data-output.html#sms-output-box)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def create_annotation_channel(\n", " channel_id: str,\n", " path_to_annotation: Path,\n", " bucket_name: str,\n", " data_prefix: str,\n", " img_annotation_to_ignore: Optional[Sequence[str]] = None,\n", ") -> Sequence[Mapping]:\n", " r\"\"\"Change format from original to augmented manifest files\n", "\n", " Parameters\n", " ----------\n", " channel_id : str\n", " name of the channel, i.e. training, validation or test\n", " path_to_annotation : Path\n", " path to annotation file\n", " bucket_name : str\n", " bucket where the data are uploaded\n", " data_prefix : str\n", " bucket prefix\n", " img_annotation_to_ignore : Optional[Sequence[str]]\n", " annotation from these images are ignore because the corresponding images are corrupted, default to None\n", "\n", " Returns\n", " -------\n", " Sequence[Mapping]\n", " List of json lines, each lines contains the annotations for a single. This recreates the\n", " format of augmented manifest files that are generated by Amazon SageMaker GroundTruth\n", " labeling jobs\n", " \"\"\"\n", " if channel_id not in (\"training\", \"validation\", \"test\"):\n", " raise ValueError(\n", " f\"Channel identifier must be training, validation or test. The passed values is {channel_id}\"\n", " )\n", " if not path_to_annotation.exists():\n", " raise FileNotFoundError(f\"Annotation file {path_to_annotation} not found\")\n", "\n", " df_annotation = pd.read_csv(\n", " path_to_annotation,\n", " header=0,\n", " names=(\n", " \"image_name\",\n", " \"x1\",\n", " \"y1\",\n", " \"x2\",\n", " \"y2\",\n", " \"class\",\n", " \"image_width\",\n", " \"image_height\",\n", " ),\n", " )\n", "\n", " df_annotation[\"left\"] = df_annotation[\"x1\"]\n", " df_annotation[\"top\"] = df_annotation[\"y1\"]\n", " df_annotation[\"width\"] = df_annotation[\"x2\"] - df_annotation[\"x1\"]\n", " df_annotation[\"height\"] = df_annotation[\"y2\"] - df_annotation[\"y1\"]\n", " df_annotation.drop(columns=[\"x1\", \"x2\", \"y1\", \"y2\"], inplace=True)\n", "\n", " jsonlines = []\n", " for img_id in df_annotation[\"image_name\"].unique():\n", " if img_annotation_to_ignore and img_id in img_annotation_to_ignore:\n", " print(\n", " f\"Annotations for image {img_id} are neglected as the image is corrupted\"\n", " )\n", " continue\n", " img_annotations = df_annotation.loc[df_annotation[\"image_name\"] == img_id, :]\n", " annotations = []\n", " for (\n", " _,\n", " _,\n", " img_width,\n", " img_heigh,\n", " bbox_l,\n", " bbox_t,\n", " bbox_w,\n", " bbox_h,\n", " ) in img_annotations.itertuples(index=False):\n", " annotations.append(\n", " {\n", " \"class_id\": 0,\n", " \"width\": bbox_w,\n", " \"top\": bbox_t,\n", " \"left\": bbox_l,\n", " \"height\": bbox_h,\n", " }\n", " )\n", " jsonline = {\n", " \"sku\": {\n", " \"annotations\": annotations,\n", " \"image_size\": [{\"width\": img_width, \"depth\": 3, \"height\": img_heigh,}],\n", " },\n", " \"sku-metadata\": {\n", " \"job_name\": f\"labeling-job/sku-110k-{channel_id}\",\n", " \"class-map\": {\"0\": \"SKU\"},\n", " \"human-annotated\": \"yes\",\n", " \"objects\": len(annotations) * [{\"confidence\": 0.0}],\n", " \"type\": \"groundtruth/object-detection\",\n", " \"creation-date\": datetime.now()\n", " .replace(second=0, microsecond=0)\n", " .isoformat(),\n", " },\n", " \"source-ref\": f\"s3://{bucket_name}/{data_prefix}/{channel_id}/{img_id}\",\n", " }\n", " jsonlines.append(jsonline)\n", " return jsonlines" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "annotation_folder = Path(local_folder) / sku_dataset_dirname / \"annotations\"\n", "channel_to_annotation_path = {\n", " \"training\": annotation_folder / \"annotations_train.csv\",\n", " \"validation\": annotation_folder / \"annotations_val.csv\",\n", " \"test\": annotation_folder / \"annotations_test.csv\",\n", "}\n", "channel_to_annotation = {}\n", "\n", "for channel in channel_to_annotation_path:\n", " annotations = create_annotation_channel(\n", " channel,\n", " channel_to_annotation_path[channel],\n", " bucket,\n", " prefix_data,\n", " CORRUPTED_IMAGES[channel],\n", " )\n", " print(f\"Number of {channel} annotations: {len(annotations)}\")\n", " channel_to_annotation[channel] = annotations" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def upload_annotations(p_annotations, p_channel: str):\n", " rsc_bucket = boto3.resource(\"s3\").Bucket(bucket)\n", "\n", " json_lines = [json.dumps(elem) for elem in p_annotations]\n", " to_write = \"\\n\".join(json_lines)\n", "\n", " with tempfile.NamedTemporaryFile(mode=\"w\") as fid:\n", " fid.write(to_write)\n", " rsc_bucket.upload_file(\n", " fid.name, f\"{prefix_data}/annotations/{p_channel}.manifest\"\n", " )" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for channel_id, annotations in channel_to_annotation.items():\n", " upload_annotations(annotations, channel_id)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's check on expected number of images in training, validation and test sets, so that any failures on upload or preprocessing are caught before user starts training" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "channel_to_expected_size = {\n", " \"training\": 8215,\n", " \"validation\": 588,\n", " \"test\": 2934,\n", "}\n", "\n", "prefix_data = \"detectron2/data\"\n", "bucket_rsr = boto3.resource(\"s3\").Bucket(bucket)\n", "for channel_name, exp_nb in channel_to_expected_size.items():\n", " nb_objs = len(\n", " list(bucket_rsr.objects.filter(Prefix=f\"{prefix_data}/{channel_name}\"))\n", " )\n", " assert (\n", " nb_objs == exp_nb\n", " ), f\"The {channel_name} set should have {exp_nb} images but it contains {nb_objs} images\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Training using Amazon SageMaker \n", "\n", "To run training job on SageMaker we will:\n", "* build training container and push it to Amazon Elastic Container Registry (\"ECR\"), container includes all runtime dependencies and training script;\n", "* define training job configuration which includes training cluster configuration and model hyperparameters;\n", "* schedule training job, observe its progress.\n", "\n", "\n", "### Building training container\n", "Before we can build training container, we need to authethicate in shared ECR repo to retrieve Pytorch base image and in private ECR repository. Enter your region and account id below, and then execute the following cell to do it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "\n", "REGION=YOUR_REGION\n", "ACCOUNT=YOUR_ACCOUNT_ID\n", "\n", "aws ecr get-login-password --region $REGION | docker login --username AWS --password-stdin 763104351884.dkr.ecr.$REGION.amazonaws.com\n", "# loging to your private ECR\n", "aws ecr get-login-password --region $REGION | docker login --username AWS --password-stdin $ACCOUNT.dkr.ecr.$REGION.amazonaws.com" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our build container uses AWS-authored Pytorch container as a base image. We extend base image with Detecton2 dependencies and copy training script. Execute cell below to review Dockerfile content." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "\n", "# execute this cell to review Docker container\n", "pygmentize -l docker Dockerfile.sku110ktraining" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we build the Docker container locally and then push it to ECR repository, so SageMaker can deploy this container on compute nodes at training time. Run command bellow to build and push container. The size of the resulting Docker image is approximately 5GB." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "./build_and_push.sh sagemaker-d2-train-sku110k latest Dockerfile.sku110ktraining" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Configure SageMaker training job\n", "\n", "Configuration includes following components:\n", "* data configuration defines where train/test/val datasets are stored;\n", "* container configuration;\n", "* model hyperparameters;\n", "* training job parameters such as size of cluster and instance type, metrics to monitor, etc." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "\n", "import boto3\n", "from sagemaker.estimator import Estimator" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Data configuration\n", "\n", "training_channel = f\"s3://{bucket}/{prefix_data}/training/\"\n", "validation_channel = f\"s3://{bucket}/{prefix_data}/validation/\"\n", "test_channel = f\"s3://{bucket}/{prefix_data}/test/\"\n", "\n", "annotation_channel = f\"s3://{bucket}/{prefix_data}/annotations/\"\n", "\n", "classes = [\n", " \"SKU\",\n", "]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Container configuration\n", "\n", "container_name = \"sagemaker-d2-train-sku110k\"\n", "container_version = \"latest\"\n", "training_image_uri = (\n", " f\"{account}.dkr.ecr.{region}.amazonaws.com/{container_name}:{container_version}\"\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Metrics to monitor during training, each metric is scraped from container Stdout\n", "\n", "metrics = [\n", " {\"Name\": \"training:loss\", \"Regex\": \"total_loss: ([0-9\\\\.]+)\",},\n", " {\"Name\": \"training:loss_cls\", \"Regex\": \"loss_cls: ([0-9\\\\.]+)\",},\n", " {\"Name\": \"training:loss_box_reg\", \"Regex\": \"loss_box_reg: ([0-9\\\\.]+)\",},\n", " {\"Name\": \"training:loss_rpn_cls\", \"Regex\": \"loss_rpn_cls: ([0-9\\\\.]+)\",},\n", " {\"Name\": \"training:loss_rpn_loc\", \"Regex\": \"loss_rpn_loc: ([0-9\\\\.]+)\",},\n", " {\"Name\": \"validation:loss\", \"Regex\": \"total_val_loss: ([0-9\\\\.]+)\",},\n", " {\"Name\": \"validation:loss_cls\", \"Regex\": \"val_loss_cls: ([0-9\\\\.]+)\",},\n", " {\"Name\": \"validation:loss_box_reg\", \"Regex\": \"val_loss_box_reg: ([0-9\\\\.]+)\",},\n", " {\"Name\": \"validation:loss_rpn_cls\", \"Regex\": \"val_loss_rpn_cls: ([0-9\\\\.]+)\",},\n", " {\"Name\": \"validation:loss_rpn_loc\", \"Regex\": \"val_loss_rpn_loc: ([0-9\\\\.]+)\",},\n", "]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Training instance type\n", "\n", "training_instance = \"ml.p3.8xlarge\"\n", "if training_instance.startswith(\"local\"):\n", " training_session = sagemaker.LocalSession()\n", " training_session.config = {\"local\": {\"local_code\": True}}\n", "else:\n", " training_session = sm_session" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following hyper-parameters are used in the training job. Feel free to change them and experiment." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Model Hyperparameters\n", "\n", "od_algorithm = \"faster_rcnn\" # choose one in (\"faster_rcnn\", \"retinanet\")\n", "training_job_hp = {\n", " # Dataset\n", " \"classes\": json.dumps(classes),\n", " \"dataset-name\": json.dumps(\"sku110k\"),\n", " \"label-name\": json.dumps(\"sku\"),\n", " # Algo specs\n", " \"model-type\": json.dumps(od_algorithm),\n", " \"backbone\": json.dumps(\"R_101_FPN\"),\n", " # Data loader\n", " \"num-iter\": 900,\n", " \"log-period\": 500,\n", " \"batch-size\": 16,\n", " \"num-workers\": 8,\n", " # Optimization\n", " \"lr\": 0.005,\n", " \"lr-schedule\": 3,\n", " # Faster-RCNN specific\n", " \"num-rpn\": 517,\n", " \"bbox-head-pos-fraction\": 0.2,\n", " \"bbox-rpn-pos-fraction\": 0.4,\n", " # Prediction specific\n", " \"nms-thr\": 0.2,\n", " \"pred-thr\": 0.1,\n", " \"det-per-img\": 300,\n", " # Evaluation\n", " \"evaluation-type\": \"fast\",\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Compile Sagemaker Training job object and start training\n", "\n", "d2_estimator = Estimator(\n", " image_uri=training_image_uri,\n", " role=role,\n", " sagemaker_session=training_session,\n", " instance_count=2,\n", " instance_type=training_instance,\n", " hyperparameters=training_job_hp,\n", " metric_definitions=metrics,\n", " output_path=f\"s3://{bucket}/{prefix_model}\",\n", " base_job_name=f\"detectron2-{od_algorithm.replace('_', '-')}\",\n", ")\n", "\n", "d2_estimator.fit(\n", " {\n", " \"training\": training_channel,\n", " \"validation\": validation_channel,\n", " \"test\": test_channel,\n", " \"annotation\": annotation_channel,\n", " },\n", " wait=False,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## HyperParameter Optimization with Amazon SageMaker\n", "\n", "SageMaker SDK comes with the `tuner` module that can be used to search for the optimal hyper-parameters (see more details [here](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html)). Let's run several experiment with different model hyperparameters with aim to minize the validation loss. \n", "\n", "`hparams_range` dictionary that defines the hyper-parameters to be optimized. Feel free to modify it. ⚠️ Please note, a tuning job runs multiple training job. Therefore, be aware of the amount of computational resources that a tuner job requires." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.tuner import (\n", " CategoricalParameter,\n", " ContinuousParameter,\n", " HyperparameterTuner,\n", " IntegerParameter,\n", ")\n", "\n", "od_algorithm = \"retinanet\" # choose one in (\"faster_rcnn\", \"retinanet\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hparams_range = {\n", " \"lr\": ContinuousParameter(0.0005, 0.1),\n", "}\n", "if od_algorithm == \"faster_rcnn\":\n", " hparams_range.update(\n", " {\n", " \"bbox-rpn-pos-fraction\": ContinuousParameter(0.1, 0.5),\n", " \"bbox-head-pos-fraction\": ContinuousParameter(0.1, 0.5),\n", " }\n", " )\n", "elif od_algorithm == \"retinanet\":\n", " hparams_range.update(\n", " {\n", " \"focal-loss-gamma\": ContinuousParameter(2.5, 5.0),\n", " \"focal-loss-alpha\": ContinuousParameter(0.3, 1.0),\n", " }\n", " )\n", "else:\n", " assert False, f\"{od_algorithm} not supported\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "obj_metric_name = \"validation:loss\"\n", "obj_type = \"Minimize\"\n", "metric_definitions = [\n", " {\"Name\": \"training:loss\", \"Regex\": \"total_loss: ([0-9\\\\.]+)\",},\n", " {\"Name\": \"training:loss_cls\", \"Regex\": \"loss_cls: ([0-9\\\\.]+)\",},\n", " {\"Name\": \"training:loss_box_reg\", \"Regex\": \"loss_box_reg: ([0-9\\\\.]+)\",},\n", " {\"Name\": obj_metric_name, \"Regex\": \"total_val_loss: ([0-9\\\\.]+)\",},\n", " {\"Name\": \"validation:loss_cls\", \"Regex\": \"val_loss_cls: ([0-9\\\\.]+)\",},\n", " {\"Name\": \"validation:loss_box_reg\", \"Regex\": \"val_loss_box_reg: ([0-9\\\\.]+)\",},\n", "]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fixed_hparams = {\n", " # Dataset\n", " \"classes\": json.dumps(classes),\n", " \"dataset-name\": json.dumps(\"sku110k\"),\n", " \"label-name\": json.dumps(\"sku\"),\n", " # Algo specs\n", " \"model-type\": json.dumps(od_algorithm),\n", " \"backbone\": json.dumps(\"R_101_FPN\"),\n", " # Data loader\n", " \"num-iter\": 9000,\n", " \"log-period\": 500,\n", " \"batch-size\": 16,\n", " \"num-workers\": 8,\n", " # Optimization\n", " \"lr-schedule\": 3,\n", " # Prediction specific\n", " \"nms-thr\": 0.2,\n", " \"pred-thr\": 0.1,\n", " \"det-per-img\": 300,\n", " # Evaluation\n", " \"evaluation-type\": \"fast\",\n", "}\n", "\n", "hpo_estimator = Estimator(\n", " image_uri=training_image_uri,\n", " role=role,\n", " sagemaker_session=sm_session,\n", " instance_count=1,\n", " instance_type=\"ml.p3.8xlarge\",\n", " hyperparameters=fixed_hparams,\n", " output_path=f\"s3://{bucket}/{prefix_model}\",\n", " use_spot_instances=True, # Use spot instances to spare a\n", " max_run=2 * 60 * 60,\n", " max_wait=3 * 60 * 60,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tuner = HyperparameterTuner(\n", " hpo_estimator,\n", " obj_metric_name,\n", " hparams_range,\n", " metric_definitions,\n", " objective_type=obj_type,\n", " max_jobs=2,\n", " max_parallel_jobs=2,\n", " base_tuning_job_name=f\"hpo-d2-{od_algorithm.replace('_', '-')}\",\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tuner.fit(\n", " inputs={\n", " \"training\": training_channel,\n", " \"validation\": validation_channel,\n", " \"test\": test_channel,\n", " \"annotation\": annotation_channel,\n", " },\n", " wait=False,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Let's review outcomes of HyperParameter search\n", "\n", "hpo_tuning_job_name = tuner.latest_tuning_job.name\n", "bayes_metrics = sagemaker.HyperparameterTuningJobAnalytics(\n", " hpo_tuning_job_name\n", ").dataframe()\n", "bayes_metrics.sort_values([\"FinalObjectiveValue\"], ascending=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model Deployment on Amazon SageMaker\n", "\n", "Just like with model training, SageMaker is using containers to run inference. Hence, we start by preparing serving container which will be then deployed with on Amazon SageMaker Hosting platform." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "\n", "# execute this cell to review Docker container\n", "pygmentize -l docker Dockerfile.sku110kserving" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run cell below to build the Docker container defined in the image `Dockerfile.sku110kserving` and push it to ECR. The size of the resulting Docker image is approximately 5GB." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "\n", "./build_and_push.sh sagemaker-d2-serve latest Dockerfile.sku110kserving" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will run batch inference, i.e. running inference against large chunk of images. We use [SageMaker Batch Transform](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-batch.html) to do it. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.pytorch import PyTorchModel" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we assume that a HPO job was executed. We attach the tuning job and fetch the best model" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.tuner import HyperparameterTuner\n", "\n", "tuning_job_id = \"Insert tuning job id\"\n", "attached_tuner = HyperparameterTuner.attach(tuning_job_id)\n", "\n", "best_estimator = attached_tuner.best_estimator()\n", "\n", "best_estimator.latest_training_job.describe()\n", "training_job_artifact = best_estimator.latest_training_job.describe()[\"ModelArtifacts\"][\n", " \"S3ModelArtifacts\"\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also specify the S3 URI of model artifact. Uncomment the following code if you want to use this option:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# training_job_artifact = \"Your model artifacts\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Define parameters of inference container\n", "\n", "serve_container_name = \"sagemaker-d2-serve\"\n", "serve_container_version = \"latest\"\n", "serve_image_uri = f\"{account}.dkr.ecr.{region}.amazonaws.com/{serve_container_name}:{serve_container_version}\"\n", "\n", "inference_output = f\"s3://{bucket}/{prefix_predictions}/{serve_container_name}/{Path(test_channel).name}_channel/{training_job_artifact.split('/')[-3]}\"\n", "inference_output" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Compile SageMaker model object and configure Batch Transform job\n", "\n", "model = PyTorchModel(\n", " name=\"d2-sku110k-model\",\n", " model_data=training_job_artifact,\n", " role=role,\n", " sagemaker_session=sm_session,\n", " entry_point=\"predict_sku110k.py\",\n", " source_dir=\"container_serving\",\n", " image_uri=serve_image_uri,\n", " framework_version=\"1.6.0\",\n", " code_location=f\"s3://{bucket}/{prefix_code}\",\n", ")\n", "\n", "transformer = model.transformer(\n", " instance_count=1,\n", " instance_type=\"ml.p3.2xlarge\", # \"ml.p2.xlarge\", #\n", " output_path=inference_output,\n", " max_payload=16,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Start Batch Transform job\n", "\n", "transformer.transform(\n", " data=test_channel,\n", " data_type=\"S3Prefix\",\n", " content_type=\"application/x-image\",\n", " wait=False,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualization\n", "\n", "Once our batch inference job is completed, let's visualize predictions. We'll use single random image for visualization. Feel free to re-run it many times." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import io\n", "\n", "import matplotlib\n", "import matplotlib.patches as patches\n", "import numpy as np\n", "from matplotlib import pyplot as plt\n", "from PIL import Image" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def key_from_uri(s3_uri: str) -> str:\n", " \"\"\"Get S3 object key from its URI\"\"\"\n", " return \"/\".join(Path(s3_uri).parts[2:])\n", "\n", "\n", "bucket_rsr = boto3.resource(\"s3\").Bucket(bucket)\n", "predict_objs = list(\n", " bucket_rsr.objects.filter(Prefix=key_from_uri(inference_output) + \"/\")\n", ")\n", "img_objs = list(bucket_rsr.objects.filter(Prefix=key_from_uri(test_channel)))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "COLORS = [\n", " (0, 200, 0),\n", "]\n", "\n", "\n", "def plot_predictions_on_image(\n", " p_img: np.ndarray, p_preds: Mapping, score_thr: float = 0.5, show=True\n", ") -> plt.Figure:\n", " r\"\"\"Plot bounding boxes predicted by an inference job on the corresponding image\n", "\n", " Parameters\n", " ----------\n", " p_img : np.ndarray\n", " input image used for prediction\n", " p_preds : Mapping\n", " dictionary with bounding boxes, predicted classes and confidence scores\n", " score_thr : float, optional\n", " show bounding boxes whose confidence score is bigger than `score_thr`, by default 0.5\n", " show : bool, optional\n", " show figure if True do not otherwise, by default True\n", "\n", " Returns\n", " -------\n", " plt.Figure\n", " figure handler\n", "\n", " Raises\n", " ------\n", " IOError\n", " If the prediction dictionary `p_preds` does not contain one of the required keys:\n", " `pred_classes`, `pred_boxes` and `scores`\n", " \"\"\"\n", " for required_key in (\"pred_classes\", \"pred_boxes\", \"scores\"):\n", " if required_key not in p_preds:\n", " raise IOError(f\"Missing required key: {required_key}\")\n", "\n", " fig, fig_axis = plt.subplots(1)\n", " fig_axis.imshow(p_img)\n", " for class_id, bbox, score in zip(\n", " p_preds[\"pred_classes\"], p_preds[\"pred_boxes\"], p_preds[\"scores\"]\n", " ):\n", " if score < score_thr:\n", " break # bounding boxes are sorted by confidence score in descending order\n", " rect = patches.Rectangle(\n", " (bbox[0], bbox[1]),\n", " bbox[2] - bbox[0],\n", " bbox[3] - bbox[1],\n", " linewidth=1,\n", " edgecolor=[float(val) / 255 for val in COLORS[class_id]],\n", " facecolor=\"none\",\n", " )\n", " fig_axis.add_patch(rect)\n", " plt.axis(\"off\")\n", " if show:\n", " plt.show()\n", " return fig" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "matplotlib.rcParams[\"figure.dpi\"] = 300\n", "\n", "sample_id = np.random.randint(0, len(img_objs), 1)[0]\n", "\n", "img_obj = img_objs[sample_id]\n", "pred_obj = predict_objs[sample_id]\n", "\n", "img = np.asarray(Image.open(io.BytesIO(img_obj.get()[\"Body\"].read())))\n", "preds = json.loads(pred_obj.get()[\"Body\"].read().decode(\"utf-8\"))\n", "\n", "sample_fig = plot_predictions_on_image(img, preds, 0.40, True)" ] } ], "metadata": { "kernelspec": { "display_name": "conda_pytorch_p36", "language": "python", "name": "conda_pytorch_p36" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.13" } }, "nbformat": 4, "nbformat_minor": 4 }