{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Prepare the ground truth object detection labeling output for training\n",
    "\n",
    "This notebook walks you through the steps we have taken to process the object detection label output from Ground Truth to prepare it for model training in SageMaker. \n",
    "\n",
    "1. [Join together outputs from multiple labeling jobs](#join_output)\n",
    "1. [Filter out labels that did not meet our quality bar](#filter_bad_labels)\n",
    "1. [Inject class labels (if you didn't have the Ground Truth workers pick classes)](#inject_class)\n",
    "1. [Split train/validation data](#split_train)\n",
    "1. [data augmentation](#data_aug)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "BUCKET = '<please replace with your s3 bucket name>'\n",
    "JOB_NAME = 'demo' "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Import dependencies and define helper functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import random\n",
    "import os, shutil\n",
    "import json\n",
    "import boto3\n",
    "import botocore\n",
    "import sagemaker"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "sagemaker_client = boto3.client('sagemaker')\n",
    "\n",
    "def make_tmp_folder(folder_name):\n",
    "    try:\n",
    "        os.makedirs(folder_name, exist_ok=False)\n",
    "    except FileExistsError:\n",
    "        print(\"{} folder already exists\".format(folder_name))\n",
    "        \n",
    "def read_manifest_file(file_path):\n",
    "    with open(file_path, 'r') as f:\n",
    "        output = [json.loads(line.strip()) for line in f.readlines()]\n",
    "        return output"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Specify the Ground Truth labeling job id(s) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "## if using your own Ground Truth labeling job, replace below with appropriate job IDs\n",
    "LABEL_JOB_IDS = [\n",
    "    'blue-box-small-job-public', \n",
    "    'yellow-box-small-job-public', \n",
    "    'blue-box-large-job-public', \n",
    "    'yellow-box-large-job-public']\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tmp folder already exists\n"
     ]
    }
   ],
   "source": [
    "TMP_FOLDER_NAME = 'tmp'\n",
    "make_tmp_folder(TMP_FOLDER_NAME)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Join outputs from multiple jobs <a id='join_output'></a>\n",
    "\n",
    "To be able to iterate on Ground Truth jobs, we created several smaller labeling jobs for our dataset instead of a single large job containing the full dataset. \n",
    "\n",
    "The below code takes one or more Ground Truth job IDs, download the output (Augmented Manifest File format) and join them together into one array for manipulation "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "download: s3://gg-sagemaker-object-detection-blog/ground-truth-output/blue-box-small-job-public.output.manifest to tmp/blue-box-small-job-public-blue-box-small-job-public.output.manifest\n",
      "loaded 21 lines from tmp/blue-box-small-job-public-blue-box-small-job-public.output.manifest\n",
      "download: s3://gg-sagemaker-object-detection-blog/ground-truth-output/yellow-box-small-job-public.output.manifest to tmp/yellow-box-small-job-public-yellow-box-small-job-public.output.manifest\n",
      "loaded 32 lines from tmp/yellow-box-small-job-public-yellow-box-small-job-public.output.manifest\n",
      "download: s3://gg-sagemaker-object-detection-blog/ground-truth-output/blue-box-large-job-public.output.manifest to tmp/blue-box-large-job-public-blue-box-large-job-public.output.manifest\n",
      "loaded 624 lines from tmp/blue-box-large-job-public-blue-box-large-job-public.output.manifest\n",
      "download: s3://gg-sagemaker-object-detection-blog/ground-truth-output/yellow-box-large-job-public.output.manifest to tmp/yellow-box-large-job-public-yellow-box-large-job-public.output.manifest\n",
      "loaded 633 lines from tmp/yellow-box-large-job-public-yellow-box-large-job-public.output.manifest\n",
      "loaded total of 1310 lines\n"
     ]
    }
   ],
   "source": [
    "joined_outputs = []\n",
    "\n",
    "def get_output_manifest_s3_uri(label_job_id):\n",
    "    # below code uses label outputs from our sample dataset\n",
    "    return f's3://gg-sagemaker-object-detection-blog/ground-truth-output/{label_job_id}.output.manifest'\n",
    "    # uncomment below if you are using your own Ground Truth labeling job \n",
    "    # return sagemaker_client.describe_labeling_job(LabelingJobName=label_job_id)['LabelingJobOutput']['OutputDatasetS3Uri']\n",
    "\n",
    "for label_job_id in LABEL_JOB_IDS: \n",
    "    output_manifest_s3_uri = get_output_manifest_s3_uri(label_job_id)\n",
    "    output_manifest_fname = \"{}-{}\".format(label_job_id, os.path.split(output_manifest_s3_uri)[1])\n",
    "    !aws s3 cp $output_manifest_s3_uri $TMP_FOLDER_NAME/$output_manifest_fname\n",
    "    output_manifest_local_path = os.path.join(TMP_FOLDER_NAME, output_manifest_fname)\n",
    "    output_manifest_lines = read_manifest_file(output_manifest_local_path)\n",
    "    print(\"loaded {} lines from {}\".format(len(output_manifest_lines), output_manifest_local_path))\n",
    "    joined_outputs += output_manifest_lines\n",
    "    \n",
    "print(\"loaded total of {} lines\".format(len(joined_outputs)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example labels"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'source-ref': 's3://gg-sagemaker-object-detection-blog/frames/blue_box_1/blue_box_1_000037.jpg',\n",
       " 'color': 'blue',\n",
       " 'object': 'box',\n",
       " 'bb': {'annotations': [{'class_id': 0,\n",
       "    'width': 543,\n",
       "    'top': 570,\n",
       "    'height': 508,\n",
       "    'left': 358}],\n",
       "  'image_size': [{'width': 1920, 'depth': 3, 'height': 1080}]},\n",
       " 'bb-metadata': {'job-name': 'labeling-job/blue-box-small-job-public',\n",
       "  'class-map': {'0': 'storage box'},\n",
       "  'human-annotated': 'yes',\n",
       "  'objects': [{'confidence': 0.09}],\n",
       "  'creation-date': '2019-05-21T21:25:24.736610',\n",
       "  'type': 'groundtruth/object-detection'}}"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "joined_outputs[15]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'source-ref': 's3://gg-sagemaker-object-detection-blog/frames/yellow_box_2/yellow_box_2_000312.jpg',\n",
       " 'color': 'yellow',\n",
       " 'object': 'box',\n",
       " 'bb': {'annotations': [{'class_id': 0,\n",
       "    'width': 469,\n",
       "    'top': 511,\n",
       "    'height': 569,\n",
       "    'left': 684}],\n",
       "  'image_size': [{'width': 1920, 'depth': 3, 'height': 1080}]},\n",
       " 'bb-metadata': {'job-name': 'labeling-job/yellow-box-large-job-public',\n",
       "  'class-map': {'0': 'storage box'},\n",
       "  'human-annotated': 'yes',\n",
       "  'objects': [{'confidence': 0.09}],\n",
       "  'creation-date': '2019-05-21T20:11:49.119720',\n",
       "  'type': 'groundtruth/object-detection'}}"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "joined_outputs[-15]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Discard any bad labels from visual inspection <a id=\"filter_bad_labels\"></a>\n",
    "\n",
    "you may manually review the labeled bounding boxes on the Ground Truth console and mark the image IDs that didn't pass a quality bar "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "TO_DISCARD = set([\n",
    "    'blue_box_1_000023',\n",
    "    'blue_box_1_000152',\n",
    "    'blue_box_2_000292',\n",
    "    'yellow_box_2_000193',\n",
    "    'yellow_box_2_000204',\n",
    "    'yellow_box_2_000205'\n",
    "])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "filtered out 6 labels. 1304 labels remains\n"
     ]
    }
   ],
   "source": [
    "filtered_manifest = []\n",
    "count_filtered = 0\n",
    "for line in joined_outputs:\n",
    "    filename= os.path.split(line[\"source-ref\"])[1]\n",
    "    imageid = os.path.splitext(filename)[0]\n",
    "    if imageid not in TO_DISCARD:\n",
    "        filtered_manifest.append(line)\n",
    "    else:\n",
    "        count_filtered+=1\n",
    "        \n",
    "print(\"filtered out {} labels. {} labels remains\".format(count_filtered, len(filtered_manifest)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'source-ref': 's3://gg-sagemaker-object-detection-blog/frames/blue_box_1/blue_box_1_000025.jpg',\n",
       " 'color': 'blue',\n",
       " 'object': 'box',\n",
       " 'bb': {'annotations': [{'class_id': 0,\n",
       "    'width': 324,\n",
       "    'top': 986,\n",
       "    'height': 94,\n",
       "    'left': 229}],\n",
       "  'image_size': [{'width': 1920, 'depth': 3, 'height': 1080}]},\n",
       " 'bb-metadata': {'job-name': 'labeling-job/blue-box-small-job-public',\n",
       "  'class-map': {'0': 'storage box'},\n",
       "  'human-annotated': 'yes',\n",
       "  'objects': [{'confidence': 0.09}],\n",
       "  'creation-date': '2019-05-21T21:25:57.486929',\n",
       "  'type': 'groundtruth/object-detection'}}"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "## example entry\n",
    "filtered_manifest[2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Inject class labels from metadata <a id=\"inject_class\"></a>\n",
    "\n",
    "As you can see from the examples above, because we didn't ask the Ground Truth workers to classify the object they are labeling, all the annotations say `'class_id': 0`, regardless of what object it actually is\n",
    "\n",
    "We can use the metadata that we injected into the manifest (`color` and `object` field) to insert the correct class ID "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "NEW_CLASS_MAP = {\"blue box\": 0 , \"yellow box\": 1}\n",
    "REVERSE_CLASS_MAP =  { '0': \"blue box\" , \"1\": \"yellow box\"}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "classified_manifest = []\n",
    "for line in filtered_manifest:\n",
    "    if line[\"object\"] == \"box\":\n",
    "        transformed_line = line.copy()\n",
    "        annotations = line['bb']['annotations']\n",
    "        new_annotations = []\n",
    "        if line[\"color\"] == \"blue\":\n",
    "            for annotation in annotations:\n",
    "                annotation[\"class_id\"] = NEW_CLASS_MAP[\"blue box\"]\n",
    "                new_annotations.append(annotation)\n",
    "        elif line[\"color\"] == \"yellow\":\n",
    "            for annotation in annotations:\n",
    "                annotation[\"class_id\"] = NEW_CLASS_MAP[\"yellow box\"]\n",
    "                new_annotations.append(annotation)\n",
    "        transformed_line['bb']['annotations'] = new_annotations\n",
    "        transformed_line['bb-metadata']['class-map'] = REVERSE_CLASS_MAP\n",
    "\n",
    "        classified_manifest.append(transformed_line)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'source-ref': 's3://gg-sagemaker-object-detection-blog/frames/blue_box_1/blue_box_1_000038.jpg',\n",
       " 'color': 'blue',\n",
       " 'object': 'box',\n",
       " 'bb': {'annotations': [{'class_id': 0,\n",
       "    'width': 599,\n",
       "    'top': 579,\n",
       "    'height': 501,\n",
       "    'left': 331}],\n",
       "  'image_size': [{'width': 1920, 'depth': 3, 'height': 1080}]},\n",
       " 'bb-metadata': {'job-name': 'labeling-job/blue-box-small-job-public',\n",
       "  'class-map': {'0': 'blue box', '1': 'yellow box'},\n",
       "  'human-annotated': 'yes',\n",
       "  'objects': [{'confidence': 0.29}],\n",
       "  'creation-date': '2019-05-21T21:28:05.367484',\n",
       "  'type': 'groundtruth/object-detection'}}"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "classified_manifest[15]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'source-ref': 's3://gg-sagemaker-object-detection-blog/frames/yellow_box_2/yellow_box_2_000312.jpg',\n",
       " 'color': 'yellow',\n",
       " 'object': 'box',\n",
       " 'bb': {'annotations': [{'class_id': 1,\n",
       "    'width': 469,\n",
       "    'top': 511,\n",
       "    'height': 569,\n",
       "    'left': 684}],\n",
       "  'image_size': [{'width': 1920, 'depth': 3, 'height': 1080}]},\n",
       " 'bb-metadata': {'job-name': 'labeling-job/yellow-box-large-job-public',\n",
       "  'class-map': {'0': 'blue box', '1': 'yellow box'},\n",
       "  'human-annotated': 'yes',\n",
       "  'objects': [{'confidence': 0.09}],\n",
       "  'creation-date': '2019-05-21T20:11:49.119720',\n",
       "  'type': 'groundtruth/object-detection'}}"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "classified_manifest[-15]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Split dataset between train and validation <a id='split_train'></a>\n",
    "\n",
    "SageMaker requires two datasets during training: train and validation dataset. The training set consists of the images and annotations you want to actually train the model with. The validation set is not used for training but used to “validate” that each training pass is improving the accuracy of the model and compare accuracy between different training jobs during hyper-parameter tuning. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "def train_validation_split(labels, split_factor=0.9):\n",
    "    np.random.shuffle(labels)\n",
    "\n",
    "    dataset_size = len(labels)\n",
    "    train_test_split_index = round(dataset_size*split_factor)\n",
    "\n",
    "    train_data = labels[:train_test_split_index]\n",
    "    validation_data = labels[train_test_split_index:]\n",
    "    return train_data, validation_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "training data size:1174\n",
      "validation data size:130\n"
     ]
    }
   ],
   "source": [
    "train_data, validation_data = train_validation_split(np.array(classified_manifest), split_factor=0.9)\n",
    "\n",
    "print(\"training data size:{}\\nvalidation data size:{}\".format(train_data.shape[0], validation_data.shape[0]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(os.path.join(TMP_FOLDER_NAME, 'train.manifest'), 'w') as f:\n",
    "    for line in train_data:\n",
    "        f.write(json.dumps(line))\n",
    "        f.write('\\n')\n",
    "    \n",
    "with open(os.path.join(TMP_FOLDER_NAME,'validation.manifest'), 'w') as f:\n",
    "    for line in validation_data:\n",
    "        f.write(json.dumps(line))\n",
    "        f.write('\\n')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1174 tmp/train.manifest\n",
      "130 tmp/validation.manifest\n"
     ]
    }
   ],
   "source": [
    "!wc -l $TMP_FOLDER_NAME/train.manifest\n",
    "!wc -l $TMP_FOLDER_NAME/validation.manifest"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "upload: tmp/train.manifest to s3://angelaw-test-sagemaker-blog/training-manifest/demo/train.manifest\n",
      "upload: tmp/validation.manifest to s3://angelaw-test-sagemaker-blog/training-manifest/demo/validation.manifest\n"
     ]
    }
   ],
   "source": [
    "!aws s3 cp $TMP_FOLDER_NAME/train.manifest s3://$BUCKET/training-manifest/$JOB_NAME/train.manifest\n",
    "!aws s3 cp $TMP_FOLDER_NAME/validation.manifest s3://$BUCKET/training-manifest/$JOB_NAME/validation.manifest"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Data augmentation (optional) <a id='data_aug'></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "%run ./flip_images.py -m s3://$BUCKET/training-manifest/$JOB_NAME/train.manifest -d $TMP_FOLDER_NAME -b $BUCKET"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO:__main__:working directory: tmp\n",
      "INFO:__main__:wrote 1174 lines to tmp/x-flipped.json\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1174\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO:common_utils:wrote to s3://angelaw-test-sagemaker-blog/demo/x-flipped.json\n",
      "INFO:__main__:uploaded tmp/x-flipped.json to s3://angelaw-test-sagemaker-blog/demo\n",
      "INFO:__main__:wrote 1174 lines to tmp/y-flipped.json\n",
      "INFO:common_utils:wrote to s3://angelaw-test-sagemaker-blog/demo/y-flipped.json\n",
      "INFO:__main__:uploaded tmp/y-flipped.json to s3://angelaw-test-sagemaker-blog/demo\n",
      "INFO:__main__:wrote 1174 lines to tmp/ccw_rotated.json\n",
      "INFO:common_utils:wrote to s3://angelaw-test-sagemaker-blog/demo/ccw_rotated.json\n",
      "INFO:__main__:uploaded tmp/ccw_rotated.json to s3://angelaw-test-sagemaker-blog/demo\n",
      "INFO:__main__:wrote 1174 lines to tmp/cw_rotated.json\n",
      "INFO:common_utils:wrote to s3://angelaw-test-sagemaker-blog/demo/cw_rotated.json\n",
      "INFO:__main__:uploaded tmp/cw_rotated.json to s3://angelaw-test-sagemaker-blog/demo\n",
      "INFO:__main__:wrote 5870 lines to tmp/all_augmented.json\n",
      "INFO:common_utils:wrote to s3://angelaw-test-sagemaker-blog/demo/all_augmented.json\n",
      "INFO:__main__:uploaded tmp/all_augmented.json to s3://angelaw-test-sagemaker-blog/demo\n"
     ]
    }
   ],
   "source": [
    "%run ./flip_annotations.py -m s3://$BUCKET/training-manifest/$JOB_NAME/train.manifest -d $TMP_FOLDER_NAME -p $JOB_NAME"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Next step\n",
    "\n",
    "Now we are ready to start training jobs! Move on to the [next notebook](./02_sagemaker_training_API.ipynb) to submit a sagemaker training job to train our custom object detection model!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}