# Prepare the ground truth object detection labeling output for training

This notebook walks you through the steps we have taken to process the object detection label output from Ground Truth to prepare it for model training in SageMaker. 

1. [Join together outputs from multiple labeling jobs](#join_output)
1. [Filter out labels that did not meet our quality bar](#filter_bad_labels)
1. [Inject class labels (if you didn't have the Ground Truth workers pick classes)](#inject_class)
1. [Split train/validation data](#split_train)
1. [data augmentation](#data_aug)

## Setup

In [None]:
BUCKET = ''
JOB_NAME = 'demo' 

### Import dependencies and define helper functions

In [1]:
import numpy as np
import random
import os, shutil
import json
import boto3
import botocore
import sagemaker

In [2]:
sagemaker_client = boto3.client('sagemaker')

def make_tmp_folder(folder_name):
 try:
 os.makedirs(folder_name, exist_ok=False)
 except FileExistsError:
 print("{} folder already exists".format(folder_name))
 
def read_manifest_file(file_path):
 with open(file_path, 'r') as f:
 output = [json.loads(line.strip()) for line in f.readlines()]
 return output

### Specify the Ground Truth labeling job id(s) 

In [3]:
## if using your own Ground Truth labeling job, replace below with appropriate job IDs
LABEL_JOB_IDS = [
 'blue-box-small-job-public', 
 'yellow-box-small-job-public', 
 'blue-box-large-job-public', 
 'yellow-box-large-job-public']


In [4]:
TMP_FOLDER_NAME = 'tmp'
make_tmp_folder(TMP_FOLDER_NAME)


tmp folder already exists


## 1. Join outputs from multiple jobs 

To be able to iterate on Ground Truth jobs, we created several smaller labeling jobs for our dataset instead of a single large job containing the full dataset. 

The below code takes one or more Ground Truth job IDs, download the output (Augmented Manifest File format) and join them together into one array for manipulation 

In [5]:
joined_outputs = []

def get_output_manifest_s3_uri(label_job_id):
 # below code uses label outputs from our sample dataset
 return f's3://gg-sagemaker-object-detection-blog/ground-truth-output/{label_job_id}.output.manifest'
 # uncomment below if you are using your own Ground Truth labeling job 
 # return sagemaker_client.describe_labeling_job(LabelingJobName=label_job_id)['LabelingJobOutput']['OutputDatasetS3Uri']

for label_job_id in LABEL_JOB_IDS: 
 output_manifest_s3_uri = get_output_manifest_s3_uri(label_job_id)
 output_manifest_fname = "{}-{}".format(label_job_id, os.path.split(output_manifest_s3_uri)[1])
 !aws s3 cp $output_manifest_s3_uri $TMP_FOLDER_NAME/$output_manifest_fname
 output_manifest_local_path = os.path.join(TMP_FOLDER_NAME, output_manifest_fname)
 output_manifest_lines = read_manifest_file(output_manifest_local_path)
 print("loaded {} lines from {}".format(len(output_manifest_lines), output_manifest_local_path))
 joined_outputs += output_manifest_lines
 
print("loaded total of {} lines".format(len(joined_outputs)))

download: s3://gg-sagemaker-object-detection-blog/ground-truth-output/blue-box-small-job-public.output.manifest to tmp/blue-box-small-job-public-blue-box-small-job-public.output.manifest
loaded 21 lines from tmp/blue-box-small-job-public-blue-box-small-job-public.output.manifest
download: s3://gg-sagemaker-object-detection-blog/ground-truth-output/yellow-box-small-job-public.output.manifest to tmp/yellow-box-small-job-public-yellow-box-small-job-public.output.manifest
loaded 32 lines from tmp/yellow-box-small-job-public-yellow-box-small-job-public.output.manifest
download: s3://gg-sagemaker-object-detection-blog/ground-truth-output/blue-box-large-job-public.output.manifest to tmp/blue-box-large-job-public-blue-box-large-job-public.output.manifest
loaded 624 lines from tmp/blue-box-large-job-public-blue-box-large-job-public.output.manifest
download: s3://gg-sagemaker-object-detection-blog/ground-truth-output/yellow-box-large-job-public.output.manifest to tmp/yellow-box-large-job-public-

## Example labels

In [6]:
joined_outputs[15]

{'source-ref': 's3://gg-sagemaker-object-detection-blog/frames/blue_box_1/blue_box_1_000037.jpg',
 'color': 'blue',
 'object': 'box',
 'bb': {'annotations': [{'class_id': 0,
 'width': 543,
 'top': 570,
 'height': 508,
 'left': 358}],
 'image_size': [{'width': 1920, 'depth': 3, 'height': 1080}]},
 'bb-metadata': {'job-name': 'labeling-job/blue-box-small-job-public',
 'class-map': {'0': 'storage box'},
 'human-annotated': 'yes',
 'objects': [{'confidence': 0.09}],
 'creation-date': '2019-05-21T21:25:24.736610',
 'type': 'groundtruth/object-detection'}}

In [7]:
joined_outputs[-15]

{'source-ref': 's3://gg-sagemaker-object-detection-blog/frames/yellow_box_2/yellow_box_2_000312.jpg',
 'color': 'yellow',
 'object': 'box',
 'bb': {'annotations': [{'class_id': 0,
 'width': 469,
 'top': 511,
 'height': 569,
 'left': 684}],
 'image_size': [{'width': 1920, 'depth': 3, 'height': 1080}]},
 'bb-metadata': {'job-name': 'labeling-job/yellow-box-large-job-public',
 'class-map': {'0': 'storage box'},
 'human-annotated': 'yes',
 'objects': [{'confidence': 0.09}],
 'creation-date': '2019-05-21T20:11:49.119720',
 'type': 'groundtruth/object-detection'}}

## 2. Discard any bad labels from visual inspection 

you may manually review the labeled bounding boxes on the Ground Truth console and mark the image IDs that didn't pass a quality bar 

In [8]:
TO_DISCARD = set([
 'blue_box_1_000023',
 'blue_box_1_000152',
 'blue_box_2_000292',
 'yellow_box_2_000193',
 'yellow_box_2_000204',
 'yellow_box_2_000205'
])

In [9]:
filtered_manifest = []
count_filtered = 0
for line in joined_outputs:
 filename= os.path.split(line["source-ref"])[1]
 imageid = os.path.splitext(filename)[0]
 if imageid not in TO_DISCARD:
 filtered_manifest.append(line)
 else:
 count_filtered+=1
 
print("filtered out {} labels. {} labels remains".format(count_filtered, len(filtered_manifest)))

filtered out 6 labels. 1304 labels remains


In [10]:
## example entry
filtered_manifest[2]

{'source-ref': 's3://gg-sagemaker-object-detection-blog/frames/blue_box_1/blue_box_1_000025.jpg',
 'color': 'blue',
 'object': 'box',
 'bb': {'annotations': [{'class_id': 0,
 'width': 324,
 'top': 986,
 'height': 94,
 'left': 229}],
 'image_size': [{'width': 1920, 'depth': 3, 'height': 1080}]},
 'bb-metadata': {'job-name': 'labeling-job/blue-box-small-job-public',
 'class-map': {'0': 'storage box'},
 'human-annotated': 'yes',
 'objects': [{'confidence': 0.09}],
 'creation-date': '2019-05-21T21:25:57.486929',
 'type': 'groundtruth/object-detection'}}

## 3. Inject class labels from metadata 

As you can see from the examples above, because we didn't ask the Ground Truth workers to classify the object they are labeling, all the annotations say `'class_id': 0`, regardless of what object it actually is

We can use the metadata that we injected into the manifest (`color` and `object` field) to insert the correct class ID 

In [11]:
NEW_CLASS_MAP = {"blue box": 0 , "yellow box": 1}
REVERSE_CLASS_MAP = { '0': "blue box" , "1": "yellow box"}

In [12]:
classified_manifest = []
for line in filtered_manifest:
 if line["object"] == "box":
 transformed_line = line.copy()
 annotations = line['bb']['annotations']
 new_annotations = []
 if line["color"] == "blue":
 for annotation in annotations:
 annotation["class_id"] = NEW_CLASS_MAP["blue box"]
 new_annotations.append(annotation)
 elif line["color"] == "yellow":
 for annotation in annotations:
 annotation["class_id"] = NEW_CLASS_MAP["yellow box"]
 new_annotations.append(annotation)
 transformed_line['bb']['annotations'] = new_annotations
 transformed_line['bb-metadata']['class-map'] = REVERSE_CLASS_MAP

 classified_manifest.append(transformed_line)

In [13]:
classified_manifest[15]

{'source-ref': 's3://gg-sagemaker-object-detection-blog/frames/blue_box_1/blue_box_1_000038.jpg',
 'color': 'blue',
 'object': 'box',
 'bb': {'annotations': [{'class_id': 0,
 'width': 599,
 'top': 579,
 'height': 501,
 'left': 331}],
 'image_size': [{'width': 1920, 'depth': 3, 'height': 1080}]},
 'bb-metadata': {'job-name': 'labeling-job/blue-box-small-job-public',
 'class-map': {'0': 'blue box', '1': 'yellow box'},
 'human-annotated': 'yes',
 'objects': [{'confidence': 0.29}],
 'creation-date': '2019-05-21T21:28:05.367484',
 'type': 'groundtruth/object-detection'}}

In [14]:
classified_manifest[-15]

{'source-ref': 's3://gg-sagemaker-object-detection-blog/frames/yellow_box_2/yellow_box_2_000312.jpg',
 'color': 'yellow',
 'object': 'box',
 'bb': {'annotations': [{'class_id': 1,
 'width': 469,
 'top': 511,
 'height': 569,
 'left': 684}],
 'image_size': [{'width': 1920, 'depth': 3, 'height': 1080}]},
 'bb-metadata': {'job-name': 'labeling-job/yellow-box-large-job-public',
 'class-map': {'0': 'blue box', '1': 'yellow box'},
 'human-annotated': 'yes',
 'objects': [{'confidence': 0.09}],
 'creation-date': '2019-05-21T20:11:49.119720',
 'type': 'groundtruth/object-detection'}}

## 4. Split dataset between train and validation 

SageMaker requires two datasets during training: train and validation dataset. The training set consists of the images and annotations you want to actually train the model with. The validation set is not used for training but used to “validate” that each training pass is improving the accuracy of the model and compare accuracy between different training jobs during hyper-parameter tuning. 

In [15]:
def train_validation_split(labels, split_factor=0.9):
 np.random.shuffle(labels)

 dataset_size = len(labels)
 train_test_split_index = round(dataset_size*split_factor)

 train_data = labels[:train_test_split_index]
 validation_data = labels[train_test_split_index:]
 return train_data, validation_data

In [16]:
train_data, validation_data = train_validation_split(np.array(classified_manifest), split_factor=0.9)

print("training data size:{}\nvalidation data size:{}".format(train_data.shape[0], validation_data.shape[0]))

training data size:1174
validation data size:130


In [17]:
with open(os.path.join(TMP_FOLDER_NAME, 'train.manifest'), 'w') as f:
 for line in train_data:
 f.write(json.dumps(line))
 f.write('\n')
 
with open(os.path.join(TMP_FOLDER_NAME,'validation.manifest'), 'w') as f:
 for line in validation_data:
 f.write(json.dumps(line))
 f.write('\n')

In [18]:
!wc -l $TMP_FOLDER_NAME/train.manifest
!wc -l $TMP_FOLDER_NAME/validation.manifest

1174 tmp/train.manifest
130 tmp/validation.manifest


In [21]:
!aws s3 cp $TMP_FOLDER_NAME/train.manifest s3://$BUCKET/training-manifest/$JOB_NAME/train.manifest
!aws s3 cp $TMP_FOLDER_NAME/validation.manifest s3://$BUCKET/training-manifest/$JOB_NAME/validation.manifest

upload: tmp/train.manifest to s3://angelaw-test-sagemaker-blog/training-manifest/demo/train.manifest
upload: tmp/validation.manifest to s3://angelaw-test-sagemaker-blog/training-manifest/demo/validation.manifest


## 5. Data augmentation (optional) 

In [None]:
%%time
%run ./flip_images.py -m s3://$BUCKET/training-manifest/$JOB_NAME/train.manifest -d $TMP_FOLDER_NAME -b $BUCKET

In [23]:
%run ./flip_annotations.py -m s3://$BUCKET/training-manifest/$JOB_NAME/train.manifest -d $TMP_FOLDER_NAME -p $JOB_NAME

INFO:__main__:working directory: tmp
INFO:__main__:wrote 1174 lines to tmp/x-flipped.json


1174


INFO:common_utils:wrote to s3://angelaw-test-sagemaker-blog/demo/x-flipped.json
INFO:__main__:uploaded tmp/x-flipped.json to s3://angelaw-test-sagemaker-blog/demo
INFO:__main__:wrote 1174 lines to tmp/y-flipped.json
INFO:common_utils:wrote to s3://angelaw-test-sagemaker-blog/demo/y-flipped.json
INFO:__main__:uploaded tmp/y-flipped.json to s3://angelaw-test-sagemaker-blog/demo
INFO:__main__:wrote 1174 lines to tmp/ccw_rotated.json
INFO:common_utils:wrote to s3://angelaw-test-sagemaker-blog/demo/ccw_rotated.json
INFO:__main__:uploaded tmp/ccw_rotated.json to s3://angelaw-test-sagemaker-blog/demo
INFO:__main__:wrote 1174 lines to tmp/cw_rotated.json
INFO:common_utils:wrote to s3://angelaw-test-sagemaker-blog/demo/cw_rotated.json
INFO:__main__:uploaded tmp/cw_rotated.json to s3://angelaw-test-sagemaker-blog/demo
INFO:__main__:wrote 5870 lines to tmp/all_augmented.json
INFO:common_utils:wrote to s3://angelaw-test-sagemaker-blog/demo/all_augmented.json
INFO:__main__:uploaded tmp/all_augmente

# Next step

Now we are ready to start training jobs! Move on to the [next notebook](./02_sagemaker_training_API.ipynb) to submit a sagemaker training job to train our custom object detection model!