# TF2 Object Detection on Amazon SageMaker - Data preparation

## Setup environment

In [None]:
#If you're using an Amazon SageMaker Notebook instance, please select the "conda_tensorflow2_p36" kernel.
#If you're using anything else feel free to make use of the script in 0_set_up_env folder and select the custom kernel.

In [None]:
!pip install -q --upgrade pip
!pip install -q sagemaker==2.96.0
!pip install -q jsonlines
# Get dataset_util file from TF2 Object Detection GitHub repository
!wget -P ./docker/code/utils https://raw.githubusercontent.com/tensorflow/models/master/research/object_detection/utils/dataset_util.py

## Import libraries

In [None]:
import os
import boto3
import jsonlines
import sagemaker
import numpy as np
from PIL import Image
from itertools import cycle, islice
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from sagemaker.processing import Processor, ProcessingInput, ProcessingOutput

sagemaker_session = sagemaker.Session()

# we are using the notebook instance role for training in this example
role = sagemaker.get_execution_role() 

# you can specify a bucket name here, we're using the default bucket of SageMaker
bucket = sagemaker_session.default_bucket() 

<a name="introduction"></a>
## Get data
In this workshop we will use a dataset from the [inaturalist.org](inaturalist.org) This dataset contains 500 images of bees that have been uploaded by inaturalist users for the purposes of recording the observation and identification. We only used images that their users have licensed under [CC0](https://creativecommons.org/share-your-work/public-domain/cc0/) license. For your convenience, we have placed the dataset in S3 in a single zip archive here: https://tf2-object-detection.s3-eu-west-1.amazonaws.com/data/bees/dataset.zip

First, download and unzip the archive.

In [None]:
!wget -O /tmp/dataset.zip https://tf2-object-detection.s3-eu-west-1.amazonaws.com/data/bees/input/dataset.zip
!mkdir data
!unzip -qo /tmp/dataset.zip -d data

The archive contains the following structure: 
- 500 `.jpg` image files 
- A "output.manifest" file (to be explained later) 

Now let's upload this dataset to your own S3 bucket in preparation for labeling and training using Amazon SageMaker.

In [None]:
prefix = 'data/bees/raw'
s3_input = sagemaker_session.upload_data('data', bucket, prefix)
print(s3_input)

## Visualize labelled images

Now let's plot all the annotated images. First, let's define a function that displays the local image file and draws over it the bounding boxes obtained via labeling.

In [None]:
def show_annotated_image(img_path, bboxes):
 im = np.array(Image.open(img_path), dtype=np.uint8)
 
 # Create figure and axes
 fig,ax = plt.subplots(1)

 # Display the image
 ax.imshow(im)

 colors = cycle(['w', 'g', 'b', 'y', 'c', 'm', 'k', 'r'])
 
 for bbox in bboxes:
 # Create a Rectangle patch
 rect = patches.Rectangle((bbox['left'],bbox['top']),bbox['width'],bbox['height'],linewidth=2,edgecolor=next(colors),facecolor='none')

 # Add the patch to the Axes
 ax.add_patch(rect)

 plt.show()

Next, read the augmented manifest (JSON lines format) line by line and display the first 10 images.

In [None]:
labeling_job_name = 'bees-500'
augmented_manifest_file = 'data/output.manifest'

with jsonlines.open(augmented_manifest_file, 'r') as reader:
 for desc in islice(reader, 10):
 img_url = desc['source-ref']
 img_file = f'data/{os.path.basename(img_url)}'
 bboxes = desc[labeling_job_name]['annotations']
 show_annotated_image(img_file, bboxes)

We will then process the object detection labels output from Ground Truth (i.e: The output manifest file) and generate TFrecords files to be used during the model training

## Build and push data processing container

As we're following the bring your own paradigm of SageMaker, the first step is to build a docker container (enabling you to chose the base docker image and libraries to be installed) and push the docker to Amazon ECR so it's accessible by Amazon SageMaker 

In [None]:
image_name = 'tfrecord-processing'

In [None]:
#The following command takes a few minutes to complete
!sh ./docker/build_and_push.sh $image_name

In [None]:
# Get the ECR image uri
with open (os.path.join('docker', 'ecr_image_fullname.txt'), 'r') as f:
 container = f.readlines()[0][:-1]
 
print(container)

## Launch SageMaker processing job to generate TFrecords

In [None]:
data_processor = Processor(
 role=role, 
 image_uri=container, 
 instance_count=1, 
 instance_type='ml.m5.xlarge',
 volume_size_in_gb=30, 
 max_runtime_in_seconds=1200,
 base_job_name='tf2-object-detection'
)

In [None]:
input_folder = '/opt/ml/processing/input'
ground_truth_manifest = '/opt/ml/processing/input/output.manifest'
label_map = '{"0": "bee"}' # class mapping here - e.g. - each class ID should map to the human readable equivalent
output_folder = '/opt/ml/processing/output'

data_processor.run(
 arguments= [
 f'--input={input_folder}',
 f'--ground_truth_manifest={ground_truth_manifest}',
 f'--label_map={label_map}',
 f'--output={output_folder}'
 ],
 inputs = [
 ProcessingInput(
 input_name='input',
 source=s3_input,
 destination=input_folder
 )
 ],
 outputs= [
 ProcessingOutput(
 output_name='tfrecords',
 source=output_folder,
 destination=f's3://{bucket}/data/bees/tfrecords'
 )
 ]
)