# Label your dataset with Amazon SageMaker Ground Truth

In [12]:
import boto3
import json
import numpy
import os
import sagemaker

from sklearn.model_selection import train_test_split

sm_client = boto3.client('sagemaker')
s3_resource = boto3.resource('s3')
sm_session = sagemaker.Session()

### Create a labeling job in Amazon SageMaker Ground Truth

To create your custom model on YOLOv5 you are going to need to label your custom dataset. To label an object detection dataset you may use Amazon SageMaker Ground Truth.

| ⚠️ WARNING: If you have already labeled an object detection dataset with Amazon SageMaker Ground Truth you can skip to the "**Get Job Details**" |
| -- |

#### Create a Labeling Workforce

Follow the steps in the SageMaker Ground Truth documentation here: https://docs.aws.amazon.com/sagemaker/latest/dg/sms-workforce-create-private-console.html#create-workforce-labeling-job


#### Create your bounding box labeling job

Follow the steps in the SageMaker Ground Truth documentation here: https://docs.aws.amazon.com/sagemaker/latest/dg/sms-create-labeling-job-console.html

If using the AWS Console, you should create a labeling job with the following options:

1. Job name: Set any unique name for the job name, for example "Object-Detection-Example".
2. Leave the "I want to specify a label attribute..." option un-checked.
3. Input data setup: Pick "Automated data setup".
4. Input dataset location: Copy and paste the location of the single folder with your images in S3. Example: "s3://mybucket/raw_images".
5. Output dataset location: Choose "Same location as input dataset".
6. Data type: Choose "Image".
7. IAM Role: Create a new role and give access to the S3 bucket where your images are located, or any S3 bucket.
8. Now hit "Complete data setup" and wait for it to be ready.
9. Task category: Choose "Image" and select "Bounding box", then hit "Next".
10. Worker types: Select "Private" and choose your team for the "Private teams" option.
11. For the Bounding box labeling tool: Enter a description and instructions, and for the "Labels" section add the relevant labels for your job. 
12. Finally choose "Create".

### Get Job Details

Once you have finished labeling your images, let's retrieve the information we need to create our dataset in the format YOLOv5 expects

In [10]:
groundtruth_job_name = "Object-Detection-Example" ### <-- Replace with the name you used for your labeling job

In [11]:
response = sm_client.describe_labeling_job(
 LabelingJobName=groundtruth_job_name
)

labelingJobStatus = response["LabelingJobStatus"]
manifestUri = response["LabelingJobOutput"]["OutputDatasetS3Uri"]
labelsListUri = response["LabelCategoryConfigS3Uri"]

print("Job Status: ",labelingJobStatus)
print("Manifest Uri: ", manifestUri)
print("Labels Uri: ", labelsListUri)

Job Status: Completed
Manifest Uri: s3://buzecd-aiml-demos/ground-truth-tests/image-bounding-box/Output/Object-Detection-Example/manifests/output/output.manifest
Labels Uri: s3://buzecd-aiml-demos/ground-truth-tests/image-bounding-box/Output/Object-Detection-Example/annotation-tool/data.json


### Get labels

We need to retrieve the labels from the training job which are located in S3.

In [6]:
def split_s3_path(s3_path):
 path_parts=s3_path.replace("s3://","").split("/")
 bucket=path_parts.pop(0)
 key="/".join(path_parts)
 return bucket, key

def get_labels_list(labels_uri):
 labels = []
 bucket, key = split_s3_path(labels_uri)
 s3_resource.meta.client.download_file(bucket, key, 'labels.json')
 with open('labels.json') as f:
 data = json.load(f)
 for label in data["labels"]:
 labels.append(label["label"])
 return labels

In [7]:
labels = get_labels_list(labelsListUri)
print("Labels: ",labels)

Labels: ['Dog', 'Cat']


### Get manifest

We need to retrieve the labeled manifest file from the training job which is located in S3

In [8]:
def get_manifest_file(manifest_uri):
 bucket, key = split_s3_path(manifest_uri)
 s3_resource.meta.client.download_file(bucket, key, 'output.manifest')
 return "output.manifest"

In [9]:
manifest = get_manifest_file(manifestUri)

### Split manifest into training and validation

Now we have our manifest, let's split our data into training and validation

In [None]:
with open(manifest) as file:
 lines = file.readlines()
 data = numpy.array(lines)
 train_data, validation_data = train_test_split(data, test_size=0.2)
 
print("The manifest contains {} annotations.".format(len(data)))
print("{} will be used for training.".format(len(train_data)))
print("{} will be used for validation.".format(len(validation_data)))

### Create YOLOv5 Training and Validation datasets

Lets download the images and create the annotation files in YOLOv5 expected format

In [None]:
dirs = ["dataset/images/train", 
 "dataset/labels/train",
 "dataset/images/validation",
 "dataset/labels/validation"]

for directory in dirs:
 !mkdir -p {directory}

In [None]:
def ground_truth_to_yolo(dataset, dataset_category):
 print("Downloading images and creating labels for the {} dataset".format(dataset_category))
 for line in dataset:
 line = json.loads(line)
 
 # Variables
 object_s3_uri = line["source-ref"]
 bucket, key = split_s3_path(object_s3_uri)
 image_filename = object_s3_uri.split("/")[-1]
 txt_filename = '.'.join(image_filename.split(".")[:-1]) + ".txt"
 txt_path = "dataset/labels/{}/{}".format(dataset_category, txt_filename)
 
 # Download image
 s3_resource.meta.client.download_file(bucket, key, "dataset/images/{}/{}".format(dataset_category,image_filename))
 
 # Create txt with annotations
 with open(txt_path, 'w') as target:
 for annotation in line[groundtruth_job_name]["annotations"]:
 class_id = annotation["class_id"]
 center_x = (annotation["left"] + (annotation["width"]/2)) / line[groundtruth_job_name]["image_size"][0]["width"]
 center_y = (annotation["top"] + (annotation["height"]/2)) / line[groundtruth_job_name]["image_size"][0]["height"]
 w = annotation["width"] / line[groundtruth_job_name]["image_size"][0]["width"]
 h = annotation["height"] / line[groundtruth_job_name]["image_size"][0]["height"]
 data = "{} {} {} {} {}\n".format(class_id, center_x, center_y, w, h)
 target.write(data)

In [None]:
ground_truth_to_yolo(train_data, "train")
ground_truth_to_yolo(validation_data, "validation")

### Validate the number of downloaded files

In [None]:
def count_files(dirs):
 for directory in dirs:
 number = len([1 for x in list(os.scandir(directory)) if x.is_file()])
 print("There are {} elements in {}".format(number, directory))

count_files(dirs)

In [None]:
# TODO: Show images with bounding boxes

### Upload to S3 the labeled dataset

Let's upload our dataset to S3, this will be used for the training job

In [None]:
bucket = sm_session.default_bucket()
#bucket = "" #Use this option if you want to use a specific S3 bucket
dataset_s3_uri = sm_session.upload_data("dataset", bucket, "yolov5dataset")
print("Dataset located in: ",dataset_s3_uri)

You have labeled your own custom dataset with Amazon SageMaker Ground Truth and split it a training and validation dataset in YOLOv5 expected format. For the next modules you will be able to use this dataset to train and deploy a custom YOLOv5 model

| ⚠️ WARNING: These are the details you will need to train your models based on the labeling job you completed. |
| -- |

In [None]:
print("Dataset S3 location: ", dataset_s3_uri)
print("Labels: ", labels)