## Amazon SageMaker Ground Truth
### Lab : Data Labeling Using Private Workforce 

In this lab, you will use Amazon SageMaker Ground Truth to label images in a training
dataset consisting of cat and dog images. You will start with an unlabeled image
training data set, acquire labels for all the images using SageMaker Ground Truth
private workforce and finally analyze the results of the labeling job.
High Level Steps:

1. Upload training data into an S3 bucket.
1. Create a private Ground Truth Labeling workforce.
1. Create a Ground Truth Labeling job
1. Label the images using the Ground Truth Labeling portal.
1. Analyze results

This notebook is based on the lab in the repository [https://github.com/mahendrabairagi/GroundTruth_lab](https://github.com/mahendrabairagi/GroundTruth_lab)

In [None]:
import boto3
import botocore
import json
import time
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, Markdown

s3 = boto3.client('s3')
s3_resource = boto3.resource('s3')

session = boto3.session.Session()
region = session.region_name
account_id = boto3.client('sts').get_caller_identity().get('Account')

bucket = 'escience-workshop-{{FIXME}}'

pd.set_option('display.max_colwidth', -1)
%matplotlib inline

### [Create S3 Bucket](https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html)

We will create an S3 bucket that will be used throughout the workshop for storing data.

[s3.create_bucket](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.create_bucket) boto3 documentation

In [None]:
def create_bucket(bucket):
 import logging

 try:
 s3.create_bucket(Bucket=bucket, CreateBucketConfiguration={'LocationConstraint': region})
 except botocore.exceptions.ClientError as e:
 logging.error(e)
 return 'Bucket {0} could not be created.'.format(bucket)
 return 'Created {0} bucket.'.format(bucket)

In [None]:
create_bucket(bucket)

### Download the training data.

In this step you will download the training data to your local machine.
* Download the training data (cat & dog images) from this link
https://s3.amazonaws.com/groundtruth-ml-roadshowworkshop/traindata_cat_dog_images_20.zip
* Extract the traindata_cat_dog_images_20.zip, if necessary. You should
see “traindata_cat_dog_images_20” folder with about 20 files.

In [None]:
!wget https://s3.amazonaws.com/groundtruth-ml-roadshow-workshop/traindata_cat_dog_images_20.zip

In [None]:
!unzip traindata_cat_dog_images_20.zip

### Verify files

You should have 20 images of cats and dogs downloaded locally. 

In [None]:
!ls -la traindata_cat_dog_images_20

In [None]:
from IPython.display import Image
Image(filename='traindata_cat_dog_images_20/160.jpg') 

### [Upload to S3](https://docs.aws.amazon.com/AmazonS3/latest/dev/Welcome.html)

Next, we will upload the files you just downloaded to S3 to be used with [SageMaker Ground Truth](https://aws.amazon.com/sagemaker/groundtruth/) for labeling.

[s3.upload_file](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.upload_file) boto3 documentation

In [None]:
# enumerate local files recursively
local_directory = 'traindata_cat_dog_images_20'

for root, dirs, files in os.walk(local_directory):
 for filename in files:
 # construct the full local path
 local_path = os.path.join(root, filename)
 relative_path = os.path.relpath(local_path, local_directory)
 if not relative_path.startswith('.'):
 s3_path = os.path.join(local_directory, relative_path)
 print(s3_path)
 s3.upload_file(local_path, bucket, s3_path)

### Create a private Ground Truth Labeling Workforce.

In this step, you will create a “private workteam” and add only one user (you) to it.
To create a private team:

* Go to AWS Console > Amazon SageMaker > Labeling workforces
 - Click "Private" tab and then "Create private team".
 - Enter the desired name for your private workteam.
 - Enter your own email address in the "Email addresses" section.
 - Enter the name of your organization.
 - Enter contact email in the "Contact email" for the private workteam.
 - Click "Create Private Team".
* The AWS Console should now return to AWS Console > Amazon SageMaker > Labeling workforces. Your newly created team should be visible under "Private teams".
* You should get an email from `no-reply@verificationemail.com` that contains your workforce username and password.
 - Use the link and login credentials from the email to access the Labeling portal.
 - You will be asked to create a new, non-default password

That's it! This is your private worker's interface.
Once the Ground Truth labeling job is submitted in the next step, you will see the
annotation job in this portal.

### Create a private Ground Truth Labeling Job.

In this step, you will create a Ground Truth Labeling job and assign it to the private
workforce created in Step 3.
* Go to AWS Console > Amazon SageMaker > Labeling jobs
* Click ‘Create labeling job’
 - In ‘Specify job details’ step
 - Job name : groundtruth-labeling-job-cat-dog (Note : Any unique name will do)
 - Input dataset location
 - Create manifest
 - Entire S3 path where images are located. (Note : should end with /; For eg : s3://escience-workshop-{{FIXME}}/traindata_cat_dog_images_20/)
 - Select 'Images' as data type
 - Wait till the manifest creation is complete.
 - Click "Use this manifest"
 - Output dataset location : Enter S3 bucket path (For eg : s3://escience-workshop-{{FIXME}}/cat_dog_images_labeled/)
 - IAM Role
 - Select 'Create a new role' from the dropdown.
 - In the “Specific S3 buckets” section, enter the S3 bucket created in Step 1
 - Click Create
 - Task Type
 - Select 'Image classification'
 - Click Next
 - In 'Workers' Step
 - Select ‘Private’
 - Select the team created in previous step from the Private teams dropdown.
 - Examine ‘Additional configuration’ options
 - Leave ‘Automated data labeling’ → ‘Enable’ unchecked.
 - Leave ‘Number of workers per dataset object’ at 1
 - In 'Image classification labeling tool' Step
 ![IMGLABEL](../../docs/assets/images/img-class-label.png)
 - Enter "Please classify the images as 'cat' or 'dog' " in the textbox as an instruction to the workforce.
 - Add two Options 'cat' or 'dog'
 - Submit
 - Go to AWS Console > Amazon SageMaker > Labeling jobs to verify that a labeling job has been created. 

### Label the images using the Ground Truth Labeling portal

In this step, you will complete a labeling/annotation job assigned to you from the
Ground Truth Labeling portal.
* Login to the Ground Truth Labeling portal using the link provided to you in the email from `noreply@verificationemail.com`. (Note : This is the same portal you used in Step 2). 
Once the annotation job is assigned, you can view the job (similar to the picture below)
![IMGPRVW](../../docs/assets/images/gt-preview.png)
**Note** : After labeling a subset of images, the annotation job will be complete. If the first annotation
job did not include all 20 images, you will see a new job in the portal after a few minutes. Repeat
the process of labeling images in the jobs as they appear in the portal, till all images are labelled.
You can check the status of the labeling job from the Ground Truth → Labeling Jobs, which will
show you the number of images labeled out of the total images.
![LBLJOB](../../docs/assets/images/gt-label-job.png)

### Analyze Results

In this step, you will review the manifest files created during the Ground Truth
Labeling process. The manifest files are in the S3 bucket you created in Step 1.

**Input Manifest File**
Located in S3 bucket in the prefix : traindata_cat_dog_images_20/datasetxxxxxx.manifest.
The manifest is a json file that captures information about the training data.
Sample :

```json
{"source-ref":"s3://escience-workshop-rr/traindata_cat_dog_images_20/0.jpg"}
{"source-ref":"s3://escience-workshop-rr/traindata_cat_dog_images_20/10.jpg"}
{"source-ref":"s3://escience-workshop-rr/traindata_cat_dog_images_20/100.jpg"}
{"source-ref":"s3://escience-workshop-rr/traindata_cat_dog_images_20/110.jpg"}
{"source-ref":"s3://escience-workshop-rr/traindata_cat_dog_images_20/120.jpg"}
{"source-ref":"s3://escience-workshop-rr/traindata_cat_dog_images_20/130.jpg"}
{"source-ref":"s3://escience-workshop-rr/traindata_cat_dog_images_20/140.jpg"}
{"source-ref":"s3://escience-workshop-rr/traindata_cat_dog_images_20/150.jpg"}
{"source-ref":"s3://escience-workshop-rr/traindata_cat_dog_images_20/160.jpg"}
{"source-ref":"s3://escience-workshop-rr/traindata_cat_dog_images_20/170.jpg"}

…
```

**Output Manifest File**
Located in S3 bucket in the prefix : /manifests/output.manifest
The manifest is a json file that captures metadata about each labeled image.
Sample:
 
```json
{"source-ref":"s3://escience-workshop-rr/traindata_cat_dog_images_20/0.jpg","groundtruth-labeling-job-cat-dog":0,"groundtruth-labeling-job-cat-dog-metadata":{"confidence":0.74,"job-name":"labeling-job/groundtruth-labeling-job-cat-dog","class-name":"cat","human-annotated":"yes","creation-date":"2019-09-13T17:15:38.005564","type":"groundtruth/image-classification"}}
{"source-ref":"s3://escience-workshop-rr/traindata_cat_dog_images_20/10.jpg","groundtruth-labeling-job-cat-dog":0,"groundtruth-labeling-job-cat-dog-metadata":{"confidence":0.56,"job-name":"labeling-job/groundtruth-labeling-job-cat-dog","class-name":"cat","human-annotated":"yes","creation-date":"2019-09-13T17:16:50.941356","type":"groundtruth/image-classification"}}
{"source-ref":"s3://escience-workshop-rr/traindata_cat_dog_images_20/100.jpg","groundtruth-labeling-job-cat-dog":0,"groundtruth-labeling-job-cat-dog-metadata":{"confidence":0.74,"job-name":"labeling-job/groundtruth-labeling-job-cat-dog","class-name":"cat","human-annotated":"yes","creation-date":"2019-09-13T17:15:38.005587","type":"groundtruth/image-classification"}}

….
```

Along with the other metadata information, the output manifest shows the identified class of the
image and confidence. 
 

### Ready to train

Open Jupyter [CatAndDog Notebook](catanddog.ipynb) to walk through training the model with [Amazon SageMaker](https://aws.amazon.com/sagemaker/)