# Amanzon SageMaker Ground Truth Demonstration for Video Classification Labeling Job

1. [Introduction](#Introduction)
    1. [Cost and runtime](#cost-runtime)
    2. [Prerequisites](#prereq)
2. [Run a Ground Truth labeling job](#run-labeling-job)
    1. [Prepare the data](#Prepare-the-data)
    2. [Create a Video Frame Input Manifest File](#create-manifest)
    3. [Create the instruction template](#create-template)
    4. [Use a private team to test your task](#Create-a-private-team-to-test-your-task)
    5. [Define pre-built lambda functions for use in the labeling job](#lambda)
    6. [Submit the Ground Truth job request](#submit-req)
    7. [Monitor job progress](#monitor)
    8. [View Task Results](#view-task)
3. [Clean Up - Optional](#cleanup)

## 1. Introduction <a class="anchor" id="Introduction"></a>

This sample notebook takes you through an end-to-end workflow to demonstrate the functionality of SageMaker Ground Truth Video Classification. You use an Amazon SageMaker Ground Truth video classification labeling task when you need workers to classify videos using predefined labels that you specify. Workers are shown videos and are asked to choose one label for each video.

You create a video classification labeling job using the [CreateLabelingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateLabelingJob.html) operation. Your video files must be encoded in a format that is supported by the browser used by the work team that labels your data. It is recommended that you verify that all video file formats in your input manifest file display correctly using the worker UI preview. You can communicate supported browsers to your workers using worker instructions. To see supported file formats, see [Supported Data Formats](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-supported-data-formats.html).

#### Cost and runtime <a class="anchor" id="cost-runtime"></a>

1. For pricing, please refer to [Ground Truth pricing scheme](https://aws.amazon.com/sagemaker/groundtruth/pricing/). In order to reduce the cost, we will use Ground Truth's auto-labeling feature. Amazon SageMaker Ground Truth can use active learning to automate the labeling of your input data for certain built-in task types. Active learning is a machine learning technique that identifies data that should be labeled by your workers. In Ground Truth, this functionality is called automated data labeling. Automated data labeling helps to reduce the cost and time that it takes to label your dataset compared to using only humans.

#### Prerequisites <a class="anchor" id="prereq"></a>
To run this notebook, you can simply execute each cell one-by-one. To understand what's happening, you'll need:
* An S3 bucket you can write to -- please provide its name in the following cell. The bucket must be in the same region as this SageMaker Notebook instance. You can also change the `EXP_NAME` to any valid S3 prefix. All the files related to this experiment will be stored in that prefix of your bucket.
* Basic familiarity with [AWS S3](https://docs.aws.amazon.com/s3/index.html),
* Basic understanding of [AWS Sagemaker](https://aws.amazon.com/sagemaker/),
* Basic familiarity with [AWS Command Line Interface (CLI)](https://aws.amazon.com/cli/) -- set it up with credentials to access the AWS account you're running this notebook from. This should work out-of-the-box on SageMaker Jupyter Notebook instances.

This notebook is only tested on a SageMaker Studio Notebook & SageMaker Notebook Instances. The runtimes given are approximate, we used an `ml.t3.medium` instance with `Data Science` image. However, you can likely run it on a local instance by first executing the cell below on SageMaker, and then copying the `role` string to your local copy of the notebook.

NOTES: 
- This notebook will create/remove subdirectories in its working directory. We recommend to place this notebook in its own directory before running it. 

- Ground Truth requires all S3 buckets that contain labeling job input image data have a CORS policy attached. To learn more about this change, see CORS Permission Requirement https://docs.aws.amazon.com/sagemaker/latest/dg/sms-cors-update.html

In [2]:
# cell 01

%load_ext autoreload
%autoreload 2

import os
import json
import time
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
import boto3
import sagemaker
from urllib.parse import urlparse
import warnings

sess = sagemaker.Session()
BUCKET = sess.default_bucket() 

EXP_NAME = "label-video/video-classification"  # Any valid S3 prefix.

# VERIFY_USING_PRIVATE_WORKFORCE = True # private team leveraged for labelling job

In [3]:
# cell 02

# Make sure the bucket is in the same region as this notebook.

role = sagemaker.get_execution_role()
region = boto3.session.Session().region_name

s3 = boto3.client("s3")
bucket_region = s3.head_bucket(Bucket=BUCKET)["ResponseMetadata"]["HTTPHeaders"][
    "x-amz-bucket-region"
]

assert (
    bucket_region == region
), f"You S3 bucket {BUCKET} and this notebook need to be in the same region."

## 2. Run a Ground Truth labeling job <a class="anchor" id="run-labeling-job"></a>


**This section should take about 30 min to complete.**

We will first run a labeling job. This involves several steps: collecting the video for labeling, specifying the possible label categories, creating instructions, and writing a labeling job specification.

### Prepare the data

For this demo, we have used four videos, two indoors and two outdoors, with a FPS of 25 and resultion of 512x512.


We will copy these videos from data directory to our local `BUCKET`, and will create the corresponding *input manifest*. The input manifest is a formatted list of the S3 locations of the videos we want Ground Truth to annotate. We will upload this manifest to our S3 `BUCKET`.


### Create a Video Frame Input Manifest File <a class="anchor" id="create-manifest"></a>
Ground Truth uses the input manifest file to identify the location of your input dataset when creating labeling tasks. For video classification labeling jobs, each line in the input manifest file identifies the location of a video file. Each sequence file identifies the images included in a single sequence of video frames. For more information, click [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-video-manual-data-setup.html#sms-video-create-manifest)

In [4]:
# cell 03

# upload videos to S3
# create manifest and manifest.json
# Manifest File: https://docs.aws.amazon.com/sagemaker/latest/dg/sms-input-data-input-manifest.html

manifest_name = 'input.manifest'
outfile = open(manifest_name, 'w')
line = 0

for i, filename in enumerate(sorted(os.listdir('./video-classification-data/'))):
    if line > 0:
        outfile.write("\n")

    if filename.endswith(('mp4')):
        s3.upload_file(f"./video-classification-data/{filename}", BUCKET, EXP_NAME + f"/{filename}")
        ss = f'"s3://{BUCKET}/{EXP_NAME}/{filename}"'
        videos_list = '{' + f'"source-ref":{ss}' + '}'
        outfile.write(f'{videos_list}')
        line += 1
        
outfile.close()

In [5]:
# cell 04

# Upload manifest and manifest.json files to S3
    
s3.upload_file("input.manifest", BUCKET, f"{EXP_NAME.split('/')[0]}" + "/input.manifest")
s3.upload_file("template.html", BUCKET, f"{EXP_NAME.split('/')[0]}" + "/template.html")

### Create the Instruction Template <a class="anchor" id="create-template"></a>
 Specify labels and provide instructions for the workers

In [6]:
# cell 05

# define the classes
json_body = {
  "labels": [
        {"label": "indoor"},
        {"label": "outdoor"},
      ],
      "instructions": {
        "shortInstruction": "<p>Please label each video.</p>"
          }
    }

# upload the json to s3
with open("class_labels.json", "w") as f:
    json.dump(json_body, f)

s3.upload_file("class_labels.json", BUCKET, EXP_NAME + "/class_labels.json")

## 3.4 Use a private team to test your task <a class="anchor" id="Use-a-private-team-to-test-your-task"></a>


Refer to Prerequisites to setup private workforce team. 

In [None]:
# cell 06

# private workforce team

private_workteam_arn = "<Your-Work-Team-ARN>"

assert (
    private_workteam_arn != "<Your-Work-Team-ARN>"
), "Please enter your private workforce team, private_workteam_arn. You can find it on Amazon SageMaker console > Ground Truth > Labeling workforces > Private Teams"


WORKTEAM_ARN = private_workteam_arn
print("WORKTEAM_ARN : {}".format(WORKTEAM_ARN))

## Define pre-built lambda functions for use in the labeling job <a class="anchor" id="lambda"></a>

In [None]:
# cell 07

# Specify ARNs for resources needed to run an video classification job.
ac_arn_map = {
    "us-west-2"     : "081040173940",
    "us-east-1"     : "432418664414",
    "us-east-2"     : "266458841044",
    "eu-west-1"     : "568282634449",
    "ap-northeast-1": "477331159723",
}

# PreHumanTaskLambdaArn for VideoClassification
prehuman_arn = f"arn:aws:lambda:{region}:{ac_arn_map[region]}:function:PRE-VideoMultiClass"

# AnnotationConsolidationConfig for VideoClassification
acs_arn = f"arn:aws:lambda:{region}:{ac_arn_map[region]}:function:ACS-VideoMultiClass" 

# auto-labelling job
# https://docs.aws.amazon.com/sagemaker/latest/dg/sms-automated-labeling.html
labeling_algorithm_specification_arn = f"arn:aws:sagemaker:{region}:027400017018:labeling-job-algorithm-specification/video-classification"


## Submit the Ground Truth job request <a class="anchor" id="submit-req"></a>
The API starts a Ground Truth job by submitting a request. The request contains the 
full configuration of the annotation task, and allows you to modify the fine details of
the job that are fixed to default values when you use the AWS Console. The parameters that make up the request are described in more detail in the [SageMaker Ground Truth documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateLabelingJob.html).

### Verify your task using a private team <a class="anchor" id="verify-team"></a>
Run the next two cells. This will define the task and submit it to the private workforce (to you).
3. After a few minutes, you should be able to see your task in your private workforce interface.
Please verify that the task appears as you want it to appear.

In [None]:
# cell 08

# task definitions
task_description = f'Classifiy video and label them.  Thank you!'
task_keywords = ['Video Classification']
task_title = 'Video Classification'
job_name = "video-classification-" + str(int(time.time()))
no_human_per_object = 1 # number of  workers required to label each text.
task_time_limit = 28800 # worker has to complete a task within 8 hours
task_availability_lifetime = 21600  # 6 hours to complete all pending tasks by human worker(s)
max_concurrent_task_count = 100 #maximum number of data objects that can be labeled by human workers at the same time
USE_AUTO_LABELING = False

In [None]:
# cell 09
template_path = f"s3://{BUCKET}/{EXP_NAME.split('/')[0]}/template.html"

human_task_config={
    'WorkteamArn': private_workteam_arn,
    'UiConfig': {'UiTemplateS3Uri' : template_path},
    'PreHumanTaskLambdaArn': prehuman_arn,
    'TaskKeywords': [
        'Video Classification',
    ],
    'TaskTitle': 'Video classification task',
    'TaskDescription': 'Select a label to classify this video',
    'NumberOfHumanWorkersPerDataObject': no_human_per_object,
    'TaskTimeLimitInSeconds': task_time_limit,
    'TaskAvailabilityLifetimeInSeconds': task_availability_lifetime,
    'MaxConcurrentTaskCount': max_concurrent_task_count,
    'AnnotationConsolidationConfig': {
        'AnnotationConsolidationLambdaArn': acs_arn
    },
}
    
ground_truth_request = {
    "InputConfig":{
        'DataSource': {
            'S3DataSource': {
                'ManifestS3Uri': f"s3://{BUCKET}/{EXP_NAME.split('/')[0]}/{manifest_name}",
            }
        },
        'DataAttributes': {
            'ContentClassifiers': [
                'FreeOfPersonallyIdentifiableInformation','FreeOfAdultContent',
            ]
        }
    },
    "OutputConfig":{
        'S3OutputPath': f"s3://{BUCKET}/{EXP_NAME}/output/",
    },
    
    "HumanTaskConfig": human_task_config,
    "LabelingJobName": job_name,
    "RoleArn": role,
    "LabelAttributeName": "category",
    "LabelCategoryConfigS3Uri": f"s3://{BUCKET}/{EXP_NAME}/class_labels.json",

}

if USE_AUTO_LABELING:
    ground_truth_request["LabelingJobAlgorithmsConfig"] = {
        "LabelingJobAlgorithmSpecificationArn": labeling_algorithm_specification_arn
    }
sagemaker_client = boto3.client("sagemaker")
sagemaker_client.create_labeling_job(**ground_truth_request)

## Monitor job progress <a class="anchor" id="monitor"></a>
You can monitor the job's progress through AWS Console. In this notebook, we will use Ground Truth output files and Cloud Watch logs in order to monitor the progress. You can re-evaluate the next two cells repeatedly. It sends a `describe_labelging_job` request which should tell you whether the job is completed or not. If it is, then 'LabelingJobStatus' will be 'Completed'.

In [None]:
# cell 10

# re-evaluate repeatedly. It sends a `describe_labelging_job` request which should tell you whether the job is completed or not. If it is, then 'LabelingJobStatus' will be 'Completed'.
while sagemaker_client.describe_labeling_job(LabelingJobName=job_name)['LabelingJobStatus'] == 'InProgress':
    job_status = sagemaker_client.describe_labeling_job(LabelingJobName=job_name)['LabelingJobStatus']
    print('Labelling job : {}, status : {}'.format(job_name, job_status))
    time.sleep(45)
print('Labelling job : {}, status : {}'.format(job_name, sagemaker_client.describe_labeling_job(LabelingJobName=job_name)['LabelingJobStatus']))

## View Task Results  <a class="anchor" id="view-task"></a>
Once work is completed, Amazon SageMaker GroundTruth stores results in your S3 bucket and sends a Cloudwatch event. Your results should be available in the S3 OUTPUT_PATH when all work is completed.

In [None]:
# cell 11

# ouput path
S3_OUTPUT = boto3.client('sagemaker').describe_labeling_job(LabelingJobName=job_name)['OutputConfig']['S3OutputPath'] + job_name
print('S3 OUPUT_PATH : {}'.format(S3_OUTPUT))

# Download human annotation data.
!aws s3 cp {S3_OUTPUT + '/manifests/output/output.manifest'} "./output/" #--recursive --quiet

In [None]:
# cell 12

data=[]
with open('./output/output.manifest') as f:
    for line in f:
        json_data = json.loads(line)
        data.append(json_data)
        
print(data)

## Clean Up [OPTIONAL] <a class="anchor" id="cleanup"></a>
Finally, let's clean up and delete this endpoint.

In [None]:
# cell 13

if sagemaker_client.describe_labeling_job(LabelingJobName=job_name)['LabelingJobStatus'] == 'InProgress':
    sagemaker_client.stop_labeling_job(LabelingJobName=job_name)

## The End!