Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.

SPDX-License-Identifier: MIT-0

# Video metadata extraction and knowledge graph workshop

# Objectives:
This repository contains a series of 4 jupyter notebooks demonstrating how AWS AI Services like Amazon Rekognition, Amazon Transcribe and Amazon Comprehend can help you extract valuable metadata from your video assets and store that information in a Graph database like Amazon Neptune for maximum query performance and flexibility.
At the end of the workshop you'll typically be able to search for a specific label or entity and return a list of 1min video segments related to your search across your videos.

To extract metadata from a video, we'll use a the following AWS AI services:
- Amazon Rekognition to cut the video in scenes and detect label from the video itself
- Amazon Transcribe to convert audio into text
- Amazon Comprehend to extract entities and topics from the transcribed text via Topic Modelling and Named Entity recognition.

The metadata related to the video, segments, scenes, entities, labels will be stored in Amazon Neptune.
Amazon Neptune is a fully managed low latency graph database service that will allow us to store metadata as nodes (aka vertices) and branches (aka edges) to represent relationships between the nodes.
https://aws.amazon.com/neptune/

The diagram below summarises the workflow:

![Overall workflow](../static/overview.png "Overall workflow")

Topics addressed within the different notebooks:

Part 0:<br>
Create the environment (S3 bucket, IAM roles/polices, SNS topic, etc) and upload your sample video

Part 1:<br>
Use Amazon Rekognition to detect scenes and labels from your video

Part 2:<br>
Use Amazon Transcribe and Amazon Comprehend to respectively transcibe audio to text and extract metadata (topics, Named Entities) from transcripts.

Part 3:<br>
Store all the previously extracted metadata in Amazon Neptune and query the graph.

Part 4:<br>
Resources clean-up

## Costs
Please note that you might incur costs by running those notebooks. Most of those AI services have free tier but depending on how much you've already used or depending on the size of the video assets you're using, it might go over the limit.

Finally, if you're not planning to use those resources anymore at the end of the workshop, don't forget to shutdown/delete your Amazon Neptune instance, your Sagemaker studio notebook instances and run the part4-cleanup notebook to delete all the other resources created throughout the notebooks (S3 buckets, IAM roles, SNS topics, etc).

Before proceeding, please check the related services pricing pages:

https://aws.amazon.com/transcribe/pricing/

https://aws.amazon.com/comprehend/pricing/

https://aws.amazon.com/rekognition/pricing/

https://aws.amazon.com/neptune/pricing/

# Part 0 - Environment setup - S3 Bucket creation, SNS topic and IAM role

In the steps below we're going to create the S3 bucket where we'll upload our video, the SNS topic that some AWS services will use to publish outcomes of the jobs as well as the required policies/roles for the various AWS services to access those objects.<br>

<b>Please note that you will need to provide an valid .mp4 video stored in a S3 bucket as input for this workshop. It is NOT included in the github repo assets.</b> 

This video will be used  for the different metadata extraction steps. We suggest you use ~5min editorial video or video trailer for which you have the required copyrights.

The example we used to run the various jobs and generate the graphs is a video trailer from an Amazon Studios production.

In [None]:
!pip install boto3
!pip install sagemaker

In [None]:
import boto3
import sagemaker
import random
import json
import time
import os
import shutil
import logging
import sys
logging.basicConfig(format='%(asctime)s | %(levelname)s : %(message)s',
                     level=logging.INFO, stream=sys.stdout)
log = logging.getLogger('knowledge-graph-logger')

s3 = boto3.client('s3')

IMPORTANT:<br>

Make sure before you start executing this notebook that the execution role you've configured for your notebook or studio instance has the following permissions:
- read/write permission to your S3 buckets
- IAM permission to create the policy/role
- SNS permission to create a SNS topic
- permissions to invoke Amazon Rekognition, Amazon Comprehend, Amazon Transcribe APIs (e.g. AmazonRekognitionFullAccess, ComprehendFullAccess, AmazonTranscribeFullAccess)

You'll get "AuthorizationErrorException" messages otherwise.

In [None]:
iam = boto3.client("iam")

#get sagemaker execution role Arn
sagemaker_role = sagemaker.get_execution_role()

#get the role's name
sagemaker_role_name = sagemaker_role.split('/')[-1]

print(f'sagemaker role name:{sagemaker_role_name} \n')

The below cell will list all managed iam policies associated with your sagemaker execution role. Check that it has the required permission before proceeding. Note that this cell will not run if your sagemaker execution role doesn't have the required IAM rights.

In [None]:
#retrieve associated managed iam policies
paginator = iam.list_attached_role_policies(RoleName=sagemaker_role_name)

#listing
for policy in paginator['AttachedPolicies']:
    print(policy)

### SNS topic creation
We're creating a simple topic that will later be used by Amazon Rekognition notably to publish the outcome/status of the video analysis jobs.

In [None]:
sns = boto3.client('sns')

def create_sns_topic(name):
    try:
        topic = sns.create_topic(Name=name)
    except:
        log.exception("Couldn't create topic %s.", name)
        raise
    else:
        return topic['TopicArn']
    
sns_topic_arn = create_sns_topic('knowledge-graph-lab-rek-sns-topic')

print(sns_topic_arn)

### S3 bucket creation

Amazon S3 bucket names are globally unique. To create a unique bucket name, we're appending your account ID and a random int at the end of the bucket name.

In [None]:
region = 'ap-southeast-2' #specify the region of your choice

#retrieving your account ID
account_id = boto3.client('sts').get_caller_identity().get('Account')

#bucket name
bucket = 'sagemaker-knowledge-graph-' + region + '-' + account_id + '-' + str(random.randint(0,100000))

log.info(f'bucket name: {bucket}')

#create the bucket
s3.create_bucket(
        Bucket=bucket,
        CreateBucketConfiguration={'LocationConstraint': region}
        )

Creating the bucket

### Create IAM policy
Amazon Rekognition, Transcribe and Comprehend will need to be able to read the contents of your S3 bucket. So add a bucket policy which allows that.

In [None]:
s3_bucket_policy = {
    "Version": "2012-10-17",
    "Id": "KnowledgeGraphS3BucketAccessPolicy",
    "Statement": [
        {
            "Sid": "KnowledgeGraphS3BucketAccessPolicy",
            "Effect": "Allow",
            "Principal": {
                "Service": "rekognition.amazonaws.com",
                "Service": "transcribe.amazonaws.com",
                "Service": "comprehend.amazonaws.com"
            },
            "Action": [
                "s3:GetObject",
                "s3:ListBucket",
                "s3:PutObject"
            ],
            "Resource": [
                "arn:aws:s3:::{}".format(bucket),
                "arn:aws:s3:::{}/*".format(bucket)
            ]
        }
    ]
}

s3.put_bucket_policy(Bucket=bucket, Policy=json.dumps(s3_bucket_policy));

### IAM Role creation

We create the role that Amazon Rekognition, Comprehend, Transcribe will need to run jobs.

In [None]:
role_name = account_id+"-knowledgeGraphLab"

assume_role_policy_document = {
    "Version": "2012-10-17",
    "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "Service": "rekognition.amazonaws.com",
            "Service": "transcribe.amazonaws.com",
            "Service": "comprehend.amazonaws.com"
          },
          "Action": "sts:AssumeRole"
        }
    ]
}

try:
    create_role_response = iam.create_role(
        RoleName = role_name,
        AssumeRolePolicyDocument = json.dumps(assume_role_policy_document)
    );
    
except iam.exceptions.EntityAlreadyExistsException as e:
    print('Warning: role already exists:', e)
    create_role_response = iam.get_role(
        RoleName = role_name
    );

role_arn = create_role_response["Role"]["Arn"]

# Pause to allow role to be fully consistent
time.sleep(10)

print('IAM Role: {}'.format(role_arn))

<br>
We create 2 policies, for S3 and SNS, that we attach to the role we created above.


In [None]:
s3_policy = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket",
                "s3:PutObject"
            ],
            "Resource": [
                "arn:aws:s3:::{}".format(bucket),
                "arn:aws:s3:::{}/*".format(bucket)
            ]
        }
    ]
}

#creating the s3 policy
s3_policy_response = iam.create_policy(
    PolicyName='s3AccessForRekCompTrans',
    PolicyDocument=json.dumps(s3_policy),
)

s3_policy_arn = s3_policy_response['Policy']['Arn']

print(s3_policy_arn)

In [None]:
#attaching the above policy to the role
attach_s3_policy_response = iam.attach_role_policy(
    RoleName = role_name,
    PolicyArn = s3_policy_response['Policy']['Arn'])

print('Response:{}'.format(attach_s3_policy_response['ResponseMetadata']['HTTPStatusCode']))

In [None]:
sns_policy = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "sns:*"
            ],
            "Effect": "Allow",
            "Resource": sns_topic_arn
        }
    ]
}
#creating the sns policy
sns_policy_response = iam.create_policy(
    PolicyName='snsAccessForRekognition-' + str(random.randint(0,1000)),
    PolicyDocument=json.dumps(sns_policy),
)

sns_policy_arn = sns_policy_response['Policy']['Arn']

print(sns_policy_arn)

In [None]:
#attaching the built-in AmazonSNSFullAccess
attach_sns_policy_response = iam.attach_role_policy(
    RoleName = role_name,
    PolicyArn = sns_policy_arn)

print('Response:{}'.format(attach_sns_policy_response['ResponseMetadata']['HTTPStatusCode']))

### Uploading the video to the newly created S3 bucket

Please specify below the S3 bucket where you've stored the video file you'll use to run the notebooks. Please keep in mind that it needs to be a valid .mp4 and that your sagemaker execution role has access to your S3 bucket. You'll get an access denied exception otherwise.

In [None]:
#S3 URL where you have uploaded your video
your_s3_original_video = 's3://< your s3 bucket>/< path to the .mp4 file>'

#extracting video names and prefix
your_s3_bucket = your_s3_original_video.split('/')[2]
your_s3_prefix = '/'.join(your_s3_original_video.split('/')[3:])
video_file = your_s3_original_video.split('/')[-1]
video_name = video_file.split('.')[0]

Downloading the file locally from the public S3 bucket to your notebook instance and uploading it to the target S3 bucket for processing.

In [None]:
#creating a temporary folder on your instance to store the video locally.
tmp_local_folder = './tmp'
if not os.path.exists(tmp_local_folder):
    #create folder
    os.makedirs(tmp_local_folder)
else:
    #remove folder and files
    shutil.rmtree(tmp_local_folder)
    #wait for deletion to finish
    while os.path.exists(tmp_local_folder): # check if it exists
        pass
    #create folder
    os.makedirs(tmp_local_folder)

In [None]:
#download the file locally
s3.download_file(your_s3_bucket, your_s3_prefix, os.path.join(tmp_local_folder, video_file))

#upload the video file to the target S3 bucket
s3_video_input_path = 'input'
s3.upload_file(os.path.join(tmp_local_folder, video_file), bucket, os.path.join(s3_video_input_path, video_file))

## Amazon Neptune

For part3 of the workshop, you will need to create a Neptune DB cluster. 

<b>IMPORTANT: please make sure you create a brand new Neptune instance for this workshop as we'll be cleaning it of its content</b>

The easiest is to create your db via the console. 

Make sure you are in the same region where you previously created your jupyter notebook instance.

Engine options: at the time when I developed this workshop, the 1.0.5.1.R2 version was the latest.

DB cluster identifier: specify a relevant name

Templates: "Development and Testing"

![Amazon Neptune DB creation](../static/neptune-creation-part1.png "Amazon Neptune DB creation")

DB instance size: db.t3.medium

Multi-AZ deployment: No

Connectivity: make sure you choose the same VPC as the one you're using for your notebook instance. In my case I am using the default one.


![Amazon Neptune DB creation](../static/neptune-creation-part2.png "Amazon Neptune DB creation")

Notebook configuration: uncheck the "Create notebook". we are going to create a separate notebook in sagemaker.

leave the rest as default and click "Create Database"

![Amazon Neptune DB creation](../static/neptune-creation-part3.png "Amazon Neptune DB creation")

Once your cluster's status is "Available", retrieve the endpoint url and port and update the endpoint variable below.

![Amazon Neptune endpoint](../static/neptune-ui.png "Amazon Neptune endpoint")

In [None]:
your_neptune_endpoint_url = 'wss://<your neptune endpoint>:<port>/gremlin'

Defining some variable we'll use later for the different metadata extraction jobs

In [None]:
%store tmp_local_folder
%store bucket
%store s3_video_input_path
%store video_file
%store video_name
%store role_arn
%store role_name
%store sns_topic_arn
%store s3_policy_arn
%store sns_policy_arn
%store your_neptune_endpoint_url