{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "a73bd45f-9f55-4c7b-93ae-9db1135f2f0f", "metadata": { "tags": [] }, "source": [ "# Amanzon SageMaker Ground Truth Demonstration for Video Classification Labeling Job\n", "\n", "1. [Introduction](#Introduction)\n", " 1. [Cost and runtime](#cost-runtime)\n", " 2. [Prerequisites](#prereq)\n", "2. [Run a Ground Truth labeling job](#run-labeling-job)\n", " 1. [Prepare the data](#Prepare-the-data)\n", " 2. [Create a Video Frame Input Manifest File](#create-manifest)\n", " 3. [Create the instruction template](#create-template)\n", " 4. [Use a private team to test your task](#Create-a-private-team-to-test-your-task)\n", " 5. [Define pre-built lambda functions for use in the labeling job](#lambda)\n", " 6. [Submit the Ground Truth job request](#submit-req)\n", " 7. [Monitor job progress](#monitor)\n", " 8. [View Task Results](#view-task)\n", "3. [Clean Up - Optional](#cleanup)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "0abbbc7c-a515-4475-934f-c48cf2c66b48", "metadata": {}, "source": [ "## 1. Introduction \n", "\n", "This sample notebook takes you through an end-to-end workflow to demonstrate the functionality of SageMaker Ground Truth Video Classification. You use an Amazon SageMaker Ground Truth video classification labeling task when you need workers to classify videos using predefined labels that you specify. Workers are shown videos and are asked to choose one label for each video.\n", "\n", "You create a video classification labeling job using the [CreateLabelingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateLabelingJob.html) operation. Your video files must be encoded in a format that is supported by the browser used by the work team that labels your data. It is recommended that you verify that all video file formats in your input manifest file display correctly using the worker UI preview. You can communicate supported browsers to your workers using worker instructions. To see supported file formats, see [Supported Data Formats](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-supported-data-formats.html).\n", "\n", "#### Cost and runtime \n", "\n", "1. For pricing, please refer to [Ground Truth pricing scheme](https://aws.amazon.com/sagemaker/groundtruth/pricing/). In order to reduce the cost, we will use Ground Truth's auto-labeling feature. Amazon SageMaker Ground Truth can use active learning to automate the labeling of your input data for certain built-in task types. Active learning is a machine learning technique that identifies data that should be labeled by your workers. In Ground Truth, this functionality is called automated data labeling. Automated data labeling helps to reduce the cost and time that it takes to label your dataset compared to using only humans.\n", "\n", "#### Prerequisites \n", "To run this notebook, you can simply execute each cell one-by-one. To understand what's happening, you'll need:\n", "* An S3 bucket you can write to -- please provide its name in the following cell. The bucket must be in the same region as this SageMaker Notebook instance. You can also change the `EXP_NAME` to any valid S3 prefix. All the files related to this experiment will be stored in that prefix of your bucket.\n", "* Basic familiarity with [AWS S3](https://docs.aws.amazon.com/s3/index.html),\n", "* Basic understanding of [AWS Sagemaker](https://aws.amazon.com/sagemaker/),\n", "* Basic familiarity with [AWS Command Line Interface (CLI)](https://aws.amazon.com/cli/) -- set it up with credentials to access the AWS account you're running this notebook from. This should work out-of-the-box on SageMaker Jupyter Notebook instances.\n", "\n", "This notebook is only tested on a SageMaker Studio Notebook & SageMaker Notebook Instances. The runtimes given are approximate, we used an `ml.t3.medium` instance with `Data Science` image. However, you can likely run it on a local instance by first executing the cell below on SageMaker, and then copying the `role` string to your local copy of the notebook.\n", "\n", "NOTES: \n", "- This notebook will create/remove subdirectories in its working directory. We recommend to place this notebook in its own directory before running it. \n", "\n", "- Ground Truth requires all S3 buckets that contain labeling job input image data have a CORS policy attached. To learn more about this change, see CORS Permission Requirement https://docs.aws.amazon.com/sagemaker/latest/dg/sms-cors-update.html" ] }, { "cell_type": "code", "execution_count": 2, "id": "7973cd90-c64a-4df1-88e8-b82be01a2edc", "metadata": { "tags": [] }, "outputs": [], "source": [ "# cell 01\n", "\n", "%load_ext autoreload\n", "%autoreload 2\n", "\n", "import os\n", "import json\n", "import time\n", "import pandas as pd\n", "import matplotlib\n", "import matplotlib.pyplot as plt\n", "from sklearn.metrics import confusion_matrix\n", "import boto3\n", "import sagemaker\n", "from urllib.parse import urlparse\n", "import warnings\n", "\n", "sess = sagemaker.Session()\n", "BUCKET = sess.default_bucket() \n", "\n", "EXP_NAME = \"label-video/video-classification\" # Any valid S3 prefix.\n", "\n", "# VERIFY_USING_PRIVATE_WORKFORCE = True # private team leveraged for labelling job" ] }, { "cell_type": "code", "execution_count": 3, "id": "7210a79d-ca05-419d-b2a3-09cb12c62b03", "metadata": { "tags": [] }, "outputs": [], "source": [ "# cell 02\n", "\n", "# Make sure the bucket is in the same region as this notebook.\n", "\n", "role = sagemaker.get_execution_role()\n", "region = boto3.session.Session().region_name\n", "\n", "s3 = boto3.client(\"s3\")\n", "bucket_region = s3.head_bucket(Bucket=BUCKET)[\"ResponseMetadata\"][\"HTTPHeaders\"][\n", " \"x-amz-bucket-region\"\n", "]\n", "\n", "assert (\n", " bucket_region == region\n", "), f\"You S3 bucket {BUCKET} and this notebook need to be in the same region.\"" ] }, { "attachments": {}, "cell_type": "markdown", "id": "8e627465-266f-4c6d-bd2c-a64e57391731", "metadata": { "tags": [] }, "source": [ "## 2. Run a Ground Truth labeling job \n", "\n", "\n", "**This section should take about 30 min to complete.**\n", "\n", "We will first run a labeling job. This involves several steps: collecting the video for labeling, specifying the possible label categories, creating instructions, and writing a labeling job specification.\n", "\n", "### Prepare the data\n", "\n", "For this demo, we have used four videos, two indoors and two outdoors, with a FPS of 25 and resultion of 512x512.\n", "\n", "\n", "We will copy these videos from data directory to our local `BUCKET`, and will create the corresponding *input manifest*. The input manifest is a formatted list of the S3 locations of the videos we want Ground Truth to annotate. We will upload this manifest to our S3 `BUCKET`.\n", "\n", "\n", "### Create a Video Frame Input Manifest File \n", "Ground Truth uses the input manifest file to identify the location of your input dataset when creating labeling tasks. For video classification labeling jobs, each line in the input manifest file identifies the location of a video file. Each sequence file identifies the images included in a single sequence of video frames. For more information, click [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-video-manual-data-setup.html#sms-video-create-manifest)" ] }, { "cell_type": "code", "execution_count": 4, "id": "01068453-ff95-466d-b820-ea09ba7382eb", "metadata": { "tags": [] }, "outputs": [], "source": [ "# cell 03\n", "\n", "# upload videos to S3\n", "# create manifest and manifest.json\n", "# Manifest File: https://docs.aws.amazon.com/sagemaker/latest/dg/sms-input-data-input-manifest.html\n", "\n", "manifest_name = 'input.manifest'\n", "outfile = open(manifest_name, 'w')\n", "line = 0\n", "\n", "for i, filename in enumerate(sorted(os.listdir('./video-classification-data/'))):\n", " if line > 0:\n", " outfile.write(\"\\n\")\n", "\n", " if filename.endswith(('mp4')):\n", " s3.upload_file(f\"./video-classification-data/{filename}\", BUCKET, EXP_NAME + f\"/{filename}\")\n", " ss = f'\"s3://{BUCKET}/{EXP_NAME}/{filename}\"'\n", " videos_list = '{' + f'\"source-ref\":{ss}' + '}'\n", " outfile.write(f'{videos_list}')\n", " line += 1\n", " \n", "outfile.close()" ] }, { "cell_type": "code", "execution_count": 5, "id": "20031da2-a4e6-4c49-8ac1-47511049dc11", "metadata": { "tags": [] }, "outputs": [], "source": [ "# cell 04\n", "\n", "# Upload manifest and manifest.json files to S3\n", " \n", "s3.upload_file(\"input.manifest\", BUCKET, f\"{EXP_NAME.split('/')[0]}\" + \"/input.manifest\")\n", "s3.upload_file(\"template.html\", BUCKET, f\"{EXP_NAME.split('/')[0]}\" + \"/template.html\")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "b39fd076-d195-423c-a165-61eb6fe75342", "metadata": {}, "source": [ "### Create the Instruction Template \n", " Specify labels and provide instructions for the workers" ] }, { "cell_type": "code", "execution_count": 6, "id": "ae761eb9-00cf-475a-99fc-78e8855e1fb0", "metadata": { "tags": [] }, "outputs": [], "source": [ "# cell 05\n", "\n", "# define the classes\n", "json_body = {\n", " \"labels\": [\n", " {\"label\": \"indoor\"},\n", " {\"label\": \"outdoor\"},\n", " ],\n", " \"instructions\": {\n", " \"shortInstruction\": \"
Please label each video.
\"\n", " }\n", " }\n", "\n", "# upload the json to s3\n", "with open(\"class_labels.json\", \"w\") as f:\n", " json.dump(json_body, f)\n", "\n", "s3.upload_file(\"class_labels.json\", BUCKET, EXP_NAME + \"/class_labels.json\")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "f929d95e-6784-46f9-90f2-24972a07b557", "metadata": { "tags": [] }, "source": [ "## 3.4 Use a private team to test your task \n", "\n", "\n", "Refer to Prerequisites to setup private workforce team. " ] }, { "cell_type": "code", "execution_count": null, "id": "f3354c92-7f87-454e-a726-a314dc622058", "metadata": { "tags": [] }, "outputs": [], "source": [ "# cell 06\n", "\n", "# private workforce team\n", "\n", "private_workteam_arn = \"