{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Amazon SageMaker Ground Truth\n", "### Lab : Data Labeling Using Private Workforce \n", "\n", "In this lab, you will use Amazon SageMaker Ground Truth to label images in a training\n", "dataset consisting of cat and dog images. You will start with an unlabeled image\n", "training data set, acquire labels for all the images using SageMaker Ground Truth\n", "private workforce and finally analyze the results of the labeling job.\n", "High Level Steps:\n", "\n", "1. Upload training data into an S3 bucket.\n", "1. Create a private Ground Truth Labeling workforce.\n", "1. Create a Ground Truth Labeling job\n", "1. Label the images using the Ground Truth Labeling portal.\n", "1. Analyze results\n", "\n", "This notebook is based on the lab in the repository [https://github.com/mahendrabairagi/GroundTruth_lab](https://github.com/mahendrabairagi/GroundTruth_lab)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import boto3\n", "import botocore\n", "import json\n", "import time\n", "import os\n", "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "from IPython.display import display, Markdown\n", "\n", "s3 = boto3.client('s3')\n", "s3_resource = boto3.resource('s3')\n", "\n", "session = boto3.session.Session()\n", "region = session.region_name\n", "account_id = boto3.client('sts').get_caller_identity().get('Account')\n", "\n", "bucket = 'escience-workshop-{{FIXME}}'\n", "\n", "pd.set_option('display.max_colwidth', -1)\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [Create S3 Bucket](https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html)\n", "\n", "We will create an S3 bucket that will be used throughout the workshop for storing data.\n", "\n", "[s3.create_bucket](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.create_bucket) boto3 documentation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def create_bucket(bucket):\n", " import logging\n", "\n", " try:\n", " s3.create_bucket(Bucket=bucket, CreateBucketConfiguration={'LocationConstraint': region})\n", " except botocore.exceptions.ClientError as e:\n", " logging.error(e)\n", " return 'Bucket {0} could not be created.'.format(bucket)\n", " return 'Created {0} bucket.'.format(bucket)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "create_bucket(bucket)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Download the training data.\n", "\n", "In this step you will download the training data to your local machine.\n", "* Download the training data (cat & dog images) from this link\n", "https://s3.amazonaws.com/groundtruth-ml-roadshowworkshop/traindata_cat_dog_images_20.zip\n", "* Extract the traindata_cat_dog_images_20.zip, if necessary. You should\n", "see “traindata_cat_dog_images_20” folder with about 20 files." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!wget https://s3.amazonaws.com/groundtruth-ml-roadshow-workshop/traindata_cat_dog_images_20.zip" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!unzip traindata_cat_dog_images_20.zip" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Verify files\n", "\n", "You should have 20 images of cats and dogs downloaded locally. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!ls -la traindata_cat_dog_images_20" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from IPython.display import Image\n", "Image(filename='traindata_cat_dog_images_20/160.jpg') " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [Upload to S3](https://docs.aws.amazon.com/AmazonS3/latest/dev/Welcome.html)\n", "\n", "Next, we will upload the files you just downloaded to S3 to be used with [SageMaker Ground Truth](https://aws.amazon.com/sagemaker/groundtruth/) for labeling.\n", "\n", "[s3.upload_file](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.upload_file) boto3 documentation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# enumerate local files recursively\n", "local_directory = 'traindata_cat_dog_images_20'\n", "\n", "for root, dirs, files in os.walk(local_directory):\n", " for filename in files:\n", " # construct the full local path\n", " local_path = os.path.join(root, filename)\n", " relative_path = os.path.relpath(local_path, local_directory)\n", " if not relative_path.startswith('.'):\n", " s3_path = os.path.join(local_directory, relative_path)\n", " print(s3_path)\n", " s3.upload_file(local_path, bucket, s3_path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create a private Ground Truth Labeling Workforce.\n", "\n", "In this step, you will create a “private workteam” and add only one user (you) to it.\n", "To create a private team:\n", "\n", "* Go to AWS Console > Amazon SageMaker > Labeling workforces\n", " - Click \"Private\" tab and then \"Create private team\".\n", " - Enter the desired name for your private workteam.\n", " - Enter your own email address in the \"Email addresses\" section.\n", " - Enter the name of your organization.\n", " - Enter contact email in the \"Contact email\" for the private workteam.\n", " - Click \"Create Private Team\".\n", "* The AWS Console should now return to AWS Console > Amazon SageMaker > Labeling workforces. Your newly created team should be visible under \"Private teams\".\n", "* You should get an email from `no-reply@verificationemail.com` that contains your workforce username and password.\n", " - Use the link and login credentials from the email to access the Labeling portal.\n", " - You will be asked to create a new, non-default password\n", "\n", "That's it! This is your private worker's interface.\n", "Once the Ground Truth labeling job is submitted in the next step, you will see the\n", "annotation job in this portal." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create a private Ground Truth Labeling Job.\n", "\n", "In this step, you will create a Ground Truth Labeling job and assign it to the private\n", "workforce created in Step 3.\n", "* Go to AWS Console > Amazon SageMaker > Labeling jobs\n", "* Click ‘Create labeling job’\n", " - In ‘Specify job details’ step\n", " - Job name : groundtruth-labeling-job-cat-dog (Note : Any unique name will do)\n", " - Input dataset location\n", " - Create manifest\n", " - Entire S3 path where images are located. (Note : should end with /; For eg : s3://escience-workshop-{{FIXME}}/traindata_cat_dog_images_20/)\n", " - Select 'Images' as data type\n", " - Wait till the manifest creation is complete.\n", " - Click \"Use this manifest\"\n", " - Output dataset location : Enter S3 bucket path (For eg : s3://escience-workshop-{{FIXME}}/cat_dog_images_labeled/)\n", " - IAM Role\n", " - Select 'Create a new role' from the dropdown.\n", " - In the “Specific S3 buckets” section, enter the S3 bucket created in Step 1\n", " - Click Create\n", " - Task Type\n", " - Select 'Image classification'\n", " - Click Next\n", " - In 'Workers' Step\n", " - Select ‘Private’\n", " - Select the team created in previous step from the Private teams dropdown.\n", " - Examine ‘Additional configuration’ options\n", " - Leave ‘Automated data labeling’ → ‘Enable’ unchecked.\n", " - Leave ‘Number of workers per dataset object’ at 1\n", " - In 'Image classification labeling tool' Step\n", " ![IMGLABEL](../../docs/assets/images/img-class-label.png)\n", " - Enter \"Please classify the images as 'cat' or 'dog' \" in the textbox as an instruction to the workforce.\n", " - Add two Options 'cat' or 'dog'\n", " - Submit\n", " - Go to AWS Console > Amazon SageMaker > Labeling jobs to verify that a labeling job has been created. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Label the images using the Ground Truth Labeling portal\n", "\n", "In this step, you will complete a labeling/annotation job assigned to you from the\n", "Ground Truth Labeling portal.\n", "* Login to the Ground Truth Labeling portal using the link provided to you in the email from `noreply@verificationemail.com`. (Note : This is the same portal you used in Step 2). \n", "Once the annotation job is assigned, you can view the job (similar to the picture below)\n", "![IMGPRVW](../../docs/assets/images/gt-preview.png)\n", "**Note** : After labeling a subset of images, the annotation job will be complete. If the first annotation\n", "job did not include all 20 images, you will see a new job in the portal after a few minutes. Repeat\n", "the process of labeling images in the jobs as they appear in the portal, till all images are labelled.\n", "You can check the status of the labeling job from the Ground Truth → Labeling Jobs, which will\n", "show you the number of images labeled out of the total images.\n", "![LBLJOB](../../docs/assets/images/gt-label-job.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Analyze Results\n", "\n", "In this step, you will review the manifest files created during the Ground Truth\n", "Labeling process. The manifest files are in the S3 bucket you created in Step 1.\n", "\n", "**Input Manifest File**\n", "Located in S3 bucket in the prefix : traindata_cat_dog_images_20/datasetxxxxxx.manifest.\n", "The manifest is a json file that captures information about the training data.\n", "Sample :\n", "\n", "```json\n", "{\"source-ref\":\"s3://escience-workshop-rr/traindata_cat_dog_images_20/0.jpg\"}\n", "{\"source-ref\":\"s3://escience-workshop-rr/traindata_cat_dog_images_20/10.jpg\"}\n", "{\"source-ref\":\"s3://escience-workshop-rr/traindata_cat_dog_images_20/100.jpg\"}\n", "{\"source-ref\":\"s3://escience-workshop-rr/traindata_cat_dog_images_20/110.jpg\"}\n", "{\"source-ref\":\"s3://escience-workshop-rr/traindata_cat_dog_images_20/120.jpg\"}\n", "{\"source-ref\":\"s3://escience-workshop-rr/traindata_cat_dog_images_20/130.jpg\"}\n", "{\"source-ref\":\"s3://escience-workshop-rr/traindata_cat_dog_images_20/140.jpg\"}\n", "{\"source-ref\":\"s3://escience-workshop-rr/traindata_cat_dog_images_20/150.jpg\"}\n", "{\"source-ref\":\"s3://escience-workshop-rr/traindata_cat_dog_images_20/160.jpg\"}\n", "{\"source-ref\":\"s3://escience-workshop-rr/traindata_cat_dog_images_20/170.jpg\"}\n", "\n", "…\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Output Manifest File**\n", "Located in S3 bucket in the prefix : /manifests/output.manifest\n", "The manifest is a json file that captures metadata about each labeled image.\n", "Sample:\n", " \n", "```json\n", "{\"source-ref\":\"s3://escience-workshop-rr/traindata_cat_dog_images_20/0.jpg\",\"groundtruth-labeling-job-cat-dog\":0,\"groundtruth-labeling-job-cat-dog-metadata\":{\"confidence\":0.74,\"job-name\":\"labeling-job/groundtruth-labeling-job-cat-dog\",\"class-name\":\"cat\",\"human-annotated\":\"yes\",\"creation-date\":\"2019-09-13T17:15:38.005564\",\"type\":\"groundtruth/image-classification\"}}\n", "{\"source-ref\":\"s3://escience-workshop-rr/traindata_cat_dog_images_20/10.jpg\",\"groundtruth-labeling-job-cat-dog\":0,\"groundtruth-labeling-job-cat-dog-metadata\":{\"confidence\":0.56,\"job-name\":\"labeling-job/groundtruth-labeling-job-cat-dog\",\"class-name\":\"cat\",\"human-annotated\":\"yes\",\"creation-date\":\"2019-09-13T17:16:50.941356\",\"type\":\"groundtruth/image-classification\"}}\n", "{\"source-ref\":\"s3://escience-workshop-rr/traindata_cat_dog_images_20/100.jpg\",\"groundtruth-labeling-job-cat-dog\":0,\"groundtruth-labeling-job-cat-dog-metadata\":{\"confidence\":0.74,\"job-name\":\"labeling-job/groundtruth-labeling-job-cat-dog\",\"class-name\":\"cat\",\"human-annotated\":\"yes\",\"creation-date\":\"2019-09-13T17:15:38.005587\",\"type\":\"groundtruth/image-classification\"}}\n", "\n", "….\n", "```\n", "\n", "Along with the other metadata information, the output manifest shows the identified class of the\n", "image and confidence. \n", " " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Ready to train\n", "\n", "Open Jupyter [CatAndDog Notebook](catanddog.ipynb) to walk through training the model with [Amazon SageMaker](https://aws.amazon.com/sagemaker/)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }