{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Amazon SageMaker Ground Truth for high quality labeling\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Introduction\n",
"\n",
"This sample notebook takes you through creation of a Ground Truth labeling job.\n",
"\n",
"\n",
"### Steps\n",
"To create a Ground Truth job we will perform the following steps:\n",
"* Create S3 bucket which will hold the images as well as job related data\n",
"* Set proper CORS permissions on the bucket\n",
"* Upload the images to be labled\n",
"* Define labeling job template\n",
"* Create labeling job using private workforce"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import os\n",
"import base64\n",
"import json\n",
"import time\n",
"import imageio\n",
"import boto3\n",
"import sagemaker\n",
"import time\n",
"from os import listdir\n",
"from os.path import isfile, join"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### USER INPUT REQUIRED "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"BUCKET = <> # Change this to your bucket\n",
"private_workteam_arn = <># Change this to your private team arn\n",
"EXP_NAME = \"aim406\" # Any valid S3 prefix."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Make sure the bucket is in the same region as this notebook.\n",
"role = sagemaker.get_execution_role()\n",
"region = boto3.session.Session().region_name\n",
"s3 = boto3.client(\"s3\")\n",
"bucket_region = s3.head_bucket(Bucket=BUCKET)[\"ResponseMetadata\"][\"HTTPHeaders\"][\n",
" \"x-amz-bucket-region\"\n",
"]\n",
"assert (\n",
" bucket_region == region\n",
"), \"Your S3 bucket {} and this notebook need to be in the same region.\".format(BUCKET)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Define the configuration rules\n",
"cors_configuration = {\n",
" 'CORSRules': [{\n",
" 'AllowedHeaders': [],\n",
" 'AllowedMethods': ['GET'],\n",
" 'AllowedOrigins': ['*'],\n",
" 'ExposeHeaders': []\n",
" }]\n",
"}\n",
"\n",
"\n",
"# Set the CORS configuration\n",
"s3.put_bucket_cors(Bucket=BUCKET,\n",
" CORSConfiguration=cors_configuration)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Upload pics to S3 and create manifest"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mypath=\"Logos\"\n",
"onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]\n",
"\n",
"# Create and upload the input manifest.\n",
"manifest_name = \"input.manifest\"\n",
"with open(manifest_name, \"w\") as f:\n",
" for img_id in onlyfiles:\n",
" img_path = \"s3://{}/{}/images/{}\".format(BUCKET, EXP_NAME, img_id)\n",
" f.write('{\"source-ref\": \"' + img_path + '\"}\\n')\n",
" s3.upload_file(mypath+'/'+img_id, BUCKET, EXP_NAME + \"/images/\" + img_id)\n",
" \n",
"s3.upload_file(manifest_name, BUCKET, EXP_NAME + \"/\" + manifest_name)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After running the cell above, you should be able to go to `s3://BUCKET/EXP_NAME/images` in the [S3 console](https://console.aws.amazon.com/s3/) and see the images which need to be labled"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Specify the categories\n",
"\n",
"To run an object detection labeling job, you must decide on a set of classes the annotators can choose from. At the moment, Ground Truth only supports annotating one object detection class at a time. In our case, the class list is simply `['DHL','McDonalds', 'RedBull']`. To work with Ground Truth, this list needs to be converted to a .json file and uploaded to the S3 `BUCKET`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"CLASS_LIST = ['DHL','McDonalds', 'RedBull']\n",
"print(\"Label space is {}\".format(CLASS_LIST))\n",
"\n",
"json_body = {\"labels\": [{\"label\": label} for label in CLASS_LIST]}\n",
"with open(\"class_labels.json\", \"w\") as f:\n",
" json.dump(json_body, f)\n",
"\n",
"s3.upload_file(\"class_labels.json\", BUCKET, EXP_NAME + \"/class_labels.json\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You should now see `class_labels.json` in `s3://BUCKET/EXP_NAME/`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create the instruction template\n",
"\n",
"Part or all of your images will be annotated by human annotators. It is **essential** to provide good instructions. Good instructions are:\n",
"1. Concise. We recommend limiting verbal/textual instruction to two sentences and focusing on clear visuals.\n",
"2. Visual. In the case of object detection, we recommend providing several labeled examples with different numbers of boxes.\n",
"\n",
"When used through the AWS Console, Ground Truth helps you create the instructions using a visual wizard. When using the API, you need to create an HTML template for your instructions. Below, we prepare a very simple but effective template and upload it to your S3 bucket.\n",
"\n",
"NOTE: If you use any images in your template (as we do), they need to be publicly accessible. You can enable public access to files in your S3 bucket through the S3 Console, as described in [S3 Documentation](https://docs.aws.amazon.com/AmazonS3/latest/user-guide/set-object-permissions.html). \n",
"\n",
"#### Testing your instructions\n",
"**It is very easy to create broken instructions.** This might cause your labeling job to fail. However, it might also cause your job to complete with meaningless results if, for example, the annotators have no idea what to do or the instructions are misleading. At the moment the only way to test the instructions is to run your job in a private workforce. This is a way to run a mock labeling job for free. We describe how in [Verify your task using a private team [OPTIONAL]](#Verify-your-task-using-a-private-team-[OPTIONAL]).\n",
"\n",
"It is helpful to show examples of correctly labeled images in the instructions. The following code block produces several such examples for our dataset and saves them in `s3://BUCKET/EXP_NAME/`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"with open('sample.jpg', 'rb') as instructions:\n",
" instructions_uri = base64.b64encode(instructions.read()).decode('utf-8').replace('\\n', '')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from IPython.core.display import HTML, display\n",
"\n",
"\n",
"def make_template(test_template=False, save_fname=\"instructions.template\"):\n",
" template = r\"\"\"\n",
" \n",
" \n",
" \n",
"\n",
" \n",
"
Inspect the image
\n",
"
Determine if the specified label is/are visible in the picture.
\n",
"
Outline each instance of the specified label in the image using the provided “Box” tool.
\n",
" \n",
"
\n",
"
Boxes should fit tight around each object
\n",
"
Do not include parts of the object are overlapping or that cannot be seen, even though you think you can interpolate the whole shape.
\n",
"
Avoid including shadows.
\n",
"
If the target is off screen, draw the box up to the edge of the image.
\n",
"
\n",
"\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \"\"\".format(\n",
" class_name='Company Logo',\n",
" instructions_uri=instructions_uri,\n",
" labels_str=str(CLASS_LIST)\n",
" if test_template\n",
" else \"{{ task.input.labels | to_json | escape }}\",\n",
" )\n",
" with open(save_fname, \"w\") as f:\n",
" f.write(template)\n",
"\n",
"\n",
"make_template(test_template=True, save_fname=\"instructions.html\")\n",
"make_template(test_template=False, save_fname=\"instructions.template\")\n",
"s3.upload_file(\"instructions.template\", BUCKET, EXP_NAME + \"/instructions.template\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You should now be able to find your template in `s3://BUCKET/EXP_NAME/instructions.template`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Specify ARNs for resources needed to run an object detection job.\n",
"ac_arn_map = {\n",
" \"us-west-2\": \"081040173940\",\n",
" \"us-east-1\": \"432418664414\",\n",
" \"us-east-2\": \"266458841044\",\n",
" \"eu-west-1\": \"568282634449\",\n",
" \"ap-northeast-1\": \"477331159723\",\n",
"}\n",
"\n",
"prehuman_arn = \"arn:aws:lambda:{}:{}:function:PRE-BoundingBox\".format(region, ac_arn_map[region])\n",
"acs_arn = \"arn:aws:lambda:{}:{}:function:ACS-BoundingBox\".format(region, ac_arn_map[region])\n",
"labeling_algorithm_specification_arn = \"arn:aws:sagemaker:{}:027400017018:labeling-job-algorithm-specification/object-detection\".format(\n",
" region\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Submit the Ground Truth job request\n",
"The API starts a Ground Truth job by submitting a request."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"VERIFY_USING_PRIVATE_WORKFORCE = True\n",
"USE_AUTO_LABELING = True\n",
"CLASS_NAME='Company Logo'\n",
"\n",
"task_description = \"Dear Annotator, please draw a box around each {}. Thank you!\".format(CLASS_NAME)\n",
"task_keywords = [\"image\", \"object\", \"detection\"]\n",
"task_title = \"Please draw a box around each {}.\".format(CLASS_NAME)\n",
"job_name = \"reinv-gt-\" + str(int(time.time()))\n",
"\n",
"human_task_config = {\n",
" \"AnnotationConsolidationConfig\": {\n",
" \"AnnotationConsolidationLambdaArn\": acs_arn,\n",
" },\n",
" \"PreHumanTaskLambdaArn\": prehuman_arn,\n",
" \"MaxConcurrentTaskCount\": 10, # Number of images that will be sent at a time to the workteam.\n",
" \"NumberOfHumanWorkersPerDataObject\": 1, # We will obtain and consolidate at this many human annotations for each image.\n",
" \"TaskAvailabilityLifetimeInSeconds\": 21600, # Your workteam has 6 hours to complete all pending tasks.\n",
" \"TaskDescription\": task_description,\n",
" \"TaskKeywords\": task_keywords,\n",
" \"TaskTimeLimitInSeconds\": 300, # Each image must be labeled within 5 minutes.\n",
" \"TaskTitle\": task_title,\n",
" \"UiConfig\": {\n",
" \"UiTemplateS3Uri\": \"s3://{}/{}/instructions.template\".format(BUCKET, EXP_NAME),\n",
" },\n",
"}\n",
"\n",
"\n",
"human_task_config[\"WorkteamArn\"] = private_workteam_arn\n",
"\n",
"ground_truth_request = {\n",
" \"InputConfig\": {\n",
" \"DataSource\": {\n",
" \"S3DataSource\": {\n",
" \"ManifestS3Uri\": \"s3://{}/{}/{}\".format(BUCKET, EXP_NAME, manifest_name),\n",
" }\n",
" },\n",
" \"DataAttributes\": {\n",
" \"ContentClassifiers\": [\"FreeOfPersonallyIdentifiableInformation\", \"FreeOfAdultContent\"]\n",
" },\n",
" },\n",
" \"OutputConfig\": {\n",
" \"S3OutputPath\": \"s3://{}/{}/output/\".format(BUCKET, EXP_NAME),\n",
" },\n",
" \"HumanTaskConfig\": human_task_config,\n",
" \"LabelingJobName\": job_name,\n",
" \"RoleArn\": role,\n",
" \"LabelAttributeName\": \"category\",\n",
" \"LabelCategoryConfigS3Uri\": \"s3://{}/{}/class_labels.json\".format(BUCKET, EXP_NAME),\n",
"}\n",
"\n",
"if USE_AUTO_LABELING:\n",
" ground_truth_request[\"LabelingJobAlgorithmsConfig\"] = {\n",
" \"LabelingJobAlgorithmSpecificationArn\": labeling_algorithm_specification_arn\n",
" }\n",
"\n",
"sagemaker_client = boto3.client(\"sagemaker\")\n",
"sagemaker_client.create_labeling_job(**ground_truth_request)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"instance_type": "ml.t3.medium",
"kernelspec": {
"display_name": "Python 3 (Data Science)",
"language": "python",
"name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.10"
}
},
"nbformat": 4,
"nbformat_minor": 4
}