{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"# Amanzon SageMaker Ground Truth Demonstration for Text Classification\n",
"\n",
"1. [Introduction](#introduction)\n",
"2. [Run a Ground Truth labeling job](#run-a-ground-truth-labeling-job)\n",
" 1. [Prepare the data](#prepare-the-data)\n",
" 2. [Specify the categories](#specify-the-categories)\n",
" 3. [Create the instruction template](#create-the-instruction-template)\n",
" 4. [Assign private team to test your task](#assign-private-team-to-test-your-task)\n",
" 5. [Define pre-built lambda functions for use in the labeling job](#define-pre-built-lambda-functions-for-use-in-the-labeling-job)\n",
" 6. [Submit the Ground Truth job request](#submit-the-ground-truth-job-request)\n",
" 7. [Monitor job progress](#monitor-job-progress)\n",
" 8. [View Task Results](#view-task-results)\n",
"3. [Clean Up](#Clean-Up)"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"## Introduction\n",
"\n",
"\n",
"This sample notebook takes you through an end-to-end workflow to demonstrate the functionality of SageMaker Ground Truth. We'll start with an unlabeled review text data set, acquire labels for the sentiment like Positive, or Negative using SageMaker Ground Truth, and analyze the results of the labeling job. Before you begin, we highly recommend you start a Ground Truth labeling job through the AWS Console first to familiarize yourself with the workflow. The AWS Console offers less flexibility than the API, but is simple to use.\n",
"\n",
"#### Cost and runtime\n",
"You can run this demo in two modes:\n",
"1. Set `RUN_FULL_AL_DEMO = True` in the next cell to label 10k reviews(text). This should cost about \\$800 given current [Ground Truth pricing scheme](https://aws.amazon.com/sagemaker/groundtruth/pricing/). In order to reduce the cost, we will use Ground Truth's auto-labeling feature. Auto-labeling uses text classification to learn from human responses and automatically create labels for the easiest review text at a cheap price. The total end-to-end runtime should be about 20h.\n",
"1. Set `RUN_FULL_AL_DEMO = False` in the next cell to label only 10 reviews. This should cost about \\$.80. **Since Ground Truth's auto-labeling feature only kicks in for datasets of 1000 images or more, this cheaper version of the demo will not use it. Some of the analysis plots might look awkward, but you should still be able to see good results on the human-annotated 100 images.**\n",
"\n",
"#### Prerequisites\n",
"To run this notebook, you can simply execute each cell one-by-one. To understand what's happening, you'll need:\n",
"* An S3 bucket you can write to -- please provide its name in the following cell. The bucket must be in the same region as this SageMaker Notebook instance. You can also change the `EXP_NAME` to any valid S3 prefix. All the files related to this experiment will be stored in that prefix of your bucket.\n",
"* Familiarity with Python and [numpy](http://www.numpy.org/).\n",
"* Basic familiarity with [AWS S3](https://docs.aws.amazon.com/s3/index.html),\n",
"* Basic understanding of [AWS Sagemaker](https://aws.amazon.com/sagemaker/),\n",
"* Basic familiarity with [AWS Command Line Interface (CLI)](https://aws.amazon.com/cli/) -- set it up with credentials to access the AWS account you're running this notebook from. This should work out-of-the-box on SageMaker Jupyter Notebook instances.\n",
"\n",
"This notebook is only tested on a SageMaker notebook instance. The runtimes given are approximate, we used an `ml.m4.xlarge` instance in our tests. However, you can likely run it on a local instance by first executing the cell below on SageMaker, and then copying the `role` string to your local copy of the notebook.\n",
"\n",
"NOTE: This notebook will create/remove subdirectories in its working directory. We recommend to place this notebook in its own directory before running it."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# cell 01\n",
"%matplotlib inline\n",
"%load_ext autoreload\n",
"%autoreload 2\n",
"import os\n",
"from collections import namedtuple\n",
"from collections import defaultdict\n",
"from collections import Counter\n",
"import itertools\n",
"import json\n",
"import time\n",
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib\n",
"import matplotlib.pyplot as plt\n",
"from matplotlib.backends.backend_pdf import PdfPages\n",
"from sklearn.metrics import confusion_matrix\n",
"import boto3\n",
"import sagemaker\n",
"from urllib.parse import urlparse\n",
"import warnings\n",
"warnings.filterwarnings(\"ignore\", category=DeprecationWarning)\n",
"\n",
"sess = sagemaker.Session()\n",
"BUCKET = sess.default_bucket()\n",
"\n",
"EXP_NAME = \"label-text/text-classification\" # Any valid S3 prefix.\n",
"RUN_FULL_AL_DEMO = False # See 'Cost and Runtime' in the Markdown cell above!\n",
"VERIFY_USING_PRIVATE_WORKFORCE = True # private team leveraged for labelling job"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# cell 02\n",
"# Make sure the bucket is in the same region as this notebook.\n",
"role = sagemaker.get_execution_role()\n",
"region = boto3.session.Session().region_name\n",
"\n",
"s3 = boto3.client(\"s3\")\n",
"bucket_region = s3.head_bucket(Bucket=BUCKET)[\"ResponseMetadata\"][\"HTTPHeaders\"][\n",
" \"x-amz-bucket-region\"\n",
"]\n",
"assert (\n",
" bucket_region == region\n",
"), \"You S3 bucket {} and this notebook need to be in the same region.\".format(BUCKET)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Run a Ground Truth labeling job\n",
"**This section should take about 20h to complete.**\n",
"\n",
"We will first run a labeling job. This involves several steps: collecting the reviews text we want labeled, specifying the possible label categories, creating instructions, and writing a labeling job specification. In addition, we highly recommend to run a (free) mock job using a private workforce before you submit any job to the public workforce. This notebook will explain how to do that as an optional step. Without using a private workforce, this section until completion of your labeling job should take about 20h. However, this may vary depending on the availability of the public annotation workforce.\n",
"\n",
"## Prepare the data\n",
"We will first download IMDB reviews subset of the [IMDB Movie Review Dataset](https://huggingface.co/datasets/pietrolesci/imdb). Later, will compare Ground Truth annotations to these labels. Our dataset will include labels in the following categories:\n",
"\n",
"* Positive\n",
"* Negative\n",
"\n",
"\n",
"If you chose `RUN_FULL_AL_DEMO = False`, then we will choose a subset of 10 reviews text in this dataset. This is a diverse dataset of interesting reviews about movies, and should be fun for the human annotators to work with. You are free to ask the annotators to annotate the sentiments.\n",
"\n",
"We will copy these images to our local `BUCKET`, and will create the corresponding *input manifest*. The input manifest is a formatted list of the S3 locations of the images we want Ground Truth to annotate. We will upload this manifest to our S3 `BUCKET`.\n",
"\n",
"#### Disclosure regarding the Open IMDB Review Dataset V1.0:\n",
"Movie Review Dataset V1 is created by Stanford AI Lab. We have not modified the text or the accompanying annotations. You can obtain the text and the annotations [here](http://ai.stanford.edu/~amaas/data/sentiment/). The following paper describes the learning of word vectors for Sentiment Analysis.\n",
"\n",
"Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). \n",
"*Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).*([link to PDF](https://aclanthology.org/P11-1015/))"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# cell 03\n",
"# Download the data if not exists\n",
"if not os.path.isfile('tc_dataset.json'):\n",
" !curl -X GET \\\n",
" \"https://datasets-server.huggingface.co/rows?dataset=pietrolesci%2Fimdb&config=pietrolesci--imdb&split=train&offset=0&limit=100\" >> tc_dataset.json\n",
"\n",
"\n",
"read_file = pd.DataFrame(columns = ['source'])\n",
"text_list = []\n",
"# Read and store content of an json file \n",
"with open ('tc_dataset.json', \"r\") as f:\n",
" # Reading from file\n",
" data = json.loads(f.read())\n",
"\n",
" # Iterating through the json list\n",
" for i in data['rows']:\n",
" text_list.append(i['row']['text'])\n",
"# remove html tags\n",
"text_list = [i.replace(' Enter description of the labels/sentiments that workers have to choose from Positive : If you are positive about things, you are hopeful and confident, and think of the good aspects of a situation rather than the bad ones.\". Negative : A fact, situation, or experience that is negative is unpleasant, depressing, or harmful.\". Add examples to help workers understand the label Positive : “good”, “great”, “wonderful”, “fantastic”. Negative : “bad”, “terrible”, “awful”, “disgusting”. Enter description of the labels/sentiments that workers have to choose from Positive : If you are positive about things, you are hopeful and confident, and think of the good aspects of a situation rather than the bad ones.\". Negative : A fact, situation, or experience that is negative is unpleasant, depressing, or harmful.\". Add examples to help workers understand the label Positive : “good”, “great”, “wonderful”, “fantastic”. Negative : “bad”, “terrible”, “awful”, “disgusting”.
', '').replace('', '').replace('
','') for i in text_list]\n",
"\n",
"read_file['source'] = text_list\n",
"\n",
"\n",
"# If running the short version of the demo, reduce each class count 10 times.\n",
"if RUN_FULL_AL_DEMO is False:\n",
" text_data = read_file.sample(n=10)\n",
"else:\n",
" text_data = read_file\n",
"\n",
"# Create a manifest (jsonline) file\n",
"text_data.to_json(r'./input.manifest', orient='records', lines=True)\n",
"\n",
"# Upload the input manifest file to s3\n",
"manifest_name = \"input/input.manifest\"\n",
"s3.upload_file(r'./input.manifest', BUCKET, EXP_NAME + \"/\" + manifest_name)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After running the cell above, you should be able to go to `s3://BUCKET/EXP_NAME/input` in [S3 console](https://console.aws.amazon.com/s3/) and see a 10k reviews. We recommend you inspect the contents of these content! You can download them all to a local machine using the AWS CLI."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Specify the categories\n",
"To run the text classification labeling job, you need to decide on a set of classes the annotators can choose from. \n",
"In our case, this list is `[\"Positive\", \"Negative\"]`. In your own job you can choose any list of up to 3 classes. We recommend the classes to be as unambiguous and concrete as possible. The categories should be mutually exclusive, with only one correct label per text. \n",
"\n",
"To work with Ground Truth, this list needs to be converted to a .json file and uploaded to the S3 `BUCKET`.\n",
"\n",
"*Note: The ordering of the labels or classes in the template governs the class indices that you will see downstream in the output manifest (this numbering is zero-indexed). In other words, the class that appears second in the template will correspond to class \"1\" in the output. At the end of this demonstration, we will train a model and make predictions, and this class ordering is instrumental to interpreting the results.*"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# cell 04\n",
"# deine the classes\n",
"import json\n",
"json_body = {\"document-version\":\"2021-05-13\",\n",
" \"labels\":[{\"label\":\"Positive\"},\n",
" {\"label\":\"Negative\"}]\n",
" }\n",
"\n",
"with open(\"class_labels.json\", \"w\") as f:\n",
" json.dump(json_body, f)\n",
" \n",
"# upload the json file to s3\n",
"s3.upload_file(\"class_labels.json\", BUCKET, EXP_NAME + \"/input/class_labels.json\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You should now see `class_labels.json` in `s3://BUCKET/EXP_NAME/input`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create the instruction template\n",
"Part or all of your revviews will be annotated by human annotators. It is **essential** to provide good instructions that help the annotators give you the annotations you want. Good instructions are:\n",
"1. Concise. We recommend limiting verbal/textual instruction to two sentences, and focusing on clear visuals.\n",
"2. Visual. In the case of text classification, we recommend providing one example in each of the classes as part of the instruction.\n",
"\n",
"When used through the AWS Console, Ground Truth helps you create the instructions using a visual wizard. When using the API, you need to create an HTML template for your instructions. Below, we prepare a very simple but effective template and upload it to your S3 bucket.\n",
"\n",
"#### Testing your instructions\n",
"It is very easy to create broken instructions. This might cause your labeling job to fail. However, it might also cause your job to complete with meaningless results (when the annotators have no idea what to do, or the instructions are plain wrong). We *highly recommend* that you verify that your task is correct in two ways:\n",
"1. The following cell creates and uploads a file called `template.liquid` to S3. It also creates `instructions.html` that you can open in a local browser window. Please do so and inspect the resulting web page; it should correspond to what you want your annotators to see (except the actual image to annotate will not be visible).\n",
"2. Run your job in a private workforce, which is a way to run a mock labeling job. We describe how to do it in [Verify your task using a private team [OPTIONAL]](#Verify-your-task-using-a-private-team-[OPTIONAL])."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"
\n",
"
\n",
"
| \n", " | source | \n", "label | \n", "label-metadata | \n", "
|---|---|---|---|
| 0 | \n", "I rented I AM CURIOUS-YELLOW from my video sto... | \n", "1 | \n", "{'class-name': 'Negative', 'job-name': 'labeli... | \n", "
| 1 | \n", "\"I Am Curious: Yellow\" is a risible and preten... | \n", "0 | \n", "{'class-name': 'Positive', 'job-name': 'labeli... | \n", "
| 2 | \n", "If only to avoid making this type of film in t... | \n", "0 | \n", "{'class-name': 'Positive', 'job-name': 'labeli... | \n", "
| 3 | \n", "This film was probably inspired by Godard's Ma... | \n", "1 | \n", "{'class-name': 'Negative', 'job-name': 'labeli... | \n", "
| 4 | \n", "Oh, brother...after hearing about this ridicul... | \n", "0 | \n", "{'class-name': 'Positive', 'job-name': 'labeli... | \n", "