{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Amazon SageMaker Ground Truth Demonstration for Named Entity Recognition\n", "\n", "1. [Introduction](#introduction)\n", "2. [Run a Ground Truth labeling job](#run-a-ground-truth-labeling-job)\n", " 1. [Prepare the data](#prepare-the-data)\n", " 2. [Specify the categories](#specify-the-categories)\n", " 3. [Assign private team to test your task](#assign-private-team-to-test-your-task)\n", " 4. [Define pre-built lambda functions for use in the labeling job](#define-pre-built-lambda-functions-for-use-in-the-labeling-job)\n", " 5. [Submit the Ground Truth job request](#submit-the-ground-truth-job-request)\n", " 6. [Monitor job progress](#monitor-job-progress)\n", " 7. [View Task Results](#view-task-results)\n", "3. [Clean Up](#clean-up)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n", "\n", "This sample notebook takes you through an end-to-end workflow to demonstrate the functionality of SageMaker Ground Truth. We'll start with an unlabeled text data set, acquire labels for all the entities like Person, Organization, GeoLocation and Date using SageMaker Ground Truth. Before you begin, we highly recommend you start a Ground Truth labeling job through the AWS Console first to familiarize yourself with the workflow. The AWS Console offers less flexibility than the API, but is simple to use.\n", "\n", "#### Cost and runtime\n", "You can run this demo in two modes:\n", "1. Set `RUN_FULL_AL_DEMO = True` in the next cell to label 200 unstructured text. This should cost about \\$800 given current [Ground Truth pricing scheme](https://aws.amazon.com/sagemaker/groundtruth/pricing/).The total end-to-end runtime should be about 2-4 days.\n", "1. Set `RUN_FULL_AL_DEMO = False` in the next cell to label only 10 text. This should cost about \\$0.80. **this is cheaper version of the demo. Some of the analysis plots might look awkward, but you should still be able to see good results on the human-annotated 10 text sentences.**\n", "\n", "#### Prerequisites\n", "To run this notebook, you can simply execute each cell one-by-one. To understand what's happening, you'll need:\n", "* An S3 bucket you can write to -- please provide its name in the following cell. The bucket must be in the same region as this SageMaker Notebook instance. You can also change the `EXP_NAME` to any valid S3 prefix. All the files related to this experiment will be stored in that prefix of your bucket.\n", "* Familiarity with Python and [numpy](http://www.numpy.org/).\n", "* Basic familiarity with [AWS S3](https://docs.aws.amazon.com/s3/index.html),\n", "* Basic understanding of [AWS Sagemaker](https://aws.amazon.com/sagemaker/),\n", "* Basic familiarity with [AWS Command Line Interface (CLI)](https://aws.amazon.com/cli/) -- set it up with credentials to access the AWS account you're running this notebook from. This should work out-of-the-box on SageMaker Jupyter Notebook instances.\n", "\n", "This notebook is only tested on a SageMaker notebook instance. The runtimes given are approximate, we used an `ml.m4.xlarge` instance in our tests. However, you can likely run it on a local instance by first executing the cell below on SageMaker, and then copying the `role` string to your local copy of the notebook.\n", "\n", "NOTE: This notebook will create/remove subdirectories in its working directory. We recommend to place this notebook in its own directory before running it." ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The autoreload extension is already loaded. To reload it, use:\n", " %reload_ext autoreload\n" ] } ], "source": [ "# cell 01\n", "%load_ext autoreload\n", "%autoreload 2\n", "import os\n", "import json\n", "import time\n", "from datetime import datetime\n", "import pandas as pd\n", "import boto3\n", "import sagemaker\n", "from urllib.parse import urlparse\n", "\n", "sess = sagemaker.Session()\n", "BUCKET = sess.default_bucket()\n", "\n", "EXP_NAME = \"label-text/ner-annotation\" # Any valid S3 prefix.\n", "RUN_FULL_AL_DEMO = False # See 'Cost and Runtime' in the Markdown cell above!\n", "VERIFY_USING_PRIVATE_WORKFORCE = True # private team leveraged for labelling job" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "tags": [] }, "outputs": [], "source": [ "# cell 02\n", "# Make sure the bucket is in the same region as this notebook.\n", "role = sagemaker.get_execution_role()\n", "region = boto3.session.Session().region_name\n", "\n", "s3 = boto3.client(\"s3\")\n", "bucket_region = s3.head_bucket(Bucket=BUCKET)[\"ResponseMetadata\"][\"HTTPHeaders\"][\n", " \"x-amz-bucket-region\"\n", "]\n", "assert (\n", " bucket_region == region\n", "), \"You S3 bucket {} and this notebook need to be in the same region.\".format(BUCKET)" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Run a Ground Truth labeling job\n", "**This section should take about 30min to complete.**\n", "\n", "We will first run a labeling job. This involves several steps: collecting the text for NER we want labeled, specifying the possible label categories, creating instructions, and writing a labeling job specification. In addition, we highly recommend to run a (free) mock job using a private workforce before you submit any job to the public workforce. This notebook will explain how to do that as an optional step. Without using a private workforce, this section until completion of your labeling job should take about 3h. However, this may vary depending on the availability of the public annotation workforce.\n", "\n", "## Prepare the data\n", "We will first download text and labels of a subset of the [CoNLL-2003](https://huggingface.co/datasets/conll2003). Our dataset will include entities in the following categories:\n", "\n", "* Person\n", "* Organization\n", "* GeoLocation\n", "* Date\n", "\n", "\n", "If you chose `RUN_FULL_AL_DEMO = False`, then we will choose a subset of 10 unstructured texts in this dataset. This is a diverse dataset of different entities, and should be fun for the human annotators to work with. You are free to ask the annotators to annotate any text you wish (as long as the text do not contain inappropriate content; in which case, you must adjust the labeling job request this job produces, please check the Ground Truth documentation).\n", "\n", "We will copy these text to our local `BUCKET`, and will create the corresponding *input manifest*. The input manifest is a formatted list of the S3 locations of the text we want Ground Truth to annotate. We will upload this manifest to our S3 `BUCKET`.\n", "\n", "#### Disclosure regarding the Open NER text Dataset:\n", "CoNLL 2003 Dataset V4 is created by Wang et al. We describe the CoNLL-2003 shared task: language-independent named entity recognition. We give background information on the data sets (English and German) and the evaluation method, present a general overview of the systems that have taken part in the task and discuss their performance.The following paper describes CoNLL-2003 NER dataset in depth: from the data collection and annotation to detailed statistics about the data and evaluation of models trained on it.\n", "\n", "Erik F. Tjong Kim Sang, Fien De Meulder\n", "*The Open NER Dataset: Language-Independent Named Entity Recognition.* arXiv:cs/0306050 [cs.CL], 2003. ([link to PDF])(https://arxiv.org/abs/cs/0306050v1)" ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 37263 100 37263 0 0 155k 0 --:--:-- --:--:-- --:--:-- 156k\n" ] } ], "source": [ "# cell 03\n", "# Download the data if not exists\n", "if not os.path.isfile('ner_dataset.json'):\n", " !curl -X GET \\\n", " \"https://datasets-server.huggingface.co/rows?dataset=conll2003&config=conll2003&split=train&offset=0&limit=100\" >> ner_dataset.json\n", " \n", "read_file = pd.DataFrame(columns = ['source'])\n", "text_list = []\n", "# Read and store content of an json file \n", "with open ('ner_dataset.json', \"r\") as f:\n", " # Reading from file\n", " data = json.loads(f.read())\n", "\n", " # Iterating through the json list\n", " for i in data['rows']:\n", " text_list.append(i['row']['tokens'])\n", "#convert the tokens into text string \n", "text_list = [\" \".join(sub)for sub in text_list]\n", "\n", "read_file['source'] = text_list\n", "\n", "# If running the short version of the demo, reduce each class count 10 times.\n", "if RUN_FULL_AL_DEMO is False:\n", " text_data = read_file[:10]\n", "else:\n", " text_data = read_file\n", " \n", "# Create a manifest (jsonline) file \n", "text_data.to_json(r'./input.manifest', orient='records', lines=True)\n", "\n", "# Copy the data to s3 manifest\n", "s3 = boto3.client(\"s3\")\n", "manifest_name = \"input/input.manifest\"\n", "s3.upload_file(r'./input.manifest', BUCKET, EXP_NAME + \"/\" + manifest_name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After running the cell above, you should be able to go to `s3://BUCKET/EXP_NAME/input` in [S3 console](https://console.aws.amazon.com/s3/) and see a text. We recommend you inspect the contents of these content! You can download them all to a local machine using the AWS CLI." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Specify the categories\n", "To run the text Named Entity Recognition labeling job, you need to decide on a set of classes the annotators can choose from. \n", "In our case, this list is `[\"Person\", \"Organization\", \"GeoLocation\", \"Date\"]`. In your own job you can choose any list of up to 10 classes. We recommend the classes to be as unambiguous and concrete as possible. The categories should be mutually exclusive, with multiple entities per text. \n", "\n", "To work with Ground Truth, this list needs to be converted to a .json file and uploaded to the S3 `BUCKET`.\n", "\n", "*Note: The ordering of the labels or classes in the template governs the class indices that you will see downstream in the output manifest (this numbering is zero-indexed). In other words, the class that appears second in the template will correspond to class \"1\" in the output. At the end of this demonstration, we will train a model and make predictions, and this class ordering is instrumental to interpreting the results.*" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "tags": [] }, "outputs": [], "source": [ "# cell 04\n", "# define the classes\n", "json_body = {\n", " \"document-version\": \"2018-11-28\",\n", " \"labels\": [\n", " {\n", " \"label\": \"Organization\",\n", " \"shortDisplayName\": \"Org\"\n", " },\n", " {\n", " \"label\": \"Person\",\n", " \"shortDisplayName\": \"Per\"\n", " },\n", " {\n", " \"label\": \"GeoLocation\",\n", " \"shortDisplayName\": \"Gpe\"\n", " },\n", " {\n", " \"label\": \"Date\",\n", " \"shortDisplayName\": \"Dat\"\n", " }\n", " ],\n", " \"instructions\": {\n", " \"shortInstruction\": \"
Enter description of the labels that workers have to choose from
Examples to help workers understand the label
\"\n", " \"PER - People, including fictional.
\"\n", " \"ORG - Companies, agencies, institutions.
\"\n", " \"GPE - Countries, cities, states.
\"\n", " \"DAT - Absolute or relative dates or periods.
\",\n", " \n", " \"fullInstruction\": \"