# Amazon SageMaker Ground Truth Demonstration for Named Entity Recognition

1. [Introduction](#introduction)
2. [Run a Ground Truth labeling job](#run-a-ground-truth-labeling-job)
    1. [Prepare the data](#prepare-the-data)
    2. [Specify the categories](#specify-the-categories)
    3. [Assign private team to test your task](#assign-private-team-to-test-your-task)
    4. [Define pre-built lambda functions for use in the labeling job](#define-pre-built-lambda-functions-for-use-in-the-labeling-job)
    5. [Submit the Ground Truth job request](#submit-the-ground-truth-job-request)
    6. [Monitor job progress](#monitor-job-progress)
    7. [View Task Results](#view-task-results)
3. [Clean Up](#clean-up)

## Introduction

This sample notebook takes you through an end-to-end workflow to demonstrate the functionality of SageMaker Ground Truth. We'll start with an unlabeled text data set, acquire labels for all the entities like Person, Organization, GeoLocation and Date using SageMaker Ground Truth. Before you begin, we highly recommend you start a Ground Truth labeling job through the AWS Console first to familiarize yourself with the workflow. The AWS Console offers less flexibility than the API, but is simple to use.

#### Cost and runtime
You can run this demo in two modes:
1. Set `RUN_FULL_AL_DEMO = True` in the next cell to label 200 unstructured text. This should cost about \$800 given current [Ground Truth pricing scheme](https://aws.amazon.com/sagemaker/groundtruth/pricing/).The total end-to-end runtime should be about 2-4 days.
1. Set `RUN_FULL_AL_DEMO = False` in the next cell to label only 10 text. This should cost about \$0.80. **this is cheaper version of the demo. Some of the analysis plots might look awkward, but you should still be able to see good results on the human-annotated 10 text sentences.**

#### Prerequisites
To run this notebook, you can simply execute each cell one-by-one. To understand what's happening, you'll need:
* An S3 bucket you can write to -- please provide its name in the following cell. The bucket must be in the same region as this SageMaker Notebook instance. You can also change the `EXP_NAME` to any valid S3 prefix. All the files related to this experiment will be stored in that prefix of your bucket.
* Familiarity with Python and [numpy](http://www.numpy.org/).
* Basic familiarity with [AWS S3](https://docs.aws.amazon.com/s3/index.html),
* Basic understanding of [AWS Sagemaker](https://aws.amazon.com/sagemaker/),
* Basic familiarity with [AWS Command Line Interface (CLI)](https://aws.amazon.com/cli/) -- set it up with credentials to access the AWS account you're running this notebook from. This should work out-of-the-box on SageMaker Jupyter Notebook instances.

This notebook is only tested on a SageMaker notebook instance. The runtimes given are approximate, we used an `ml.m4.xlarge` instance in our tests. However, you can likely run it on a local instance by first executing the cell below on SageMaker, and then copying the `role` string to your local copy of the notebook.

NOTE: This notebook will create/remove subdirectories in its working directory. We recommend to place this notebook in its own directory before running it.

In [49]:
# cell 01
%load_ext autoreload
%autoreload 2
import os
import json
import time
from datetime import datetime
import pandas as pd
import boto3
import sagemaker
from urllib.parse import urlparse

sess = sagemaker.Session()
BUCKET = sess.default_bucket()

EXP_NAME = "label-text/ner-annotation"  # Any valid S3 prefix.
RUN_FULL_AL_DEMO = False  # See 'Cost and Runtime' in the Markdown cell above!
VERIFY_USING_PRIVATE_WORKFORCE = True # private team leveraged for labelling job

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [50]:
# cell 02
# Make sure the bucket is in the same region as this notebook.
role = sagemaker.get_execution_role()
region = boto3.session.Session().region_name

s3 = boto3.client("s3")
bucket_region = s3.head_bucket(Bucket=BUCKET)["ResponseMetadata"]["HTTPHeaders"][
    "x-amz-bucket-region"
]
assert (
    bucket_region == region
), "You S3 bucket {} and this notebook need to be in the same region.".format(BUCKET)

## Run a Ground Truth labeling job
**This section should take about 30min to complete.**

We will first run a labeling job. This involves several steps: collecting the text for NER  we want labeled, specifying the possible label categories, creating instructions, and writing a labeling job specification. In addition, we highly recommend to run a (free) mock job using a private workforce before you submit any job to the public workforce. This notebook will explain how to do that as an optional step. Without using a private workforce, this section until completion of your labeling job should take about 3h. However, this may vary depending on the availability of the public annotation workforce.

## Prepare the data
We will first download text and labels of a subset of the [CoNLL-2003](https://huggingface.co/datasets/conll2003). Our dataset will include entities in the following categories:

* Person
* Organization
* GeoLocation
* Date


If you chose `RUN_FULL_AL_DEMO = False`, then we will choose a subset of 10 unstructured texts in this dataset. This is a diverse dataset of different entities, and should be fun for the human annotators to work with. You are free to ask the annotators to annotate any text you wish (as long as the text do not contain inappropriate content; in which case, you must adjust the labeling job request this job produces, please check the Ground Truth documentation).

We will copy these text to our local `BUCKET`, and will create the corresponding *input manifest*. The input manifest is a formatted list of the S3 locations of the text we want Ground Truth to annotate. We will upload this manifest to our S3 `BUCKET`.

#### Disclosure regarding the Open NER text Dataset:
CoNLL 2003 Dataset V4 is created by Wang et al. We describe the CoNLL-2003 shared task: language-independent named entity recognition. We give background information on the data sets (English and German) and the evaluation method, present a general overview of the systems that have taken part in the task and discuss their performance.The following paper describes CoNLL-2003 NER dataset in depth: from the data collection and annotation to detailed statistics about the data and evaluation of models trained on it.

Erik F. Tjong Kim Sang, Fien De Meulder
*The Open  NER Dataset: Language-Independent Named Entity Recognition.* arXiv:cs/0306050 [cs.CL], 2003. ([link to PDF])(https://arxiv.org/abs/cs/0306050v1)

In [60]:
# cell 03
# Download the data if not exists
if not os.path.isfile('ner_dataset.json'):
    !curl -X GET \
     "https://datasets-server.huggingface.co/rows?dataset=conll2003&config=conll2003&split=train&offset=0&limit=100" >> ner_dataset.json
    
read_file = pd.DataFrame(columns = ['source'])
text_list = []
# Read and store content of an json file 
with open ('ner_dataset.json', "r") as f:
    # Reading from file
    data = json.loads(f.read())

    # Iterating through the json list
    for i in data['rows']:
        text_list.append(i['row']['tokens'])
#convert the tokens into text string       
text_list = [" ".join(sub)for sub in text_list]

read_file['source'] = text_list

# If running the short version of the demo, reduce each class count 10 times.
if RUN_FULL_AL_DEMO is False:
    text_data = read_file[:10]
else:
    text_data = read_file
    
# Create a manifest (jsonline) file    
text_data.to_json(r'./input.manifest', orient='records', lines=True)

# Copy the data to s3 manifest
s3 = boto3.client("s3")
manifest_name = "input/input.manifest"
s3.upload_file(r'./input.manifest', BUCKET, EXP_NAME + "/" + manifest_name)

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 37263  100 37263    0     0   155k      0 --:--:-- --:--:-- --:--:--  156k


After running the cell above, you should be able to go to `s3://BUCKET/EXP_NAME/input` in [S3 console](https://console.aws.amazon.com/s3/) and see a text. We recommend you inspect the contents of these content! You can download them all to a local machine using the AWS CLI.

## Specify the categories
To run the text Named Entity Recognition labeling job, you need to decide on a set of classes the annotators can choose from. 
In our case, this list is `["Person", "Organization", "GeoLocation", "Date"]`. In your own job you can choose any list of up to 10 classes. We recommend the classes to be as unambiguous and concrete as possible. The categories should be mutually exclusive, with multiple entities per text. 

To work with Ground Truth, this list needs to be converted to a .json file and uploaded to the S3 `BUCKET`.

*Note: The ordering of the labels or classes in the template governs the class indices that you will see downstream in the output manifest (this numbering is zero-indexed). In other words, the class that appears second in the template will correspond to class "1" in the output. At the end of this demonstration, we will train a model and make predictions, and this class ordering is instrumental to interpreting the results.*

In [53]:
# cell 04
# define the classes
json_body = {
  "document-version": "2018-11-28",
  "labels": [
    {
      "label": "Organization",
      "shortDisplayName": "Org"
    },
    {
      "label": "Person",
      "shortDisplayName": "Per"
    },
    {
      "label": "GeoLocation",
      "shortDisplayName": "Gpe"
    },
    {
      "label": "Date",
      "shortDisplayName": "Dat"
    }
  ],
  "instructions": {
    "shortInstruction": "<p>Enter description of the labels that workers have to choose from</p><br><p>Examples to help workers understand the label</p>"
      "<p>PER - People, including fictional.</p>"
      "<p>ORG - Companies, agencies, institutions.</p>"
      "<p>GPE - Countries, cities, states.</p>"
      "<p>DAT - Absolute or relative dates or periods.</p>",
      
    "fullInstruction": "<ol><li><strong>Read</strong> the text carefully.</li><li><strong>Highlight</strong> words, phrases, or sections of the text.</li><li><strong>Choose</strong> the label that best matches what you have highlighted.</li><li>To <strong>change</strong> a label, choose highlighted text and select a new label.</li><li>To <strong>remove</strong> a label from highlighted text, choose the X next to the abbreviated label name on the highlighted text.</li><li>You can select all of a previously highlighted text, but not a portion of it.</li></ol>"
      }
    }

# upload the json to s3
with open("class_labels.json", "w") as f:
    json.dump(json_body, f)

s3.upload_file("class_labels.json", BUCKET, EXP_NAME + "/input/class_labels.json")

You should now see `class_labels.json` in `s3://BUCKET/EXP_NAME/input`.

## Assign private team to test your task

Refer to Prerequisites to setup private workforce team. 

Copy private_workteam_arn, from Amazon SageMaker console > Ground Truth > Labeling workforces > Private Teams

The [SageMaker Ground Truth documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-workforce-management-private.html) has more details on the management of private workteams. 

In [54]:
# cell 05
# define the workforce arn for 
if VERIFY_USING_PRIVATE_WORKFORCE:
    private_workteam_arn = "<< your private workteam ARN here >>"
    WORKTEAM_ARN = private_workteam_arn
else:
    workteam_arn = "arn:aws:sagemaker:{}:394669845002:workteam/public-crowd/default".format(region)
    WORKTEAM_ARN = workteam_arn

print("WORKTEAM_ARN : {}".format(WORKTEAM_ARN))

WORKTEAM_ARN : arn:aws:sagemaker:us-east-1:370501389570:workteam/private-crowd/smgt-immersionday


## Define pre-built lambda functions for use in the labeling job 

Before we submit the request, we need to define the ARNs for four key components of the labeling job: 1) the workteam, 2) the annotation consolidation Lambda function, 3) the pre-labeling task Lambda function, and 4) the human task UI to perform. These functions are defined by strings with region names and AWS service account numbers, so we will define a mapping below that will enable you to run this notebook in any of our supported regions. 

See the official documentation for the available ARNs:
* [Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-workforce-management-public.html) for a discussion of the workteam ARN definition. There is only one valid selection if you choose to use the public workfofce; if you elect to use a private workteam, you should check the corresponding ARN for the workteam.
* [Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/API_HumanTaskConfig.html#SageMaker-Type-HumanTaskConfig-PreHumanTaskLambdaArn) for available pre-human ARNs for other workflows.
* [Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/API_AnnotationConsolidationConfig.html#SageMaker-Type-AnnotationConsolidationConfig-AnnotationConsolidationLambdaArn) for available annotation consolidation ANRs for other workflows.

In [55]:
# cell 06
# Specify ARNs for resources needed to run a text Named Entity Recognition.
ac_arn_map = {
    "us-west-2": "081040173940",
    "us-east-1": "432418664414",
    "us-east-2": "266458841044",
    "eu-west-1": "568282634449",
    "ap-northeast-1": "477331159723",
}

# PreHumanTaskLambdaArn for text NER
prehuman_arn = "arn:aws:lambda:{}:{}:function:PRE-NamedEntityRecognition".format(
    region, ac_arn_map[region]
)

# AnnotationConsolidationConfig for text NER
acs_arn = "arn:aws:lambda:{}:{}:function:ACS-NamedEntityRecognition".format(region, ac_arn_map[region])

#Built-in instruction template for text NER
ui_arn = "arn:aws:sagemaker:{}:394669845002:human-task-ui/NamedEntityRecognition".format(region)

## Submit the Ground Truth job request
The API starts a Ground Truth job by submitting a request. The request contains the 
full configuration of the annotation task, and allows you to modify the fine details of
the job that are fixed to default values when you use the AWS Console. The parameters that make up the request are described in more detail in the [SageMaker Ground Truth documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateLabelingJob.html).

After you submit the request, you should be able to see the job in your AWS Console, at `Amazon SageMaker > Labeling Jobs`.
You can track the progress of the job there. This job will take several hours to complete. 

### Verify your task using a private team [OPTIONAL]
If you chose to follow the steps in [Create a private team](#Create-a-private-team-to-test-your-task-[OPTIONAL]), then you can first verify that your task runs as expected. To do this:
1. Set VERIFY_USING_PRIVATE_WORKFORCE to True in the cell below.
2. Run the next two cells. This will define the task and submit it to the private workforce (to you).
3. After a few minutes, you should be able to see your task in your private workforce interface [Create a private team](#Create-a-private-team-to-test-your-task-[OPTIONAL]).
Please verify that the task appears as you want it to appear.
4. If everything is in order, change `VERIFY_USING_PRIVATE_WORKFORCE` to `False` and rerun the cell below to start the real annotation task!

In [56]:
# cell 07
# task description
task_description = 'Apply the labels provided to specific words or phrases within the larger text block.'
task_keywords = ["text", "ner", "humans"]
task_title = 'Named Entity Recognition'
job_name = "ground-truth-text-ner-" + str(int(time.time()))
output_manifest = "output_manifest"

human_task_config = {
    "AnnotationConsolidationConfig": {
        "AnnotationConsolidationLambdaArn": acs_arn,
    },
    "PreHumanTaskLambdaArn": prehuman_arn,
    "MaxConcurrentTaskCount": 100,  # 100 text sentences will be sent at a time to the workteam.
    "NumberOfHumanWorkersPerDataObject": 1,  # number of  workers required to label each text.
    "TaskAvailabilityLifetimeInSeconds": 21600,  # Your worteam has 6 hours to complete all pending tasks.
    "TaskDescription": task_description,
    "TaskKeywords": task_keywords,
    "TaskTimeLimitInSeconds": 300,  # Each text must be labeled within 5 minutes.
    "TaskTitle": task_title,
    "UiConfig": {
        'HumanTaskUiArn': ui_arn,
    },
}

if not VERIFY_USING_PRIVATE_WORKFORCE:
    human_task_config["PublicWorkforceTaskPrice"] = {
        "AmountInUsd": {
            "Dollars": 0,
            "Cents": 1,
            "TenthFractionsOfACent": 2,
        }
    }
    human_task_config["WorkteamArn"] = workteam_arn
else:
    human_task_config["WorkteamArn"] = private_workteam_arn

ground_truth_request = {
    "InputConfig": {
        "DataSource": {
            "S3DataSource": {
                "ManifestS3Uri": "s3://{}/{}/{}".format(BUCKET, EXP_NAME, manifest_name),
            }
        },
        "DataAttributes": {
            "ContentClassifiers": ["FreeOfPersonallyIdentifiableInformation", "FreeOfAdultContent"]
        },
    },
    "OutputConfig": {
        "S3OutputPath": "s3://{}/{}/output/".format(BUCKET, EXP_NAME, output_manifest),
    },
    "HumanTaskConfig": human_task_config,
    "LabelingJobName": job_name,
    "RoleArn": role,
    "LabelAttributeName": "category",
    "LabelCategoryConfigS3Uri": "s3://{}/{}/input/class_labels.json".format(BUCKET, EXP_NAME),
    "Tags":[
        {
            'Key': 'text',
            'Value': 'NER'
        },
    ]

}

sagemaker_client = boto3.client("sagemaker")
sagemaker_client.create_labeling_job(**ground_truth_request)

{'LabelingJobArn': 'arn:aws:sagemaker:us-east-1:370501389570:labeling-job/ground-truth-text-ner-1686344604',
 'ResponseMetadata': {'RequestId': '89f51d94-780d-4764-bb5f-99863ca4b1dd',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '89f51d94-780d-4764-bb5f-99863ca4b1dd',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '107',
   'date': 'Fri, 09 Jun 2023 21:03:23 GMT'},
  'RetryAttempts': 0}}

In [57]:
#sagemaker_client.describe_labeling_job(LabelingJobName=job_name)

In [58]:
# cell 08
workteamName = WORKTEAM_ARN[WORKTEAM_ARN.rfind('/') + 1:]
print("Navigate to the private worker portal and do the tasks. Make sure you've invited yourself to your workteam!")
print('https://' + sagemaker_client.describe_workteam(WorkteamName=workteamName)['Workteam']['SubDomain'])

Navigate to the private worker portal and do the tasks. Make sure you've invited yourself to your workteam!
https://so3zqudq9a.labeling.us-east-1.sagemaker.aws


## Monitor job progress
A Ground Truth job can take a few hours to complete (if your dataset is larger than 100 text, it can take much longer than that!). One way to monitor the job's progress is through AWS Console. In this notebook, we will use Ground Truth output files and Cloud Watch logs in order to monitor the progress. You can re-evaluate the next two cells repeatedly.


You can re-evaluate the next cell repeatedly. It sends a `describe_labelging_job` request which should tell you whether the job is completed or not. If it is, then 'LabelingJobStatus' will be 'Completed'.

In [2]:
# cell 09
# re-evaluate repeatedly. It sends a `describe_labelging_job` request which should tell you whether the job is completed or not. If it is, then 'LabelingJobStatus' will be 'Completed'.
while sagemaker_client.describe_labeling_job(LabelingJobName=job_name)['LabelingJobStatus'] == 'InProgress':
    job_status = sagemaker_client.describe_labeling_job(LabelingJobName=job_name)['LabelingJobStatus']
    print('Labelling job : {}, status : {}'.format(job_name, job_status))
    time.sleep(30)
print('Labelling job : {}, status : {}'.format(job_name, sagemaker_client.describe_labeling_job(LabelingJobName=job_name)['LabelingJobStatus']))

### Wait For Workers to Complete Task.....

## View Task Results  

Once work is completed, Amazon SageMaker GroundTruth stores results in your S3 bucket and sends a Cloudwatch event. Your results should be available in the S3 OUTPUT_PATH when all work is completed.

In [None]:
# cell 10
# ouput path
S3_OUTPUT = boto3.client('sagemaker').describe_labeling_job(LabelingJobName=job_name)['OutputConfig']['S3OutputPath'] + job_name
print('S3 OUPUT_PATH : {}'.format(S3_OUTPUT))

# Download human annotation data.
!aws s3 cp {S3_OUTPUT + '/manifests/output/output.manifest'} "./output/" #--recursive --quiet

In [None]:
# cell 11
data=[]
with open('./output/output.manifest') as f:
    for line in f:
        json_data = json.loads(line)
        data.append(json_data)
        
output = [[] for i in range(4)]
for item in data:
    text=item['source']
    for line in item['category']['annotations']['entities']:
        output[0].append(text)     
        output[1].append(line['label'])
        output[2].append(line['startOffset'])
        output[3].append(line['endOffset'])

pd.DataFrame(data={'source': output[0], 'entities': output[1], 'startOffset': output[2], 'endOffset': output[3]})

## Clean Up [OPTIONAL]

Finally, let's clean up and delete this endpoint.

In [None]:
# cell 12
if sagemaker_client.describe_labeling_job(LabelingJobName=job_name)['LabelingJobStatus'] == 'InProgress':
    sagemaker_client.stop_labeling_job(LabelingJobName=job_name)

## The End!