# Amazon Comprehend Custom Entity Recognizer

This notebook will serve as a template for the overall process of taking a text dataset and integrating it into [Amazon Comprehend Custom Entity Recognizer](https://docs.aws.amazon.com/comprehend/latest/dg/custom-entity-recognition.html) and perform natural language processing (NLP) to detect custom entities in your text.

## Overview

1. [Introduction to Amazon Comprehend Custom NER](#Introduction)
1. [Obtaining Your Data](#data)
1. [Pre-processing data](#preprocess)
1. [Training a custom recognizer](#train)
1. [Real time inference](#inference)
1. [Cleanup](#cleanup)

## Introduction to Amazon Comprehend Custom Entity Recognition 

Amazon Comprehend recognizes and detects nine entity types out of the box from your data, such as person, date, place etc. Custom entity recognition extends the capability of Amazon Comprehend by helping you identify your specific new entity types that are not of from the preset generic entity types. In this case, this notebook trains Amazon Comprehend to detect three additional entity types - Robot Ethics, Positronic Brain and Kinematics.

Building a custom entity recognizer helps to identify key words and phrases that are relevant to your business needs, and Amazon Comprehend helps you in reducing the complexity by providing automatic annotation and model training to create a custom entity model. For more information, see [Comprehend Custom Entity Recognition](https://docs.aws.amazon.com/comprehend/latest/dg/custom-entity-recognition.html)

## Obtaining Your Data 

To train a custom entity recognizer, Amazon Comprehend needs training data in one of two formats -
1. **Entity Lists (plain text only)**
You specify a list of documents that contain your entities, and in addition, specify a list of specific entities to search for in the documents. This is preferred when you have a finite list of entities to work with (for example, the EasyTron model names).
2. **Annotations**
This is more comprehensive, and provides the location of your entities in a large number of documents using the entity locations (offsets). Through this, Comprehnd can train on both the entity and its context. 

For our use case, to generate custom annotations, we make use of [Amazon SageMaker Ground Truth](https://aws.amazon.com/sagemaker/groundtruth/). We use Ground Truth with a private workforce to annotate the entities in hundreds of documents, and generate annotation files using the results. To learn more about how to use Ground Truth to annotate data, see [Named Entity Recognition](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-named-entity-recg.html).

For the lab, we have already labeled the data and the annotation files are provided. 

## Pre-processing data 

In [None]:
# Firstly, we import necessary libraries and initialize clients
import re
import time
import json
import uuid
import boto3
import random
import secrets
import datetime
import sagemaker
import pandas as pd
from sagemaker import get_execution_role


s3 = boto3.client('s3')
comprehend = boto3.client('comprehend')

# provide the name of your S3 bucket here. This was already created in your account for this workshop
bucket = '' 

region = boto3.session.Session().region_name

# Amazon S3 (S3) client
s3 = boto3.client('s3', region)
s3_resource = boto3.resource('s3')
try:
 s3.head_bucket(Bucket=bucket)
except:
 print("The S3 bucket name {} you entered seems to be incorrect, please try again".format(bucket))

In [None]:
# This is the execution role that will be used to call Amazon Transcribe and Amazon Translate
role = get_execution_role()
display(role)

#### We already provided you a training dataset and an annotations file in the repository, let's have a look at them now

In [None]:
pd.read_csv('train.csv',header=None).head(10)

In [None]:
pd.read_csv('annotations.csv').head(10)

In [None]:
# let's upload our train and annotation files to S3
s3.upload_file('train.csv', bucket, 'comprehend/train/train.csv')
s3.upload_file('annotations.csv', bucket, 'comprehend/train/annotations.csv')
s3_train_channel = "s3://" + bucket + "/comprehend/train/train.csv"
s3_annot_channel = "s3://" + bucket + "/comprehend/train/annotations.csv"

### Create Comprehend Custom Entity Recognizer

In [None]:
custom_entity_request = {
 "DataFormat": "COMPREHEND_CSV",
 "Documents": { 
 "S3Uri": s3_train_channel,
 "InputFormat": "ONE_DOC_PER_LINE"
 },
 "Annotations": { 
 "S3Uri": s3_annot_channel
 },
 "EntityTypes": [
 {
 "Type": "MOVEMENT"
 },
 {
 "Type": "BRAIN"
 },
 {
 "Type": "ETHICS"
 }
 ]
}

In [None]:
# create unique ID for recognizer
uid = str(uuid.uuid4())

response = comprehend.create_entity_recognizer(
 RecognizerName=f"aim317-ner-{uid}", 
 DataAccessRoleArn=role,
 InputDataConfig=custom_entity_request,
 LanguageCode="en",
 VersionName= 'v001'
)

print(response['EntityRecognizerArn'])

### Check training status in Amazon Comprehend console

[Go to Amazon Comprehend Console](https://console.aws.amazon.com/comprehend/v2/home?region=us-east-1#entity-recognition)

This will take approximately 20 minutes. **Execute the Entity Recongizer Metrics step below only after** the entity recognizer model has been created and is ready for use. Otherwise you will get an error message. If this is the case no worries, just try it again after the entity recognizer has finished training.

In [None]:
describe_response = comprehend.describe_entity_recognizer(
 EntityRecognizerArn=response['EntityRecognizerArn']
)

print(describe_response['EntityRecognizerProperties']['Status'])

### Entity Recognizer Metrics

In [None]:
# Print recognizer metrics
print("Entity recognizer metrics:")
for ent in describe_response["EntityRecognizerProperties"]["RecognizerMetadata"]["EntityTypes"]:
 print(ent['Type'])
 metrics = ent['EvaluationMetrics']
 for k, v in metrics.items():
 metrics[k] = round(v, 2)
 print(metrics)

In [None]:
describe_response['EntityRecognizerProperties']['EntityRecognizerArn']

## Create endpoint

Now that the model is trained, we'll deploy the model to an Amazon Comprehend endpoint for synchronous, real-time inference. 

In [None]:
# NOTE - We are using real-time endpoints and chunked text for demo purposes in this workshop. For your actual use case
 # if you don't need real-time insights from Comprehend, we suggest using Comprehend start_entities_detection_job or batch_detect_entities to send the full corpus for entity detection
 # If your need is real-time inference, please use the Comprehend real-time endpoint as we show in this notebook.
 # We have used 4 Inference Units (IU) in this workshop, each IU has a throughput of 100 characters per second.
endpoint_response = comprehend.create_endpoint(
 EndpointName=f"aim317-ner-endpoint",
 ModelArn=describe_response['EntityRecognizerProperties']['EntityRecognizerArn'],
 DesiredInferenceUnits=4, # you are charged based on Inference Units, for this workshop lets create 4 IUs
 DataAccessRoleArn=role
)

In [None]:
print(endpoint_response['EndpointArn'])

### Check endpoint status in Amazon Comprehend console

[Go to Amazon Comprehend Console](https://console.aws.amazon.com/comprehend/v2/home?region=us-east-1#endpoints)

This will take approximately 10 minutes. Go to the **Run Inference** step below after the endpoint has been created and is ready for use. Running the cells prior to the endpoint being ready will result in error. You can re-execute the cell after the endpoint becomes available.

## Run inference

In [None]:
# Input files ready for entity recognition
!aws s3 ls s3://{bucket}/comprehend/input/

In [None]:
# Prepare to page through our transcripts in S3

# Define the S3 handles
s3 = boto3.client('s3')
s3_resource = boto3.resource('s3')


# Specify an S3 output prefix
t_prefix = 'quicksight/data/entity'


# Lets define the bucket name that contains the transcripts first
# So far we used a session bucket we created for training and testing the classifier
paginator = s3.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket=bucket, Prefix='comprehend/input')
job_name_list = []

# We will use a temp DataFrame to extract the entity type that is most prominent in the transcript
tempcols = ['Type', 'Score']
df_temp = pd.DataFrame(columns=tempcols)


# We will define a DataFrame to store the results of the classifier
cols = ['transcript_name', 'entity_type']
df_ent = pd.DataFrame(columns=cols)

# Now lets page through the transcripts
for page in pages:
 for obj in page['Contents']:
 entity = ''
 # get the transcript file name
 transcript_file_name = obj['Key'].split('/')[2]
 # now lets get the transcript file contents
 temp = s3_resource.Object(bucket, obj['Key'])
 transcript_content = temp.get()['Body'].read().decode('utf-8')
 # Send a chunk of the transcript for entity recognition
 # NOTE - We are using real-time endpoints and chunked text for demo purposes in this workshop. For your actual use case
 # if you don't need real-time insights from Comprehend, we suggest using Comprehend start_entities_detection_job or batch_detect_entities to send the full corpus for entity detection
 # If your need is real-time inference, please use the Comprehend real-time endpoint as we show in this notebook.
 # We have used 4 Inference Units (IU) in this workshop, each IU has a throughput of 100 characters per second.
 transcript_truncated = transcript_content[400:1800]
 # Call Comprehend to get the entity types the transcript belongs to
 response = comprehend.detect_entities(Text=transcript_truncated, LanguageCode='en', EndpointArn=endpoint_response['EndpointArn'])
 # Extract prominent entity
 df_temp = pd.DataFrame(columns=tempcols)
 for ent in response['Entities']:
 df_temp.loc[len(df_temp.index)] = [ent['Type'],ent['Score']]
 if len(df_temp) > 0:
 entity = df_temp.iloc[df_temp.Score.argmax(), 0:2]['Type']
 else:
 entity = 'No entities'
 
 # Update the results DataFrame with the detected entities
 df_ent.loc[len(df_ent.index)] = [transcript_file_name.strip('en-').strip('.txt'),entity] 

 # Create a CSV file with cta label from this DataFrame
df_ent.to_csv('s3://' + bucket + '/' + t_prefix + '/' + 'entities.csv', index=False)
df_ent

### We are done here. You can return to the workshop instructions for next steps