# How to prepare a dataset and submit a custom entity recognizer for Amazon Comprehend

This notebook walks through how to prepare a training dataset for custom entities in Amazon Comprehend

More information on how to create a custom entity recognizer model can be found here.

https://docs.aws.amazon.com/comprehend/latest/dg/training-recognizers.html





In [1]:
# library imports
import boto3

import botocore
import re
import numpy as np
import pandas as pd
import matplotlib
import csv
import json
import time
import os
import datetime

from sagemaker import get_execution_role
from sagemaker.session import Session

comprehend = boto3.client('comprehend')

# Specify S3 bucket and prefix that you want to use for model data
# Feel free to specify a different bucket here if you wish.
bucket = '<your-bucket>'
prefix = 'comprehend-custom-entity'

role = get_execution_role()
print(role)

arn:aws:iam::202860692096:role/service-role/AmazonSageMaker-ExecutionRole-20191203T163283


In this example we will be using the following twitter dataset. https://www.kaggle.com/thoughtvector/customer-support-on-twitter
Download the dataset and save it in the ./data folder.

If you don't have an account on kaggle you can run the following commands from the notebook terminal.

aws s3 cp s3://phi-demo-london/twcs/twcs.zip /home/ec2-user/SageMaker/amazon-comprehend-custom-entity/data/twcs.zip

cd /home/ec2-user/SageMaker/amazon-comprehend-custom-entity/data

unzip twcs.zip

In [2]:
tweets = pd.read_csv('./data/twcs.csv',encoding='utf-8')
print(tweets.shape)
tweets.head()

(2811774, 7)


Unnamed: 0,tweet_id,author_id,inbound,created_at,text,response_tweet_id,in_response_to_tweet_id
0,1,sprintcare,False,Tue Oct 31 22:10:47 +0000 2017,@115712 I understand. I would like to assist y...,2.0,3.0
1,2,115712,True,Tue Oct 31 22:11:45 +0000 2017,@sprintcare and how do you propose we do that,,1.0
2,3,115712,True,Tue Oct 31 22:08:27 +0000 2017,@sprintcare I have sent several private messag...,1.0,4.0
3,4,sprintcare,False,Tue Oct 31 21:54:49 +0000 2017,@115712 Please send us a Private Message so th...,3.0,5.0
4,5,115712,True,Tue Oct 31 21:49:35 +0000 2017,@sprintcare I did.,4.0,6.0


<a id='data-wrangling'></a>

## Data Wrangling

This is a very interesting tweet data set, about 3 million tweets, and we have information on the author of the tweets and whether the tweet was a query or a response (the "inbound" column). If the tweet was a query, the response_tweet_id gives the response made by the support team.

It would be interesting to modify this dataframe to get query - response pairs in every row.
The following code, to do just what we want, was pulled from [this kernel](https://www.kaggle.com/soaxelbrooke/first-inbound-and-response-tweets)

In [3]:
first_inbound = tweets[pd.isnull(tweets.in_response_to_tweet_id) & tweets.inbound]

QnR = pd.merge(first_inbound, tweets, left_on='tweet_id', 
                                  right_on='in_response_to_tweet_id')

# Filter to only outbound replies (from companies)
QnR = QnR[QnR.inbound_y ^ True]
print(f'Data shape: {QnR.shape}')
QnR.head()

Data shape: (794299, 14)


Unnamed: 0,tweet_id_x,author_id_x,inbound_x,created_at_x,text_x,response_tweet_id_x,in_response_to_tweet_id_x,tweet_id_y,author_id_y,inbound_y,created_at_y,text_y,response_tweet_id_y,in_response_to_tweet_id_y
0,8,115712,True,Tue Oct 31 21:45:10 +0000 2017,@sprintcare is the worst customer service,9610,,6,sprintcare,False,Tue Oct 31 21:46:24 +0000 2017,@115712 Can you please send us a private messa...,57.0,8.0
1,8,115712,True,Tue Oct 31 21:45:10 +0000 2017,@sprintcare is the worst customer service,9610,,9,sprintcare,False,Tue Oct 31 21:46:14 +0000 2017,@115712 I would love the chance to review the ...,,8.0
2,8,115712,True,Tue Oct 31 21:45:10 +0000 2017,@sprintcare is the worst customer service,9610,,10,sprintcare,False,Tue Oct 31 21:45:59 +0000 2017,@115712 Hello! We never like our customers to ...,,8.0
3,18,115713,True,Tue Oct 31 19:56:01 +0000 2017,@115714 y’all lie about your “great” connectio...,17,,17,sprintcare,False,Tue Oct 31 19:59:13 +0000 2017,@115713 H there! We'd definitely like to work ...,16.0,18.0
4,20,115715,True,Tue Oct 31 22:03:34 +0000 2017,"@115714 whenever I contact customer support, t...",19,,19,sprintcare,False,Tue Oct 31 22:10:10 +0000 2017,@115715 Please send me a private message so th...,,20.0


In [4]:
#Let's filter the dataframe contains only the needed columns
QnR = QnR[["author_id_x","created_at_x","text_x","author_id_y","created_at_y","text_y"]]
QnR.head(5)

Unnamed: 0,author_id_x,created_at_x,text_x,author_id_y,created_at_y,text_y
0,115712,Tue Oct 31 21:45:10 +0000 2017,@sprintcare is the worst customer service,sprintcare,Tue Oct 31 21:46:24 +0000 2017,@115712 Can you please send us a private messa...
1,115712,Tue Oct 31 21:45:10 +0000 2017,@sprintcare is the worst customer service,sprintcare,Tue Oct 31 21:46:14 +0000 2017,@115712 I would love the chance to review the ...
2,115712,Tue Oct 31 21:45:10 +0000 2017,@sprintcare is the worst customer service,sprintcare,Tue Oct 31 21:45:59 +0000 2017,@115712 Hello! We never like our customers to ...
3,115713,Tue Oct 31 19:56:01 +0000 2017,@115714 y’all lie about your “great” connectio...,sprintcare,Tue Oct 31 19:59:13 +0000 2017,@115713 H there! We'd definitely like to work ...
4,115715,Tue Oct 31 22:03:34 +0000 2017,"@115714 whenever I contact customer support, t...",sprintcare,Tue Oct 31 22:10:10 +0000 2017,@115715 Please send me a private message so th...


## Filter to only telco tweets
In our example, we want to create a custom entity to recognize smartphones devices. Let's filer our dataframe to only incclude the T-Mobile and Sprint tweets.

In [5]:
tweet_telco = QnR[QnR["author_id_y"].isin(["TMobileHelp", "sprintcare"])]

Let's concatenate the question and response into one column.

In [6]:
tweet_telco['text'] = tweet_telco['text_x'] + ' | ' + tweet_telco['text_y']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


Let's save our telco tweets as a csv file.

In [7]:

tweet_telco['text'].to_csv('./data/tweet_telco.csv', encoding='utf-8', index=False)


  from ipykernel import kernelapp as app


## Entity list
In order to create our dataset we need to provide an entity list for our new class named DEVICE.

In the second notebook, we load a corpus into a word2vec model and generate a list of keywords that are contextually similar. This technique will be used in the custom classifer in the third notebook. The same technique could be applied here.

For now, in order to create our entity list, we will generate keywords of different smartphones manually.

In [8]:
sphones = ['iPhone X', 'iPhoneX', 'iphoneX', 'Samsung Galaxy', 'Samsung Note', 'iphone', 'iPhone', 'android', 'Android']

df_entity_list = pd.DataFrame(sphones, columns=['Text'])


Let's add another column with our class label. This is required part of the Amazon Comprehend training dataset.

More information can be found here.

https://docs.aws.amazon.com/comprehend/latest/dg/cer-entity-list.html


In [9]:
df_entity_list['Type'] = 'DEVICE'


In [10]:
df_entity_list.head()

Unnamed: 0,Text,Type
0,iPhone X,DEVICE
1,iPhoneX,DEVICE
2,iphoneX,DEVICE
3,Samsung Galaxy,DEVICE
4,Samsung Note,DEVICE


Let's create our training, entity list, and test file and upload it to S3

In [11]:
import os

training_file = './data/telco_train.csv'
tweet_telco['text'].to_csv(training_file, encoding='utf-8', index=False)

entity_file = './data/entity_list.csv'
df_entity_list.to_csv(entity_file, encoding='utf-8', index=False)

test_file = './data/telco_device_test.csv'
tweet_telco['text'].tail(10000).to_csv(test_file, encoding='utf-8', index=False)




In [12]:
def upload_to_s3(s3path, file):
    s3 = boto3.resource('s3')
    data = open(file, "rb")
    key = s3path
    s3.Bucket(bucket).put_object(Key=key, Body=data)

s3_train_key = prefix + "/train/telco_train.csv" 
s3_test_key = prefix + "/test/telco_device_test.csv"
s3_entity_key = prefix + "/entity/telco_entity.csv"

upload_to_s3(s3_train_key, training_file)
upload_to_s3(s3_test_key, test_file)
upload_to_s3(s3_entity_key, entity_file)

In [13]:
#Create s3 paths variable 
s3_train_data = 's3://{}/{}'.format(bucket, s3_train_key)
s3_train_entity = 's3://{}/{}'.format(bucket, s3_entity_key)
s3_test_data = 's3://{}/{}'.format(bucket, s3_test_key)
s3_output_test_data = 's3://{}/{}/test/{}'.format(bucket, prefix, "telco_test_output.json")
print('uploaded training data location: {}'.format(s3_train_data))


uploaded training data location: s3://data-phi/comprehend-custom-entity/train/telco_train.csv


## Training
Let's prepare the Custom Entity training job request file.

In [14]:
custom_entity_request = {

      "Documents": { 
         "S3Uri": s3_train_data
      },
      "EntityList": { 
         "S3Uri": s3_train_entity
      },
      "EntityTypes": [ 
         { 
            "Type": "DEVICE"
         }
      ]
   
}

In [53]:

id = str(datetime.datetime.now().strftime("%s"))
create_custom_entity_response = comprehend.create_entity_recognizer(
        RecognizerName = "custom-device-recognizer"+id, 
        DataAccessRoleArn = role,
        InputDataConfig = custom_entity_request,
        LanguageCode = "en"
)

In [None]:
jobArn = create_custom_entity_response['EntityRecognizerArn']

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_custom_recognizer = comprehend.describe_entity_recognizer(
        EntityRecognizerArn = jobArn
    )
    status = describe_custom_recognizer["EntityRecognizerProperties"]["Status"]
    print("Custom entity recognizer: {}".format(status))
    
    if status == "TRAINED" or status == "IN_ERROR":
        break
        
    time.sleep(60)

Custom entity recognizer: SUBMITTED
Custom entity recognizer: TRAINING
Custom entity recognizer: TRAINING
Custom entity recognizer: TRAINING
Custom entity recognizer: TRAINING
Custom entity recognizer: TRAINING
Custom entity recognizer: TRAINING
Custom entity recognizer: TRAINING
Custom entity recognizer: TRAINING
Custom entity recognizer: TRAINING
Custom entity recognizer: TRAINING
Custom entity recognizer: TRAINING
Custom entity recognizer: TRAINING


## Evaluation

You can see the different metrics for our custom entity recognizer. More information can be found here.
https://docs.aws.amazon.com/comprehend/latest/dg/cer-metrics.html

In [19]:
print(json.dumps(describe_custom_recognizer["EntityRecognizerProperties"]["RecognizerMetadata"]["EntityTypes"], indent=2, default=str))

[
  {
    "Type": "DEVICE",
    "EvaluationMetrics": {
      "Precision": 99.12023460410558,
      "Recall": 100.0,
      "F1Score": 99.55817378497791
    },
    "NumberOfTrainMentions": 2764
  }
]


## Testing our custom entity model

Let's invoke the Comprehend API to run our test job from the test file we prepared earlier.

In [20]:
test_response = comprehend.start_entities_detection_job(
    InputDataConfig={
        'S3Uri': s3_test_data,
        'InputFormat': 'ONE_DOC_PER_LINE'
    },
    OutputDataConfig={
        'S3Uri': s3_output_test_data
    },
    DataAccessRoleArn=role,
    JobName='Custom_Device_Test',
    EntityRecognizerArn=jobArn,
    LanguageCode='en'
)

Let's monitor the job.

In [None]:
jobId = test_response['JobId']
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_job = comprehend.describe_entities_detection_job(
        JobId = jobId
    )
    status = describe_job["EntitiesDetectionJobProperties"]["JobStatus"]
    print("Job Status: {}".format(status))
    
    if status == "COMPLETED" or status == "FAILED":
        break
        
    time.sleep(60)

Job Status: IN_PROGRESS
Job Status: IN_PROGRESS
Job Status: IN_PROGRESS
Job Status: IN_PROGRESS
Job Status: IN_PROGRESS
Job Status: IN_PROGRESS
Job Status: IN_PROGRESS
Job Status: IN_PROGRESS
Job Status: COMPLETED


In [22]:
#Download the test output to local machine
job_output = describe_job["EntitiesDetectionJobProperties"]["OutputDataConfig"]["S3Uri"]
path_prefix = 's3://{}/'.format(bucket)
job_key = os.path.relpath(job_output, path_prefix)

s3 = boto3.resource('s3')
s3.Bucket(bucket).download_file(job_key, 'output.tar.gz')


In [23]:
!tar xvzf output.tar.gz

output


In [24]:
#Load all the Entities values in a list
import json

data = []
for line in open('output', 'r'):
    entities = json.loads(line)['Entities']
    if entities != None and len(entities) > 0:
        data.append(entities[0]['Text'])
    

# function to get unique values 
def unique(list1): 
      
    # insert the list to the set 
    list_set = set(list1) 
    # convert the set to the list 
    unique_list = (list(list_set)) 
    for x in unique_list: 
        print(x), 
        
unique(data)


iPhone X
iPhone
Samsung Note
android
iphone
Samsung Glxy
Android
Pure X
iphonex
iPhoneX
iphoneX
Samsung galaxy
Samsung Galaxy
newiPhone X
