# Amazon Comprehend Custom Classification

This notebook will serve as a template for the overall process of taking a text dataset and integrating it into [Amazon Comprehend Custom Classification](https://docs.aws.amazon.com/comprehend/latest/dg/how-document-classification.html) and perform NLP for custom classification.

## Overview

1. [Introduction to Amazon Comprehend Custom Classification](#Introduction)
1. [Obtaining Your Data](#data)
1. [Pre-processing data](#preprocess)
1. [Building Custom Classification model](#build)
1. [Real time inference](#inference)
1. [Cleanup](#cleanup)


## Introduction to Amazon Comprehend Custom Classification 

If you are not familiar with Amazon Comprehend Custom Classification you can learn more about this tool on these pages:

* [Product Page](https://aws.amazon.com/comprehend/)
* [Product Docs](https://docs.aws.amazon.com/comprehend/latest/dg/how-document-classification.html)

## Training a custom classifier

Custom classification is a two-step process. First, you train a custom classifier to recognize the classes that are of interest to you. Then you send unlabeled documents to be classified.

To train the classifier, specify the options you want, and send Amazon Comprehend documents to be used as training material. Based on the options you indicated, Amazon Comprehend creates a custom ML model that it trains based on the documents you provided. This custom model (the classifier) examines each document you submit. It then returns either the specific class that best represents the content (if you're using multi-class mode) or the set of classes that apply to it (if you're using multi-label mode).

We are going to use a Hugging Face pre-canned dataset of customer reviews and use the multi-class mode. We ensure that dataset is a .csv and the format of the file must be one class and document per line. For example:
```
CLASS,Text of document 1
CLASS,Text of document 2
CLASS,Text of document 3
```


In [None]:
# Install Hugging Face datasets package
!pip --disable-pip-version-check install datasets --quiet

With the datasets installed, now we will import the Pandas library as well as a few other data science tools in order to inspect the information.

In [None]:
import os
import json
import time
import uuid
import boto3
import pprint
import string
import random
import datetime 
import subprocess
import numpy as np
import pandas as pd
from time import sleep

Lets load the data in to dataframe and look at the data we uploaded. Examine the number of columns that are present. Look at few samples to see the content of the data. **This will take 5 to 8 minutes**.

**Note:** CTA means call to action. No CTA means no call to action. This is a metric to determine if the customer's concern was addressed by the agent during the call. A CTA indicates that the customer is satisfied that their concerns has been or will be addressed by the company.

In [None]:
from datasets import load_dataset
dataset = load_dataset('amazon_us_reviews', 'Electronics_v1_00', split='train[:10%]')

In [None]:
dataset.set_format(type='pandas')
df = dataset[:1000]

To convert data to the format that is required by Amazon Comprehend Custom Classifier,

```
CLASS,Text of document 1
CLASS,Text of document 2
CLASS,Text of document 3
```
We will identify the column which are class and which have the text content we would like to train on, we can create a new dataframe with selected columns.

In [None]:
df1 = df[['star_rating','review_body']]
df1 = df1.rename(columns={"review_body": "text", "star_rating": "class"})

We will translate the customer product ratings to CTA (call-to-action) and No CTA (no call-to-action). All ratings from 3 and above are considerd as CTA (customer is satisfied) with 1 and 2 considered as No CTA (customer is not satisfied)

In [None]:
df1.loc[df1['class'] >= 3, 'class'] = 'CTA'
df1.loc[df1['class'] != 'CTA', 'class'] = 'No CTA'

Remove all punctuation from the text

In [None]:
import string
for i,row in df1.iterrows():
 a = row['text'].strip(string.punctuation)
 df1.loc[i,'text'] = a

In [None]:
df1.head()

In [None]:
df1['class'].value_counts()

## Pre-processing data 


For training, the file format must conform with the [following](https://docs.aws.amazon.com/comprehend/latest/dg/how-document-classification-training.html):

- File must contain one label and one text per line – 2 columns
- No header
- Format UTF-8, carriage return “\n”.

Labels “must be uppercase, can be multitoken, have whitespace, consist of multiple words connect by underscores or hyphens or may even contain a comma in it, as long as it is correctly escaped.”

For the inference part of it - when you want your custom model to determine which label corresponds to a given text -, the file format must conform with the following:

- File must contain text per line
- No header
- Format UTF-8, carriage return “\n”.

At this point we have all the data the 2 needed files. 

### Building The Target Train and Test Files

With all of the above spelled out the next thing to do is to build training file:

1. `comprehend-train.csv` - A CSV file containing 2 columns without header, first column class, second column text.

In [None]:
DSTTRAINFILE='comprehend-train.csv'

df1.to_csv(path_or_buf=DSTTRAINFILE,
 header=False,
 index=False)

## Train an Amazon Comprehend custom classifier
Now that all of the required data to get started exists, we can start working on Comprehend Custom Classfier. 

The custom classifier workload is built in two steps:

1. Training the custom model – no particular machine learning or deep learning knowledge is necessary
1. Classifying new data

Lets follow below steps for Training the custom model:

1. Specify the bucket name that was pre-created for you that will host training data artifacts and production results. 
1. Configure an IAM role allowing Comprehend to [access newly created buckets](https://docs.aws.amazon.com/comprehend/latest/dg/access-control-managing-permissions.html#auth-role-permissions)
1. Prepare data for training
1. Upload training data in the S3 bucket
1. Launch a “Train Classifier” job from the console: “Amazon Comprehend” > “Custom Classification” > “Train Classifier”
1. Prepare data for classification (one text per line, no header, same format as training data). Some more details [here](https://docs.aws.amazon.com/comprehend/latest/dg/how-class-run.html)


In [None]:
# Get notebook's region
region = boto3.Session().region_name
print(region)

Configure your AWS APIs

In [None]:
import sagemaker

s3 = boto3.client('s3')
comprehend = boto3.client('comprehend')
role = sagemaker.get_execution_role()

Specify an Amazon s3 bucket that will host training data and test data. **Note:** This bucket should have been created already for you. Please go the Amazon S3 console to verify the bucket is present. It should start with `aim317...`. **Specify your bucket name in the cell below**.

In [None]:
bucket = '' # Provide your bucket name here
prefix = 'comprehend-custom-classifier' # you can leave this as it is

try:
 s3.head_bucket(Bucket=bucket)
except:
 print("The S3 bucket name {} you entered seems to be incorrect, please try again".format(bucket))

### Uploading the data

In [None]:
s3.upload_file(DSTTRAINFILE, bucket, prefix+'/' + DSTTRAINFILE)

## Building Custom Classification model 

Launch the classifier training:

In [None]:
s3_train_data = 's3://{}/{}/{}'.format(bucket, prefix, DSTTRAINFILE)
s3_output_job = 's3://{}/{}/{}'.format(bucket, prefix, 'output/train_job')
print('training data location: ',s3_train_data, "output location:", s3_output_job)

In [None]:
uid = uuid.uuid4()

training_job = comprehend.create_document_classifier(
 DocumentClassifierName='aim317-cc-' + str(uid),
 DataAccessRoleArn=role,
 InputDataConfig={
 'S3Uri': s3_train_data
 },
 OutputDataConfig={
 'S3Uri': s3_output_job
 },
 LanguageCode='en',
 VersionName='v001'
)

### Check training status in Amazon Comprehend console

[Go to Amazon Comprehend Console](https://console.aws.amazon.com/comprehend/v2/home?region=us-east-1#classification)

This will take approximately 30 minutes. Go to the **Classifier Metrics** step below after the classifier has been created and is ready for use. Running the cells prior to classifier being ready, will throw an error. Simply re-execute the cell again after the classifier is ready.

### Classifier Metrics

In [None]:
response = comprehend.describe_document_classifier(
 DocumentClassifierArn=training_job['DocumentClassifierArn']
)
print(response['DocumentClassifierProperties']['ClassifierMetadata']['EvaluationMetrics'])

## Real time inference 
We will now use a custom classifier real time endpoint to detect if the audio transcripts and translated text contain indication of there is a clear CTA or not. 

### Create endpoint

In [None]:
model_arn = response["DocumentClassifierProperties"]["DocumentClassifierArn"]
print('Model used for real time endpoint ' + model_arn)

In [None]:
# Let's create an endpoint with 4 Inference Units to account for us sending approximately 400 characters per second to the endpoint

create_endpoint_response = comprehend.create_endpoint(
 EndpointName='aim317-cc-ep',
 ModelArn=model_arn,
 DesiredInferenceUnits=4,
 
)

print(create_endpoint_response['EndpointArn'])

### Check endpoint status in Amazon Comprehend console

[Go to Amazon Comprehend Console](https://console.aws.amazon.com/comprehend/v2/home?region=us-east-1#endpoints)

This will take approximately 10 minutes. Go to the **Run Inference** step below after the classifier has been created and is ready for use. Running the cells prior to classifier being ready, will lock the cell. This will presume only after classifier has been trained.

### Run Inference

Lets review the list of files ready for inference in the `comprehend/input` folder of our S3 bucket. These files were created by the notebook available in `1-Transcribe-Translate-Calls`

In [None]:
# Input files ready for classification
!aws s3 ls s3://{bucket}/comprehend/input/

In [None]:
# Prepare to page through our transcripts in S3

# Define the S3 handles
s3 = boto3.client('s3')
s3_resource = boto3.resource('s3')


# We will be merging the classifier predictions with the transcript segments we created for quicksight in 1-Transcribe-Translate
t_prefix = 'quicksight/data/cta'


# Lets define the bucket name that contains the transcripts first
# So far we used a session bucket we created for training and testing the classifier

paginator = s3.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket=bucket, Prefix='comprehend/input')
a = []


# We will define a DataFrame to store the results of the classifier
cols = ['transcript_name', 'cta_status']
df_class = pd.DataFrame(columns=cols)

# Now lets page through the transcripts
for page in pages:
 for obj in page['Contents']:
 cta = ''
 # get the transcript file name
 transcript_file_name = obj['Key'].split('/')[2]
 # now lets get the transcript file contents
 temp = s3_resource.Object(bucket, obj['Key'])
 transcript_content = temp.get()['Body'].read().decode('utf-8')
 # Send the last few sentence(s) for classification
 transcript_truncated = transcript_content[1500:1900]
 # Call Comprehend to classify input text
 response = comprehend.classify_document(Text=transcript_truncated, EndpointArn=create_endpoint_response['EndpointArn'])
 # Now we need to determine which of the two classes has the higher confidence score
 # Use the name for that score as our predicted label
 a = response['Classes']
 # We will use this temp DataFrame to extract the class with maximum confidence level for CTA
 tempcols = ['Name', 'Score']
 df_temp = pd.DataFrame(columns=tempcols)
 for i in range(0, 2):
 df_temp.loc[len(df_temp.index)] = [a[i]['Name'], a[i]['Score']]
 cta = df_temp.iloc[df_temp.Score.argmax(), 0:2]['Name']
 
 # Update the results DataFrame with the cta predicted label
 # Create a CSV file with cta label from this DataFrame
 df_class.loc[len(df_class.index)] = [transcript_file_name.strip('en-').strip('.txt'), cta] 

df_class.to_csv('s3://' + bucket + '/' + t_prefix + '/' + 'cta_status.csv', index=False)
df_class

### End of notebook
Please go back to the workshop instructions to continue to the next step