# Leveraging pre-trained APIs in Amazon Comprehend

## Table of contents
- [Introduction](#intro)
- [Setup](#setup)
- [Identifying Named Entities](#identifying-named-entities)
- [Detecting Key Phrases](#detecting-key-phrases)
- [Identifying the Dominant Language](#identifying-the-ominant-language)
- [Determining Emotional sentiment](#determining-emotional-sentiment)
- [Determining Syntax](#determiningsyntax)
- [Detecting Personally Identifiable Information (PII)](#detecting-pii)
- [Conclusion](#conclusion)
- [Clean Up](#clean-up)

## Introduction

This notebook provides step-by-step instructions to use the [Amazon Comprehend](https://aws.amazon.com/comprehend/)'s pre-trained APIs to to uncover information in unstructured data. Amazon Comprehend uses a pre-trained model to examine and analyze a document or set of documents to gather insights about it. This model is continuously trained on a large body of text so that there is no need for you to provide training data.

We will explore 6 pre-trained APIs: Identifying Named Entities, Extracting Key Phrases, Identifying the Dominant Language, Determining Emotional sentiment, Determining Syntax, Detecting Detect Personally Identifiable Information (PII).

This Notebook uses AWS resources and you may incur a cost when running these cells.

## Tips

If you are new to Python Notebooks: `SHIFT` + `ENTER` will execute a code cell and go to the next one.

## Setup
We import the relevant packages to interact with Amazon Comprehend. [Boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) is the AWS Python SDK.

In [None]:
import boto3

We specify the SageMaker execution role, this is the role that is used in this notebook and the region the notebook is in.

In [None]:
import sagemaker
from sagemaker import get_execution_role
role = get_execution_role()
region = boto3.Session().region_name

We import other packages we will use.

In [None]:
import json
import pandas as pd
import numpy as np

## Starting the Amazon Comprehend client

In [None]:
comprehend = boto3.client(service_name='comprehend', region_name=region)

## Data

In this lab, we will use the same sample input text used in the [Amazon Comprehend console](https://console.aws.amazon.com/comprehend). We have copyed it here for convenience. If you wish to experiment with a different text, simply modify the cell bellow.

In [None]:
sample_text = '''
Hello Zhang Wei. Your AnyCompany Financial Services, LLC credit card account 1111-0000-1111-0000 has a minimum payment of $24.53 that is due by July 31st. Based on your autopay settings, we will withdraw your payment on the due date from your bank account XXXXXX1111 with the routing number XXXXX0000. 
Your latest statement was mailed to 100 Main Street, Anytown, WA 98121. 
After your payment is received, you will receive a confirmation text message at 206-555-0100. 
If you have questions about your bill, AnyCompany Customer Service is available by phone at 206-555-0199 or email at support@anycompany.com.
'''

## Identifying Named Entities

A named entity is a real-world object (persons, places, locations, organizations, etc.) that can be denoted with a proper name.

Amazon Comprehend can extract named entities from a document or text. This can be useful, for example, for indexing, document labeling or search. For more information, see [Detect Entities](https://docs.aws.amazon.com/comprehend/latest/dg/API_DetectEntities.html)).

The API used to extract these entities is the [DetectEntities API](https://docs.aws.amazon.com/comprehend/latest/dg/API_DetectEntities).

For each entity detected Amazon Comprehend returns both the type, for instance "Person" or "Date", as well as a confidence score which indicates how confident the model is in this detection. In your implementation you can use this confidence score to set threshold values.

In [None]:
print('Calling DetectEntities')
detected_entities = comprehend.detect_entities(Text=sample_text, LanguageCode='en')
print(json.dumps(detected_entities, sort_keys=True, indent=4))
print('End of DetectEntities\n')

The response includes the full score, type, and offsets.

Now lets make it a bit more human readable:

In [None]:
detectec_entities_df = pd.DataFrame([ [entity['Text'], entity['Type'], entity['Score']] for entity in detected_entities['Entities']],
 columns=['Text', 'Type', 'Score'])

print('This was the text analyzed:')
print(sample_text)
print()
display (detectec_entities_df)

## Detecting Key Phrases

Amazon Comprehend can extract key noun phrases that appear in a document. For example, a document about a basketball game might return the names of the teams, the name of the venue, and the final score. This can be used, for example, for indexing or summarization. For more information, see [Detect Key Phrases](https://docs.aws.amazon.com/comprehend/latest/dg/get-started-api-key-phrases.html).

The API used to extract these key phrases is the [DetectKeyPhrases API](https://docs.aws.amazon.com/comprehend/latest/dg/API_DetectKeyPhrases).

Amazon Comprehend returns the key phrases, as well as a confidence score which indicates how confident the model is in this detection. In your implementation you can use this confidence score to set threshold values.

In [None]:
print('Calling DetectKeyPhrases')
detected_key_phrases = comprehend.detect_key_phrases(Text=sample_text, LanguageCode='en')
print(json.dumps(detected_key_phrases, sort_keys=True, indent=4))
print('End of DetectKeyPhrases\n')

The response includes the full score, key phrase text, and offsets.

Now lets make it a bit more human readable:

In [None]:
detected_key_phrases_df = pd.DataFrame([ [entity['Text'], entity['Score']] for entity in detected_key_phrases['KeyPhrases']],
 columns=['Text', 'Score'])

print('This was the text analyzed:')
print(sample_text)
print()
display (detected_key_phrases_df)

## Identifying the Dominant Language

Amazon Comprehend identifies the dominant language in a document. Amazon Comprehend can currently identify many languages. This can be useful as a first step before further processing, for example when phone call transcripts can be in different languages. For more information, including which languages can be identified, see [Detect the Dominant Language](https://docs.aws.amazon.com/comprehend/latest/dg/how-languages.html).

The API used to identify the dominant language is the [DetectDominantLanguage API](https://docs.aws.amazon.com/comprehend/latest/dg/API_DetectDominantLanguage).

Amazon Comprehend returns the dominant language, as well as a confidence score which indicates how confident the model is in this detection. In your implementation you can use this confidence score to set threshold values. If more than one language is detected, it will return each detected language and its corresponding confidence score.

In [None]:
print('Calling DetectDominantLanguage')
detected_language = comprehend.detect_dominant_language(Text=sample_text)
print(json.dumps(detected_language, sort_keys=True, indent=4))
print('End of DetectDominantLanguage\n')

The response includes the full score, and the detected language codes.

Now lets make it a bit more human readable:

In [None]:
detected_language_df = pd.DataFrame([ [code['LanguageCode'], code['Score']] for code in detected_language['Languages']],
 columns=['Language Code', 'Score'])

print('This was the text analyzed:')
print(sample_text)
print()
display (detected_language_df)

## Determining Emotional Sentiment

Amazon Comprehend determines the emotional sentiment of a document. Sentiment can be positive, neutral, negative, or mixed. For more information, see Determine Sentiment. This can be useful for example to analyze the content of reviews or transcripts from call centres. For more information, see [Detecting Sentiment](https://docs.aws.amazon.com/comprehend/latest/dg/get-started-api-sentiment.html).

The API used to extract the emotional sentiment is the [DetectSentiment API](https://docs.aws.amazon.com/comprehend/latest/dg/API_DetectSentiment).

Amazon Comprehend returns the different sentiments and the related confidence score for each of them, which indicates how confident the model is in this detection. The sentiment with the highest confidence score can be seen as the predominant sentiment in the text.

In [None]:
print('Calling DetectSentiment')
detected_sentiment = comprehend.detect_sentiment(Text=sample_text, LanguageCode='en')
print(json.dumps(detected_sentiment, sort_keys=True, indent=4))
print('End of DetectSentiment\n')

The response includes the predominant sentiment and the full scores for each detected sentiment.

Now lets make it a bit more human readable:

In [None]:
predominant_sentiment = detected_sentiment['Sentiment']
detected_sentiments_df = pd.DataFrame([ [sentiment, detected_sentiment['SentimentScore'][sentiment]] for sentiment in detected_sentiment['SentimentScore']],
 columns=['Language Code', 'Score'])

print('This was the text analyzed:')
print(sample_text)
print()
print('The predominant sentiment is {}.'.format(predominant_sentiment))
print()
display (detected_sentiments_df)

## Determining Syntax

Amazon Comprehend parses each word in your document and determines the syntax, the part of speech, for the word. For example, in the sentence "It is raining today in Seattle," "it" is identified as a pronoun, "raining" is identified as a verb, and "Seattle" is identified as a proper noun. For more information, see [Analyze Syntax](https://docs.aws.amazon.com/comprehend/latest/dg/how-syntax.html).

The API used to extract thesyntax information is the [DetectSyntax API](https://docs.aws.amazon.com/comprehend/latest/dg/API_DetectSyntax).

Amazon Comprehend returns the different parts of speech and the related confidence score for each of them, which indicates how confident the model is in this detection, token Ids and offsets.

In [None]:
print('Calling DetectSyntax')
detected_syntax = comprehend.detect_syntax(Text=sample_text, LanguageCode='en')
print(json.dumps(detected_syntax, sort_keys=True, indent=4))
print('End of DetectSyntax\n')

Amazon Comprehend returns the text, different parts of speech, confidence score for each of them, token Ids and offsets.

Now lets make it a bit more human readable:

In [None]:
detected_syntax_df = pd.DataFrame([ [part['Text'], part['PartOfSpeech']['Tag'], part['PartOfSpeech']['Score']] for part in detected_syntax['SyntaxTokens']],
 columns=['Text', 'Part Of Speech', 'Score'])

print('This was the text analyzed:')
print(sample_text)
print()
print('First twenty tokens:')
display (detected_syntax_df.head(20))

## Detecting Personally Identifiable Information (PII)

Amazon Comprehend analyzes documents to detect personal data that could be used to identify an individual, such as an address, bank account number, or phone number. This can be usefull, for example, for information extraction and indexing, and to comply with legal requirements around data protection. For more information, see [Detect Personally Identifiable Information (PII)](https://docs.aws.amazon.com/comprehend/latest/dg/how-pii.html).

Amazon Comprehend can help you identify the location of individual PII in your document or help you label documents that contain PII. 

### Identify the location of PII in your text documents

Amazon Comprehend can help you identify the location of individual PII in your document. Select "Offsets" in the Personally identifiable information (PII) analysis mode.

The API used to identify the location of individual PII is the [DetectPiiEntities API](https://docs.aws.amazon.com/comprehend/latest/dg/API_DetectPiiEntities.html).

Amazon Comprehend returns the different PII and the related confidence score for each of them, which indicates how confident the model is in this detection. 

In [None]:
print('Calling DetectPiiEntities')
detected_pii_entities = comprehend.detect_pii_entities(Text=sample_text, LanguageCode='en')
print(json.dumps(detected_pii_entities, sort_keys=True, indent=4))
print('End of DetectPiiEntities\n')

Amazon Comprehend returns the PII entity, a confidence score for each of them, and offsets.

Now lets make it a bit more human readable:

In [None]:
detected_pii_entities_df = pd.DataFrame([ [entity['Type'], entity['Score']] for entity in detected_pii_entities['Entities']],
 columns=['Type', 'Score'])

print('This was the text analyzed:')
print(sample_text)
print()
display (detected_pii_entities_df)

### Label text documents with PII

Amazon Comprehend can help you label documents that contain PII. Select "Labels" in the Personally identifiable information (PII) analysis mode.

The API used to extract the PII enties in the document. We used the [ContainsPiiEntities API](https://docs.aws.amazon.com/comprehend/latest/dg/API_ContainsPiiEntities.html).

Amazon Comprehend returns the different PII labels and the related confidence score for each of them, which indicates how confident the model is in this detection. These labels indicate the presence of these types of PII in the document. 

In [None]:
print('Calling ContainsPiiEntities')
detected_pii_labels = comprehend.contains_pii_entities(Text=sample_text, LanguageCode='en')
print(json.dumps(detected_pii_labels, sort_keys=True, indent=4))
print('End of ContainsPiiEntities\n')

Amazon Comprehend returns the PII entity name and full scores.

Now lets make it a bit more human readable:

In [None]:
detected_pii_labels_df = pd.DataFrame([ [entity['Name'], entity['Score']] for entity in detected_pii_labels['Labels']],
 columns=['Name', 'Score'])

print('This was the text analyzed:')
print(sample_text)
print()
display (detected_pii_labels_df)

## Conclusion

You have now learned how to use the pre-trained APIs using the Python SDK.

For examples of how to use these APIs from the AWS Management Console, follow the steps in "Using the AWS Management Console" in the workshop website.

## Clean Up

Once you have finished using this notebook, make sure to stop and delete this Amazon SageMaker Notebook instance in the [Amazon SageMaker Console](https://console.aws.amazon.com/sagemaker/) to avoid incurring additional costs.