# Text Moderation - PII Detection/Redaction and Sentiment Analysis

PII (Personal Identifiable Information) detection and Sentiment Analysis are popular use cases of text moderation. You can accurately analyze customer interactions, including social media posts, reviews, customer interaction transcripts to improve your products and services. You can determine if the sentiment is positive, negative, neutral, or mixed. For example, you can use sentiment analysis to determine the sentiments of comments on a blog posting to determine if your readers liked the post.

- [Setup notebook](#step1)
- [Detect and redact PII entities in real-time](#step2)
- [Detect PII entities as a batch job with redaction](#step3)
- [Analyze text sentiment in real-time](#step4)

# Set up Notebook 
Run the below cell to install/update Python dependencies if you run the lab using a local IDE. It is optional if you use a SageMaker Studio Juypter Notebook, which already includes the dependencies in the kernel. 

In [None]:
# First, let's get the latest installations of our dependencies
%pip install -qU pip
%pip install boto3 -qU

In [None]:
import boto3
import sagemaker as sm
import json
from IPython.display import clear_output

# variables
data_bucket = sm.Session().default_bucket()
region = boto3.session.Session().region_name
role = sm.get_execution_role()

s3=boto3.client('s3')
comprehend=boto3.client('comprehend', region_name=region)

# Detect/redact PII entities in real-time

You can use the Amazon Comprehend console or APIs to detect personally identifiable information (PII) in English text documents. PII is a textual reference to personal data that could be used to identify an individual. PII examples include addresses, bank account numbers, and phone numbers.

In the below cell, we will call the Amazon Comprehend [DetectPiiEntity](https://docs.aws.amazon.com/comprehend/latest/dg/realtime-pii-api.html) API, which analyzes a single document in real-time.

In [None]:
text = "Here are my contact details: Call me 751-01-1111 or email me at someone@domain.com"

print('Calling DetectPii')
response = comprehend.detect_pii_entities(Text=text, LanguageCode='en')
display(response)

The PII API responsee has "Entities" node, a list of PII enitites detected by the Comprehend API. Each detected entity contains:
- Score: the confidence score of this entity.
- Type: PII type. For a full list of the PII entity types support by Comprehend, refer to this [doc](https://docs.aws.amazon.com/comprehend/latest/dg/how-pii.html).
- BeginOffset: A character offset in the input text that shows where the PII entity begins.
- EndOffset: A character offset in the input text that shows where the PII entity ends. 

The real-time PII detection endpoint doesn't support masking out of the box. But you can apply a light logic easily using the metadata in the response.

In [None]:
def pii_redaction(text, entities, mask_character='*'):
 for entity in entities:
 mask = ''
 for i in range(entity["BeginOffset"], entity["EndOffset"]):
 mask += mask_character
 text = f'{text[0:entity["BeginOffset"]]}{mask}{text[entity["EndOffset"]:len(text)]}'
 return text

In [None]:
result = pii_redaction(text, response["Entities"])
result

# Detect PII entities as a batch job with redaction 
Now, let's try to start an asynchronous [PII entity detection job](https://docs.aws.amazon.com/comprehend/latest/dg/get-started-api-pii.html) to process a collection of documents. 

We will generate a CSV file with some test messages containing PII information

In [None]:
msgs = [
 "Hello, My name is John Doe and my address is 1 main st, Pleasanton,CA,94000",
 "Hello, I m John Doe and my phone number is 785-000-0000",
 "My credit card number is 5370-0000-1234-0000,111",
 "For verification purposes, my SSN is 172-00-0000",
 "Please send me the receipt on my email address someone@domain.com",
 "I do not wish to subscribe to your marketing messages, please unsubscribe me",
 "Good morning, everybody. My name is Jan Doe, and today I feel like sharing a whole lot of personal information with you. Let's start with my Email address jandoe@domain.com. My address is 2 park blvd, Los Angeles,CA,92000. My phone number is 818-111-1111. My Social security number is 548-00-0001. My Bank account number is 94000000001 and routing number 195111111. My credit card number is 5534816011111110,my pin 123456. Well, I think that's it. You know a whole lot about me. And I hope that Amazon comprehend is doing a good job at identifying PII entities so you can redact my personal information away from this document. Let's check.""",
]
with open('pii-test-data.csv','w') as f:
 for m in msgs:
 f.write(f'"{m}"\n')

Upload the CSV file to S3 for Comprehend batch process to access

In [None]:
s3_key = 'content-moderation-im/text-moderation/pii-test-data.csv'
s3.upload_file('pii-test-data.csv', data_bucket, s3_key)

In [None]:
import uuid

InputS3URI= f"s3://{data_bucket}/{s3_key}"
OutputS3URI=f"s3://{data_bucket}/content-moderation-im/text-moderation/pii-detection-redaction"
job_name = f"pii-job-{uuid.uuid1()}"

One parameter we passed to Comprehend start_pii_entities_detection_job API is called RedactionConfig. In the below example, we ask the API to detect the PII entities, then apply redaction logic to these entities by replacing them using a mask character *.

In [None]:
response = comprehend.start_pii_entities_detection_job(
 InputDataConfig={
 'S3Uri': InputS3URI,
 'InputFormat': 'ONE_DOC_PER_FILE'
 },
 OutputDataConfig={
 'S3Uri': OutputS3URI
 
 },
 Mode='ONLY_REDACTION',
 RedactionConfig={
 'PiiEntityTypes': [
 'ALL',
 ],
 'MaskMode': 'MASK',
 'MaskCharacter': '*'
 },
 DataAccessRoleArn = role,
 JobName=job_name,
 LanguageCode='en',
 
)
events_job_id = response['JobId']

The job will take roughly ~10 minutes. The below code is to check the status of the job. The cell execution would be completed after the job is completed 

In [None]:
%%time
# Loop through and wait for the training to complete . Takes up to 10 mins 
import time
from datetime import datetime

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
 now = datetime.now()
 current_time = now.strftime("%H:%M:%S")
 
 response = comprehend.describe_pii_entities_detection_job(JobId=events_job_id)

 status = response["PiiEntitiesDetectionJobProperties"]["JobStatus"]
 clear_output(wait=True)
 print(f"{current_time} : PII detection batch job: {status}")
 
 if status == "COMPLETED" or status == "FAILED":
 break
 
 time.sleep(10)

In [None]:
display(response)

Once the batch job is complete, Comprehend will store PII detection/redaction result in the S3 bucket.

In [None]:
output_data_s3_file = response['PiiEntitiesDetectionJobProperties']['OutputDataConfig']['S3Uri'] + 'pii-test-data.csv.out'
output_data_s3_file

Now let's use the API for our sample dataset and let's check the response 

In [None]:
from io import BytesIO

# The output filename is the input filename + ".out"

filename="pii-test-data.csv"
output_data_s3_file = response['PiiEntitiesDetectionJobProperties']['OutputDataConfig']['S3Uri'] + filename + '.out'
output_data_s3_filepath=output_data_s3_file.replace(f's3://{data_bucket}/','')
print(output_data_s3_filepath)

f = BytesIO()
s3.download_fileobj(data_bucket, output_data_s3_filepath, f)
f.seek(0)
arr = f.read().decode('UTF-8').split('\r\n')
for a in arr:
 print(a)

# Analyze text sentiment in real-time 
Amazon Comprehend can be used to perform sentiment analysis. You can accurately analyze customer interactions, including social media posts, reviews, customer interaction transcripts to improve your products and services.

This lab will only cover the real-time Sentiment analysis use case. For more information about Comprehend Sentiment Analysis, refers to this [doc](https://docs.aws.amazon.com/comprehend/latest/dg/how-sentiment.html).

In [None]:
text = "It is raining today in Seattle"
response = comprehend.detect_sentiment(Text=text, LanguageCode='en')
print(json.dumps(response, sort_keys=True, indent=4))

Sentiment determination returns the following values with confidence score:
- Positive – The text expresses an overall positive sentiment.
- Negative – The text expresses an overall negative sentiment.
- Mixed – The text expresses both positive and negative sentiments.
- Neutral – The text does not express either positive or negative sentiments.

The above response shows that the sentiment of the input text is 'NEUTRAL' with a confidence score associated with each value.

# Conclusion

We have tried using Comprehend PII API in this lab to detect PII entities and apply redaction logic. We also tried an example of using Comprehend Sentiment Analysis API to get the sentiment of the input text. 