# Document Enrichment

<div class="alert alert-block alert-warning">
    <b>⚠️ PRE-REQUISITE:</b> In order to execute this notebook, make sure you have completed the first notebook 01-idp-document-classification.ipynb
</div>

So far, in the previous two notebooks, we have categorized documents and identified the bank statements and receipt documents. We have also extracted the text from these documents and have viewed the entities extraced by Amazon Comprehend. However, we want specific entities to be recognized for our use case so that we can perform certain enrichments on our documents. 

In this notebook we will train an Amazon Comprehend custom entity recognizer and deploy an endpoint with it. We will then identify the specific entities and generate custom metadata about our document in CSV format to be later analyzed by the business use case, and we will also identify any ADDRESS entity in bank statements and perform redaction on it, since that is a customer private information.

![IDP Entity](./images/idp-entity.png)



In this notebook we will - 

- [Step 1: Setup notebook](#step1)
- [Step 2: Perform entity recocnition with Amazon Comprehend](#step2)
- [Step 3: Train a custom Amazon Comprehend entity recognizer](#step3)
- [Step 4: Create custom entity recognizer real-time endpoint](#step4)

---

# Step 1: Setup notebook <a id="step1"></a>

In this step, we will import some necessary libraries that will be used throughout this notebook. 

In [None]:
import boto3
import botocore
import sagemaker
import time
import os
import json
import datetime
import io
import uuid
import pandas as pd
import numpy as np
from pytz import timezone
from PIL import Image, ImageDraw, ImageFont
import multiprocessing as mp
from pathlib import Path
from textractcaller.t_call import call_textract, Textract_Features
from textractprettyprinter.t_pretty_print import Textract_Pretty_Print, get_string
from trp import Document

# Document
from IPython.display import Image, display, HTML, JSON
from PIL import Image as PImage, ImageDraw


# variables
data_bucket = sagemaker.Session().default_bucket()
region = boto3.session.Session().region_name
account_id = boto3.client('sts').get_caller_identity().get('Account')

os.environ["BUCKET"] = data_bucket
os.environ["REGION"] = region
role = sagemaker.get_execution_role()

print(f"SageMaker role is: {role}\nDefault SageMaker Bucket: s3://{data_bucket}")

s3=boto3.client('s3')
textract = boto3.client('textract', region_name=region)
comprehend=boto3.client('comprehend', region_name=region)


---
# Step 2: Perform Name Entity Recognition using Amazon Comprehend <a id="step1"></a>

We have categorized our documents according to their respective document types and stored them in S3. Next, we will perform name entity recognition for 1 bank statement and 1 receipt using [Amazon Comprehend NER](https://docs.aws.amazon.com/comprehend/latest/dg/how-entities.html), in this case Comprehend will extract the prebuilt generic entity types from the documents. 

We will start the process by loading the extracted document text from S3 into a dataframe and subsequently using Amazon Comprehend [DetectEntities](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/comprehend.html#Comprehend.Client.detect_entities) API. 

In [None]:
extracted_df = pd.read_csv('extracted_doc.csv')

In [None]:
from trp import Document

def get_entities(text):
    try:
        #detect entities
        entities = comprehend.detect_entities(LanguageCode="en", Text=text)  
        df = pd.DataFrame(entities["Entities"], columns = ['Text', 'Type'])
        display(HTML(df.to_html(index=False)))
    except Exception as e:
        print(e)

We will grab one of bank statement and receipt each from our S3 location.

In [None]:
bank_df = extracted_df[extracted_df['DocType'] == 'bank-statements']

bank_document = bank_df.iloc[0]['DocText']
print(bank_document)

Execute entity extraction on Bank Statement

In [None]:
get_entities(bank_document)

Although entity extraction worked fairly well in identifying the _generic_ entity types for everything in the documents, we want _specific_ entities to be recognized for our use case. More specifically, we need to identify the customer's Savings and Checking bank account numbers, for example we want the entity types to be "CHECKING_AC" and "SAVINGS_AC".

Amazon Comprehend's default prebuilt entity recognizer isn't aware of these entity types, so we will need to train and use a custom entity recognizer in this notebook. We will also perform some document enrichments for example, in the bank statement we want to redact the customer's account numbers. We will discuss more and do all of this in the next notebook.

---
# Step 3: Train a custom Amazon Comprehend entity recognizer <a id="step3"></a>

In [None]:
# If you have imported a custom entity recognizer model, uncomment the follwing code lines and replace the <model_arn> with the model's ARN and execute this code cell. 
# If you execute this code cell then you can skip the model training and move to Step 4.

# entity_recognizer_arn = "<model_arn>"
# %store entity_recognizer_arn

We will be training a custom [Amazon Comprehend entity recognizer](https://docs.aws.amazon.com/comprehend/latest/dg/custom-entity-recognition.html). There are two ways a custom recognizer can be trained -

- [Using Annotations](https://docs.aws.amazon.com/comprehend/latest/dg/cer-annotation.html)
- [Using Entity Lists](https://docs.aws.amazon.com/comprehend/latest/dg/cer-entity-list.html)

Annotations uses a large set of PDF files that have been annotated. These annotations can be created with service such as Amazon Ground Truth where real human workers can review your files and annotate them. This method is quite involved and if you are interested to learn more refer to [this blog](https://aws.amazon.com/blogs/machine-learning/custom-document-annotation-for-extracting-named-entities-in-documents-using-amazon-comprehend/) and [this blog](https://aws.amazon.com/blogs/machine-learning/extract-custom-entities-from-documents-in-their-native-format-with-amazon-comprehend/). 

In our case, we will use Entity Lists, which is a CSV file that should contain the texts and it's corresponding entity type. The entities in this file is going to be specific to our business needs. For the purposes of this exercise, we have provided an entity list in CSV format in the `/entity-training/` directory called `entitylist.csv`. This file contains a custom entity _Type_ for customer account numbers. We have used _CHECKING_AC_ and _SAVINGS_AC_ as the custom entity types. With this, we ultimately need the custom entity recognizer to recognize the savings and checking bank account numbers.

Let's take a look at our entity list.

In [None]:
entities_df = pd.read_csv('./entity-training/entitylist.csv', dtype={'Text': object})
entities = entities_df["Type"].unique().tolist()
print(f'Custom entities : {entities}')
print(f'\nTotal Custom entities: {entities_df["Type"].nunique()}')
display(HTML(entities_df.to_html(index=False)))

Let's train a custom entity recognizer with Amazon Comprehend. In order to train a custom entity recognizer we will need the entity list and the set of documents to train the model. We will use the same set of documents that we used earlier to train the custom classifer for this purpose.

Each custom entity needs atleast 100 samples in the data corpus (documents) for training purposes, meaning you should have atleast a 100 documents containing examples of each of the custom entities in your training dataset. Also, a minimum of 250 entity matches are needed per entity in the entity list to train a model for custom entity recognition. We have provided a training corpus named `entity_training_corpus.csv` which can be used to train the entity recognizer along with the entity list. Note that this corpus was generated the same way we generated training data for training a custom classifier in the first notebook. With these two data sets we will use Amazon Comprehend's [`CreateEntityRecognizer` API](https://docs.aws.amazon.com/comprehend/latest/dg/API_CreateEntityRecognizer.html).

In [None]:
#Upload entity list CSV to S3
entities_key='idp/comprehend/entities/entitylist.csv'
training_data_key='idp/comprehend/entities/entity_training_corpus.csv'

s3.upload_file(Filename='./entity-training/entitylist.csv', 
               Bucket=data_bucket, 
               Key=entities_key)

s3.upload_file(Filename='./entity-training/entity_training_corpus.csv', 
               Bucket=data_bucket, 
               Key=training_data_key)

entities_uri = f's3://{data_bucket}/{entities_key}'
training_data_uri = f's3://{data_bucket}/{training_data_key}'

print(f'Entity List CSV File: {entities_uri}')
print(f'Training Data File: {training_data_uri}')

### Let's now train a custom entity recognizer with this data and the entity list of savings and checking account numbers

In [None]:
# Create a custom entity recognizer
account_id = boto3.client('sts').get_caller_identity().get('Account')
id = str(datetime.datetime.now().strftime("%s"))

entity_recognizer_name = 'Sample-Entity-Recognizer-IDP'
entity_recognizer_version = 'Sample-Entity-Recognizer-IDP-v1'
entity_recognizer_arn = ''
create_response = None
EntityTypes = [ {'Type': entity} for entity in entities]

try:
    create_response = comprehend.create_entity_recognizer(
        InputDataConfig={
            'DataFormat': 'COMPREHEND_CSV',
            'EntityTypes': EntityTypes,
            'Documents': {
                'S3Uri': training_data_uri
            },
            'EntityList': {
                'S3Uri': entities_uri
            }
        },
        DataAccessRoleArn=role,
        RecognizerName=entity_recognizer_name,
        VersionName=entity_recognizer_version,
        LanguageCode='en'
    )
    
    entity_recognizer_arn = create_response['EntityRecognizerArn']
    
    print(f"Comprehend Custom entity recognizer created with ARN: {entity_recognizer_arn}")
except Exception as error:
    if error.response['Error']['Code'] == 'ResourceInUseException':
        print(f'An entity recognizer with the name "{entity_recognizer_name}" already exists.')
        entity_recognizer_arn = f'arn:aws:comprehend:{region}:{account_id}:entity-recognizer/{entity_recognizer_name}/version/{entity_recognizer_version}'
        print(f'The entity recognizer ARN is: "{entity_recognizer_arn}"')
    else:
        print(error)

In [None]:
%store entity_recognizer_arn

Check status of the Comprehend custom entity recognizer job

In [None]:
%%time
# Loop through and wait for the training to complete . Takes up to 10 mins 
from IPython.display import clear_output
import time
from datetime import datetime

jobArn = create_response['EntityRecognizerArn']

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    now = datetime.now()
    current_time = now.strftime("%H:%M:%S")
    
    describe_custom_recognizer = comprehend.describe_entity_recognizer(
        EntityRecognizerArn = jobArn
    )
    status = describe_custom_recognizer["EntityRecognizerProperties"]["Status"]
    clear_output(wait=True)
    print(f"{current_time} : Custom document entity recognizer: {status}")
    
    if status == "TRAINED" or status == "IN_ERROR":
        break
    time.sleep(10)

---

# Step 4: Create custom entity recognizer real-time endpoint <a id="step4"></a>

We will create a real time entity recognizer endpoint with the trained entity recognizer.

In [None]:
#create comprehend endpoint
model_arn = entity_recognizer_arn
ep_name = 'idp-er-endpoint'

try:
    endpoint_response = comprehend.create_endpoint(
        EndpointName=ep_name,
        ModelArn=model_arn,
        DesiredInferenceUnits=1,    
        DataAccessRoleArn=role
    )
    ER_ENDPOINT_ARN=endpoint_response['EndpointArn']
    print(f'Endpoint created with ARN: {ER_ENDPOINT_ARN}')
    %store ER_ENDPOINT_ARN
except Exception as error:
    if error.response['Error']['Code'] == 'ResourceInUseException':
        print(f'An endpoint with the name "{ep_name}" already exists.')
        ER_ENDPOINT_ARN = f'arn:aws:comprehend:{region}:{account_id}:entity-recognizer-endpoint/{ep_name}'
        print(f'The classifier endpoint ARN is: "{ER_ENDPOINT_ARN}"')
        %store ER_ENDPOINT_ARN
    else:
        print(error)

Check creation status of the entity recognizer endpoint.

In [None]:
%%time
# Loop through and wait for the training to complete . Takes up to 10 mins 
from IPython.display import clear_output
import time
from datetime import datetime

ep_arn = endpoint_response["EndpointArn"]

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    now = datetime.now()
    current_time = now.strftime("%H:%M:%S")
    
    describe_endpoint_resp = comprehend.describe_endpoint(
        EndpointArn=ep_arn
    )
    status = describe_endpoint_resp["EndpointProperties"]["Status"]
    clear_output(wait=True)
    print(f"{current_time} : Custom entity recognizer classifier: {status}")
    
    if status == "IN_SERVICE" or status == "FAILED":
        break
        
    time.sleep(10)

---

Once we train a custom entity recognizer, we will use to extract some enriched information out of the document and then perform document redaction with the help of the custom entities recognized by Amazon Comprehend and bounding box information from Amazon Textract

- [Enrichment 1: Extract custom entity out of the documents](#er1)
- [Enrichment 2: Perform redaction document enrichment ](#er2)

# Enrichment 1: Detect Custom Entities  <a id="er1"></a>

<p align="center">
  <img src="./images/idp-custom-entity.png" alt="cfn1" style="width:500px;"/>
</p>

In [None]:
extracted_df = pd.read_csv('extracted_doc.csv')
extracted_df

In [None]:
from trp import Document

def get_entities(text):
    try:
        #detect entities
        entities_custom = comprehend.detect_entities(LanguageCode="en", Text=text, EndpointArn=ER_ENDPOINT_ARN)    
        df_custom = pd.DataFrame(entities_custom["Entities"], columns = ['Text', 'Type'])
        df_custom = df_custom.drop_duplicates(subset=['Text']).reset_index()
        return df_custom
    except Exception as e:
        print(e)

We will grab a bank statement from our S3 location.

In [None]:
bank_df = extracted_df[extracted_df['DocType'] == 'bank-statements']

bank_document = bank_df.iloc[3]['DocText']

Execute entity extraction on Bank Statement

In [None]:
entities_df = get_entities(bank_document)
entities_df

Much better! Now instead of returning us generic entities, Amazon Comprehend is returning us the entities from our scanned document that we are interested, i.e. the checking and savings account number. We will now save this as a CSV file, which can be further consumed into a database or datawarehouse for additional analysis. For our excercise, we will save it to a csv file and upload to S3.

In [None]:
entities_df.to_csv('final_output.csv', index=False)

#Upload dataframe as csv to S3
s3.upload_file(Filename='final_output.csv', 
               Bucket=data_bucket, 
               Key=f'idp/comprehend/output-entities/final_output.csv')

---
# Enrichment 2: Perform redaction document enrichment  <a id="er2"></a>
We still need to perform some enrichments on the document. Since the document contains the customers savings and checking account numbers, we would like to redact those. Since we already know, by means of our custom entity, which is the name and which is the address, we can easily use Amazon Textract's geometry data to redact that information in the document.

<p align="center">
  <img src="./images/idp-redaction.png" alt="cfn1" style="width:500px;"/>
</p>

Let's grab a bank statement from our classified list of documents, we will pick the S3 location of the document and then perform the actions below-

- Use Amazon Textract to get the geometry information i.e. the bounding boxes, of all the lines in the document 
- Use the extracted text above to identify the entities CHECKING_AC and SAVINGS_AC, using Comprehend custom entity recognizer
- Find the bounding box for the CHECKING_AC and SAVINGS_AC words from the Textract response
- Use the bounding box geometry to annotate the document and redact the customer name and address.

In order to obtain the bounding box geometry of all words from the document we will use a tool called `amazon-textract-overlayer`. See [documentation](https://github.com/aws-samples/amazon-textract-textractor/tree/master/overlayer) for learn more about `amazon-textract-overlayer`.

In [None]:
!python -m pip install -q amazon-textract-overlayer

In [None]:
from textractoverlayer.t_overlay import DocumentDimensions, get_bounding_boxes
from textractcaller.t_call import Textract_Types

def redact_doc(s3document, localpath):
    try:
        img = PImage.open(localpath)
        draw = ImageDraw.Draw(img)

        # Use call_textract to get bounding boxes
        resp = call_textract(input_document = f's3://{data_bucket}/{s3document}', features = [Textract_Features.TABLES, Textract_Features.FORMS])
        document_dimension:DocumentDimensions = DocumentDimensions(doc_width=img.size[0], doc_height=img.size[1])
        overlay=[Textract_Types.WORD, Textract_Types.FORM, Textract_Types.CELL]
        bounding_box_list = get_bounding_boxes(textract_json=resp, document_dimensions=[document_dimension], overlay_features=overlay)

        print('Detecting entities...')
                
        text = get_string(textract_json=resp, output_type=[Textract_Pretty_Print.LINES])
                
        #entity recognizer
        entity_resp = comprehend.detect_entities(LanguageCode="en", Text=text, EndpointArn=ER_ENDPOINT_ARN)         
        entities = [{'Type': entity['Type'], 'Text': entity['Text']} for entity in entity_resp['Entities']]
        redactions = []

        #collect the bounding boxes for the custom entities
        for entity in entities:
            entity_text = entity['Text']
            print(f'Found Entity: {entity_text}')
            for bbox in bounding_box_list:                
                if bbox.text == entity_text:
                    redactions.append(bbox)

        for box in redactions:
            draw.rectangle(xy=[box.xmin, box.ymin, box.xmax, box.ymax], fill="Black")

        opfile = Path(localpath).stem
        opfile = f'{opfile}_redacted.png'        
        img.save(opfile)           
        print(f'Done.... Redacted file saved: {opfile}')
        return opfile
    except Exception as e:
        print(e)

The function above finds the custom entities in the document, finds the corresponding geometry information of the custom entity text and perform redaction on the document. Let's call it for a sample bank statement.

In [None]:
# Select a bank document

bank_df = extracted_df[extracted_df['DocType'] == 'bank-statements']
bank_document_s3 = bank_df.iloc[3]['s3path']
bank_document_local = bank_df.iloc[3]['Document']
bank_document_local, bank_document_s3

#perform redaction enrichment
redacted_file = redact_doc(bank_document_s3, f'./samples/mixedbag/{bank_document_local}')

Once our redacted file has been generated, lets take a look at it...

In [None]:
print(f'\nUnredacted Document\t\t\t\t\t\t\tRedacted Document \n')

HTML(f"""
    <div class="row">
            <img src=./samples/mixedbag/{bank_document_local} style="width:45%"></img>
            <img src={redacted_file} style="width:45%">  </img>
    </div>
    """)

---
# Conclusion

In this notebook we trained an Amazon Comprehend custom entity recognizer using our own entity list so that we can extract those entities from our documents. We used 2 entities CHECKING_AC, and SAVINGS_AC. We then created an endpoint with the custom entity recognizer and performed a detect_entities with the endpoint with one of the bank statements. Finally, we saved the extracted entities into a CSV file and uploaded it to S3 for further analysis. 

We still needed to perform some enrichments on the document. Since the document contains the customers checking and savings account numbers, we would like to redact those. Since we already know, by means of our custom entity, the customer's checking and savings bank account numbers, we used Amazon Textract's geometry data to redact that information in the document.

---

# Cleanup

Cleanup is optional at this point if you want to execute the next notebook (Module-4). 

Refer to the `05-idp-cleanup.ipynb` for cleanup and deletion of resources.