# Custom Entity detection with Textract and Comprehend

## Contents
1. [Background](#Background)
1. [Setup](#Setup)
1. [Data Prep](#Data-Prep)
1. [Textract OCR++](#Textract-OCR++)
1. [Amazon GroundTruth Labeling](#Amazon-GroundTruth-Labeling)
1. [Comprehend Custom Entity Training](#Comprehend-Custom-Entity-Training)
1. [Model Performance](#Model-Performance)
1. [Inference](#Inference)
1. [Results](#Results)


## Background

In this notebook, we will cover how to extract and build a custom entity recognizer using Amazon Textract and Comprehend. We will be using Amazon Textract to perform OCR++ on scanned document, GroundTruth to label the interested entities, then passing the extracted documents to Amazon Comprehend to build and train a custom entity recognition model. No prior machine learning knowledge is required. 

In this example, We are using a public dataset from Kaggle: [Resume Entities for NER](https://www.kaggle.com/dataturks/resume-entities-for-ner?select=Entity+Recognition+in+Resumes.json). The dataset comprised 220 samples of candidate resumes in JSON format. 


## Setup
_This Notebook was created on ml.t2.medium notebook instances._

Let's start by install and import all neccessary libaries:

In [None]:
# Installing tqdm Python Library
!pip install tqdm

In [None]:
import sagemaker
import logging
import boto3
import glob
import time
import os 
from tqdm import tqdm
import json


region = boto3.Session().region_name    
role = sagemaker.get_execution_role()
bucket = sagemaker.Session().default_bucket()
prefix = 'textract_comprehend_NER'

## Data Prep <a class="anchor" id="Data-Prep"></a>

PDF and PNG are most common format for scanned documents within enterprises. We already converted these resumes into PDF format to emulate this. Let's upload all these PDF resumes onto S3 for Textract processing. Please note, there are only 220 samples of resume inside the dataset. By modern standards, this is a very small dataset. This dataset also come with few labeled custom entities. However, we will be running this dataset through Amazon GroundTruth to obtain a fresh copy of entity list.

In [None]:
# Uploading PDF resumes to S3
pdfResumeFileList = glob.glob("./resume_pdf/*.pdf")
prefix_resume_pdf = prefix + "/resume_pdf/"

for filePath in tqdm(pdfResumeFileList):
    file_name = os.path.basename(filePath)
    boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix_resume_pdf, file_name)).upload_file(filePath)

resume_pdf_bucket_name = 's3://'+bucket+'/'+prefix+'/'+'resume_pdf/'
print('Uploaded Resume PDFs :\t', resume_pdf_bucket_name)

## Textract OCR++ <a class="anchor" id="Textract-OCR++"></a>

Now these PDFs are ready for Textract to perform OCR++, you can kick off the process with [StartDocumentTextDetection](https://docs.aws.amazon.com/textract/latest/dg/API_StartDocumentTextDetection.html) async API cal. Here we are only set to process 2 resume PDF for demonstrating the process. To save time, we have all 220 resumes processed and avaliable for you. See textract_output directory for all the reuslts.

In [None]:
s3_client = boto3.client('s3')
pdf_object_list = []

# Getting a list of resume PDF files:
response = s3_client.list_objects(
    Bucket= bucket,
    Prefix= prefix+'/'+'resume_pdf/text_output'
)

for obj in response['Contents']:
    pdf_object_list.append(obj['Key'])

pdf_object_list[:5]

In [None]:
from s3_textract_functions import *
import codecs

sample_to_process = 2

# We are only processing few files as example; You do not need to process all 220 files
for file_obj in tqdm(pdf_object_list[:sample_to_process]):
    print('Textract Processing PDF: \t'+ file_obj)             
    job_id = StartDocumentTextDetection(bucket, file_obj)
    print('Textract Job Submitted: \t'+ job_id)
    response = getDocumentTextDetection(job_id)
    
    # renaming .pdf to .text
    text_output_name = file_obj.replace('.pdf', '.txt')
    text_output_name = text_output_name[(text_output_name.rfind('/')+1):]
    print('Output Name:\t', text_output_name)
    
    output_dir = './textract_output/'
    
    # Writing Textract Output to Text Files:
    with codecs.open(output_dir + text_output_name, "w", "utf-8") as output_file:
        for item in response["Blocks"]:
            if item["BlockType"] == "LINE":
                print('\033[94m' + item["Text"] + '\033[0m')
                output_file.write(item["Text"]+'\n')
    output_file.close()


In [None]:
from tqdm import tqdm
    
# Uploading Textract Output to S3
textract_output_filelist = glob.glob("./textract_output/*.txt")
prefix_textract_output = prefix + "/textract_output/"

for filePath in tqdm(textract_output_filelist):
    file_name = os.path.basename(filePath)
    boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix_textract_output, file_name)).upload_file(filePath)

comprehend_input_doucuments = 's3://' + bucket+'/'+prefix_textract_output
print('Textract Output:\t', comprehend_input_doucuments)

## Amazon GroundTruth Labeling <a class="anchor" id="Amazon-GroundTruth-Labeling"></a>

Since we need to train a custom entity recognition model with Comprehend, and with any machine learning models, we need large amount of training data. In this example, we are leveraging Amazon GroundTruth to label our entities. Amazon Comprehend by default already can recognize entities like [Person, Title, Organization, and etc](https://docs.aws.amazon.com/comprehend/latest/dg/how-entities.html). To demonstrate custom entity recognition capability, we are focusing on Skill entities inside these resumes. We have the labeled and cleaned the data with Amazon GroundTruth (see: entity_list.csv). If you are interested, you can follow this blog to [add data labeling workflow for named entity recognition](https://aws.amazon.com/blogs/machine-learning/adding-a-data-labeling-workflow-for-named-entity-recognition-with-amazon-sagemaker-ground-truth/). 


Before we start training, let's upload the entity list onto S3

In [None]:
# Uploading Entity List to S3
entity_list_file = './entity_list.csv'
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix+'/entity_list/', 'entity_list.csv')).upload_file(entity_list_file)

comprehend_input_entity_list = 's3://' + bucket+'/'+prefix+'/entity_list/'+'entity_list.csv'
print('Entity List:\t', comprehend_input_entity_list)

## Comprehend Custom Entity Training <a class="anchor" id="Comprehend-Custom-Entity-Training"></a>

Now we have both raw and labeled data, and ready to train our model. You can kick off the process with create_entity_recognizer API call.

In [None]:
comprehend_client = boto3.client('comprehend')
custom_recognizer_name = 'resume-entity-recognizer-'+ str(int(time.time()))
comprehend_custom_recognizer_response = comprehend_client.create_entity_recognizer(
    RecognizerName = custom_recognizer_name,
    DataAccessRoleArn=role,
    InputDataConfig={
        'EntityTypes': [
            {
                'Type': 'SKILLS'
            },
        ],
        'Documents': {
            'S3Uri': comprehend_input_doucuments
        },
        'EntityList': {
            'S3Uri': comprehend_input_entity_list
        }
    },
    LanguageCode='en'
)

print(json.dumps(comprehend_custom_recognizer_response, indent=2))

Once the training job is submitted, you can see the recognizer is being trained on Comprehend Console. 
This will take approxiamately 20 minutes to train

In [None]:
comprehend_model_response = comprehend_client.describe_entity_recognizer(
    EntityRecognizerArn= comprehend_custom_recognizer_response['EntityRecognizerArn']
)

print('ARN:\t', comprehend_model_response['EntityRecognizerProperties']['EntityRecognizerArn'])
print('Training Job Status:\t', comprehend_model_response['EntityRecognizerProperties']['Status'])

## Model Performance <a class="anchor" id="Model-Performance"></a>

In the training, Comprehend will divide the dataset into training documents and test documents. Once the recognizer is trained, you can see the recognizer’s overall performance, as well as the performance for each entity. 

In [None]:
if comprehend_model_response['EntityRecognizerProperties']['Status'] == 'TRAINED':
    print('Number of Document Trained:\t', comprehend_model_response['EntityRecognizerProperties']['RecognizerMetadata']['NumberOfTrainedDocuments'])
    print('Number of Document Tested:\t', comprehend_model_response['EntityRecognizerProperties']['RecognizerMetadata']['NumberOfTestDocuments'])
    print('\n-------------- Evaluation Metrics: ----------------')
    print('Precision:\t', comprehend_model_response['EntityRecognizerProperties']['RecognizerMetadata']['EvaluationMetrics']['Precision'])
    print('ReCall:\t\t', comprehend_model_response['EntityRecognizerProperties']['RecognizerMetadata']['EvaluationMetrics']['Recall'])
    print('F1 Score:\t', comprehend_model_response['EntityRecognizerProperties']['RecognizerMetadata']['EvaluationMetrics']['F1Score'])
else:
    print('Please wait for previous step to be completed')


## Inference

Next, we have prepared a small sample of text to test out our newly trained custom entity recognizer. First, we will upload the document onto S3 and start a custom recognizer job. Once the job is submitted, you can see the progress in console under Amazon Comprehend → Analysis Jobs. 

### Uploading Test PDF resumes to S3 for OCR++

In [None]:
pdfResumeFileList = glob.glob("./test_document/*.pdf")
prefix_resume_pdf = prefix + "/test_document/"

for filePath in tqdm(pdfResumeFileList):
    file_name = os.path.basename(filePath)
    boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix_resume_pdf, file_name)).upload_file(filePath)

resume_pdf_bucket_name = 's3://'+bucket+'/'+prefix+'/'+'test_document/'
print('Uploaded Resume PDFs :\t', resume_pdf_bucket_name)

### Performing OCR++ Using Textract

In [None]:
pdf_object_list = []
pdf_object_list.append(prefix_resume_pdf+"test_document.pdf")

output_dir = './test_document/'

for file_obj in tqdm(pdf_object_list):
    print('Textract Processing PDF: \t'+ file_obj)             
    job_id = StartDocumentTextDetection(bucket, file_obj)
    print('Textract Job Submitted: \t'+ job_id)
    response = getDocumentTextDetection(job_id)
    
    # renaming .pdf to .text
    text_output_name = file_obj.replace('.pdf', '.txt')
    text_output_name = text_output_name[(text_output_name.rfind('/')+1):]
    print('Output Name:\t', text_output_name)
    
    
    # Writing Textract Output to Text Files:
    with codecs.open(output_dir + text_output_name, "w", "utf-8") as output_file:
        for item in response["Blocks"]:
            if item["BlockType"] == "LINE":
                print('\033[94m' + item["Text"] + '\033[0m')
                output_file.write(item["Text"]+'\n')
    output_file.close()

### Uploading the Textract Result for Inference

In [None]:
# Uploading test document onto S3:
test_document = './test_document/test_document.txt'
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix+'/test_document/', 'test_document.txt')).upload_file(test_document)

s3_test_document = 's3://' + bucket+'/'+prefix+'/test_document/'+'test_document.txt'
s3_test_document_output = 's3://' + bucket+'/'+prefix+'/test_document/'
print('Test Document Input: ', s3_test_document)
print('Test Document Output: ', s3_test_document_output)

In [None]:
# Start a recognizer Job:
custom_recognizer_job_name = 'recognizer-job-'+ str(int(time.time()))

recognizer_response = comprehend_client.start_entities_detection_job(
    InputDataConfig={
        'S3Uri': s3_test_document,
        'InputFormat': 'ONE_DOC_PER_LINE'
    },
    OutputDataConfig={
        'S3Uri': s3_test_document_output
    },
    DataAccessRoleArn=role,
    JobName=custom_recognizer_job_name,
    EntityRecognizerArn=comprehend_model_response['EntityRecognizerProperties']['EntityRecognizerArn'],
    LanguageCode='en'
)

Use follow code to check if the Detection Job for completion

In [None]:
job_response = comprehend_client.describe_entities_detection_job(
    JobId=recognizer_response['JobId']
)
print('Detection Job Name:\t', job_response['EntitiesDetectionJobProperties']['JobName'])
print('Detection Job ID:\t', job_response['EntitiesDetectionJobProperties']['JobId'])
print('Detection Job Status:\t', job_response['EntitiesDetectionJobProperties']['JobStatus'])


In [None]:
output_url=job_response['EntitiesDetectionJobProperties']['OutputDataConfig']['S3Uri']
print('S3 Output URL:\t', output_url)

## Results

Once the Analysis job is done, you can download the output and see the results. Here we converted the json result into table format.

In [None]:
from urllib.parse import urlparse

#create dir for output file:
!mkdir test_document_output

# Downloading Output File
if job_response['EntitiesDetectionJobProperties']['JobStatus'] == 'COMPLETED':
    filename = './test_document_output/output.tar.gz'
    output_url_o = urlparse(output_url, allow_fragments=False)
    s3_client.download_file(output_url_o.netloc, output_url_o.path.lstrip('/'), filename)

    !cd test_document_output; tar -xvzf output.tar.gz
    
    print("Output downloaded ... ")
else:
    print("Please wait for the analysis job to be completed!")

In [None]:
from IPython.display import HTML, display

output_file_name = './test_document_output/output'
data = [['Start Offset', 'End Offset', 'Confidence', 'Text', 'Type']]

with open(output_file_name, 'r', encoding='utf-8') as input_file:
    for line in input_file.readlines():
        json_line = json.loads(line)  # converting line of text into JSON
        entities = json_line['Entities']
        if(len(entities)>0):
            for entry in entities:
                entry_data = [entry['BeginOffset'], entry['EndOffset'], entry['Score'], entry['Text'],entry['Type']]
                data.append(entry_data)
        
display(HTML(
   '<table><tr>{}</tr></table>'.format(
       '</tr><tr>'.join(
           '<td>{}</td>'.format('</td><td>'.join(str(_) for _ in row)) for row in data)
       )
))