# Intelligent Document Processing

Documents contain valuable information and come in various shapes and forms. In most cases, you are manually processing these documents which is time consuming, prone to error, and costly. Not only do you want this information extracted quickly but can also automate business processes that presently relies on manual inputs and intervention across various file types and formats.

To help you overcome these challenges, AWS Machine Learning (ML) now provides you choices when it comes to extracting information from complex content in any document format such as insurance claims, mortgages, healthcare claims, contracts, and legal contracts.

The diagram below shows an architecture for an Intelligent document processing workflow. It starts with data capture stage to securely store and aggregate different types (PDF, PNG, JPEG, and TIFF), formats, and layouts of documents. Followed by accurate classification of documents and extracting text and key insights from documents and perform further enrichments of the documents (such as identity entities, redaction etc.). Finally, the verification and review stage involves manual review of the documents for quality and accuracy, followed by consumption of the documents and extracted information into downstream databases/applications.

In this workshop, we will explore the various aspects of this workflow such as the document classification, text and insights extraction, enrichments, and human review.

![Arch](./images/idp.png)



# Document Classification
In this lab we will walk you through an hands-on lab on document classification using Amazon Comprehend
Custom Classifier. We will use Amazon Textract to first extract the text out of our documents and then label them and then use the data for training our Amazon comprehend custom classifier. We will create an Amazon Comprehend real time endpoint with the custom classifier to classify our documents.

![IDP Classify](./images/idp-classify.png)

- [Step 1: Setup notebook and upload sample documents to Amazon S3](#step1)
- [Step 2: Extract text from sample documents using Amazon Textract](#step2)
- [Step 3: Prepare a CSV training dataset for Amazon Comprehend custom classifier training](#step3)
- [Step 4: Create Amazon Comprehend Classification training job](#step4)
- [Step 5: Create Amazon Comprehend real-time endpoint](#step5)
- [Step 6: Classify Documents using the real-time endpoint](#step6)



---

# Step 1: Setup notebook and upload  sample documents to Amazon S3 <a id="step1"></a>

In this step, we will import some necessary libraries that will be used throughout this notebook. We will then upload all the documents from the `/classification-training` folder to SageMaker's default bucket.

In [None]:
!python -m pip install -q amazon-textract-response-parser --upgrade
!python -m pip install -q amazon-textract-caller --upgrade
!python -m pip install -q amazon-textract-prettyprinter --upgrade

In [None]:
from textractcaller.t_call import call_textract, Textract_Features
from textractprettyprinter.t_pretty_print import Textract_Pretty_Print, get_string
from trp import Document

If the import statements above fails then please restart the notebook kernel by clicking the circular arrow button at the top of the notebook.

In [None]:
import boto3
import botocore
import sagemaker
import os
import io
import datetime
import pandas as pd
from PIL import Image
from pathlib import Path
import multiprocessing as mp
from IPython.display import Image, display, HTML, JSON

# variables
data_bucket = sagemaker.Session().default_bucket()
region = boto3.session.Session().region_name

os.environ["BUCKET"] = data_bucket
os.environ["REGION"] = region
role = sagemaker.get_execution_role()

print(f"SageMaker role is: {role}\nDefault SageMaker Bucket: s3://{data_bucket}")

s3=boto3.client('s3')
textract = boto3.client('textract', region_name=region)
comprehend=boto3.client('comprehend', region_name=region)


### Download and Unzip the sample data `classification-training.zip`

In [None]:
!curl https://idp-assets-wwso.s3.us-east-2.amazonaws.com/workshop-data/classification-training.zip --output classification-training.zip

In [None]:
import shutil

try:
    shutil.unpack_archive("./classification-training.zip", extract_dir="classification-training")
    print("Document archive extracted successfully...")
    for path, subdirs, files in os.walk('./classification-training'):
        for name in files:
            if name.startswith('.'):
                hidden = os.path.join(path, name)
                print(f'Removing hidden files/directories: {hidden}')
                os.system(f"rm -rf {hidden}")
        for dirs in subdirs:
            if dirs.startswith('.'):
                if dirs.startswith('.'):
                    hidden = os.path.join(path, dirs)
                    print(f'Removing hidden files/directories: {hidden}')
                    os.system(f"rm -rf {hidden}")
except Exception as e:
    print("Please upload the document zip file classification-training.zip")
    raise e

### Upload sample data to S3 bucket

The sample documents are in `/classification-training` directory. For this workshop, we will be using sample bank statements, invoices, and receipts.

In [None]:
# Upload images to S3 bucket:
!aws s3 cp classification-training s3://{data_bucket}/idp/textract --recursive --only-show-errors

### Validate the documents in S3

We will create a small utility function to verify that our documents have been uploaded to the S3 bucket. This function will also be used to collect the document paths (S3 keys) into an array that we will user later to extract text using Amazon Textract.

In [None]:
def get_s3_bucket_items(bucket, prefix, start_after):
    list_items=[]
    
    paginator = s3.get_paginator('list_objects_v2')
    operation_parameters = {'Bucket': bucket,
                            'Prefix': prefix,
                            'StartAfter':start_after}
    page_iterator = paginator.paginate(**operation_parameters)
    for page in page_iterator:
        for item in page['Contents']:
            list_items.append(item['Key'])
    names=list(set([os.path.dirname(x)+'/' for x in list_items]))
    images=[x for x in list_items if x not in names and '.ipynb_checkpoints' not in x ]
    names=[x.replace(prefix,'').strip('/') for x in names if  '.ipynb_checkpoints' not in x]
    return list_items, names, images

The code cell below will list a few documents that have been uploaded to our S3 bucket.

In [None]:
docs=[]

train_objects, names, train_images=get_s3_bucket_items(data_bucket, 'idp/textract', 'idp/textract/') 
docs.append(train_images)

if type(docs[0]) is list:
    docs=[item for sublist in docs for item in sublist]
    
names, docs[-10:], docs[:10]

---
# Step 2: Extract text from sample documents using Amazon Textract and label<a id="step2"></a>

In this section we  use Amazon Textract's `detect_document_text` API to extract the raw text information for all the documents in S3. We will also label the data according to the document type. This labeled data will be used to train a custom Amazon Comprehend classifier. We define a utility function that uses the `textract_extract_text` API to extract text from a document and find which category (or directory in S3) it belongs to and then label the data and return an array `[<label>, <document_text>]`. 

In order to extract text from a document using textract we use the `DetectDocumentText` API. You can use the Boto3 version of the API as `textract.detect_document_text`, however in this notebook we will use the `call_textract` tool that we installed earlier in the Notebook ([refer to `amazon-textract-caller`](https://pypi.org/project/amazon-textract-caller/) for more info).

In [None]:
def textract_extract_text(document, bucket=data_bucket):        
    try:
        print(f'Processing document: {document}')
        lines = ""
        row = []
        
        # using amazon-textract-caller
        response = call_textract(input_document=f's3://{bucket}/{document}') 
        # using pretty printer to get all the lines
        lines = get_string(textract_json=response, output_type=[Textract_Pretty_Print.LINES])
        
        label = [name for name in names if(name in document)]  
        row.append(label[0])
        row.append(lines)        
        return row
    except Exception as e:
        print (e)

Call the Textract function defined above for all the documents to extract text --

<div class="alert alert-block alert-warning">
    <b>⚠️ Note:</b> The code below uses <a href="https://docs.python.org/3/library/multiprocessing.html" target="_blank">multiprocessing</a> and can cause Amazon Textract API throttling due to <a href="https://docs.aws.amazon.com/general/latest/gr/textract.html" target="_blank">Amazon Textract soft limits</a>. The code below should work fine for this workshop in an <b>ml.t3.medium</b> SageMaker instance with 2vCPU. Please excercise caution in using similar patterns in production and make sure to implement proper <a href="https://docs.aws.amazon.com/general/latest/gr/api-retries.html" target="_blank">exponential back off</a> in your code. This code below is for demonstration purposes only.
</div>

In [None]:
pool = mp.Pool(mp.cpu_count())
pool_results = [pool.apply_async(textract_extract_text, (document,data_bucket)) for document in docs]
labeled_collection = [res.get() for res in pool_results]
pool.close()

---
# Step 3: Prepare a CSV training dataset for Amazon Comprehend custom classifier training<a id="step3"></a>

Now that we have text extracted from our documents and have also labeled them, we will create the training data in order to train an [Amazon Comprehend custom classification model](https://docs.aws.amazon.com/comprehend/latest/dg/how-document-classification.html). Let's take a look at the labeled data. We have 100 sample of each document, so we should have about 300 rows of labeled data.

In [None]:
comprehend_df = pd.DataFrame(labeled_collection, columns=['label','document'])
comprehend_df

We will create a training dataset from extracted text and upload it to Amazon S3. The training data file will be written in `CSV` format and will be named `comprehend_train_data.csv`. Note that you can have more than one `CSV` file in an S3 bucket for training a Comprehend custom classifier. If you have more than one file, you can specify only the bucket/prefix in call to train the custom classifier. Amazon Comprehend will automatically use all the files under the bucket/prefix for training purposes.

The following code cells will upload the training data to the S3 bucket, and create a Custom Comprehend Classifier. You can also create a custom classifier manually, please see the subsequent sections for instructions on how to do that.

In [None]:
# Upload Comprehend training data to S3
key='idp/comprehend/comprehend_train_data.csv'

comprehend_df.to_csv("comprehend_train_data.csv", index=False, header=False)
s3.upload_file(Filename='comprehend_train_data.csv', 
               Bucket=data_bucket, 
               Key=key)


---
# Step 4: Create Amazon Comprehend Classification training job <a id="step4"></a>

Once we have a labeled dataset ready we are going to create and train a [Amazon Comprehend custom classification model](https://docs.aws.amazon.com/comprehend/latest/dg/how-document-classification.html) with the dataset.

In [None]:
# If you have imported a custom classifier model, uncomment the follwing code lines and replace the <model_arn> with the model's ARN and execute this code cell. 
# If you execute this code cell then you can skip the model training and move to Step 5.

# document_classifier_arn = "<model_arn>"
# %store document_classifier_arn

### Create Amazon Comprehend custom classification Training Job

We will use Amazon Comprehend's Custom Classification to train our own model for classifying the documents. We will use Amazon Comprehend `CreateDocumentClassifier` API to create a classifier which will train a custom model using the labeled data CSV file we created above. The training data contains extracted text, that was extracted using Amazon Textract, and then labeled.

In [None]:
f's3://{data_bucket}/{key}'

In [None]:
# Create a document classifier
account_id = boto3.client('sts').get_caller_identity().get('Account')
id = str(datetime.datetime.now().strftime("%s"))

document_classifier_name = 'Sample-Doc-Classifier-IDP'
document_classifier_version = 'Sample-Doc-Classifier-IDP-v1'
document_classifier_arn = ''
response = None

try:
    create_response = comprehend.create_document_classifier(
        InputDataConfig={
            'DataFormat': 'COMPREHEND_CSV',
            'S3Uri': f's3://{data_bucket}/{key}'
        },
        DataAccessRoleArn=role,
        DocumentClassifierName=document_classifier_name,
        VersionName=document_classifier_version,
        LanguageCode='en',
        Mode='MULTI_CLASS'
    )
    
    document_classifier_arn = create_response['DocumentClassifierArn']
    
    print(f"Comprehend Custom Classifier created with ARN: {document_classifier_arn}")
except Exception as error:
    if error.response['Error']['Code'] == 'ResourceInUseException':
        print(f'A classifier with the name "{document_classifier_name}" already exists.')
        document_classifier_arn = f'arn:aws:comprehend:{region}:{account_id}:document-classifier/{document_classifier_name}/version/{document_classifier_version}'
        print(f'The classifier ARN is: "{document_classifier_arn}"')
    else:
        print(error)

In [None]:
%store document_classifier_arn


Check status of the Comprehend Custom Classification Job

In [None]:
%%time
# Loop through and wait for the training to complete . Takes up to 10 mins 
from IPython.display import clear_output
import time
from datetime import datetime

jobArn = create_response['DocumentClassifierArn']

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    now = datetime.now()
    current_time = now.strftime("%H:%M:%S")
    describe_custom_classifier = comprehend.describe_document_classifier(
        DocumentClassifierArn = jobArn
    )
    status = describe_custom_classifier["DocumentClassifierProperties"]["Status"]
    clear_output(wait=True)
    print(f"{current_time} : Custom document classifier: {status}")
    
    if status == "TRAINED" or status == "IN_ERROR":
        break
        
    time.sleep(60)
    


Alternatively, to create a Comprehend Custom Classifier Job manually using the console go to [Amazon Comprehend Console](https://console.aws.amazon.com/comprehend/v2/home?region=us-east-1#classification)
  
- On the left menu click "Custom Classification"
- In the "Classifier models" section, click on "Create new model"
- In Model Setting for Model name, enter a name 
- In Data Specification; select "Using Single-label" mode and for Data format select CSV file
- For Training dataset browse to your data-bucket created above and select the file `comprehend_train_data.csv`
- For IAM role select "Create an IAM role" and specify a prefix (this will create a new IAM Role for Comprehend)
- Click create

This job can take ~30 minutes to complete. Once the training job is completed move on to next step.

---
# Step 5: Classify documents with Amazon Comprehend custom classifier

In this step we will use Amazon Comprehend custom classification model to classify sample documents. We will use `start_document_classification_job` API to launch an asynchronous job to classify the documents. This API supports documents in their native format (PDF/PNG/JPG/TIF) and can use Amazon Textract behind the scenes to read the text from the documents and subsequently determine the document class. Let's start by uploading our sample documents to the S3 bucket

In [None]:
!aws s3 cp ./samples s3://{data_bucket}/idp/comprehend --recursive --only-show-errors

Amazon Comprehend Async classification works with PDF, PNG, JPEG, as well as UTF-8 encoded plaintext files. Since our sample documents under the `samples` directory are of PNG format, we will specify a `DocumentReadAction` and use Amazon Textract with the `TEXTRACT_DETECT_DOCUMENT_TEXT` option. This will tell Amazon Comprehend to use Amazon Textract `DetectDocumentText` API behind the scenes to extract the text and then perform classification. For `InputFormat`, we will use `ONE_DOC_PER_FILE` mode which signifies that each file is a single document (the other mode is `ONE_DOC_PER_LINE` which means every line in the plaintext file is a document, this is best suited for small documents such as product reviews or customer service chat transcripts etc.). More on this, see [documentation](https://docs.aws.amazon.com/comprehend/latest/dg/what-is.html).

In [None]:
import uuid

jobname = f'doc-classification-job-{uuid.uuid1()}'
print(f'Starting Comprehend Classification job {jobname} with model {document_classifier_arn}')

response = comprehend.start_document_classification_job(
    JobName=jobname,
    DocumentClassifierArn=document_classifier_arn,
    InputDataConfig={
        'S3Uri': f's3://{data_bucket}/idp/comprehend/mixedbag/',
        'InputFormat': 'ONE_DOC_PER_FILE',
        'DocumentReaderConfig': {
            'DocumentReadAction': 'TEXTRACT_DETECT_DOCUMENT_TEXT',
            'DocumentReadMode': 'FORCE_DOCUMENT_READ_ACTION'
        }
    },
    OutputDataConfig={
        'S3Uri': f's3://{data_bucket}/idp/comprehend/doc-class-output/'
    },
    DataAccessRoleArn=role
)

response

## Check status of the classification job

The code block below will check the status of the classification job. If the job completes then it will download the output predictions. The output is a zip file which will contain the inference result for each of the documents being classified. The zip will also contain the output of the Textract operation performed by Amazon Comprehend.

In [None]:
%%time
# Loop through and wait for the training to complete . Takes up to 10 mins 
import time
import json
from datetime import datetime
import tarfile
import os

classify_response=response
max_time = time.time() + 3*60*60 # 3 hours
documents=[]

while time.time() < max_time:
    now = datetime.now()
    current_time = now.strftime("%H:%M:%S")
    describe_job = comprehend.describe_document_classification_job(
        JobId=classify_response['JobId']
    )
    status = describe_job["DocumentClassificationJobProperties"]["JobStatus"]

    print(f"{current_time} : Custom document classifier Job: {status}")
    
    if status == "COMPLETED" or status == "FAILED":
        if status == "COMPLETED":
            classify_output_file = describe_job["DocumentClassificationJobProperties"]["OutputDataConfig"]["S3Uri"]
            print(f'Output generated - {classify_output_file}')
            !mkdir -p classification-output
            !aws s3 cp {classify_output_file} ./classification-output
            
            opfile = os.path.basename(classify_output_file)
            # open file
            file = tarfile.open(f'./classification-output/{opfile}')
            # extracting file
            file.extractall('./classification-output')
            file.close()
            
            for file in os.listdir('./classification-output'):
                if file.endswith('.out'):
                    with open(f'./classification-output/{file}', 'r') as f:
                        documents.append(dict(file=file, classification_output=json.load(f)['Classes']))        
        else:
            print("Classification job failed")
            print(describe_job)
        break
        
    time.sleep(10)

In [None]:
documents

Let's take a look at the Amazon Comprehend classification output. We have collected the output for all the files in a documents variable. The script above will download and un-zip the zip file locally, so you can navigate into the classification-output directory from the file browser panel on the left and inspect the files manually.

In [None]:
classification = []
for doc in documents:
    document = []    
    classes_df = pd.DataFrame(doc['classification_output'])
    result = classes_df.iloc[classes_df['Score'].idxmax()]
    document.extend([doc['file'].replace(".out",""), result.Name, result.Score])    
    classification.append(document)
    
doc_class_df = pd.DataFrame(classification, columns = ['Document', 'DocType', 'Confidence'])
doc_class_df                                                      

Our documents under the `samples/mixedbag` folder has now been classified. We will upload them into S3 with proper prefix label.

In [None]:
print(f'Default bucket is : s3://{data_bucket}')

In [None]:
root='idp/comprehend/classified-docs'

def upload_classified_docs(filename,prefix):
    document = os.path.basename(filename)
    key = f'{root}/{prefix}/{document}'
    print(f'Uploading: {filename}...')
    res = s3.upload_file(Filename=f"./samples/mixedbag/{filename}", 
                   Bucket=data_bucket, 
                   Key=key)
    return f'{root}/{prefix}/{document}'
    
doc_class_df['s3path'] = doc_class_df.apply(lambda row : upload_classified_docs(row['Document'],row['DocType']), axis = 1)

#verify uploads
[objects['Key'] for objects in s3.list_objects(Bucket=data_bucket, Prefix=f"{root}/")['Contents']]

The Comprehend classification process has also generated the Textract output from the documents (present under the `classification-output/amazon-textract-output`). This directory contains a folder for each document with the Amazon Textract JSON response. Let's load the plain text of each of these documents into the data frame. We use the pretty printer tool to get the LINES out of the documents.

In [None]:
from textractprettyprinter.t_pretty_print import Textract_Pretty_Print, get_string
import json

def get_text(doc):
    with open(f'classification-output/amazon-textract-output/{doc}/1', 'r') as myfile:
        data=myfile.read()
    obj = json.loads(data)
    text = get_string(textract_json=obj, output_type=[Textract_Pretty_Print.LINES])
    return text

doc_class_df['DocText'] = doc_class_df.apply(lambda row : get_text(row['Document']), axis = 1)
doc_class_df

 Finally, we keep a copy of the extracted text of the documents in the mixedbag directory for use later in Notebook 3 (Document Enrichment).

In [None]:
doc_class_df.to_csv('extracted_doc.csv')
#Upload dataframe as csv to S3
s3.upload_file(Filename='extracted_doc.csv', 
               Bucket=data_bucket, 
               Key=f'idp/comprehend/extracted/extracted_doc.csv')

---
# Step 6: Create Amazon Comprehend real time endpoint _(optional)_ <a id="step5"></a>

Once our Comprehend custom classifier is fully trained (i.e. status = `TRAINED`). We can create a real-time endpoint. We will use this endpoint to classify documents in real time. The following code cells use the `comprehend` Boto3 client to create an endpoint, but you can also create one manually via the console. Instructions on how to do that can be found in the subsequent section.

In [None]:
#create comprehend endpoint
model_arn = document_classifier_arn
ep_name = 'idp-endpoint'

try:
    endpoint_response = comprehend.create_endpoint(
        EndpointName=ep_name,
        ModelArn=model_arn,
        DesiredInferenceUnits=1,    
        DataAccessRoleArn=role
    )
    ENDPOINT_ARN=endpoint_response['EndpointArn']
    print(f'Endpoint created with ARN: {ENDPOINT_ARN}')    
except Exception as error:
    if error.response['Error']['Code'] == 'ResourceInUseException':
        print(f'An endpoint with the name "{ep_name}" already exists.')
        ENDPOINT_ARN = f'arn:aws:comprehend:{region}:{account_id}:document-classifier-endpoint/{ep_name}'
        print(f'The classifier endpoint ARN is: "{ENDPOINT_ARN}"')
        %store ENDPOINT_ARN
    else:
        print(error)
    

In [None]:
%store ENDPOINT_ARN

In [None]:
display(endpoint_response)


Alternatively, use the steps below to create a Comprehend endpoint using the AWS console.

- Go to [Comprehend on AWS Console](https://console.aws.amazon.com/comprehend/v2/home?region=us-east-1#endpoints) and click on Endpoints in the left menu.
- Click on "Create endpoint"
- Give an Endpoint name; for Custom model type select Custom classification; for version select no version or the latest version of the model.
- For Classifier model select from the drop down menu
- For Inference Unit select 1
- Check "Acknowledge"
- Click "Create endpoint"

[It may take ~15 minutes](https://console.aws.amazon.com/comprehend/v2/home?region=us-east-1#endpoints) for the endpoint to get created. The code cell below checks the creation status.


In [None]:
%%time
# Loop through and wait for the training to complete . Takes up to 10 mins 
from IPython.display import clear_output
import time
from datetime import datetime

ep_arn = endpoint_response["EndpointArn"]

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    now = datetime.now()
    current_time = now.strftime("%H:%M:%S")
    describe_endpoint_resp = comprehend.describe_endpoint(
        EndpointArn=ep_arn
    )
    status = describe_endpoint_resp["EndpointProperties"]["Status"]
    clear_output(wait=True)
    print(f"{current_time} : Custom document classifier: {status}")
    
    if status == "IN_SERVICE" or status == "FAILED":
        break
        
    time.sleep(10)
    

---
# Step 7: Classify a document with the real-time endpoint _(optional)_ <a id="step6"></a>

Once the endpoint has been created, we will use a mix of documents under the `/samples/mixedbag/` directory and try to classify them to bank statement, invoice, and receipt documents respectively.

In [None]:
import random

root = "./samples/mixedbag"
files = []

for file in os.listdir(root):
    if not file.startswith('.'):
        files.append(f'./samples/mixedbag/{file}')

files_df = pd.DataFrame(files, columns=["Document"])
files_df


Let's view one of the documents

In [None]:
file = files_df.sample().iloc[0]['Document']
display(Image(filename=file, width=400, height=500))

To classify this sample document, we will first convert the documents to ByteArray and then use Textract `classify_document` API to classify it. Since `classify_document` is a real time (synchronous) API we will call it with the document bytes of the above sample document. Again, as before we will let Amazon Comprehend utilize Amazon Textract behind the scenes to read the document and then classify it. Since we are allowing Comprehend to use Amazon Textract behind the scenes to extract the text, we will be limited to use single page documents.

In [None]:
import time 
from datetime import datetime

def classify_doc(document):
    imageBytes = None
    with open(document, 'rb') as document:
        imageBytes = bytearray(document.read())
    try:
        response = comprehend.classify_document(
            Bytes = imageBytes,
            EndpointArn=ENDPOINT_ARN,
            DocumentReaderConfig={
                'DocumentReadAction': 'TEXTRACT_DETECT_DOCUMENT_TEXT',
                'DocumentReadMode': 'FORCE_DOCUMENT_READ_ACTION'
            }
        )
               
        return response      # return the corresponding response
    except Exception as e:
        print(e)
        return 'error'

Lets now run the inference on our sample document

In [None]:
classification_op = classify_doc(file)
JSON(classification_op)

---

# Cleanup

Cleanup is optional if you want to execute subsequent notebooks. 

Refer to the `05-idp-cleanup.ipynb` for cleanup and deletion of resources.

---
# Conclusion

In this notebook we have trained an Amazon Comprehend custom classifier using our sample documents by extracting the text from the documents using Amazon Textract and labeling the data into a CSV file format training dataset. We then trained an Amazon Comprehend custom classifier with the extracted text and created an Amazon Comprehend Classifier real time endpoint to performe classification of documents.

In the next notebook we will look at a few methods to perfrom extraction of key insights from our documents using Amazon Textract.