# Mortgage Document Extraction

---

At this point we have identified the documents, and we can now start extracting information from them. What we want to extract information from each document will depend on the type of document as depicted in the figure below.

<p align="center">
  <img src="./images/extraction.png" alt="exn" width="800px"/>
</p>



---

## Setup Notebook

We will be using the [Amazon Textract Parser Library](https://github.com/aws-samples/amazon-textract-response-parser/tree/master/src-python) for parsing through the Textract response, data science library [Pandas](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) for content analysis, the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/), and [AWS boto3 python sdk](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) to work with Amazon Textract and Amazon A2I. Let's now install and import them.

In [None]:
!python -m pip install -q amazon-textract-response-parser --upgrade --force-reinstall
!python -m pip install -q amazon-textract-caller --upgrade --force-reinstall
!python -m pip install -q amazon-textract-prettyprinter --upgrade --force-reinstall
!python -m pip install pdfplumber
!sed -i '/disable ghostscript format types/,+6d' /etc/ImageMagick-6/policy.xml

In [None]:
#Restart the kernel
import IPython
IPython.Application.instance().kernel.do_shutdown(True)

In [None]:
import boto3
import botocore
import sagemaker
import os
import io
import datetime
import json
import pandas as pd
from PIL import Image as PImage, ImageDraw
from pathlib import Path
from IPython.display import Image, display, HTML, JSON, IFrame
from textractcaller import call_textract_lending
from textractprettyprinter.t_pretty_print import convert_lending_from_trp2
import trp.trp2_lending as tl
import multiprocessing as mp
from trp import Document
from sagemaker import get_execution_role


# variables
data_bucket = sagemaker.Session().default_bucket()
region = boto3.session.Session().region_name
account_id = boto3.client('sts').get_caller_identity().get('Account')

os.environ["BUCKET"] = data_bucket
os.environ["REGION"] = region
role = sagemaker.get_execution_role()

print(f"SageMaker role is: {role}\nDefault SageMaker Bucket: s3://{data_bucket}")

s3=boto3.client('s3')
textract = boto3.client('textract', region_name=region)

In [None]:
documentName = "docs/lending_package.pdf"
display(IFrame(documentName, 500, 600));

---
# Extracting information from documents


From module 1 - Document classification, we have the following as documents identified from the lending packet. We will extract the information out of each of the Mortgage documents

- Page Number: 1 , Page Classification: `PAYSLIPS`
- Page Number: 2 , Page Classification: `CHECKS`
- Page Number: 3 , Page Classification: `IDENTITY_DOCUMENT`
- Page Number: 4 , Page Classification: `1099_DIV`
- Page Number: 5 , Page Classification: `BANK_STATEMENT`
- Page Number: 6 , Page Classification: `W_2`
- Page Number: 7 , Page Classification: `UNCLASSIFIED`
    
    
We will use Amazon Textract's [Analyze Lending API](https://docs.aws.amazon.com/textract/latest/dg/analyzing-document-text.html) API to extract Key-Value pairs out of the documents. Note that, in some cases (for example; for the sample) the API will also detect the presence of [signatures.](https://docs.aws.amazon.com/textract/latest/dg/lending-response-objects.html)

For the `Unclassified` type of document,i.e., Page Number 7 (which is a sample Homeowners insurance application form), we will use Amazon Textract Queries feature to extract information out of it.


---
## Load the Output JSON response that was generated in the previous module

In [None]:
with open('lending-doc-output.json') as textract_json:
    data = json.load(textract_json)

data

## Use the Textract Response Library to convert the response to an easier format to work with

The JSON is converted to a "flattened" CSV structure which is easier to work with. The CSV has the following columns:

- page_classification
- page_number_within_document_type
- page_number_in_document
- key
- key_confidence
- value
- value_confidence
- key-bounding-box.top
- key-bounding-box.height
- key-bb.width
- key-bb.left
- value-bounding-box.top
- value-bb.height
- value-bb.width
- value-bb.left


Here, we use the Textract Response Parser helper library:
[Link](https://pypi.org/project/amazon-textract-response-parser/)

In [None]:
trp2_doc: tl.TFullLendingDocument = tl.TFullLendingDocumentSchema().load(data)
lending_array = convert_lending_from_trp2(trp2_doc)

In [None]:
for row in lending_array:
    print(row)

We will use [Pandas Dataframes](https://pandas.pydata.org/) which is a widely used Python library in machine learning and data science. We will store the `lending_array` in a Pandas dataframe structure to further extract and analyze data according to the respective document labels/classifications.

In [None]:
import pandas as pd

df = pd.DataFrame(lending_array, columns = ['{page_classification}_{page_number_within_document_type}', 'page_number_in_document', 'key', 'key_confidence', 'value', 'value_confidence', 'key-bounding-box.top', 'key-bounding-box.height', 'key-bb.width', 'key-bb.left', 'value-bounding-box.top', 'value-bb.height', 'value-bb.width', 'value-bb.left'])
df

Following are the document pages as they were classified in the previous document classification step: 

## 1. Payslip <a id="step1"></a>


### Get info (key-value) pairs from the response

In this section we will get the extractions in the document in key-value pair format.

In [None]:
payslip_df = pd.DataFrame()
payslip_df = df.loc[df['{page_classification}_{page_number_within_document_type}'] == "PAYSLIPS_undetected"].set_index('{page_classification}_{page_number_within_document_type}')

payslip_df[['key','value']]

---
## 2. Check <a id="step2"></a>


### Get info (key-value) pairs from the response

In this section we will get the extractions in the document in key-value pair format.

> **_NOTE:_**  For this document, the following key-value pair included in the response: "SIGNATURE â€“ SIGNATURE_PRESENT" indicates that a signature was found in the document.

In [None]:
checks_df = pd.DataFrame()
checks_df = df.loc[df['{page_classification}_{page_number_within_document_type}'] == "CHECKS_undetected"].set_index('{page_classification}_{page_number_within_document_type}')

checks_df[['key','value']]

---
## 3. Identity Document <a id="step3"></a>

### Get info (key-value) pairs from the response

In this section we will get the extractions in the document in key-value pair format.

In [None]:
ID_df = pd.DataFrame()
ID_df = df.loc[df['{page_classification}_{page_number_within_document_type}'] == "IDENTITY_DOCUMENT_undetected"].set_index('{page_classification}_{page_number_within_document_type}')

ID_df[['key','value']]

---
## 4. HomeOwners Insurance Application - _Amazon Textract Queries example_ <a id="step6"></a>

In document classification, this document was detected as `unclassified`, hence we will use Amazon Textract Queries feature to extract key information from such types of documents.

### Upload the unclassified document to S3

We will extract and upload the unclassified document to Amazon S3 Bucket. Note that documents will be uploaded into SageMaker's default S3 bucket. If you wish to use a different bucket please make sure you update the bucket name in `data_bucket` variable and also ensure that SageMaker has permissions to the S3 bucket.

In [None]:
import pdfplumber
input_pdf = "docs/lending_package.pdf"

with pdfplumber.open(input_pdf) as pdf:
    page7 = pdf.pages[6]
    image = page7.to_image(resolution=350).original
    image.save("docs/unclassified.pdf")

In [None]:
# Upload images to S3 bucket:
!aws s3 cp docs/unclassified.pdf s3://{data_bucket}/idp/textract/ --only-show-errors

In [None]:
file_key="idp/textract/unclassified.pdf"
display(IFrame(src=s3.generate_presigned_url('get_object', Params={'Bucket': data_bucket, 'Key': file_key}), width=600, height=700))

A we can see the homeowners insurance application is a document containing dense text. In this case we are interested in finding out a few key information from the entire document. Instead of extracting all the text from the document, and then apply logic (for example: regular expression) to find out that information, hence, we use the Amazon Textract Queries feature to grab the infromation from the document. 

Specifically, the information we are looking for are-

1. Named insured.
2. Policy number.
3. Effective date.
4. Insurance company.

We will craft questions in plain english language for the Textract API and pass it to the API call to get the information. Queries are-

1. Who is the named insured?
2. What is the policy number?
3. What is the effective date?
4. What is the insurance company name?

Also, we can see that all of this information is available in the single-page pdf document so we don't need the AI to look through all the pages to find this info. We will pass the page number when making the API call. Note: If the page number is not known then the `pages` parameter can be omitted.

In [None]:
from textractcaller import QueriesConfig, Query
from textractcaller.t_call import call_textract, Textract_Features

# Setup the queries
query1 = Query(text="Who is the named insured?" , alias="NAMED_INSURED", pages=["1"])
query2 = Query(text="What is the policy number?", alias="POLICY_NUMBER", pages=["1"])
query3 = Query(text="What is the effective date?", alias="EFFECTIVE_DATE", pages=["1"])
query4 = Query(text="What is the insurance company name?", alias="INSURANCE_COMPANY", pages=["1"])

#Setup the query config with the above queries
queries_config = QueriesConfig(queries=[query1, query2, query3, query4])

response_unclassified = call_textract(input_document=f's3://{data_bucket}/{file_key}',
                          features=[Textract_Features.QUERIES],
                          queries_config=queries_config)
doc_unclassified = Document(response_unclassified)

In [None]:
import trp.trp2 as t2
doc_unclassified : t2.TDocumentSchema = t2.TDocumentSchema().load(response_unclassified) 
    
entities = {}
for page in doc_unclassified.pages:
    query_answers = doc_unclassified.get_query_answers(page=page)
    if query_answers:
        for answer in query_answers:
            entities[answer[1]] = answer[2]
            
display(JSON(entities, root='Query Answers'))

---
# Conclusion

In this notebook, we saw how to extract KEY-VALUE pairs from various documents that may be present in a mortgage packet. We also used Amazon Textract AnalyzeDocument Queries feature to extract specific information from an unclassified document, which is dense text and got accurate responses back from the API. In the next stages, we will continue with capabilities of Amazon Comprehend Custom NER and then perform enrichment on one of the documents.