# Document Extraction

In this notebook, we will demonstrate how to utilize textract to extract text, table's, forms from documents. We will also showcase how to use the Analyze ID and Analyze Expense API's

Additional code samples and methods of wrangling textract api response JSON's can be found [here](https://github.com/aws-samples/amazon-textract-code-samples/blob/master/python/Textract.ipynb) - pay particular attention to Section 8, 10 and 12. 

We are first going to install the following python libraries: 

- [Amazon Textract Response Parser](https://github.com/aws-samples/amazon-textract-response-parser/blob/master/src-python/README.md)
- [tabulate](https://pypi.org/project/tabulate/)
- [Amazon Textract Pretty Printer](https://github.com/aws-samples/amazon-textract-textractor/tree/master/prettyprinter)
- [Amazon Textract Caller](https://github.com/aws-samples/amazon-textract-textractor/tree/master/caller)

Throughout this notebook, we will use utilise the libraries in a few different ways - this is simply to showcase the versatility of the tools available to you when handle JSON output from the textract API.

In [None]:
!pip install amazon-textract-response-parser tabulate amazon-textract-prettyprinter

Let's now import some additional libraries.

In [None]:
#import our libararies
import boto3
import json
import re
import csv
import sagemaker
from sagemaker import get_execution_role
from sagemaker.s3 import S3Uploader, S3Downloader
import uuid
import time
import io
from io import BytesIO
import sys
from pprint import pprint
import trp #textract response parser library
from tabulate import tabulate

from IPython.display import Image, display, IFrame
from PIL import Image as PImage, ImageDraw
from textractprettyprinter.t_pretty_print_expense import get_string, Textract_Expense_Pretty_Print, Pretty_Print_Table_Format, get_expensesummary_string, get_expenselineitemgroups_string
# from textractcaller.t_call import call_textract, Textract_Features
from textractprettyprinter.t_pretty_print import Textract_Pretty_Print, get_string

Below is an example function that we are constructing using the Textract response parser - this functions shows how to:

- iterate through distinct **pages** in the textract JSON response
- how to select the **form** construct from the pages
- how to select the identified **keys** and **values** from the forms

In [None]:
# #Helper Function for Form Extraction
from trp import Document
def getformkeyvalue(response):
 doc = Document(response)
 
 key_map = {}
 for page in doc.pages:
 # Print fields
 for field in page.form.fields:
 if field is None or field.key is None or field.value is None:
 continue
 
 key_map[field.key.text] = field.value.text
 return key_map

We will now upload the sample documents that are in the `./dataset/document_samples/` folder to S3.

S3 will serve as the starting point for our txtract extractions - identical to how textract would be employed in production workloads.

In [None]:
mySession = boto3.session.Session()
awsRegion = mySession.region_name
data_bucket = sagemaker.Session().default_bucket()

# Amazon S3 client
s3 = boto3.client('s3')

# Amazon Textract client
textract = boto3.client('textract')

In [None]:
#upload to S3
cms_key='idp/textract/cms1500.png'
s3.upload_file(Filename='./dataset/document_samples/CMS1500.png', 
 Bucket=data_bucket, 
 Key=cms_key)

In [None]:
display(Image(filename="./dataset/document_samples/CMS1500.png", width=900, height=400))

## Extract data from claim form CMS1500

---

We will now use textract to extract both the **FORMS** and the **TABLES** from the CMS1500 document.

In [None]:
from IPython.display import HTML as html_print

form_resp = textract.analyze_document(Document={'S3Object':{"Bucket": data_bucket, "Name": cms_key}
 }, FeatureTypes=['FORMS', 'TABLES']
)

html_print(get_string(textract_json=form_resp, output_type=[Textract_Pretty_Print.TABLES], table_format=Pretty_Print_Table_Format.html))


In [None]:
html_print(get_string(textract_json=form_resp, output_type=[Textract_Pretty_Print.FORMS], table_format=Pretty_Print_Table_Format.html))

In [None]:
from IPython.display import display, JSON

# using our constructed helper function - values returned as a dictionary

display(JSON(getformkeyvalue(form_resp), root="Claim Form"))

## Extract data from ID Document

---

Let's now look at the **Analyze ID** API.

In [None]:
#upload to S3
key='idp/textract/drivers_license.png'
s3.upload_file(Filename='./dataset/document_samples/drivers_license.png', 
 Bucket=data_bucket, 
 Key=key)

In [None]:
display(Image(filename="./dataset/document_samples/drivers_license.png", width=900, height=400))

In [None]:
ID_resp = textract.analyze_id(DocumentPages=[{'S3Object':{"Bucket": data_bucket, "Name": key}}])

In [None]:
#once again using the textract response parser
from trp.trp2_analyzeid import TAnalyzeIdDocument, TAnalyzeIdDocumentSchema
t_doc = TAnalyzeIdDocumentSchema().load(ID_resp)

From below, you will notice certain key's are presented that were **not** in the driver's license itself.

For example, **Veteran** is not a key found in the license, however, it is a pre-populated key value that Analyze ID supports, due to the differences found amongst licenses between states.

In [None]:
list_of_results = t_doc.get_values_as_list()
print(tabulate([x[1:3] for x in list_of_results]))

## Extract data from Medical insurance card

---

Let's now take a look at an example medical insurance card and extract the 'form' from it.

In [None]:
#upload to S3
ins_card_key='idp/textract/insurance_card.png'
s3.upload_file(Filename='./dataset/document_samples/insurance_card.png', 
 Bucket=data_bucket, 
 Key=ins_card_key)

In [None]:
display(Image(filename="./dataset/document_samples/insurance_card.png", width=900, height=400))

In [None]:
ins_form_resp = textract.analyze_document(Document={'S3Object':{"Bucket": data_bucket, "Name": ins_card_key}
 }, FeatureTypes=['FORMS']
)

In [None]:
from IPython.display import display, JSON

# using our constructed helper function - values returned as a dictionary
display(JSON(getformkeyvalue(ins_form_resp), root="Insurance card"))

## Extract data from medical invoice

---

Let's now look at the **Analyze Expense** API to extract information from a medical insurance invoice.

In [None]:
#upload to S3
invc_key='idp/textract/insurance_invoice.png'
s3.upload_file(Filename='./dataset/document_samples/insurance_invoice.png', 
 Bucket=data_bucket, 
 Key=invc_key)

In [None]:
display(Image(filename="./dataset/document_samples/insurance_invoice.png", width=900, height=400))

From the below, it is important to notice that textract provides the ability to seperately extract the "line items" from the overall invoice.

In [None]:
expense_resp = textract.analyze_expense(Document={'S3Object':{"Bucket": data_bucket, "Name": invc_key}})

### Invoice Summary

In [None]:
print(get_expensesummary_string(textract_json=expense_resp, table_format=Pretty_Print_Table_Format.fancy_grid))

### Invoice line items

In [None]:
print(get_expenselineitemgroups_string(textract_json=expense_resp, table_format=Pretty_Print_Table_Format.fancy_grid))

 
### Extraction with Queries using Amazon Textract 

When processing a document with Amazon Textract, you may add queries to your analysis to specify what information you need. This involves passing a question, such as "What is the customer's social security number?" to Amazon Textract. Amazon Textract will then find the information in the document for that question and return it in a response structure separate from the rest of the document's information. Queries can be processed alone, or in combination with any other FeatureType, such as Tables or Forms.

Example Query: What is the customer’s SSN?

Example Answer: 111-xx-333

To understand more about how Amazon Textract extracts query based information in a document please follow this [link.](https://docs.aws.amazon.com/textract/latest/dg/queryresponse.html)


In [None]:
display(Image(filename="./dataset/document_samples/insurance_card.png", width=900, height=400))

In [None]:
# saving the insurance card as a variable
filename = "./dataset/document_samples/insurance_card.png"

### Configuring the Queries NLP Questions

In [None]:
response = None
with open(filename, 'rb') as document:
 imageBytes = bytearray(document.read())

 # Call Textract
 response = textract.analyze_document(
 Document={'Bytes': imageBytes},
 # adding additional Feature Types like FORMS and/or TABLES
 # FeatureTypes=["QUERIES", "FORMS", "TABLS"],
 FeatureTypes=["QUERIES"],
 QueriesConfig={
 "Queries": [{
 "Text": "What is the member name?",
 "Alias": "INSURANCE_CARD_NAME"
 },
 {
 "Text": "What is the member id?",
 "Alias": "INSURANCE_CARD_MEMBER_ID"
 },
 {
 "Text": "What is the PCP name?",
 "Alias": "INSURANCE_CARD_PCP_NAME"
 },
 {
 "Text": "What is the PCP phone number?",
 "Alias": "INSURANCE_CARD_PCP_PHONE"
 }
 ]
 })


The response does include all OCR WORDS and LINES, geometry information, confidence scores. For the purpose of this blog, we just print the query responses. The code (commented out part) also demonstrates how to configure FORMS and/or TABLES Feature Types along with Queries, making it easy to add QUERIES to any existing workflow.


In [None]:
import trp.trp2 as t2
from tabulate import tabulate

d = t2.TDocumentSchema().load(response)
page = d.pages[0]

query_answers = d.get_query_answers(page=page)

print(tabulate(query_answers, tablefmt="github"))

 
# Conclusion

In this notebook, we saw how to extract `FORMS, TABLES` and `QUERIES` from various document types that may be present in a claims processing packet. We also used Amazon Textract `AnalyzeID` API to detect information from passport document. We also used Textract to extract specific information out of a document which is dense text and got accurate responses back from the API. In the next notebook, we will perform enrichment on one of the documents.