# Document Extraction

In this lab we will look at a method of how to extract table information out of the documents.



- [Step 1: Setup notebook](#step1)
- [Step 2: Extract unstructured data with Amazon Textract](#step2)
- [Step 3: Extract table data using Amazon Textract](#step3)
- [Step 4: Extract forms (key/value) data using Amazon Textract](#step4)
- [Step 5: Query based extraction using Amazon Textract](#step5)
- [Step 6: Signature detection with Amazon Textract](#step6)
- [Step 7: Extracting invoices/receipts with Amazon Textract](#step7)
- [Step 8: Extracting identity documents with Amazon Textract](#step8)
- [Cleanup](#cleanup)
- [Conclusion](#conclusion)
---

# Step 1: Setup notebook 

In this step, we will import some necessary libraries that will be used throughout this notebook. 

In [None]:
!python -m pip install -q amazon-textract-response-parser --upgrade
!python -m pip install -q amazon-textract-caller --upgrade
!python -m pip install -q amazon-textract-prettyprinter==0.0.16
!python -m pip install -q amazon-textract-textractor --upgrade

In [None]:
#Restart the kernel
import IPython
IPython.Application.instance().kernel.do_shutdown(True)

In [None]:
import boto3
import botocore
import sagemaker
import pandas as pd
from IPython.display import Image, display, JSON
from textractcaller.t_call import call_textract, Textract_Features, call_textract_expense
from textractprettyprinter.t_pretty_print import convert_table_to_list
from trp import Document
import os

# variables
data_bucket = sagemaker.Session().default_bucket()
region = boto3.session.Session().region_name
account_id = boto3.client('sts').get_caller_identity().get('Account')

os.environ["BUCKET"] = data_bucket
os.environ["REGION"] = region
role = sagemaker.get_execution_role()

print(f"SageMaker role is: {role}\nDefault SageMaker Bucket: s3://{data_bucket}")

s3=boto3.client('s3')
textract = boto3.client('textract', region_name=region)
comprehend=boto3.client('comprehend', region_name=region)

Let's select a bank statement we classified in the previous exercise

In [None]:
import random
prefix = 'idp/comprehend/classified-docs/bank-statements'
start_after = 'idp/comprehend/classified-docs/bank-statements/'

paginator = s3.get_paginator('list_objects_v2')
operation_parameters = {'Bucket': data_bucket,
 'Prefix': prefix,
 'StartAfter':start_after}
list_items=[]
page_iterator = paginator.paginate(**operation_parameters)

for page in page_iterator:
 if "Contents" in page:
 for item in page['Contents']:
 list_items.append(f's3://{data_bucket}/{item["Key"]}')
 else:
 list_items.append('./samples/mixedbag/document_0.png')

file = random.sample(list_items, k=1)[0] #select a random bank statement document from the list

if "s3://" in file:
 file_key=file.replace(f"s3://{data_bucket}/","")
else:
 print(f"S3 File not found, using file from local {file}\n")
 file_key=f"idp/textract/sample/{os.path.basename(file)}"
 !aws s3 cp {file} s3://{data_bucket}/{file_key} --only-show-errors

display(Image(url=s3.generate_presigned_url('get_object', Params={'Bucket': data_bucket, 'Key': file_key}), width=600))

---
# Step 2: Extract unstructured data with Amazon Textract 

Amazon Textract is an ML powered OCR service that is capable of detecting and extracting text from documents. Text data in the form of WORDS and LINES can be extracted from documents using Amazon Textract `DetectDocumentText` API. Let's extract the words and lines from the bank statement.

In [None]:
# Call Amazon Textract
response = textract.detect_document_text(
 Document={
 'S3Object': {
 'Bucket': data_bucket,
 'Name': file_key
 }
 })


# Print detected text
for item in response["Blocks"]:
 if item["BlockType"] == "LINE":
 print (item["Text"])

As you can notice, we were able to extract the LINES and WORDS from the document, but we also lost some of the structural formatting within the document. For example the document contains a few tables and we would like to extract the table information in a tabular structure. So let's do that next.

---
# Step 3: Extract table data using Amazon Textract 

In this step we will take a brief look at how to extract table information from the bank statemente. Our bank statement has two tables. 

In [None]:
response = textract.analyze_document(
 Document={
 'S3Object': {
 'Bucket': data_bucket,
 'Name': file_key
 }
 },
 FeatureTypes=["TABLES"])

response

As you can see, the response from Amazon Textract is a large JSON object that contains a lot of information. Let's parse out the table data from this reponse. To do this, we will see how to extract the tables using the textract response parser tool that we installed earlier. To learn about how Textract Table response works, refer to the [documentation](https://docs.aws.amazon.com/textract/latest/dg/how-it-works-tables.html).

In [None]:
#print(response)
doc = Document(response)
for page in doc.pages:
 # Print tables
 for table in page.tables:
 for r, row in enumerate(table.rows):
 for c, cell in enumerate(row.cells):
 print("Table[{}][{}] = {}".format(r, c, cell.text))

In the code cells above, we used the Textract `AnalyzeDocument` API to extract info from the document and subsequently used textract response parser `Document` to parse out the tables from the JSON response. We can further use additional tooling to call the Textract API and use textract pretty printer tool to view the tables in a slightly more human readable way. We will see how to extract the tables using the Textract pretty printer tool. We will also use `call_textract` method from the Textract Caller tool that we installed earlier. These set of tools make it easy for us to make Textract API calls and parse it's JSON output. In our subsequent sections, we will make use of these tools to make API calls and subsequently to parse the JSON response.

In [None]:
resp = call_textract(input_document=file, features=[Textract_Features.TABLES])
tdoc = Document(resp)
dfs = list()

for page in tdoc.pages:
 for table in page.tables:
 tab_list = convert_table_to_list(trp_table=table)
 print(tab_list)
 dfs.append(pd.DataFrame(tab_list))

df1 = dfs[0]
df2 = dfs[1]

In the code cell above, we extracted the tables as a Python List and then converted them to Pandas DataFrame. You can also extract tables in other formats such as CSV, TSV etc. Refer to the [PrettyPrinter](https://github.com/aws-samples/amazon-textract-textractor/tree/master/prettyprinter) documentation for more. Now let's look at the DataFrames.

In [None]:
df1

In [None]:
df2

---
# Step 4: Extract forms (key/value) data using Amazon Textract 

Let's look at how Amazon Textract can be used to extract form data from the document. In this example, we will use a sample Employment Verification form.

In [None]:
display(Image(url="./samples/textract/Employment_Verification.png", width=600))

In our previous example, our document was in S3 and we called Amazon Textract by specifying the S3 location of the document. In this case our document is present locally, we can either upload this document into S3, or we can use the document's Byte Array from our local environment to call the API. Let's use the document Byte Array for this example. Note that this method only applies to Textract Sync (real-time) APIs, since the async APIs only support documents placed in S#. In the code cell below, we first convert our document to a Byte array, and then call the `AnalyzeDocument` API with `FORMS` feature. Subsequently we use textract response parser tool to parse out the form key/value pairs and print them out.

In [None]:
# Read document content
documentName="./samples/textract/Employment_Verification.png"
with open(documentName, 'rb') as document:
 imageBytes = bytearray(document.read())

# Call Amazon Textract
response = call_textract(input_document=imageBytes, features=[Textract_Features.FORMS])

doc = Document(response)

for page in doc.pages:
 # Print fields
 print("Fields:")
 for field in page.form.fields:
 print("Key: {}, Value: {}".format(field.key, field.value))

 # Get field by key
 print("\nGet Field by Key (Base Pay):")
 key = "Base Pay"
 field = page.form.getFieldByKey(key)
 if(field):
 print("Key: {}, Value: {}".format(field.key, field.value))

 # Search fields by key
 print("\nSearch Fields (address):")
 key = "address"
 fields = page.form.searchFieldsByKey(key)
 for field in fields:
 print("Key: {}, Value: {}".format(field.key, field.value))

---
# Step 5: Query based extraction using Amazon Textract 

When processing a document with Amazon Textract, you may add queries to your analysis to specify what information you need. This involves passing a question, such as "What is the customer's social security number?" to Amazon Textract. Amazon Textract will then find the information in the document for that question and return it in a response structure separate from the rest of the document's information. Queries can be processed alone, or in combination with any other FeatureType, such as TABLES or FORMS. Queries can be a powerful tool in situations where only a few pieces of critical information is desired from a document. There are limits to how many queries you can pass, please refer to the [Set Quotas in Amazon Textract](https://docs.aws.amazon.com/textract/latest/dg/limits-document.html) document for more info.

Let's pass a couple of Queries to extract from our Employment Verification form.

In [None]:
from textractcaller import QueriesConfig, Query
import trp.trp2 as t2 

# Setup the queries
query1 = Query(text="Who is the applicant's date of employmet?" , alias="EMPLOYMENT_DATE", pages=["1"])
query2 = Query(text="What is the probability of continued emplyment?", alias="CONTINUED_EMPLYMT_PROB", pages=["1"])

#Setup the query config with the above queries
queries_config = QueriesConfig(queries=[query1, query2])

documentName="./samples/textract/Employment_Verification.png"
with open(documentName, 'rb') as document:
 imageBytes = bytearray(document.read())

response = call_textract(input_document=imageBytes,
 features=[Textract_Features.QUERIES],
 queries_config=queries_config)
doc_ev = Document(response)

doc_ev: t2.TDocumentSchema = t2.TDocumentSchema().load(response)

entities = {}
for page in doc_ev.pages:
 query_answers = doc_ev.get_query_answers(page=page)
 if query_answers:
 for answer in query_answers:
 entities[answer[1]] = answer[2]
 
display(JSON(entities, root='Query Answers'))

---
# Step 6: Signature detection with Amazon Textract 

Amazon Textract can detect the presence of signatures in documents. The AnalyzeDocument API has the following four feature types – Forms, Tables, Queries, and Signatures. The Signatures feature can be used by itself or in combination with other feature types. When used by itself, Signatures feature type provides a json response that includes a) location and confidence scores of the detected signatures and b) raw text (words and lines) from the documents. If the Signatures feature is used along with Forms feature that extracts key value pairs in a form, the detected signature will be associated as a value to the relevant key. Similarly, when used along with Tables feature type, the detected signature will be associated to a cell within the table.

Let's try to detect the signatures in our Employment Verification form.

In [None]:
# Read document content
from textractor.parsers import response_parser

documentName="./samples/textract/Employment_Verification.png"
with open(documentName, 'rb') as document:
 imageBytes = bytearray(document.read())

# Call Amazon Textract
response = call_textract(input_document=imageBytes,
 features=[Textract_Features.SIGNATURES])
tdoc = response_parser.parse(response)

for signature in tdoc.signatures:
 print(signature.bbox)
 print(f"Confidence: {signature.confidence}\n")

Textract has detected three signatures in the document along with their bounding box information along with the confidence scores.

---
# Step 7: Extracting invoices/receipts with Amazon Textract 

Let's now look at the `AnalyzeExpense` API to extract information from an invoice document.

In [None]:
documentName = "./samples/textract/invoice.png"
display(Image(filename=documentName, width=600)) 

It is important to note that textract provides the ability to seperately extract the "line items" in the invoice and the "Summary" of the invoice.

In [None]:
with open(documentName, 'rb') as document:
 imageBytes = bytearray(document.read())
 
# expense_resp = call_textract_expense(input_document=imageBytes) 
expense_resp = textract.analyze_expense(Document={'Bytes': imageBytes}) 

In [None]:
summary_entities_values = []
summary_fields = []
expense_item = []

for expense_doc in expense_resp["ExpenseDocuments"]:
 for field in expense_doc["SummaryFields"]:
 kvs = {}
 if "LabelDetection" in field:
 if "ValueDetection" in field:
 kvs[field["LabelDetection"]["Text"]] = field["ValueDetection"]["Text"]
 else:
 kvs[field["Type"]["Text"]] = field["ValueDetection"]["Text"]
 summary_entities_values.append(kvs.copy())
 kvs = None

 for line_item_group in expense_doc["LineItemGroups"]:
 for line_items in line_item_group["LineItems"]:
 for field in line_items["LineItemExpenseFields"]:
 kvs = {}
 if "LabelDetection" in field:
 if "ValueDetection" in field:
 kvs[field["LabelDetection"]["Text"]] = field["ValueDetection"]["Text"]
 else:
 kvs[field["Type"]["Text"]] = field["ValueDetection"]["Text"]
 expense_item.append(kvs.copy())
 kvs = None
print("Invoice Summary:")
print("==========================================")
print(*summary_entities_values, sep='\n')
print("\nInvoice Line Items:")
print("==========================================")
print(*expense_item, sep='\n')

---
# Step 8: Extracting identity documents with Amazon Textract 
 
To see how extraction of identity documents works with Amazon Textract we will use a sample Passport document. Passport is a special document, i.e. an Identity document. To extract infromation from US passports and driver's license, Amazon Textract's AnalyzeID API can be used.

In [None]:
documentName = "./samples/textract/Passport.png"

display(Image(url=documentName, width=500));

We will use the call_textract_analyzeid tool from the amazon-textract-textractor library.

In [None]:
from textractcaller import call_textract_analyzeid
import trp.trp2_analyzeid as t2id

with open(documentName, 'rb') as document:
 imageBytes = bytearray(document.read())

response_passport = call_textract_analyzeid(document_pages=[imageBytes])
doc_passport: t2id.TAnalyzeIdDocument = t2id.TAnalyzeIdDocumentSchema().load(response_passport) 

Note that in the call to `call_textract_analyzeid` you can also pass an S3 path to the parameter `document_pages` as

```
call_textract_analyzeid(document_pages=["s3://bucket/prefix/doc.png"])
```

Let's look at the extracted information from the Passport document. Notice that the Keys are normalized, this means it makes it easy to parse out the required information from the response JSON from Textract.

In [None]:
 for id_docs in response_passport['IdentityDocuments']:
 id_doc_kvs={}
 for field in id_docs['IdentityDocumentFields']:
 id_doc_kvs[field['Type']['Text']] = field['ValueDetection']['Text']

display(JSON(id_doc_kvs, root='ID Document Key-values', expanded=True))

---
# Cleanup 

Cleanup is optional if you want to execute subsequent notebooks. 

Refer to the `05-idp-cleanup.ipynb` for cleanup and deletion of resources.

---
# Conclusion 

In this notebook we did a table extraction from a bank statement and further looked on a few additional ways Amazon Textract can help extract specific structured and semi-structured data such as forms data from our documents. In the next notebook we will extract entity information from our documents using Amazon Comprehend.


You can further explore all Amazon Textract capabilities by cloning the entire code repository using the `git clone` command below.

`git clone https://github.com/aws-samples/amazon-textract-code-samples`