# Customize Business Rules for Intelligent Document Processing with Human Review and BI Visualization

Amazon Textract (https://aws.amazon.com/textract/) lets you easily extract text from various documents, and Amazon Augmented AI (https://aws.amazon.com/augmented-ai/) (A2I) enables you to implement a human review of machine learning predictions. The default Amazon A2I template allows you to build a human review pipeline based on conditions such as when the extraction confidence score is lower than a pre-defined threshold, required keys are missing, or randomly assigning documents to human review. But sometimes, customers need the document processing pipeline to support flexible business rules, such as validating the string format, verifying the data type and range, and cross fields validation. This post shows how you can leverage Amazon Textract and Amazon A2I to customize a generic document processing pipeline supporting flexible business rules.

### The following diagram illustrates the workflow of the pipeline that supports customized business rules:
![a2i with custom rules](./images/a2i-custom-rule.png)

- [Step 1: Setup notebook](#step1)
- [Step 2: Extract text from sample documents using Amazon Textract](#step2)
- [Step 3: Validate rules and route document to A2I](#step3)
- [Step 4: Build BI dashboard using QuickSight](#step4)

### The Sample Document
The documents processed in the sample solution is the Tax Form 990 (https://en.wikipedia.org/wiki/Form_990), a US IRS form that provides the public with financial information about a non-profit organization. We will only cover the extraction logic for some of the fields on the 1st page as example since this post focuses on the end-to-end pipeline. 

<div class="alert alert-warning"> <h4><strong>üí° NOTE</strong>
</h4>
    The document is from the publicly available IRS (Internal Revenue Service) <a href="https://www.irs.gov/charities-non-profits/form-990-series-downloads">website</a>.
</div>

# Step 1: Setup notebook <a id="step1"></a>

In this step, we will import some necessary libraries that will be used throughout this notebook. 

In [None]:
!pip install awscli --upgrade
!pip install botocore --upgrade
!pip install boto3 --upgrade

In [None]:
import boto3
import botocore
import sagemaker as sm
import os
import io
import datetime

# variables
data_bucket = sm.Session().default_bucket()
region = boto3.session.Session().region_name

os.environ["BUCKET"] = data_bucket
os.environ["REGION"] = region
role = sm.get_execution_role()

print(f"SageMaker role is: {role}\nDefault SageMaker Bucket: s3://{data_bucket}")

s3=boto3.client('s3')
textract = boto3.client('textract', region_name=region)
comprehend=boto3.client('comprehend', region_name=region)
sagemaker=boto3.client('sagemaker', region_name=region)
a2i=boto3.client('sagemaker-a2i-runtime', region_name=region)

---
# Step 2: Extract text from sample documents using Amazon Textract<a id="step2"></a>

In this section we  use Amazon Textract's `analyze_document` API to extract the text information for the 990 Tax Form Page 1 document. We will also map the data to a JSON data model. This data model will be used to validate business rules in the later steps.

First, define a JSON structure `page1` map to the fields on the 990 Tax form on page 1, so we can apply business rules in the later steps. 

The below image shows how the text on page 1 maps to the JSON fields.

![990 page1 mapping](./images/a2i-page1-data-model-mapping.png)

The below `page1` object will hold the Textract extraction result. 
Following the same pattern, you can expand this JSON to include more fields. For this lab, the sample code only extracts and maps a partial page.

In [None]:
# JSON structure to hold the Page 1 extraction result
page1 = {
          "dln": None,
          "omb_no": None,
          "b.address_change": None,
          "b.name_change": None,
          "c.org_name": None,
          "c.street_no": None,
          "d.employer_id": None,
          "e.phone_number": None,
          "i.501_c_3": None,
          "i.501_c": None,
          "part1.1": None,
          "part1.3": None,
          "part1.8_pre_yr": None,
          "part1.8_cur_yr": None,
        }

Now, let's start to prepare extraction by uploading the sample document to the S3 bucket:

In [None]:
s3_key = 'idp/textract/990-sample-page-1.jpg'

In [None]:
# Upload images to S3 bucket:
!aws s3 cp a2idata/990-sample-page-1.jpg s3://{data_bucket}/{s3_key} --only-show-errors

In [None]:
# Get image meta data
from PIL import Image
im = Image.open('a2idata/990-sample-page-1.jpg')
image_width, image_height = im.size

We now call Textrat's `analyze_document` API Query feature to extract fields by asking specific questions. You do not need to know the structure of the data in the document (table, form, implied field, nested data) or worry about variations across document versions and formats. Queries leverages a combination of visual, spatial, and language cues to extract the information you seek with high accuracy.

In the below code, 14 Textract Query questions map to the fields in the JSON structure defined earlier:

In [None]:
response = textract.analyze_document(
            Document={'S3Object': {'Bucket': data_bucket, 'Name': s3_key}},
            FeatureTypes=["QUERIES"],
            QueriesConfig={
                    'Queries': [
                        {
                            'Text': 'What is the DLN?',
                            'Alias': 'DLN_NO'
                        },
                        {
                            'Text': 'What is the OMB No?',
                            'Alias': 'OMB_NO'
                        },
                        {
                            'Text': 'Does the address changed?',
                            'Alias': 'B_ADDRESS_CHANGED'
                        },
                        {
                            'Text': 'Does the name changed?',
                            'Alias': 'B_NAME_CHANGED'
                        },
                        {
                            'Text': 'What is the name of organzation?',
                            'Alias': 'C_ORG_NAME'
                        },
                        {
                            'Text': 'What is the Number and street?',
                            'Alias': 'C_STREET_NUMBER'
                        },
                        {
                            'Text':'What is the Employer identification number?',
                            'Alias': 'D_EMPLOYER_ID'
                        },
                        {
                            'Text':'What is Telephone Number?',
                            'Alias': 'E_PHONE'
                        },
                        {
                            'Text':'Does 501(cx3) checked?',
                            'Alias': 'I_501_CX3'
                        },
                        {
                            'Text':'Does 501(c) checked?',
                            'Alias': 'I_501_C'
                        },
                        {
                            'Text':'What is Breifly describe the organization''s mission or most significant activities?',
                            'Alias': 'PART_1_1'
                        },
                        {
                            'Text':'What is Number of voting members of the governing body?',
                            'Alias': 'PART_1_3'
                        },
                        {
                            'Text':'What is 8 contributes and grants for Prior Year?',
                            'Alias': 'PART_1_8_PRIOR_YEAR'
                        },
                        {
                            'Text':'What is 8 contributes and grants for Current Year?',
                            'Alias': 'PART_1_8_CURRENT_YEAR'
                        },
                    ]
                }
        )

The Textract JSON response is relatively large. 
You can use the below code snippet to save it as a JSON file under the same directory called: `textract-response.json`. 

Open the file in a new SageMaker Studio tab (or you preferred IDE) for easy reviewing and searching.

In [None]:
import json, os
with open('textract-response.json','w') as f:
    f.write(json.dumps(response))

Now let's parse the Textract response to populate the values to the `page1` object defined earlier.

In [None]:
# Utility functions to allocate fields from Textract response JSON
# Find the Query item in block. Return text and confidence score
# return tuple contains (parsed_value, confidence_score, raw_block)
def get_query_ref(id):
    for b in response["Blocks"]:
        if b["BlockType"] == "QUERY_RESULT" and b["Id"] == id:
            return {
                        "value": b.get("Text"), 
                        "confidence": b.get("Confidence"), 
                        "block": b
                    }
    return None
        
def get_query_answer(q_alias):
    for b in response["Blocks"]:
        if b["BlockType"] == "QUERY" and b["Query"]["Alias"] == q_alias:
                ref_id = b["Relationships"][0]["Ids"][0]
                return get_query_ref(ref_id)
    return None

# Populate Textract Query results to the page1 object
page1['dln'] = get_query_answer('DLN_NO')
page1['omb_no'] = get_query_answer('OMB_NO')
page1['b.address_change'] = get_query_answer('B_ADDRESS_CHANGED')
page1['b.name_change'] = get_query_answer('B_NAME_CHANGED')
page1['c.org_name'] = get_query_answer('C_ORG_NAME')
page1['c.street_no'] = get_query_answer('C_STREET_NUMBER')
page1['d.employer_id'] = get_query_answer('D_EMPLOYER_ID')
page1['e.phone_number'] = get_query_answer('E_PHONE')
page1['i.501_c_3'] = get_query_answer('I_501_CX3')
page1['i.501_c'] = get_query_answer('I_501_C')
page1['part1.1'] = get_query_answer('PART_1_1')
page1['part1.3'] = get_query_answer('PART_1_3')
page1['part1.8_pre_yr'] = get_query_answer('PART_1_8_PRIOR_YEAR')
page1['part1.8_cur_yr'] = get_query_answer('PART_1_8_CURRENT_YEAR')

Each fields in the `page1` object contains 3 sub-fields:
* *value*: The text value extracted by Textract
* *confidence*: Textract confidence score - you can define custom business rule base on it.
* *block*: The original Textract block section keeps Geometry metadata. We will need it for the custom A2I UI to plot the bounding box on top of the original text block.

In [None]:
# Print out page1
page1

# Step 3: Define generic business rules <a id="step3"></a>

In this lab, we defined 3 business rules for demo purposes:
* The 1st rule is for the employer Id field. The rule will fail if the Textract confidence score is lower than 99%. In this demo rule, we set the confidence score threshold high, which will break by design. You could adjust the threshold to a more reasonable value to reduce unnecessary human effort in a real-world environment.

* The 2nd rule is for the DLN field, the unique identifier of the Tax form, which is a must-have for the downstream processing logic. This rule will fail if the DLN field misses or with an empty value.

* The 3rd rule is also for DLN field but with a different condition type ‚ÄúLengthCheck‚Äù. The rule will break if the DLN length is not 16 characters. 

In [None]:
rules = [
    {
        "description": "Employee Id confidence score should greater than 99",
        "field_name": "d.employer_id",
        "field_name_regex": None, # support Regex: "_confidence$",
        "condition_category": "Confidence",
        "condition_type": "ConfidenceThreshold",
        "condition_setting": "99",
    },
    {
        "description": "dln is required",
        "field_name": "dln",
        "condition_category": "Required",
        "condition_type": "Required",
        "condition_setting": None,
    },
    {
        "description": "dln length should be 16",
        "field_name": "dln",
        "condition_category": "LengthCheck",
        "condition_type": "ValueRegex",
        "condition_setting": "^[0-9a-zA-Z]{16}$",
    }
]

More information about the rule definition:
* *description*: the description of the rule. 
* *field_name*: the field name defined in the Data Model JSON. (in the Define Data Model section)
* *field_name_regex*: The regular expression applies to field_name when you want to apply the same rule to multiple fields. E.g., applying the rule to all fields with the prefix ‚Äúpart1‚Äù will need field_name_regex value ‚Äúpart1$‚Äù in a standard regular expression format. Note, filed_name will be ignored when ‚Äúfield_name_regex‚Äù is specified.
* *condition_category*: The category of the condition for display and tracking purpose. In one of these values: "Required", "Confidence", "LengthCheck", "ValueCheck"
* *condition_type*: The type of the condition in one of these values: "Required", "ValueRegex", "ConfidenceThreshold"
* *condition_setting*: Regular Expression string when the condition_type is ‚ÄúValueRegex‚Äù. None when the condition_type is ‚ÄúRequiredField‚Äù.

## We have the data extracted from the document and the rules defined. Now let's evaluate the data against these rules.
The `Condition` class is a generic `Rules Engine` that takes 2 parameters: the data (page1 object) and the conditions we defined above. It will return 2 lists due to met and failed conditions. We then can send the document to A2I for human review if any conditions fail.

The `Condition` class source code locates in the a2idata folder `condition.py` file. It supports basic validation logic, such as validating a string's length, value range, and confidence score threshold. You can modify the code to support more condition types for complex validation logic.

In [None]:
from a2idata.condition import Condition

# Validate business rules:
con = Condition(page1, rules)
rule_missed, rule_satisfied = con.check_all()

In [None]:
# print out the list of failed business rules
rule_missed

You should see 2 conditions that failed by design:
- The 1st condition expects the DLN confidence score higher than 99%, but the acutal Textract confidence is 98%.
- The 3rd condition does a length check of the DLN number and expects the length should be 16 exact, but the actual length is 15. 

In the next step, we will send this list of failed conditions to A2I for human review.

## Step 3.2: Setup customized A2I UI template and workforce

Amazon A2I allows you to customize the reviewer‚Äôs web view by defining Work Task Template (https://docs.aws.amazon.com/sagemaker/latest/dg/a2i-custom-templates.html). The template is a static web page in HTML and JavaScript. You can pass data to the customized reviewer page leveraging the Liquid (https://shopify.github.io/liquid/) syntax.
In the below sample, the custom template shows the PDF page on the left and the unsatisfied conditions on the right. Reviewers can correct the extraction value along with their comments. 

### In the previous lab: notebook 04-idp-document-a2i.ipynb, we have set up the the below resources:
* A **Work Team (WHO)** authenticates the group of workers you have selected to review the tasks.
* A **Human Task UI (WHAT)** defines what the reviewer will see in the A2I console when reviewing the task, using the default Textract/A2I template.
* A **Workflow Definition (WHEN)** wrapping the above information and defining the conditions when the human review should trigger, using the default Textract A2I condition template.

In this lab, we will check if the account already has a work team. The below code will get the work team ARN and store it in a local variable. If you get an error from the above step. Follow this instruction to set up the workforce and run the below cell again: 
https://catalog.workshops.aws/intelligent-document-processing/en-US/02-getting-started/module-4-human-review#setup-an-a2i-human-review-workflow

In [None]:
# get the existing workforce arn
work_team_arn = sagemaker.list_workteams()["Workteams"][0]["WorkteamArn"]
work_team_arn

Create a new A2I Work Task Template - this is the Liquid HTML page you use to customize the reviewer UI. (The HTML template stores at a2idata/a2i-custom-ui.html)

In [None]:
# read the UI template from a2i-data directory
template = ""
with open('a2idata/a2i-custom-ui.html','r') as f:
    template = f.read()

resp = sagemaker.create_human_task_ui(
        HumanTaskUiName="a2i-custom-ui-demo",
        UiTemplate={'Content': template})

In [None]:
# Keep the new UI template ARN in a variable
ui_template_arn = resp["HumanTaskUiArn"]
ui_template_arn

Create a new human review workflow to wrap up all the information A2I needed.

In [None]:
resp = sagemaker.create_flow_definition(
        FlowDefinitionName= "a2i-custom-ui-demo-workflow",
        RoleArn= role,
        HumanLoopConfig= {
            "WorkteamArn": work_team_arn,
            "HumanTaskUiArn": ui_template_arn,
            "TaskCount": 1,
            "TaskDescription": "A2I custom business rule and UI demo workflow",
            "TaskTitle": "Custom rule sample task"
        },
        OutputConfig={
            "S3OutputPath" : f's3://{data_bucket}/a2i/output/'
        }
    )

workflow_definition_arn = resp['FlowDefinitionArn']

The new A2I UI template and the Workflow definition are in place. Let's send the missed conditions to the Workflow, so a reviewer can verify the result using A2I.

In [None]:
import uuid
human_loop_name = 'custom-loop-' + str(uuid.uuid4())

# Construct the data send to the custom A2I human review task
a2i_payload = {
                "InputContent": json.dumps({
                    "Results": {
                        "ConditionMissed": rule_missed,
                        "ConditionSatisfied": rule_satisfied
                    },
                    "s3":{
                        "bucket":data_bucket,
                        "path":s3_key,
                        "url": f's3://{data_bucket}/{s3_key}',
                        "image_width": image_width,
                        "image_height": image_height
                    },
                    "text": "990 Tax Form Page 1",
                })
            }

# Start the human loop task
start_loop_response = a2i.start_human_loop(
            HumanLoopName=human_loop_name,
            FlowDefinitionArn=workflow_definition_arn,
            HumanLoopInput=a2i_payload)


In [None]:
human_loop_arn = start_loop_response["HumanLoopArn"]

Check status of the Human Loop

In [None]:
a2i.describe_human_loop(HumanLoopName=human_loop_name)["HumanLoopStatus"]

The below cell will print out the A2I console URL, which you can use to log in using the credential received when setting up the Workforce to review the task.

In [None]:
work_team_name = work_team_arn[work_team_arn.rfind('/') + 1:]
print("Navigate to the private worker portal and do the tasks. Make sure you've invited yourself to your workteam!")
print('https://' + sagemaker.describe_workteam(WorkteamName=work_team_name)['Workteam']['SubDomain'])

In the A2I console, you should see a task in the list. Click on the "Start Working" button, and A2I will bring you to the customized UI page that looks like the below. 

Below is a screenshot of the customized A2I UI. It shows the original image document on the left and the 2 failed conditions on the right (We defined the conditions to fail on purpose):

* The DLN numbers should be 16 characters long. The actual DLN has 15 characters.
* The field employer_id‚Äôs confidence score is lower than 99%. The actual confidence score is around 98%.


![A2I custom UI](./images/a2i-custom-ui.png)

Once you review the task and click on the Sumbit button. The human review task status will change to "Completed".

A2I will store a JSON file in the S3 bucket once the review is submitted. The JSON will include the original data sent to A2I and the reviewer's input.

In [None]:
a2i_resp = a2i.describe_human_loop(HumanLoopName=human_loop_name)
print("Human Loop task status: ", a2i_resp["HumanLoopStatus"])
print("Human Loop output: ", a2i_resp["HumanLoopOutput"]["OutputS3Uri"])

## Check A2I generated JSON
Now, let's download the A2I output file and print it out:

In [None]:
s3.download_file(data_bucket, a2i_resp["HumanLoopOutput"]["OutputS3Uri"].replace(f's3://{data_bucket}/',''), 'a2i-output.json')

The `humanAnswers` field in the JSON file contains the reviewer's input. The `inputContent` field contains the original data sent to A2I.

In [None]:
import json
with open('a2i-output.json','r') as f:
    print(json.dumps(json.loads(f.read()), indent=2))

---
## Expand the solution to support more documents and business rules

To expand the solution to support more document pages with corresponding business rules, you will need to make changes in the below 3 places:

* Create a data model for the new page: in JSON structure representing all the values you want to extract out of the pages. 
* The extraction logic: leverage Amazon Textract to extract text out of the document and populate value to the data model.
* Add business rules corresponding to the page in JSON format. 

The custom A2I UI in the solution is generic, which doesn‚Äôt require a change to support new business rules.

---

# Step 4: Build BI dashboard using QuickSight  <a id="step4"></a>

In the below section, we will build a BI dashboard using Amazon QuickSight to get insights into the IDP pipeline.  
Below is a screenshot of the Amazon QuickSight dashboard. It includes widgets presenting numbers of the documents processed automatically or requiring human review. The primary reason caused the document to require human review and a histogram plot of the number of documents processed daily.

You can expand the dashboard by including more data and visuals to get insights and support business decisions.

![A2I custom UI](./images/a2i-quicksight-dashboard.png)

## QuickSight Initial Setup 
You will need author access to a QuickSight Enterprise Account for this workshop.

If you don't have a QuickSight account already, steps to create one are given below.

***Setup QuickSight***

1. Launch AWS Console (https://console.aws.amazon.com ) in a new browser tab, search for QuickSight and launch it.
2. On QuickSight page, click Sign up for QuickSight button.
3. Keep the default Enterprise edition, scroll down and click Continue button.
4. Enter QuickSight account name & Notification email address. Be sure to choose a name that is relevant and applicable to your entire user pool. Enter your official email as the notification email.
5. Scroll down and click Finish button. (It can take 15-30 Secs to set up the account)
6. Click Go to Amazon QuickSight button. You will now be taken to QuickSight console.

![A2I custom UI](./images/a2i-quicksight-init.gif)

In this lab, we have a CSV file under the a2idata folder called a2i-bi-sample-data.csv, which you can use to build the QuickSight dashboard. It is a sample dataset to start with. You can develop your ETL process to transform the A2I JSON data to your preferred format.

**Add Dataset**

1. Download the CSV file to your local drive from a2idata/a2i-bi-sample-data.csv
2. On the QuickSight page, click on "Datasets" then click on "Add new dataset" button on the top-right side
3. Click on "Upload a file" then choose the csv file downloaded in step 1
4. Click "Next" on the confirm file upload setting" page.
5. click on "Visualize" button then QuickSight will navigate to the Analyses page

![A2I custom UI](./images/a2i-quicksight-dataset.gif)

After the dataset adds to QuickSight, you will navigate to the Analyses page managing visuals. Let's create a Pie chart to show the total number of documents processed automatically vs. with human review.

**Create a Pie Chart**
1. Select the Pie chart in the "Visual Type" section
2. Drag the "process_method" field to the first drop-down list in Field wells
3. Drag the "doc_id" field to the 2nd drop-down list in Field wells, then change the aggregation type from the default "count" to "Count distinct"
4. Change the Visual display summary by double-clicking the visual to "Numbers of documents processed automatically vs. human review" then click "Save"

You now get the first visual ready.

![A2I custom UI](./images/a2i-quicksight-visual-pie.gif)

Let's create another visual that shows the field(s) that caused most of the human review. So you could optimize the workflow based on the insights.

**Create a Word Cloud Visual**
1. Click "Add" - > "Add Visual" on the top-left menu
2. Select the Word Cloud in the "Visual Type" section
3. Drag the "field_name" field to the first drop-down list in Field wells

We need to filter the dataset for this Visual, so it only shows fields from the "manu" tasks

4. Click "Filter" on the left menu, then click "Create one..." link
5. Choose "process_method" field which you will apply the filter
6. Click on "include all" to change the default filter setting
7. Uncheck "auto" from the list, then click "Apply"

Now the visual will ignore the "auto" task in the dataset

8. Change the Visual display summary by double-clicking the visual to "Fields caused most of the human review" then click "Save"

![A2I custom UI](./images/a2i-quicksight-visual-wordcloud.gif)

Let's add a 3rd visual, a histogram showing the numbers of documents processed daily by humans vs. automation. 

**Create a Histogram Visual**
1. Click "Add" - > "Add Visual" on the top-left menu
2. Select the Line Chart in the "Visual Type" section
3. Drag the "timestamp" field to the first drop-down list in Field wells
4. Drag the "doc_id" field to the second drop-down list in Field wells and change the aggregation type of Count distinct
5. Drag the "process_type" field to the third drop-down list in Field wells
6. Change the Visual display summary by double-clicking the visual to "Numbers documents processed daily" then click "Save"

![A2I custom UI](./images/a2i-quicksight-visual-histogram.gif)

We now have 3 visuals show insights of the IDP A2I workflow. You can publish them to a dashboard by following the below steps:
1. Click "Share" on the top-right menu and choose "Publish a dashboard"
2. For the first time publishing a dashboard, type a name in the textbox under "Publish new dashboard as" then click "Publish Dashboard"
You can set up access control in QuickSight to share the dashboard with the other users.

![A2I custom UI](./images/a2i-quicksight-publish.gif)

---

# Cleanup

Cleanup is optional if you want to execute subsequent notebooks. 

Refer to the `05-idp-cleanup.ipynb` for cleanup and deletion of resources.

---
# Conclusion

In this notebook, we built a pipeline to extract data from the first page 990 Tax form using Textract, applied customized business rules against the extracted data, and then used the customized A2I UI to review the result. In the end, we also put together a QuickSight dashboard to get an insight into the overall workflow. 

Intelligent Document Processing is in high demand, and companies need a customized pipeline to support their unique business logic. Amazon A2I offers a built-in template integrated with Amazon Textract support for common human review use cases. It also allows you to customize the reviewer page to serve flexible requirements. 