Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved._

SPDX-License-Identifier: MIT-0

# Handling Multi Page Tables in Textract

## Background

In this notebook, we will cover how to detect and merge single tables that span multiple pages. <br />
All document samples used are available in the *document_input* folder.


## Setup
_This Notebook was created on ml.t2.medium notebook instances._

Let's start by install and import all neccessary libaries:

In [None]:
!pip install amazon-textract-response-parser
!pip install amazon-textract-prettyprinter
!pip install amazon-textract-helper

In [None]:
import os
import json
from trp.t_pipeline import pipeline_merge_tables
import trp.trp2 as t2
from textractcaller.t_call import call_textract, Textract_Features
from textractprettyprinter.t_pretty_print import Textract_Pretty_Print, get_string, get_tables_string, Pretty_Print_Table_Format
from trp.trp2 import TDocument, TDocumentSchema
from trp.t_tables import MergeOptions, HeaderFooterType
import boto3
textract_client = boto3.client('textract', region_name='us-east-2')

## Call Textract Command-line Tool
amazon-textract-helper provides a collection of ready to use functions and sample implementations to speed up the evaluation and development for any project using Amazon Textract. It installs a command line tool called amazon-textract.
Make sure your environment is setup with AWS credentials through configuration files or environment variables or an attached role. You can replace the S3 URI for pdf documents with your own. (https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html)

In [None]:
s3_uri_of_documents = "s3://amazon-textract-public-content/multi-page-table/MPT_sample01-multi_page_table.pdf"
textract_json = call_textract(input_document=s3_uri_of_documents, features=[Textract_Features.FORMS, Textract_Features.TABLES], boto3_textract_client = textract_client)

## Pretty print the output (pre-table merge)
Pretty print outputs nicely formatted information for words, lines, forms or tables. The pretty print command requires to read a file. So first we write the response into a json file. As you can see, there are two separate tables printed by this function.

In [None]:
import pandas as pd
from trp import Document
from textractprettyprinter.t_pretty_print import convert_table_to_list
from IPython.display import display

def PrettyPrintTables(textract_json):
    df = None
    table_count = 0
    tdoc = Document(textract_json)
    for page in tdoc.pages:
      for table in page.tables:
        table_count += 1
        df = pd.DataFrame(convert_table_to_list(trp_table=table))
        print('Table id:', table.id, 'Row count:', len(df.index))
        display(df)

In [None]:
PrettyPrintTables(textract_json)

## Merge tables across pages
Sometimes tables start on one page and continue across the next page or pages. This component identifies if that is the case based on the number of columns and if a header is present on the subsequent table and can modify the output Textract JSON schema for down-stream processing. Other custom-logic is possible to develop for specific use cases.

The MergeOptions.MERGE combines the tables and makes them appear as one for post processing, with the drawback that the geometry information is not accuracy any longer. So overlaying with bounding boxes will not be accuracy.

The MergeOptions.LINK maintains the geometric structure and enriches the table information with links between the table elements. There is a custom['previus_table'] and custom['next_table'] attribute added to the TABLE blocks in the Textract JSON schema.

In [None]:
t_document: t2.TDocument = t2.TDocumentSchema().load(textract_json)    
t_document = pipeline_merge_tables(t_document, MergeOptions.MERGE, None, HeaderFooterType.NONE)
json_data = t2.TDocumentSchema().dump(t_document)   

#### Pretty print the output (post-table merge)
As you can see, both tables are merged into one table.

In [None]:
PrettyPrintTables(json_data)

## Link tables across pages
The MergeOptions.LINK maintains the geometric structure and enriches the table information with links between the table elements. There is a custom['previus_table'] and custom['next_table'] attribute added to the TABLE blocks in the Textract JSON schema.

In [None]:
t_document: t2.TDocument = t2.TDocumentSchema().load(textract_json)    
t_document = pipeline_merge_tables(t_document, MergeOptions.LINK, None, HeaderFooterType.NONE)  

In [None]:
for b in t_document.blocks:
    if b.block_type == t2.TextractBlockTypes.TABLE.name:
        print('---------------')
        print('Table id: ' + b.id)
        print(b.custom)
        

## Additional Examples: The tool identifies and merges tables across the document
In this example, the document contains multiple tables across the document. Two pairs of tables require to be merged.

In [None]:
textract_json = call_textract(input_document="s3://amazon-textract-public-content/multi-page-table/MPT_sample02-multi_tables.pdf",features=[Textract_Features.FORMS, Textract_Features.TABLES], boto3_textract_client = textract_client)
PrettyPrintTables(textract_json)

#### Merge tables with 95% dimension tolerance
We use a custom accuracy of 95% to calculate table similarity. By default, the component uses 99%

In [None]:
t_document: t2.TDocument = t2.TDocumentSchema().load(textract_json)    
t_document = pipeline_merge_tables(t_document, MergeOptions.MERGE, None, HeaderFooterType.NONE, 95)
json_data = t2.TDocumentSchema().dump(t_document)
PrettyPrintTables(json_data)

## Additional Examples: Merging a table that extends across pages
This example has a table that extends across pages 1,2 and 3 and requires to be merged.

In [None]:
textract_json = call_textract(input_document="s3://amazon-textract-public-content/multi-page-table/MPT_sample03-long_multi_page_table.pdf",features=[Textract_Features.FORMS, Textract_Features.TABLES], boto3_textract_client = textract_client)
PrettyPrintTables(textract_json)

In [None]:
t_document: t2.TDocument = t2.TDocumentSchema().load(textract_json)    
t_document = pipeline_merge_tables(t_document, MergeOptions.MERGE, None, HeaderFooterType.NONE)
json_data = t2.TDocumentSchema().dump(t_document)
PrettyPrintTables(json_data)

## Additional Examples: Merging tables when the Pages have headers and footers
The document contains header and footer values that can be ignored while assessing tables to be merged. This example has both a header and a footer.

In [None]:
textract_json = call_textract(input_document="s3://amazon-textract-public-content/multi-page-table/MPT_sample04-header_footer_table.pdf",features=[Textract_Features.FORMS, Textract_Features.TABLES], boto3_textract_client = textract_client)
PrettyPrintTables(textract_json)

In [None]:
t_document: t2.TDocument = t2.TDocumentSchema().load(textract_json)    
t_document = pipeline_merge_tables(t_document, MergeOptions.MERGE, None, HeaderFooterType.NORMAL)
json_data = t2.TDocumentSchema().dump(t_document)
PrettyPrintTables(json_data)

## Creating a Custom Table Detection Function
The component allows you to use your own table detection logic by passing the function to the pipeline_merge_tables function.
In the below example, we use a sample custom function that merges successive tables together

In [None]:
textract_json = call_textract(input_document="s3://amazon-textract-public-content/multi-page-table/MPT_sample02-multi_tables.pdf",features=[Textract_Features.FORMS, Textract_Features.TABLES], boto3_textract_client = textract_client)
PrettyPrintTables(textract_json)

In [None]:
from trp.t_pipeline import order_blocks_by_geo

def CustomTableDetectionFunction(t_document):
    table_ids_merge_list = []
    table_id_pairs = []
    ordered_doc = order_blocks_by_geo(t_document)
    trp_doc = Document(TDocumentSchema().dump(ordered_doc))
    for current_page in trp_doc.pages:
        if(len(current_page.tables) == 0):
            break
        for table in current_page.tables:
            table_id_pairs.append(table.id)
            if(len(table_id_pairs) > 1):
                table_ids_merge_list.append(table_id_pairs.copy())
                table_id_pairs.clear()
    return table_ids_merge_list


t_document: t2.TDocument = t2.TDocumentSchema().load(textract_json)    
t_document = pipeline_merge_tables(t_document, MergeOptions.MERGE, CustomTableDetectionFunction, HeaderFooterType.NORMAL)
json_data = t2.TDocumentSchema().dump(t_document)
PrettyPrintTables(json_data)