{ "cells": [ { "cell_type": "markdown", "id": "4c2f249b", "metadata": {}, "source": [ "## Detecting Merged Cells And Headers on fictitious bank statement" ] }, { "cell_type": "markdown", "id": "5b16cdb3", "metadata": {}, "source": [ "We will be using the modules below:\n", "* amazon-textract-caller (https://pypi.org/project/amazon-textract-caller/) to invoke Amazon Textract API on our behalf\n", "* amazon-textract-response-parser (http://%28https//pypi.org/project/amazon-textract-response-parser/) to parse the response payload\n", "* amazoon-textract-prettyprinter (https://pypi.org/project/amazon-textract-prettyprinter/) to \"pretty-print\" tables" ] }, { "cell_type": "code", "execution_count": null, "id": "a6e6c072", "metadata": {}, "outputs": [], "source": [ "!pip install boto3\n", "!pip install amazon-textract-caller\n", "!pip install amazon-textract-prettyprinter\n", "!pip install trp" ] }, { "cell_type": "code", "execution_count": null, "id": "2bd23c9d", "metadata": {}, "outputs": [], "source": [ "import boto3\n", "import json\n", "import pandas as pd\n", "from textractcaller import call_textract, Textract_Features\n", "from textractprettyprinter.t_pretty_print import Pretty_Print_Table_Format, Textract_Pretty_Print, get_string, get_tables_string\n", "from trp import Document\n", "from trp.trp2 import TDocument, TDocumentSchema\n", "from trp.t_pipeline import order_blocks_by_geo\n", "from IPython.display import display" ] }, { "cell_type": "markdown", "id": "f40c07e6", "metadata": {}, "source": [ "Let's initialize the boto3 session and then invoke textract_caller to perform the document processing API call and collect the response back on our behalf." ] }, { "cell_type": "code", "execution_count": null, "id": "34479535", "metadata": {}, "outputs": [], "source": [ "session = boto3.Session(profile_name='')\n", "documentName = \"s3://textract-table-merged-cells-data-sample/Textract-MergeCell-Statement.pdf\"\n", "textract_json = call_textract(input_document=documentName, features = [Textract_Features.TABLES])" ] }, { "cell_type": "markdown", "id": "4dd698d3", "metadata": {}, "source": [ "Let's pretty-print the response payload. As you can see, by default the date is not populated across all rows." ] }, { "cell_type": "code", "execution_count": null, "id": "3e0fb7fb", "metadata": {}, "outputs": [], "source": [ "print(get_string(textract_json=textract_json, output_type=[Textract_Pretty_Print.TABLES]))" ] }, { "cell_type": "markdown", "id": "36cedb05", "metadata": {}, "source": [ "Now let's load the response into an ordered document and scan the statement's table." ] }, { "cell_type": "code", "execution_count": null, "id": "50846117", "metadata": {}, "outputs": [], "source": [ "t_doc = TDocumentSchema().load(textract_json)\n", "ordered_doc = order_blocks_by_geo(t_doc)\n", "trp_doc = Document(TDocumentSchema().dump(ordered_doc))\n", "\n", "table_index = 1\n", "dataframes = []\n", "\n", "def combine_headers(top_h, bottom_h):\n", " bottom_h[3] = top_h[2] + \" \" + bottom_h[3]\n", " bottom_h[4] = top_h[2] + \" \" + bottom_h[4]\n", "\n", "for page in trp_doc.pages:\n", " for table in page.tables:\n", " table_data = []\n", " headers = table.get_header_field_names()\n", " if(len(headers)>0): #Let's retain the only table with headers\n", " print(\"Statememt headers: \"+ repr(headers))\n", " top_header= headers[0]\n", " bottom_header = headers[1]\n", " combine_headers(top_header, bottom_header) #The statement has two headers. let's combine them\n", " for r, row in enumerate(table.rows_without_header): #New Table attribute returning rows without headers\n", " table_data.append([])\n", " for c, cell in enumerate(row.cells):\n", " table_data[r].append(cell.mergedText) #New Cell attribute returning merged cells common values\n", " \n", " if len(table_data)>0:\n", " df = pd.DataFrame(table_data, columns=bottom_header)\n", "\n", "df" ] }, { "cell_type": "markdown", "id": "77d05bcd", "metadata": {}, "source": [ "Now we can even use multi level indexing and reproduce the table's initial structure." ] }, { "cell_type": "code", "execution_count": null, "id": "e21b6df1", "metadata": {}, "outputs": [], "source": [ "multi = df.set_index(['Date', 'Details'])\n", "display(multi)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 5 }