{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "4c2f249b",
   "metadata": {},
   "source": [
    "## Detecting Merged Cells And Headers on fictitious bank statement"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5b16cdb3",
   "metadata": {},
   "source": [
    "We will be using the modules below:\n",
    "* amazon-textract-caller (https://pypi.org/project/amazon-textract-caller/) to invoke Amazon Textract API on our behalf\n",
    "* amazon-textract-response-parser (http://%28https//pypi.org/project/amazon-textract-response-parser/) to parse the response payload\n",
    "* amazoon-textract-prettyprinter (https://pypi.org/project/amazon-textract-prettyprinter/) to \"pretty-print\" tables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a6e6c072",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install boto3\n",
    "!pip install amazon-textract-caller\n",
    "!pip install amazon-textract-prettyprinter\n",
    "!pip install trp"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2bd23c9d",
   "metadata": {},
   "outputs": [],
   "source": [
    "import boto3\n",
    "import json\n",
    "import pandas as pd\n",
    "from textractcaller import call_textract, Textract_Features\n",
    "from textractprettyprinter.t_pretty_print import Pretty_Print_Table_Format, Textract_Pretty_Print, get_string, get_tables_string\n",
    "from trp import Document\n",
    "from trp.trp2 import TDocument, TDocumentSchema\n",
    "from trp.t_pipeline import order_blocks_by_geo\n",
    "from IPython.display import display"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f40c07e6",
   "metadata": {},
   "source": [
    "Let's initialize the boto3 session and then invoke textract_caller to perform the document processing API call and collect the response back on our behalf."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "34479535",
   "metadata": {},
   "outputs": [],
   "source": [
    "session = boto3.Session(profile_name='<your_profile_name>')\n",
    "documentName = \"s3://textract-table-merged-cells-data-sample/Textract-MergeCell-Statement.pdf\"\n",
    "textract_json = call_textract(input_document=documentName, features = [Textract_Features.TABLES])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4dd698d3",
   "metadata": {},
   "source": [
    "Let's pretty-print the response payload. As you can see, by default the date is not populated across all rows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3e0fb7fb",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(get_string(textract_json=textract_json, output_type=[Textract_Pretty_Print.TABLES]))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "36cedb05",
   "metadata": {},
   "source": [
    "Now let's load the response into an ordered document and scan the statement's table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "50846117",
   "metadata": {},
   "outputs": [],
   "source": [
    "t_doc = TDocumentSchema().load(textract_json)\n",
    "ordered_doc = order_blocks_by_geo(t_doc)\n",
    "trp_doc = Document(TDocumentSchema().dump(ordered_doc))\n",
    "\n",
    "table_index = 1\n",
    "dataframes = []\n",
    "\n",
    "def combine_headers(top_h, bottom_h):\n",
    "    bottom_h[3] = top_h[2] + \" \" + bottom_h[3]\n",
    "    bottom_h[4] = top_h[2] + \" \" + bottom_h[4]\n",
    "\n",
    "for page in trp_doc.pages:\n",
    "    for table in page.tables:\n",
    "        table_data = []\n",
    "        headers = table.get_header_field_names()\n",
    "        if(len(headers)>0):                                      #Let's retain the only table with headers\n",
    "            print(\"Statememt headers: \"+ repr(headers))\n",
    "            top_header= headers[0]\n",
    "            bottom_header = headers[1]\n",
    "            combine_headers(top_header, bottom_header)           #The statement has two headers. let's combine them\n",
    "            for r, row in enumerate(table.rows_without_header):  #New Table attribute returning rows without headers\n",
    "                table_data.append([])\n",
    "                for c, cell in enumerate(row.cells):\n",
    "                    table_data[r].append(cell.mergedText)        #New Cell attribute returning merged cells common values\n",
    "            \n",
    "            if len(table_data)>0:\n",
    "                df = pd.DataFrame(table_data, columns=bottom_header)\n",
    "\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "77d05bcd",
   "metadata": {},
   "source": [
    "Now we can even use multi level indexing and reproduce the table's initial structure."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e21b6df1",
   "metadata": {},
   "outputs": [],
   "source": [
    "multi = df.set_index(['Date', 'Details'])\n",
    "display(multi)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}