{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Mortgage Document Extraction\n", "\n", "---\n", "\n", "At this point we have identified the documents, and we can now start extracting information from them. What we want to extract information from each document will depend on the type of document as depicted in the figure below.\n", "\n", "

\n", " \"exn\"\n", "

\n", "\n", "\n", "\n", "---\n", "\n", "## Setup Notebook\n", "\n", "We will be using the [Amazon Textract Parser Library](https://github.com/aws-samples/amazon-textract-response-parser/tree/master/src-python) for parsing through the Textract response, data science library [Pandas](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) for content analysis, the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/), and [AWS boto3 python sdk](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) to work with Amazon Textract and Amazon A2I. Let's now install and import them." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true, "tags": [] }, "outputs": [], "source": [ "!python -m pip install -q amazon-textract-response-parser --upgrade --force-reinstall\n", "!python -m pip install -q amazon-textract-caller --upgrade --force-reinstall\n", "!python -m pip install -q amazon-textract-prettyprinter --upgrade --force-reinstall\n", "!python -m pip install pdfplumber\n", "!sed -i '/disable ghostscript format types/,+6d' /etc/ImageMagick-6/policy.xml" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "#Restart the kernel\n", "import IPython\n", "IPython.Application.instance().kernel.do_shutdown(True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true, "tags": [] }, "outputs": [], "source": [ "import boto3\n", "import botocore\n", "import sagemaker\n", "import os\n", "import io\n", "import datetime\n", "import json\n", "import pandas as pd\n", "from PIL import Image as PImage, ImageDraw\n", "from pathlib import Path\n", "from IPython.display import Image, display, HTML, JSON, IFrame\n", "from textractcaller import call_textract_lending\n", "from textractprettyprinter.t_pretty_print import convert_lending_from_trp2\n", "import trp.trp2_lending as tl\n", "import multiprocessing as mp\n", "from trp import Document\n", "from sagemaker import get_execution_role\n", "\n", "\n", "# variables\n", "data_bucket = sagemaker.Session().default_bucket()\n", "region = boto3.session.Session().region_name\n", "account_id = boto3.client('sts').get_caller_identity().get('Account')\n", "\n", "os.environ[\"BUCKET\"] = data_bucket\n", "os.environ[\"REGION\"] = region\n", "role = sagemaker.get_execution_role()\n", "\n", "print(f\"SageMaker role is: {role}\\nDefault SageMaker Bucket: s3://{data_bucket}\")\n", "\n", "s3=boto3.client('s3')\n", "textract = boto3.client('textract', region_name=region)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "documentName = \"docs/lending_package.pdf\"\n", "display(IFrame(documentName, 500, 600));" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "# Extracting information from documents\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From module 1 - Document classification, we have the following as documents identified from the lending packet. We will extract the information out of each of the Mortgage documents\n", "\n", "- Page Number: 1 , Page Classification: `PAYSLIPS`\n", "- Page Number: 2 , Page Classification: `CHECKS`\n", "- Page Number: 3 , Page Classification: `IDENTITY_DOCUMENT`\n", "- Page Number: 4 , Page Classification: `1099_DIV`\n", "- Page Number: 5 , Page Classification: `BANK_STATEMENT`\n", "- Page Number: 6 , Page Classification: `W_2`\n", "- Page Number: 7 , Page Classification: `UNCLASSIFIED`\n", " \n", " \n", "We will use Amazon Textract's [Analyze Lending API](https://docs.aws.amazon.com/textract/latest/dg/analyzing-document-text.html) API to extract Key-Value pairs out of the documents. Note that, in some cases (for example; for the sample) the API will also detect the presence of [signatures.](https://docs.aws.amazon.com/textract/latest/dg/lending-response-objects.html)\n", "\n", "For the `Unclassified` type of document,i.e., Page Number 7 (which is a sample Homeowners insurance application form), we will use Amazon Textract Queries feature to extract information out of it.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Load the Output JSON response that was generated in the previous module" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true, "tags": [] }, "outputs": [], "source": [ "with open('lending-doc-output.json') as textract_json:\n", " data = json.load(textract_json)\n", "\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Use the Textract Response Library to convert the response to an easier format to work with\n", "\n", "The JSON is converted to a \"flattened\" CSV structure which is easier to work with. The CSV has the following columns:\n", "\n", "- page_classification\n", "- page_number_within_document_type\n", "- page_number_in_document\n", "- key\n", "- key_confidence\n", "- value\n", "- value_confidence\n", "- key-bounding-box.top\n", "- key-bounding-box.height\n", "- key-bb.width\n", "- key-bb.left\n", "- value-bounding-box.top\n", "- value-bb.height\n", "- value-bb.width\n", "- value-bb.left\n", "\n", "\n", "Here, we use the Textract Response Parser helper library:\n", "[Link](https://pypi.org/project/amazon-textract-response-parser/)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "trp2_doc: tl.TFullLendingDocument = tl.TFullLendingDocumentSchema().load(data)\n", "lending_array = convert_lending_from_trp2(trp2_doc)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true, "tags": [] }, "outputs": [], "source": [ "for row in lending_array:\n", " print(row)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will use [Pandas Dataframes](https://pandas.pydata.org/) which is a widely used Python library in machine learning and data science. We will store the `lending_array` in a Pandas dataframe structure to further extract and analyze data according to the respective document labels/classifications." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "df = pd.DataFrame(lending_array, columns = ['{page_classification}_{page_number_within_document_type}', 'page_number_in_document', 'key', 'key_confidence', 'value', 'value_confidence', 'key-bounding-box.top', 'key-bounding-box.height', 'key-bb.width', 'key-bb.left', 'value-bounding-box.top', 'value-bb.height', 'value-bb.width', 'value-bb.left'])\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Following are the document pages as they were classified in the previous document classification step: \n", "\n", "## 1. Payslip \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Get info (key-value) pairs from the response\n", "\n", "In this section we will get the extractions in the document in key-value pair format." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "payslip_df = pd.DataFrame()\n", "payslip_df = df.loc[df['{page_classification}_{page_number_within_document_type}'] == \"PAYSLIPS_undetected\"].set_index('{page_classification}_{page_number_within_document_type}')\n", "\n", "payslip_df[['key','value']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## 2. Check \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Get info (key-value) pairs from the response\n", "\n", "In this section we will get the extractions in the document in key-value pair format.\n", "\n", "> **_NOTE:_** For this document, the following key-value pair included in the response: \"SIGNATURE – SIGNATURE_PRESENT\" indicates that a signature was found in the document." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "checks_df = pd.DataFrame()\n", "checks_df = df.loc[df['{page_classification}_{page_number_within_document_type}'] == \"CHECKS_undetected\"].set_index('{page_classification}_{page_number_within_document_type}')\n", "\n", "checks_df[['key','value']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## 3. Identity Document " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Get info (key-value) pairs from the response\n", "\n", "In this section we will get the extractions in the document in key-value pair format." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "ID_df = pd.DataFrame()\n", "ID_df = df.loc[df['{page_classification}_{page_number_within_document_type}'] == \"IDENTITY_DOCUMENT_undetected\"].set_index('{page_classification}_{page_number_within_document_type}')\n", "\n", "ID_df[['key','value']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## 4. HomeOwners Insurance Application - _Amazon Textract Queries example_ \n", "\n", "In document classification, this document was detected as `unclassified`, hence we will use Amazon Textract Queries feature to extract key information from such types of documents." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Upload the unclassified document to S3\n", "\n", "We will extract and upload the unclassified document to Amazon S3 Bucket. Note that documents will be uploaded into SageMaker's default S3 bucket. If you wish to use a different bucket please make sure you update the bucket name in `data_bucket` variable and also ensure that SageMaker has permissions to the S3 bucket." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "import pdfplumber\n", "input_pdf = \"docs/lending_package.pdf\"\n", "\n", "with pdfplumber.open(input_pdf) as pdf:\n", " page7 = pdf.pages[6]\n", " image = page7.to_image(resolution=350).original\n", " image.save(\"docs/unclassified.pdf\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# Upload images to S3 bucket:\n", "!aws s3 cp docs/unclassified.pdf s3://{data_bucket}/idp/textract/ --only-show-errors" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "file_key=\"idp/textract/unclassified.pdf\"\n", "display(IFrame(src=s3.generate_presigned_url('get_object', Params={'Bucket': data_bucket, 'Key': file_key}), width=600, height=700))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A we can see the homeowners insurance application is a document containing dense text. In this case we are interested in finding out a few key information from the entire document. Instead of extracting all the text from the document, and then apply logic (for example: regular expression) to find out that information, hence, we use the Amazon Textract Queries feature to grab the infromation from the document. \n", "\n", "Specifically, the information we are looking for are-\n", "\n", "1. Named insured.\n", "2. Policy number.\n", "3. Effective date.\n", "4. Insurance company.\n", "\n", "We will craft questions in plain english language for the Textract API and pass it to the API call to get the information. Queries are-\n", "\n", "1. Who is the named insured?\n", "2. What is the policy number?\n", "3. What is the effective date?\n", "4. What is the insurance company name?\n", "\n", "Also, we can see that all of this information is available in the single-page pdf document so we don't need the AI to look through all the pages to find this info. We will pass the page number when making the API call. Note: If the page number is not known then the `pages` parameter can be omitted." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "from textractcaller import QueriesConfig, Query\n", "from textractcaller.t_call import call_textract, Textract_Features\n", "\n", "# Setup the queries\n", "query1 = Query(text=\"Who is the named insured?\" , alias=\"NAMED_INSURED\", pages=[\"1\"])\n", "query2 = Query(text=\"What is the policy number?\", alias=\"POLICY_NUMBER\", pages=[\"1\"])\n", "query3 = Query(text=\"What is the effective date?\", alias=\"EFFECTIVE_DATE\", pages=[\"1\"])\n", "query4 = Query(text=\"What is the insurance company name?\", alias=\"INSURANCE_COMPANY\", pages=[\"1\"])\n", "\n", "#Setup the query config with the above queries\n", "queries_config = QueriesConfig(queries=[query1, query2, query3, query4])\n", "\n", "response_unclassified = call_textract(input_document=f's3://{data_bucket}/{file_key}',\n", " features=[Textract_Features.QUERIES],\n", " queries_config=queries_config)\n", "doc_unclassified = Document(response_unclassified)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "import trp.trp2 as t2\n", "doc_unclassified : t2.TDocumentSchema = t2.TDocumentSchema().load(response_unclassified) \n", " \n", "entities = {}\n", "for page in doc_unclassified.pages:\n", " query_answers = doc_unclassified.get_query_answers(page=page)\n", " if query_answers:\n", " for answer in query_answers:\n", " entities[answer[1]] = answer[2]\n", " \n", "display(JSON(entities, root='Query Answers'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "# Conclusion\n", "\n", "In this notebook, we saw how to extract KEY-VALUE pairs from various documents that may be present in a mortgage packet. We also used Amazon Textract AnalyzeDocument Queries feature to extract specific information from an unclassified document, which is dense text and got accurate responses back from the API. In the next stages, we will continue with capabilities of Amazon Comprehend Custom NER and then perform enrichment on one of the documents." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "Python 3 (Data Science)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-2:429704687514:image/datascience-1.0" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 4 }