{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Document Extraction\n", "\n", "In this notebook, we will demonstrate how to utilize textract to extract text, table's, forms from documents. We will also showcase how to use the Analyze ID and Analyze Expense API's\n", "\n", "Additional code samples and methods of wrangling textract api response JSON's can be found [here](https://github.com/aws-samples/amazon-textract-code-samples/blob/master/python/Textract.ipynb) - pay particular attention to Section 8, 10 and 12. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are first going to install the following python libraries: \n", "\n", "- [Amazon Textract Response Parser](https://github.com/aws-samples/amazon-textract-response-parser/blob/master/src-python/README.md)\n", "- [tabulate](https://pypi.org/project/tabulate/)\n", "- [Amazon Textract Pretty Printer](https://github.com/aws-samples/amazon-textract-textractor/tree/master/prettyprinter)\n", "- [Amazon Textract Caller](https://github.com/aws-samples/amazon-textract-textractor/tree/master/caller)\n", "\n", "Throughout this notebook, we will use utilise the libraries in a few different ways - this is simply to showcase the versatility of the tools available to you when handle JSON output from the textract API." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install amazon-textract-response-parser tabulate amazon-textract-prettyprinter" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's now import some additional libraries." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#import our libararies\n", "import boto3\n", "import json\n", "import re\n", "import csv\n", "import sagemaker\n", "from sagemaker import get_execution_role\n", "from sagemaker.s3 import S3Uploader, S3Downloader\n", "import uuid\n", "import time\n", "import io\n", "from io import BytesIO\n", "import sys\n", "from pprint import pprint\n", "import trp #textract response parser library\n", "from tabulate import tabulate\n", "\n", "from IPython.display import Image, display, IFrame\n", "from PIL import Image as PImage, ImageDraw\n", "from textractprettyprinter.t_pretty_print_expense import get_string, Textract_Expense_Pretty_Print, Pretty_Print_Table_Format, get_expensesummary_string, get_expenselineitemgroups_string\n", "# from textractcaller.t_call import call_textract, Textract_Features\n", "from textractprettyprinter.t_pretty_print import Textract_Pretty_Print, get_string" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below is an example function that we are constructing using the Textract response parser - this functions shows how to:\n", "\n", "- iterate through distinct **pages** in the textract JSON response\n", "- how to select the **form** construct from the pages\n", "- how to select the identified **keys** and **values** from the forms" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# #Helper Function for Form Extraction\n", "from trp import Document\n", "def getformkeyvalue(response):\n", " doc = Document(response)\n", " \n", " key_map = {}\n", " for page in doc.pages:\n", " # Print fields\n", " for field in page.form.fields:\n", " if field is None or field.key is None or field.value is None:\n", " continue\n", " \n", " key_map[field.key.text] = field.value.text\n", " return key_map" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will now upload the sample documents that are in the `./dataset/document_samples/` folder to S3.\n", "\n", "S3 will serve as the starting point for our txtract extractions - identical to how textract would be employed in production workloads." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mySession = boto3.session.Session()\n", "awsRegion = mySession.region_name\n", "data_bucket = sagemaker.Session().default_bucket()\n", "\n", "# Amazon S3 client\n", "s3 = boto3.client('s3')\n", "\n", "# Amazon Textract client\n", "textract = boto3.client('textract')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#upload to S3\n", "cms_key='idp/textract/cms1500.png'\n", "s3.upload_file(Filename='./dataset/document_samples/CMS1500.png', \n", " Bucket=data_bucket, \n", " Key=cms_key)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "display(Image(filename=\"./dataset/document_samples/CMS1500.png\", width=900, height=400))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Extract data from claim form CMS1500\n", "\n", "---\n", "\n", "We will now use textract to extract both the **FORMS** and the **TABLES** from the CMS1500 document." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from IPython.display import HTML as html_print\n", "\n", "form_resp = textract.analyze_document(Document={'S3Object':{\"Bucket\": data_bucket, \"Name\": cms_key}\n", " }, FeatureTypes=['FORMS', 'TABLES']\n", ")\n", "\n", "html_print(get_string(textract_json=form_resp, output_type=[Textract_Pretty_Print.TABLES], table_format=Pretty_Print_Table_Format.html))\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "html_print(get_string(textract_json=form_resp, output_type=[Textract_Pretty_Print.FORMS], table_format=Pretty_Print_Table_Format.html))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from IPython.display import display, JSON\n", "\n", "# using our constructed helper function - values returned as a dictionary\n", "\n", "display(JSON(getformkeyvalue(form_resp), root=\"Claim Form\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Extract data from ID Document\n", "\n", "---\n", "\n", "Let's now look at the **Analyze ID** API." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#upload to S3\n", "key='idp/textract/drivers_license.png'\n", "s3.upload_file(Filename='./dataset/document_samples/drivers_license.png', \n", " Bucket=data_bucket, \n", " Key=key)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "display(Image(filename=\"./dataset/document_samples/drivers_license.png\", width=900, height=400))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ID_resp = textract.analyze_id(DocumentPages=[{'S3Object':{\"Bucket\": data_bucket, \"Name\": key}}])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#once again using the textract response parser\n", "from trp.trp2_analyzeid import TAnalyzeIdDocument, TAnalyzeIdDocumentSchema\n", "t_doc = TAnalyzeIdDocumentSchema().load(ID_resp)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From below, you will notice certain key's are presented that were **not** in the driver's license itself.\n", "\n", "For example, **Veteran** is not a key found in the license, however, it is a pre-populated key value that Analyze ID supports, due to the differences found amongst licenses between states." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "list_of_results = t_doc.get_values_as_list()\n", "print(tabulate([x[1:3] for x in list_of_results]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Extract data from Medical insurance card\n", "\n", "---\n", "\n", "Let's now take a look at an example medical insurance card and extract the 'form' from it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#upload to S3\n", "ins_card_key='idp/textract/insurance_card.png'\n", "s3.upload_file(Filename='./dataset/document_samples/insurance_card.png', \n", " Bucket=data_bucket, \n", " Key=ins_card_key)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "display(Image(filename=\"./dataset/document_samples/insurance_card.png\", width=900, height=400))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ins_form_resp = textract.analyze_document(Document={'S3Object':{\"Bucket\": data_bucket, \"Name\": ins_card_key}\n", " }, FeatureTypes=['FORMS']\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from IPython.display import display, JSON\n", "\n", "# using our constructed helper function - values returned as a dictionary\n", "display(JSON(getformkeyvalue(ins_form_resp), root=\"Insurance card\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Extract data from medical invoice\n", "\n", "---\n", "\n", "Let's now look at the **Analyze Expense** API to extract information from a medical insurance invoice." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#upload to S3\n", "invc_key='idp/textract/insurance_invoice.png'\n", "s3.upload_file(Filename='./dataset/document_samples/insurance_invoice.png', \n", " Bucket=data_bucket, \n", " Key=invc_key)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "display(Image(filename=\"./dataset/document_samples/insurance_invoice.png\", width=900, height=400))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the below, it is important to notice that textract provides the ability to seperately extract the \"line items\" from the overall invoice." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "expense_resp = textract.analyze_expense(Document={'S3Object':{\"Bucket\": data_bucket, \"Name\": invc_key}})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Invoice Summary" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(get_expensesummary_string(textract_json=expense_resp, table_format=Pretty_Print_Table_Format.fancy_grid))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Invoice line items" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(get_expenselineitemgroups_string(textract_json=expense_resp, table_format=Pretty_Print_Table_Format.fancy_grid))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", "### Extraction with Queries using Amazon Textract " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When processing a document with Amazon Textract, you may add queries to your analysis to specify what information you need. This involves passing a question, such as \"What is the customer's social security number?\" to Amazon Textract. Amazon Textract will then find the information in the document for that question and return it in a response structure separate from the rest of the document's information. Queries can be processed alone, or in combination with any other FeatureType, such as Tables or Forms.\n", "\n", "Example Query: What is the customer’s SSN?\n", "\n", "Example Answer: 111-xx-333\n", "\n", "To understand more about how Amazon Textract extracts query based information in a document please follow this [link.](https://docs.aws.amazon.com/textract/latest/dg/queryresponse.html)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "display(Image(filename=\"./dataset/document_samples/insurance_card.png\", width=900, height=400))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# saving the insurance card as a variable\n", "filename = \"./dataset/document_samples/insurance_card.png\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Configuring the Queries NLP Questions" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response = None\n", "with open(filename, 'rb') as document:\n", " imageBytes = bytearray(document.read())\n", "\n", " # Call Textract\n", " response = textract.analyze_document(\n", " Document={'Bytes': imageBytes},\n", " # adding additional Feature Types like FORMS and/or TABLES\n", " # FeatureTypes=[\"QUERIES\", \"FORMS\", \"TABLS\"],\n", " FeatureTypes=[\"QUERIES\"],\n", " QueriesConfig={\n", " \"Queries\": [{\n", " \"Text\": \"What is the member name?\",\n", " \"Alias\": \"INSURANCE_CARD_NAME\"\n", " },\n", " {\n", " \"Text\": \"What is the member id?\",\n", " \"Alias\": \"INSURANCE_CARD_MEMBER_ID\"\n", " },\n", " {\n", " \"Text\": \"What is the PCP name?\",\n", " \"Alias\": \"INSURANCE_CARD_PCP_NAME\"\n", " },\n", " {\n", " \"Text\": \"What is the PCP phone number?\",\n", " \"Alias\": \"INSURANCE_CARD_PCP_PHONE\"\n", " }\n", " ]\n", " })\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The response does include all OCR WORDS and LINES, geometry information, confidence scores. For the purpose of this blog, we just print the query responses. The code (commented out part) also demonstrates how to configure FORMS and/or TABLES Feature Types along with Queries, making it easy to add QUERIES to any existing workflow.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import trp.trp2 as t2\n", "from tabulate import tabulate\n", "\n", "d = t2.TDocumentSchema().load(response)\n", "page = d.pages[0]\n", "\n", "query_answers = d.get_query_answers(page=page)\n", "\n", "print(tabulate(query_answers, tablefmt=\"github\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", "# Conclusion\n", "\n", "In this notebook, we saw how to extract `FORMS, TABLES` and `QUERIES` from various document types that may be present in a claims processing packet. We also used Amazon Textract `AnalyzeID` API to detect information from passport document. We also used Textract to extract specific information out of a document which is dense text and got accurate responses back from the API. In the next notebook, we will perform enrichment on one of the documents." ] } ], "metadata": { "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "Python 3 (Data Science)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-2:429704687514:image/datascience-1.0" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 5 }