{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Mortgage Document Enrichment\n", "\n", "---\n", "\n", "## Setup Notebook\n", "\n", "We will be using the [Amazon Textract Parser Library](https://github.com/aws-samples/amazon-textract-response-parser/tree/master/src-python) for parsing through the Textract response, data science library [Pandas](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) for content analysis, the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/), and [AWS boto3 python sdk](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) to work with Amazon Textract and Amazon A2I. Let's now install and import them." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "!python -m pip install -q amazon-textract-overlayer --force-reinstall" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "from textractcaller.t_call import call_textract, Textract_Features, Textract_Types\n", "from textractprettyprinter.t_pretty_print import Textract_Pretty_Print, get_string, Pretty_Print_Table_Format\n", "from trp.trp2 import TDocument" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true, "tags": [] }, "outputs": [], "source": [ "import boto3\n", "import botocore\n", "import sagemaker\n", "import os\n", "import io\n", "import datetime\n", "import json\n", "import pandas as pd\n", "from PIL import Image as PImage, ImageDraw\n", "from pathlib import Path\n", "import multiprocessing as mp\n", "from IPython.display import Image, display, HTML, JSON, IFrame\n", "from trp import Document\n", "\n", "# variables\n", "data_bucket = sagemaker.Session().default_bucket()\n", "region = boto3.session.Session().region_name\n", "account_id = boto3.client('sts').get_caller_identity().get('Account')\n", "\n", "os.environ[\"BUCKET\"] = data_bucket\n", "os.environ[\"REGION\"] = region\n", "role = sagemaker.get_execution_role()\n", "\n", "print(f\"SageMaker role is: {role}\\nDefault SageMaker Bucket: s3://{data_bucket}\")\n", "\n", "s3=boto3.client('s3')\n", "textract = boto3.client('textract', region_name=region)\n", "comprehend=boto3.client('comprehend', region_name=region)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Redact PII in Paystub\n", "\n", "In this step we will first detect PII (Personally Identifiable Information) in the Paystub document. We will then obtain the bounding box information for the detected PII entities and create redaction bounding boxes on the document.\n", "\n", "In order to obtain the bounding box geometry of all words from the document we will use a tool called `amazon-textract-overlayer`. See [documentation](https://github.com/aws-samples/amazon-textract-textractor/tree/master/overlayer) for learn more about `amazon-textract-overlayer`.\n", "\n", "The un-redacted document looks like below-" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "documentName = \"docs/Paystub.png\"\n", "display(Image(filename=documentName, width=500))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Detecting PII Entities\n", "\n", "Let's take a look at how to detect PII entities using Amazon Comprehend Detect PII Entities API. Amazon Comprehend's PII detection API uses a pre-trained NLP model that can detect most common PII entities such as NAME, ADDRESS, SSN, BANK A/C NUMBERS, DATES and so on (for a full list see [documentation](https://docs.aws.amazon.com/comprehend/latest/dg/how-pii.html#how-pii-types))." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true, "tags": [] }, "outputs": [], "source": [ "with open(documentName, 'rb') as document:\n", " imageBytes = bytearray(document.read())\n", "\n", "resp = call_textract(input_document = imageBytes)\n", "text = get_string(textract_json=resp, output_type=[Textract_Pretty_Print.LINES])\n", "\n", "#Call Amazon Comprehend Detect PII Entities API\n", "entity_resp = comprehend.detect_pii_entities(Text=text, LanguageCode=\"en\") \n", "\n", "pii = []\n", "for entity in entity_resp['Entities']:\n", " pii_entity={}\n", " pii_entity['Type'] = entity['Type']\n", " pii_entity['Text'] = text[entity['BeginOffset']:entity['EndOffset']]\n", " pii.append(pii_entity)\n", "pii " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "### Perform Document Enrichment\n", "\n", "Using a combination of Amazon Comprehend and Amazon Textract we can now perform some PII redaction on a Paystub document. Next we will define a helper function that will\n", "\n", "1. Call Amazon Textract to get the plain text information from the Paystub and the corresponding bounding box information since it is an image file\n", "2. Use the extracted text to call Amazon Comprehend's [Detect PII](https://docs.aws.amazon.com/comprehend/latest/dg/how-pii.html) API\n", "3. Use Python Pillow library to draw bounding box redactions on the original document\n", "4. Save the new enriched document with redactions to the file system" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true, "tags": [] }, "outputs": [], "source": [ "from textractoverlayer.t_overlay import DocumentDimensions, get_bounding_boxes\n", "\n", "def redact_doc(localpath, redact_entities):\n", " try:\n", " img = PImage.open(localpath)\n", " draw = ImageDraw.Draw(img)\n", "\n", " # Use call_textract to get bounding boxes\n", " # call_textract without the features parameter uses Textract Detect text\n", " with open(localpath, 'rb') as document:\n", " imageBytes = bytearray(document.read())\n", " \n", " resp = call_textract(input_document = imageBytes)\n", " document_dimension:DocumentDimensions = DocumentDimensions(doc_width=img.size[0], doc_height=img.size[1])\n", " overlay=[Textract_Types.LINE, Textract_Types.WORD, Textract_Types.FORM, Textract_Types.CELL, Textract_Types.KEY, Textract_Types.VALUE]\n", " bounding_box_list = get_bounding_boxes(textract_json=resp, document_dimensions=[document_dimension], overlay_features=overlay)\n", "\n", " print('Detecting entities...')\n", " \n", " text = get_string(textract_json=resp, output_type=[Textract_Pretty_Print.LINES])\n", " \n", " #detect PII Entities\n", " entity_resp = comprehend.detect_pii_entities(Text=text, LanguageCode=\"en\") \n", "\n", " entities = []\n", " for entity in entity_resp['Entities']:\n", " pii_entity={}\n", " pii_entity['Type'] = entity['Type']\n", " pii_entity['Text'] = text[entity['BeginOffset']:entity['EndOffset']]\n", " entities.append(pii_entity) \n", " redactions = []\n", "\n", " #collect the bounding boxes for the custom entities\n", " for entity in entities:\n", " entity_text = entity['Text']\n", " entity_type = entity['Type']\n", " for bbox in bounding_box_list: \n", " if bbox.text == entity_text and entity_type in redact_entities:\n", " print(f'Found Entity: {entity_text}')\n", " redactions.append(bbox)\n", " \n", " #Perform redaction\n", " for box in redactions:\n", " draw.rectangle(xy=[box.xmin, box.ymin, box.xmax, box.ymax], fill=\"Black\")\n", " \n", " #Generate the redacted/enriched document file and save to file system\n", " opfile = Path(localpath).stem\n", " opfile = f'{opfile}_redacted.png' \n", " img.save(opfile) \n", " print(f'Done.... Redacted file saved: {opfile}')\n", " return opfile\n", " except Exception as e:\n", " print(e)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can call this method and pass it the PII Entities we wish to perform redaction on. In this case, we will pass `NAME`, `SSN`, and `DATE_TIME` for redaction." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "redact_doc('docs/Paystub.png',['NAME','SSN','DATE_TIME'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "#### Side-by-side comparison of un-redacted vs. redacted document." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "bank_document_local='docs/Paystub.png'\n", "redacted_file='Paystub_redacted.png'\n", "\n", "print(f'\\nUnredacted Document\\t\\t\\t\\t\\t\\t\\tRedacted Document \\n')\n", "\n", "HTML(f\"\"\"\n", "