{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Document Enrichment\n", "\n", "---\n", "\n", "## Setup Notebook\n", "\n", "We will be using the [Amazon Textract Parser Library](https://github.com/aws-samples/amazon-textract-response-parser/tree/master/src-python) for parsing through the Textract response, data science library [Pandas](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) for content analysis, the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/), and [AWS boto3 python sdk](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) to work with Amazon Textract and Amazon A2I. Let's now install and import them." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "!python -m pip install -q amazon-textract-overlayer --force-reinstall" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from textractcaller.t_call import call_textract, Textract_Features, Textract_Types\n", "from textractprettyprinter.t_pretty_print import Textract_Pretty_Print, get_string, Pretty_Print_Table_Format\n", "from trp.trp2 import TDocument" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "import boto3\n", "import botocore\n", "import sagemaker\n", "import os\n", "import io\n", "import datetime\n", "import json\n", "import pandas as pd\n", "from PIL import Image as PImage, ImageDraw\n", "from pathlib import Path\n", "import multiprocessing as mp\n", "from IPython.display import Image, display, HTML, JSON, IFrame\n", "from trp import Document\n", "\n", "# variables\n", "data_bucket = sagemaker.Session().default_bucket()\n", "region = boto3.session.Session().region_name\n", "account_id = boto3.client('sts').get_caller_identity().get('Account')\n", "\n", "os.environ[\"BUCKET\"] = data_bucket\n", "os.environ[\"REGION\"] = region\n", "role = sagemaker.get_execution_role()\n", "\n", "print(f\"SageMaker role is: {role}\\nDefault SageMaker Bucket: s3://{data_bucket}\")\n", "\n", "s3=boto3.client('s3')\n", "textract = boto3.client('textract', region_name=region)\n", "comprehend=boto3.client('comprehend', region_name=region)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Redact PII in Medical Transcription letter\n", "\n", "In this step we will first detect PII (Personally Identifiable Information) in our sample Medical transcription letter. We will then extract the bounding box information for the detected PII entities and create redaction bounding boxes on the document.\n", "In order to obtain the bounding box geometry of all words from the document we will use a helper tool called `amazon-textract-overlayer`. See [documentation](https://github.com/aws-samples/amazon-textract-textractor/tree/master/overlayer) to learn more about `amazon-textract-overlayer`.\n", "\n", "Let's look at the un-redacted document -" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "documentName = \"./dataset/document_samples/dr-note-sample.png\"\n", "display(Image(filename=documentName, width=900))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Detecting PII Entities\n", "\n", "Let's take a look at how to detect PII entities using Amazon Comprehend Detect PII Entities API. Amazon Comprehend's PII detection API uses a pre-trained NLP model that can detect most common PII entities such as NAME, AGE, ADDRESS, SSN, DATE and so on (for a full list see [documentation](https://docs.aws.amazon.com/comprehend/latest/dg/how-pii.html#how-pii-types))." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#upload to S3\n", "key='idp/textract/dr-note-sample.png'\n", "s3.upload_file(Filename='./dataset/document_samples/dr-note-sample.png', \n", " Bucket=data_bucket, \n", " Key=key)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "resp = call_textract(input_document = f's3://{data_bucket}/idp/textract/dr-note-sample.png')\n", "text = get_string(textract_json=resp, output_type=[Textract_Pretty_Print.LINES])\n", "\n", "#Call Amazon Comprehend Detect PII Entities API\n", "entity_resp = comprehend.detect_pii_entities(Text=text, LanguageCode=\"en\") \n", "\n", "pii = []\n", "for entity in entity_resp['Entities']:\n", " pii_entity={}\n", " pii_entity['Type'] = entity['Type']\n", " pii_entity['Text'] = text[entity['BeginOffset']:entity['EndOffset']]\n", " pii.append(pii_entity)\n", "pii " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "### Perform Document Enrichment\n", "\n", "Using a combination of Amazon Comprehend and Amazon Textract we can now perform some PII redaction on the Medical transcription document. Next we will define a helper function that will\n", "\n", "1. Call Amazon Textract to get the plain text information from the Paystub and the corresponding bounding box information since it is an image file\n", "2. Use the extracted text to call Amazon Comprehend's [Detect PII](https://docs.aws.amazon.com/comprehend/latest/dg/how-pii.html) API\n", "3. Use Python Pillow library to draw bounding box redactions on the original document\n", "4. Save the new enriched document with redactions to the file system" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Detecting PII Entities\n", "\n", "Let's take a look at how to detect PII entities using Amazon Comprehend Detect PII Entities API. Amazon Comprehend's PII detection API uses a pre-trained NLP model that can detect most common PII entities such as NAME, AGE, ADDRESS, SSN, DATE and so on (for a full list see [documentation](https://docs.aws.amazon.com/comprehend/latest/dg/how-pii.html#how-pii-types))." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "from textractoverlayer.t_overlay import DocumentDimensions, get_bounding_boxes\n", "\n", "def redact_doc(s3document, localpath, redact_entities):\n", " print(s3document)\n", " try:\n", " img = PImage.open(localpath)\n", " draw = ImageDraw.Draw(img)\n", "\n", " # Use call_textract to get bounding boxes\n", " # call_textract without the features parameter uses Textract Detect text\n", " resp = call_textract(input_document = s3document)\n", " document_dimension:DocumentDimensions = DocumentDimensions(doc_width=img.size[0], doc_height=img.size[1])\n", " overlay=[Textract_Types.LINE, Textract_Types.WORD, Textract_Types.FORM, Textract_Types.CELL, Textract_Types.KEY, Textract_Types.VALUE]\n", " bounding_box_list = get_bounding_boxes(textract_json=resp, document_dimensions=[document_dimension], overlay_features=overlay)\n", "\n", " print('Detecting entities...')\n", " \n", " text = get_string(textract_json=resp, output_type=[Textract_Pretty_Print.LINES])\n", " \n", " #detect PII Entities\n", " entity_resp = comprehend.detect_pii_entities(Text=text, LanguageCode=\"en\") \n", "\n", " entities = []\n", " for entity in entity_resp['Entities']:\n", " pii_entity={}\n", " pii_entity['Type'] = entity['Type']\n", " pii_entity['Text'] = text[entity['BeginOffset']:entity['EndOffset']]\n", " entities.append(pii_entity) \n", " redactions = []\n", "\n", " #collect the bounding boxes for the custom entities\n", " for entity in entities:\n", " entity_text = entity['Text']\n", " entity_type = entity['Type']\n", " for bbox in bounding_box_list:\n", " if bbox.text in entity_text and entity_type in redact_entities:\n", " print(f'Found Entity: {entity_text}')\n", " redactions.append(bbox)\n", " \n", " #Perform redaction\n", " for box in redactions:\n", " draw.rectangle(xy=[box.xmin, box.ymin, box.xmax, box.ymax], fill=\"Black\")\n", " \n", " #Generate the redacted/enriched document file and save to file system\n", " opfile = Path(localpath).stem\n", " opfile = f'{opfile}_redacted.png' \n", " img.save(opfile) \n", " print(f'Done.... Redacted file saved: {opfile}')\n", " return opfile\n", " except Exception as e:\n", " print(e)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can call this method and pass it the PII Entities we wish to perform redaction on. In this case, we will pass `NAME` and `AGE` for redaction." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "redact_doc(f's3://{data_bucket}/idp/textract/dr-note-sample.png','./dataset/document_samples/dr-note-sample.png',['NAME','NAME','AGE'])\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "#### Side-by-side comparison of un-redacted vs. redacted document." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bank_document_local='./dataset/document_samples/dr-note-sample.png'\n", "redacted_file='dr-note-sample_redacted.png'\n", "\n", "print(f'\\nUnredacted Document\\t\\t\\t\\t\\t\\t\\tRedacted Document \\n')\n", "\n", "HTML(f\"\"\"\n", "
\n", " \n", " \n", "
\n", " \"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "# Conclusion\n", "\n", "In this notebook, we saw how we can perform document enrichment such as redacting PII information. We first detected PII entities in the document using Amazon comprehend's detect PII API. Once we were able to detect PII infromation in the document, we used Amazon Textract to obtain the bounding box information for the information that needs to be treated as a PII info and then we finally drew bounding box redactions on the document using an image library." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "# Cleanup\n", "\n", "In order to clean up the files uploaded into the S3 bucket, execute the following command. If you created this SageMaker Domain Studio environment manually then follow the SageMaker documentation to delete the Studio domain. If you created, the Studio Domain using a CloudFormation stack, delete the stack. If you are performing this lab as part of an instructor led workshop please follow instructions shared by the instructor." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!aws s3 rm s3://{data_bucket}/idp/textract/ --recursive --exclude \"*\" --include \"textract/*\"" ] } ], "metadata": { "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "Python 3 (Data Science)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-2:429704687514:image/datascience-1.0" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 4 }