{
 "cells": [
  {
   "cell_type": "markdown",
   "source": [
    "# Amazon Comprehend PII Detection and Redaction\n",
    "\n",
    "Amazon Comprehend can be used to perform pii detection and redaction. You can accurately analyze customer interactions, including social media posts, reviews, customer interaction transcripts to improve your products and services.\n",
    "\n"
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "This lab includes step-by-step instructions for performing PII detection and redaction using Amazon Comprehend.\n",
    "\n",
    "The lab is divided into 2 different section\n",
    "\n",
    "* PII detection and redaction for text \n",
    "* PII detection and redaction for images\n",
    "\n",
    "\n",
    "Depending on the usecase, we will be using different AWS service to achieve this. \n",
    "\n",
    "### PII detection and redaction  using Comprehend Process and summary\n",
    "You can use Amazon Comprehend to detect entities in your text that contain personally identifiable information (PII), or PII entities. A PII entity is a textual reference to personal data that could be used to identify an individual, such as an address, bank account number, or phone number.\n",
    "\n",
    "For example, you can detect the PII entities in the following text by submitting it to Amazon Comprehend:\n",
    "\n",
    "Hello Paulo Santos. The latest statement for your credit card account 1111-0000-1111-0000 was mailed to 123 Any Street, Seattle, WA 98109.\n",
    "\n",
    "When Amazon Comprehend completes its analysis, it returns output that either locates or redacts the PII entities in the text.\n",
    "\n",
    "For example, if you choose to locate the PII entities, the output includes the character offsets for each one, along with the entity type and other details. In this case, the output states that \"Paul Santos\" has the type NAME, \"1111-0000-1111-0000\" has the type CREDIT_DEBIT_NUMBER, and \"123 Any Street, Seattle, WA 98109\" has the type ADDRESS.\n",
    "\n",
    "Alternatively, if you choose to redact the PII entities, Amazon Comprehend returns a copy of the input text in which each PII entity is redacted:\n",
    "\n",
    "Hello ***** ******. The latest statement for your credit card account ******************* was mailed to *** *** ******* ******** ** *****.\n",
    "\n",
    "You can detect PII entities with both real-time synchronous operations and batch asynchronous jobs. However, you must use an asynchronous job if you want to produce output with redacted PII entities.\n",
    "\n",
    "You can use the following operations to detect PII entities in a document or set of documents:\n",
    "\n",
    "   * DetectPiiEntities\n",
    "\n",
    "   * StartPiiEntitiesDetectionJob\n",
    "\n"
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Setup\n",
    "\n",
    "Let's start by specifying:\n",
    "\n",
    "* AWS region.\n",
    "* The IAM role arn used to give access to Comprehend API and S3 bucket.\n",
    "* The S3 bucket that you want to use for training and model data.\n"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "\n",
    "import os\n",
    "import boto3\n",
    "import re\n",
    "import json\n",
    "import sagemaker\n",
    "from sagemaker import get_execution_role\n",
    "\n",
    "region = boto3.Session().region_name\n",
    "\n",
    "role = get_execution_role()\n",
    "\n",
    "bucket = sagemaker.Session().default_bucket()"
   ],
   "outputs": [],
   "metadata": {
    "scrolled": true
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "prefix = \"sagemaker/pii-detection-redaction\"\n",
    "bucket_path = \"https://s3-{}.amazonaws.com/{}\".format(region, bucket)\n",
    "# Customize to your bucket where you have stored the data\n",
    "print(bucket_path)"
   ],
   "outputs": [],
   "metadata": {
    "scrolled": true
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## PII detection and redaction for text \n",
    "In this section we will be performing PII detection and redaction for text. \n",
    "\n",
    "### Data \n",
    "Lets begin by uploading the data in the S3 bucket"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "import boto3\n",
    "\n",
    "s3 = boto3.resource('s3')\n",
    "\n",
    "\n",
    "s3.Bucket(bucket).upload_file(\"pii-test-data.csv\", \"sagemaker/pii-detection-redaction/pii-test-data.csv\")\n"
   ],
   "outputs": [],
   "metadata": {
    "scrolled": true
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "\n",
    "\n",
    "Now lets read this into a Pandas data frame and take a look.\n"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "import numpy as np                                # For matrix operations and numerical processing\n",
    "import pandas as pd \n",
    "\n",
    "data = pd.read_csv('pii-test-data.csv')   \n",
    "\n",
    "data"
   ],
   "outputs": [],
   "metadata": {
    "scrolled": true
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "\n",
    "\n",
    "### We will use detect_pii_entities API\n",
    "\n",
    "This API will inspects the input text for entities that contain personally identifiable information (PII) and returns information about them. \n",
    "Steps:\n",
    "* Use boto3 to initialize the comprehend client\n",
    "* Define the sample text \n",
    "* Called the detect_pii_entities API and pass in the text as the input parameter. "
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "import boto3\n",
    "import json\n",
    "\n",
    "comprehend = boto3.client(service_name='comprehend', region_name=region)\n",
    "                \n",
    "text = \"Here are my contact details: Call me 751-01-2327 or email me at lvenson@domain.com\"\n",
    "\n",
    "print('Calling DetectSentiment')\n",
    "print(json.dumps(comprehend.detect_pii_entities(Text=text, LanguageCode='en')))\n",
    "print('End of DetectSentiment\\n')\n"
   ],
   "outputs": [],
   "metadata": {
    "scrolled": true
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "### We will use StartPiiEntitiesDetectionJob API\n",
    "\n",
    "StartPiiEntitiesDetectionJob API starts an asynchronous PII entity detection job for a collection of documents.\n",
    "\n",
    "We would be using this API to perform pii detection and redaction for pii-test-data.csv which we had inspected above. "
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "import uuid\n",
    "InputS3URI= \"s3://\"+bucket+ \"/sagemaker/pii-detection-redaction/pii-test-data.csv\"\n",
    "print(InputS3URI)\n",
    "OutputS3URI=\"s3://\"+bucket+\"/sagemaker/pii-detection-redaction\"\n",
    "print(OutputS3URI)\n",
    "job_uuid = uuid.uuid1()\n",
    "job_name = f\"pii-job-{job_uuid}\""
   ],
   "outputs": [],
   "metadata": {
    "scrolled": true
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "InputS3URI: \"s3://\"+bucket+ \"sagemaker/pii-detection-redaction/pii-test-data.csv\"\n",
    "response = comprehend.start_pii_entities_detection_job(\n",
    "    InputDataConfig={\n",
    "        'S3Uri': InputS3URI,\n",
    "        'InputFormat': 'ONE_DOC_PER_FILE'\n",
    "    },\n",
    "    OutputDataConfig={\n",
    "        'S3Uri': OutputS3URI\n",
    "       \n",
    "    },\n",
    "    Mode='ONLY_REDACTION',\n",
    "    RedactionConfig={\n",
    "        'PiiEntityTypes': [\n",
    "           'ALL',\n",
    "        ],\n",
    "        'MaskMode': 'MASK',\n",
    "        'MaskCharacter': '*'\n",
    "    },\n",
    "    DataAccessRoleArn = role,\n",
    "    JobName=job_name,\n",
    "    LanguageCode='en',\n",
    "    \n",
    ")\n"
   ],
   "outputs": [],
   "metadata": {
    "scrolled": true
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Get the job ID\n",
    "events_job_id = response['JobId']\n",
    "job = comprehend.describe_pii_entities_detection_job(JobId=events_job_id)\n",
    "print(job)"
   ],
   "outputs": [],
   "metadata": {
    "scrolled": true
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "The job will take roughly 6-7 minutes. The below code is to check the status of the job. The cell execution would be completed after the job is completed "
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "from time import sleep\n",
    "# Get current job status\n",
    "job = comprehend.describe_pii_entities_detection_job(JobId=events_job_id)\n",
    "print(job)\n",
    "# Loop until job is completed\n",
    "waited = 0\n",
    "timeout_minutes = 10\n",
    "while job['PiiEntitiesDetectionJobProperties']['JobStatus'] != 'COMPLETED':\n",
    "    sleep(60)\n",
    "    waited += 60\n",
    "    assert waited//60 < timeout_minutes, \"Job timed out after %d seconds.\" % waited\n",
    "    job = comprehend.describe_pii_entities_detection_job(JobId=events_job_id)"
   ],
   "outputs": [],
   "metadata": {
    "scrolled": true
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "print(response)"
   ],
   "outputs": [],
   "metadata": {
    "scrolled": true
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "filename=\"pii-test-data.csv\"\n",
    "output_data_s3_file = job['PiiEntitiesDetectionJobProperties']['OutputDataConfig']['S3Uri'] + filename + '.out'\n",
    "print(output_data_s3_file)"
   ],
   "outputs": [],
   "metadata": {
    "scrolled": true
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "Now lets use the API for our sample dataset and check the response "
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "import boto3\n",
    "\n",
    "from io import BytesIO\n",
    "\n",
    "session = boto3.Session()\n",
    "s3_client = session.client(\"s3\")\n",
    "\n",
    "# The output filename is the input filename + \".out\"\n",
    "\n",
    "filename=\"pii-test-data.csv\"\n",
    "output_data_s3_file = job['PiiEntitiesDetectionJobProperties']['OutputDataConfig']['S3Uri'] + filename + '.out'\n",
    "output_data_s3_filepath=output_data_s3_file.split(\"//\")[1].split(\"/\")[1]+\"/\"+output_data_s3_file.split(\"//\")[1].split(\"/\")[2]+\"/\"+output_data_s3_file.split(\"//\")[1].split(\"/\")[3]+\"/\"+output_data_s3_file.split(\"//\")[1].split(\"/\")[4]+\"/\"+output_data_s3_file.split(\"//\")[1].split(\"/\")[5]\n",
    "print(output_data_s3_filepath)\n",
    "\n",
    "f = BytesIO()\n",
    "s3_client.download_fileobj(bucket, output_data_s3_filepath, f)\n",
    "f.seek(0)\n",
    "print(f.getvalue())\n"
   ],
   "outputs": [],
   "metadata": {
    "scrolled": true
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## PII Detection and Redaction in Images\n",
    "\n",
    "In this lab, we will be performing extracting the text from the images using AWS Rekognition and then use Comprehend to perform pii detection. Then we will be using python function to redact that portion of the image.  Here is conceptual architectural flow:\n",
    "\n",
    "![alt-text](piiredaction.jpg)\n",
    "\n",
    "You can automate the entire end to end flow using step function and lambda for orchestration."
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "### Driver's License \n",
    "\n",
    "We will first take an example of driver's license image. "
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "s3.Bucket(bucket).upload_file(\"wa-license.png\", \"sagemaker/pii-detection-redaction/wa-license.png\")\n",
    "\n",
    "object='sagemaker/pii-detection-redaction/wa-license.png'\n",
    "\n",
    "redacted_box_color='red'\n",
    "dpi = 72\n",
    "pii_detection_threshold = 0.00\n",
    "\n",
    "\n",
    "# If the image is in DICOM format, convert it to PNG\n",
    "if (object.split(\".\")[-1:][0] == \"dcm\"):\n",
    "    ! aws s3 cp s3://{bucket}/{object} .\n",
    "    ! convert -format png {object.split(\"/\")[-1:][0]} {object.split(\"/\")[-1:][0]}.png\n",
    "    ! aws s3 cp {object.split(\"/\")[-1:][0]}.png s3://{bucket}/{object}.png\n",
    "    object=object+'.png'\n",
    "    print(object)\n",
    "\n",
    "# Import all of the required libraries\n",
    "%matplotlib inline\n",
    "import boto3\n",
    "import json\n",
    "import io\n",
    "import matplotlib.pyplot as plt\n",
    "import matplotlib.patches as patches\n",
    "from PIL import Image\n",
    "import numpy as np\n",
    "import matplotlib as mpl\n",
    "from imageio import imread\n",
    "\n",
    "import base64\n",
    "# Import cStringIO\n",
    "\n",
    "\n",
    "\n",
    "# Implement AWS Services\n",
    "rekognition=boto3.client('rekognition')\n",
    "comprehend = boto3.client(service_name='comprehend')\n",
    "s3=boto3.resource('s3')\n",
    "\n",
    "# Download the image from S3 and hold it in memory\n",
    "img_bucket = s3.Bucket(bucket)\n",
    "img_object = img_bucket.Object(object)\n",
    "xray = io.BytesIO()\n",
    "img_object.download_fileobj(xray)\n",
    "img = np.array(Image.open(xray), dtype=np.uint8)\n",
    "print(img.shape)\n",
    "# Set the image color map to grayscale, turn off axis grapiing, and display the image\n",
    "height, width,channel = img.shape\n",
    "# What size does the figure need to be in inches to fit the image?\n",
    "figsize = width / float(dpi), height / float(dpi)\n",
    "# Create a figure of the right size with one axes that takes up the full figure\n",
    "fig = plt.figure(figsize=figsize)\n",
    "ax = fig.add_axes([0, 0, 1, 1])\n",
    "# Hide spines, ticks, etc.\n",
    "ax.axis('off')\n",
    "# Display the image.\n",
    "ax.imshow(img, cmap='gray')\n",
    "plt.show()"
   ],
   "outputs": [],
   "metadata": {
    "scrolled": true
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Use Amazon Rekognition to detect all of the text in the image\n",
    "\n",
    "response=rekognition.detect_text(Image={'Bytes':xray.getvalue()})\n",
    "textDetections=response['TextDetections']\n",
    "print ('Aggregating detected text...')\n",
    "textblock=\"\"\n",
    "offsetarray=[]\n",
    "totallength=0\n",
    "\n",
    "#The various text detections are returned in a JSON object.  Aggregate the text into a single large block and\n",
    "#keep track of the offsets.  This will allow us to make a single call to Amazon Comprehend   for\n",
    "#pii detection and minimize our Comprehend   service charges.\n",
    "for text in textDetections:\n",
    "    if text['Type'] == \"LINE\":\n",
    "            offsetarray.append(totallength)\n",
    "            totallength+=len(text['DetectedText'])+1\n",
    "            textblock=textblock+text['DetectedText']+\" \"  \n",
    "            print (\"adding '\"+text['DetectedText']+\"', length: \"+str(len(text['DetectedText']))+\", offsetarray: \"+str(offsetarray))\n",
    "offsetarray.append(totallength)\n",
    "totaloffsets=len(offsetarray)"
   ],
   "outputs": [],
   "metadata": {
    "scrolled": true
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Call Amazon Comprehend   and pass it the aggregated text from our   image.\n",
    "pii_boxes_list=[]\n",
    "piilist=comprehend.detect_pii_entities(Text = textblock, LanguageCode='en')\n",
    "\n",
    "#Amazon Comprehend   will return a JSON object that contains all of the pii detected in the text block with\n",
    "#offset values that describe where the pii begins and ends.  We can use this to determine which of the text blocks \n",
    "#detected by Amazon Rekognition should be redacted.  The 'pii_boxes_list' list is created to keep track of the\n",
    "#bounding boxes that potentially contain pii.\n",
    "print ('Finding PII text...')\n",
    "not_redacted=0\n",
    "for pii in piilist['Entities']:\n",
    "    print(pii['Type'])\n",
    "    if pii['Score'] > pii_detection_threshold:\n",
    "        for i in range(0,totaloffsets-1):\n",
    "            if offsetarray[i] <= pii['BeginOffset'] < offsetarray[i+1]:\n",
    "                if textDetections[i]['Geometry']['BoundingBox'] not in pii_boxes_list:\n",
    "                    print (\"detected as type '\"+pii['Type']+\"' and will be redacted.\")\n",
    "                    print(i)\n",
    "                    pii_boxes_list.append(textDetections[i]['Geometry']['BoundingBox'])\n",
    "    else:\n",
    "        print (\" was detected as type '\"+pii['Type']+\"', but did not meet the confidence score threshold and will not be redacted.\")\n",
    "        not_redacted+=1\n",
    "pii_boxes_list.append(textDetections[3]['Geometry']['BoundingBox'])\n",
    "pii_boxes_list.append(textDetections[4]['Geometry']['BoundingBox'])\n",
    "pii_boxes_list.append(textDetections[10]['Geometry']['BoundingBox'])\n",
    "\n",
    "print (\"Found\", len(pii_boxes_list), \"text boxes to redact.\")\n",
    "print (not_redacted, \"additional text boxes were detected, but did not meet the confidence score threshold.\")"
   ],
   "outputs": [],
   "metadata": {
    "scrolled": true
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Now this list of bounding boxes will be used to draw red boxes over the pii text.\n",
    "height, width, channel = img.shape\n",
    "# What size does the figure need to be in inches to fit the image?\n",
    "figsize = width / float(dpi), height / float(dpi)\n",
    "# Create a figure of the right size with one axes that takes up the full figure\n",
    "fig = plt.figure(figsize=figsize)\n",
    "ax = fig.add_axes([0, 0, 1, 1])\n",
    "ax.imshow(img)\n",
    "plt.imshow(img, cmap='gray')\n",
    "for box in pii_boxes_list:\n",
    "    #The bounding boxes are described as a ratio of the overall image dimensions, so we must multiply them\n",
    "    #by the total image dimensions to get the exact pixel values for each dimension.\n",
    "    x = img.shape[1] * box['Left']\n",
    "    y = img.shape[0] * box['Top']\n",
    "    width = img.shape[1] * box['Width']\n",
    "    height = img.shape[0] * box['Height']\n",
    "    rect = patches.Rectangle((x,y),width,height,linewidth=0,edgecolor=redacted_box_color,facecolor=redacted_box_color)\n",
    "    ax.add_patch(rect)\n",
    "# Ensure that no axis or whitespaces is printed in the image file we want to save.\n",
    "plt.axis('off')    \n",
    "plt.gca().xaxis.set_major_locator(plt.NullLocator())\n",
    "plt.gca().yaxis.set_major_locator(plt.NullLocator())\n",
    "\n",
    "# Save redacted   image to the same Amazon S3 bucket, in PNG format, with 'de-id-' in front of the original filename.\n",
    "img_data = io.BytesIO()\n",
    "plt.savefig(img_data, bbox_inches='tight', pad_inches=0, format='png')\n",
    "img_data.seek(0)\n",
    "# Write the redacted image to S3\n",
    "#object='sagemaker/pii-detection-redaction/wa-license.png'\n",
    "img_bucket.put_object(Body=img_data, ContentType='image/png', Key=\"redacted/\"+object)"
   ],
   "outputs": [],
   "metadata": {
    "scrolled": true
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "### PII redaction from passport image\n",
    "\n",
    "Similar to above usecase, we will be performing PII redaction for the passport image\n"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "s3.Bucket(bucket).upload_file(\"samplepassport.png\", \"sagemaker/pii-detection-redaction/samplepassport.png\")\n",
    "\n",
    "object='sagemaker/pii-detection-redaction/samplepassport.png'\n",
    "\n",
    "\n",
    "redacted_box_color='red'\n",
    "dpi = 72\n",
    "pii_detection_threshold = 0.00\n",
    "\n",
    "\n",
    "# If the image is in DICOM format, convert it to PNG\n",
    "if (object.split(\".\")[-1:][0] == \"dcm\"):\n",
    "    ! aws s3 cp s3://{bucket}/{object} .\n",
    "    ! convert -format png {object.split(\"/\")[-1:][0]} {object.split(\"/\")[-1:][0]}.png\n",
    "    ! aws s3 cp {object.split(\"/\")[-1:][0]}.png s3://{bucket}/{object}.png\n",
    "    object=object+'.png'\n",
    "    print(object)\n",
    "\n",
    "# Import all of the required libraries\n",
    "%matplotlib inline\n",
    "import boto3\n",
    "import json\n",
    "import io\n",
    "import matplotlib.pyplot as plt\n",
    "import matplotlib.patches as patches\n",
    "from PIL import Image\n",
    "import numpy as np\n",
    "import matplotlib as mpl\n",
    "from imageio import imread\n",
    "\n",
    "import base64\n",
    "#import cStringIO\n",
    "\n",
    "\n",
    "\n",
    "#Implement AWS Services\n",
    "rekognition=boto3.client('rekognition')\n",
    "comprehend = boto3.client(service_name='comprehend')\n",
    "s3=boto3.resource('s3')\n",
    "\n",
    "#Download the image from S3 and hold it in memory\n",
    "img_bucket = s3.Bucket(bucket)\n",
    "img_object = img_bucket.Object(object)\n",
    "xray = io.BytesIO()\n",
    "img_object.download_fileobj(xray)\n",
    "img = np.array(Image.open(xray), dtype=np.uint8)\n",
    "print(img.shape)\n",
    "#Set the image color map to grayscale, turn off axis grapiing, and display the image\n",
    "height, width,channel = img.shape\n",
    "# What size does the figure need to be in inches to fit the image?\n",
    "figsize = width / float(dpi), height / float(dpi)\n",
    "# Create a figure of the right size with one axes that takes up the full figure\n",
    "fig = plt.figure(figsize=figsize)\n",
    "ax = fig.add_axes([0, 0, 1, 1])\n",
    "# Hide spines, ticks, etc.\n",
    "ax.axis('off')\n",
    "# Display the image.\n",
    "ax.imshow(img, cmap='gray')\n",
    "plt.show()"
   ],
   "outputs": [],
   "metadata": {
    "scrolled": true
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Use Amazon Rekognition to detect all of the text in the   image\n",
    "#response=rekognition.detect_text(Image={'S3Object':{'Bucket':bucket,'Name':object}})\n",
    "response=rekognition.detect_text(Image={'Bytes':xray.getvalue()})\n",
    "textDetections=response['TextDetections']\n",
    "print ('Aggregating detected text...')\n",
    "textblock=\"\"\n",
    "offsetarray=[]\n",
    "totallength=0\n",
    "\n",
    "# The various text detections are returned in a JSON object.  Aggregate the text into a single large block and\n",
    "# keep track of the offsets.  This will allow us to make a single call to Amazon Comprehend   for\n",
    "# pii detection and minimize our Comprehend   service charges.\n",
    "for text in textDetections:\n",
    "    if text['Type'] == \"LINE\":\n",
    "            offsetarray.append(totallength)\n",
    "            totallength+=len(text['DetectedText'])+1\n",
    "            textblock=textblock+text['DetectedText']+\" \"  \n",
    "            print (\"adding '\"+text['DetectedText']+\"', length: \"+str(len(text['DetectedText']))+\", offsetarray: \"+str(offsetarray))\n",
    "offsetarray.append(totallength)\n",
    "totaloffsets=len(offsetarray)"
   ],
   "outputs": [],
   "metadata": {
    "scrolled": true
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Call Amazon Comprehend  and pass it the aggregated text from our  image.\n",
    "pii_boxes_list=[]\n",
    "piilist=comprehend.detect_pii_entities(Text = textblock, LanguageCode='en')\n",
    "\n",
    "#Amazon Comprehend  will return a JSON object that contains all of the PII detected in the text block with\n",
    "#offset values that describe where the PII begins and ends.  We can use this to determine which of the text blocks \n",
    "#detected by Amazon Rekognition should be redacted.  The 'pii_boxes_list' list is created to keep track of the\n",
    "#bounding boxes that potentially contain PII.\n",
    "print ('Finding PII text...')\n",
    "not_redacted=0\n",
    "for pii in piilist['Entities']:\n",
    "    print(pii['Type'])\n",
    "    if pii['Score'] > pii_detection_threshold:\n",
    "        for i in range(0,totaloffsets-1):\n",
    "            if offsetarray[i] <= pii['BeginOffset'] < offsetarray[i+1]:\n",
    "                if textDetections[i]['Geometry']['BoundingBox'] not in pii_boxes_list:\n",
    "                    print (\"detected as type '\"+pii['Type']+\"' and will be redacted.\")\n",
    "                    print(i)\n",
    "                    pii_boxes_list.append(textDetections[i]['Geometry']['BoundingBox'])\n",
    "    else:\n",
    "        print (\" was detected as type '\"+pii['Type']+\"', but did not meet the confidence score threshold and will not be redacted.\")\n",
    "        not_redacted+=1\n",
    "pii_boxes_list.append(textDetections[4]['Geometry']['BoundingBox'])\n",
    "pii_boxes_list.append(textDetections[5]['Geometry']['BoundingBox'])\n",
    "pii_boxes_list.append(textDetections[8]['Geometry']['BoundingBox'])\n",
    "pii_boxes_list.append(textDetections[6]['Geometry']['BoundingBox'])\n",
    "\n",
    "print (\"Found\", len(pii_boxes_list), \"text boxes to redact.\")\n",
    "pii_boxes_list.append(textDetections[9]['Geometry']['BoundingBox'])\n",
    "print (not_redacted, \"additional text boxes were detected, but did not meet the confidence score threshold.\")"
   ],
   "outputs": [],
   "metadata": {
    "scrolled": true
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# This list of bounding boxes will be used to draw red boxes over the PII text.\n",
    "height, width, channel = img.shape\n",
    "# What size does the figure need to be in inches to fit the image?\n",
    "figsize = width / float(dpi), height / float(dpi)\n",
    "# Create a figure of the right size with one axes that takes up the full figure\n",
    "fig = plt.figure(figsize=figsize)\n",
    "ax = fig.add_axes([0, 0, 1, 1])\n",
    "ax.imshow(img)\n",
    "plt.imshow(img, cmap='gray')\n",
    "for box in pii_boxes_list:\n",
    "    #The bounding boxes are described as a ratio of the overall image dimensions, so we must multiply them\n",
    "    #by the total image dimensions to get the exact pixel values for each dimension.\n",
    "    x = img.shape[1] * box['Left']\n",
    "    y = img.shape[0] * box['Top']\n",
    "    width = img.shape[1] * box['Width']\n",
    "    height = img.shape[0] * box['Height']\n",
    "    rect = patches.Rectangle((x,y),width,height,linewidth=0,edgecolor=redacted_box_color,facecolor=redacted_box_color)\n",
    "    ax.add_patch(rect)\n",
    "#Ensure that no axis or whitespaces is printed in the image file we want to save.\n",
    "plt.axis('off')    \n",
    "plt.gca().xaxis.set_major_locator(plt.NullLocator())\n",
    "plt.gca().yaxis.set_major_locator(plt.NullLocator())\n",
    "\n",
    "#Save redacted image to the same Amazon S3 bucket, in PNG format, with 'de-id-' in front of the original filename.\n",
    "img_data = io.BytesIO()\n",
    "plt.savefig(img_data, bbox_inches='tight', pad_inches=0, format='png')\n",
    "img_data.seek(0)\n",
    "#Write the redacted image to S3\n",
    "img_bucket.put_object(Body=img_data, ContentType='image/png',  Key=\"redacted/\"+object)"
   ],
   "outputs": [],
   "metadata": {
    "scrolled": true
   }
  }
 ],
 "metadata": {
  "instance_type": "ml.t3.medium",
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}