{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Amazon Textract Pre-processing: Detecting Visuals\n",
"\n",
"## Introduction\n",
"\n",
"Many customers encounter one challenge: removing or redacting visuals during pre-processing and feeding only relevant contents into Textract for extraction. These visuals including logos, product sample pictures, charts, etc. In many use cases, these visuals should be redacted or extracted out for further processing. \n",
"\n",
"In this Notebook, we will go through different options for handling these visuals and point out each option's pros and cons. As well as additional parameters engineers/developers should watch out for.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Table of Contents:\n",
"* [Option 1: Searchable PDF](#option-1)\n",
"* [Option 2: OpenCV Edge Detector](#option-2)\n",
"* Option 3: Pixel Analysis Notebook"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Option 1: Searchable PDF\n",
"Searchable PDFs are native documents, and they usually are generated by other applications. These types of PDFs retain metadata, text, and image information inside the document. Therefore, you can easily use python or other libraries to extract images and text. In these two notebooks, we will focus on un-searchable scanned PDFs."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Option 2: OpenCV Edge Detector\n",
"This option will be using [OpenCV](https://docs.opencv.org/master/index.html) Edge Detector to detect edges of visuals within the document. In this example, we are using a single page of the PDF. You can loop through to process multiple pages of PDF documents. Also, we are creating few variables to help us adjust and optimize accuracy depending on the use case."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Environment Setup"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"#Uncomment to install the dependencies\n",
"!pip install --upgrade matplotlib\n",
"!pip install --upgrade opencv-python\n",
"%matplotlib inline\n",
"!pip install --upgrade boto3\n",
"!pip install --upgrade numpy\n",
"!pip install --upgrade pdf2image\n",
"\n",
"#For Mac users \n",
"!brew install poppler\n",
"\n",
"#For Linux Users\n",
"#!sudo yum install poppler-utils -y"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Initialization"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import cv2\n",
"from matplotlib import pyplot as plt\n",
"from pdf2image import convert_from_path\n",
"from IPython.display import IFrame\n",
"from pathlib import Path\n",
"import boto3\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Optimization Parameters"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# Minimum Width and Height of a visual. These are expressed in percentage of page width or page height. \n",
"# This can be as small as a width of a character, then you will treat all characters as an visual. \n",
"# The optimum threshold depends on your use case. You can play around these these parameters to find the optimum point.\n",
"MINIMUM_WIDTH = 0.05 \n",
"MINIMUM_HEIGTH = 0.05\n",
"\n",
"# Image padding parameters (in pixels): Set padding values to bring the text associated to your image,\n",
"# This is particular useful with visuals/charts without boarders, and you like to consider the axis values, \n",
"# image title, or image descriptions as part of the visual.\n",
"LEFT_PADDING = 15\n",
"RIGHT_PADDING = 5\n",
"TOP_PADDING = 5\n",
"BOTTOM_PADDING = 65"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Input and Output Directories"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Input and Output Files/Directories:\n",
"input_file_location = \"./doc_input/\"\n",
"output_file_location = \"./doc_output/\"\n",
"input_file_name = \"sample_input.pdf\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Preprocessing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### - Read Input Document"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"doc = convert_from_path(input_file_location + input_file_name)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### - Get the first document page"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"page = doc[0]\n",
"page_number = 1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### - Convert the page image to a bytearray readable by OpenCV"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"original_img = cv2.cvtColor(np.asarray(page), code=cv2.COLOR_RGB2BGR)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### - Convert the original image to grayscale"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Convert Page to gray scale\n",
"gray_scale_image = cv2.cvtColor(original_img, cv2.COLOR_BGR2GRAY)\n",
"\n",
"#Show the Original Page\n",
"fig = plt.figure(figsize=(15,10))\n",
"show_img = fig.add_subplot(1,2,1)\n",
"show_img.set_title(r'$\\bf Original$'+ '\\n\\n', fontsize= 16)\n",
"show_img.xaxis.tick_top()\n",
"img_plot = plt.imshow(cv2.cvtColor(original_img, cv2.COLOR_BGR2RGB))\n",
"\n",
"#Show the redacted page\n",
"show_img = fig.add_subplot(1,2,2)\n",
"show_img.set_title(r'$\\bf Gray Scale$' + '\\n\\n', fontsize= 16)\n",
"show_img.xaxis.tick_top()\n",
"img_plot = plt.imshow(cv2.cvtColor(gray_scale_image, cv2.COLOR_BGR2RGB))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### - Detect edges using OpenCV Canny Edge detector on grayscale image"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Apply a Canny Edge detector to the gray scale page and detect edges in the gray scale page\n",
"canny_img = cv2.Canny(gray_scale_image, 0, 255, apertureSize=3, L2gradient=True) \n",
"\n",
"fig = plt.figure(figsize=(15,10))\n",
"show_img = fig.add_subplot(1,2,1)\n",
"show_img.set_title(r'$\\bf Grayscale$'+ '\\n\\n', fontsize= 16)\n",
"show_img.xaxis.tick_top()\n",
"img_plot = plt.imshow(cv2.cvtColor(gray_scale_image, cv2.COLOR_BGR2RGB))\n",
"\n",
"show_img = fig.add_subplot(1,2,2)\n",
"show_img.set_title(r'$\\bf Canny-Edged$' + '\\n\\n', fontsize= 16)\n",
"show_img.xaxis.tick_top()\n",
"img_plot = plt.imshow(cv2.cvtColor(canny_img, cv2.COLOR_BGR2RGB))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### - Detect the image contours in the Cany Edged image"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Detect the image contours in the edges detected\n",
"contours, hierarchy = cv2.findContours(canny_img, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### - Set the minimum width and height for filtering the contours size"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Set the minimun dimensions for the images inside the page\n",
"minimum_width = int(original_img.shape[1] * MINIMUM_WIDTH)\n",
"minimum_height = int(original_img.shape[0] * MINIMUM_HEIGTH)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### - Iterate over the contours list to crop and redact from the original document the visuals found in the canny edged image"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Iterate over the contours detected\n",
"image_index = 0\n",
"save_images = True\n",
"\n",
"for c in contours: \n",
" #Get the contour corner (x,y) width and height\n",
" x,y,w,h = cv2.boundingRect(c)\n",
"\n",
" #Verify if the contour dimensions match the minimun dimensions set with minimum_width and minimum_height\n",
" if (w >= minimum_width and h >= minimum_height):\n",
" if (save_images):\n",
" image_index += 1\n",
" #Crop the image inside the contour and save it to a file\n",
" cropped_image = original_img[y - TOP_PADDING:(y + h + BOTTOM_PADDING), x - LEFT_PADDING:(x + w + RIGHT_PADDING)]\n",
" cv2.imwrite(fr\"{output_file_location}{Path(input_file_name).stem}_page_{page_number + 1}_image_{image_index}.png\",cropped_image)\n",
" cropped_image = None\n",
"\n",
" #Redact the image inside the page document\n",
" cv2.rectangle(original_img, (x - LEFT_PADDING, y - TOP_PADDING), (x + w + RIGHT_PADDING, y + h + BOTTOM_PADDING), (255, 255, 255), -1)\n",
" \n",
"fig = plt.figure(figsize=(15,10))\n",
"show_img = fig.add_subplot(1,2,1)\n",
"show_img.set_title(r'$\\bfCanny-Edged$'+ '\\n\\n', fontsize= 16)\n",
"show_img.xaxis.tick_top()\n",
"img_plot = plt.imshow(cv2.cvtColor(canny_img, cv2.COLOR_BGR2RGB))\n",
"\n",
"show_img = fig.add_subplot(1,2,2)\n",
"show_img.set_title(r'$\\bfNo-Visuals$' + '\\n\\n', fontsize= 16)\n",
"show_img.xaxis.tick_top()\n",
"img_plot = plt.imshow(cv2.cvtColor(original_img, cv2.COLOR_BGR2RGB)) \n",
"\n",
"img_encode = cv2.imencode('.png', original_img)[1]\n",
"data_encode = np.array(img_encode)\n",
"img_bytes = data_encode.tobytes()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### OCR Textract the No-Visuals images"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#create a Textract Client\n",
"textract = boto3.client('textract') \n",
"# Analyze Document Page using Textract (OCR)\n",
"response = textract.detect_document_text(Document={'Bytes': img_bytes})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### - Show the Textract Results"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for item in response[\"Blocks\"]:\n",
" if item[\"BlockType\"] == \"LINE\":\n",
" print(item[\"Text\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}