# Amazon Textract Pre-processing: Detecting Visuals

## Introduction

Many customers encounter one challenge: removing or redacting visuals during pre-processing and feeding only relevant contents into Textract for extraction. These visuals including logos, product sample pictures, charts, etc. In many use cases, these visuals should be redacted or extracted out for further processing. 

In this Notebook, we will go through different options for handling these visuals and point out each option's pros and cons. As well as additional parameters engineers/developers should watch out for.


# Table of Contents:
* [Option 1: Searchable PDF](#option-1)
* [Option 2: OpenCV Edge Detector](#option-2)
* Option 3: Pixel Analysis Notebook

## Option 1: Searchable PDF
Searchable PDFs are native documents, and they usually are generated by other applications. These types of PDFs retain metadata, text, and image information inside the document. Therefore, you can easily use python or other libraries to extract images and text. In these two notebooks, we will focus on un-searchable scanned PDFs.

## Option 2: OpenCV Edge Detector
This option will be using [OpenCV](https://docs.opencv.org/master/index.html) Edge Detector to detect edges of visuals within the document. In this example, we are using a single page of the PDF. You can loop through to process multiple pages of PDF documents. Also, we are creating few variables to help us adjust and optimize accuracy depending on the use case.

### Environment Setup

In [None]:
#Uncomment to install the dependencies
!pip install --upgrade matplotlib
!pip install --upgrade opencv-python
%matplotlib inline
!pip install --upgrade boto3
!pip install --upgrade numpy
!pip install --upgrade pdf2image

#For Mac users 
!brew install poppler

#For Linux Users
#!sudo yum install poppler-utils -y

### Initialization

In [None]:
import cv2
from matplotlib import pyplot as plt
from pdf2image import convert_from_path
from IPython.display import IFrame
from pathlib import Path
import boto3
import numpy as np

### Optimization Parameters

In [None]:
# Minimum Width and Height of a visual. These are expressed in percentage of page width or page height. 
# This can be as small as a width of a character, then you will treat all characters as an visual. 
# The optimum threshold depends on your use case. You can play around these these parameters to find the optimum point.
MINIMUM_WIDTH = 0.05 
MINIMUM_HEIGTH = 0.05

# Image padding parameters (in pixels): Set padding values to bring the text associated to your image,
# This is particular useful with visuals/charts without boarders, and you like to consider the axis values, 
# image title, or image descriptions as part of the visual.
LEFT_PADDING = 15
RIGHT_PADDING = 5
TOP_PADDING = 5
BOTTOM_PADDING = 65

### Input and Output Directories

In [None]:
# Input and Output Files/Directories:
input_file_location = "./doc_input/"
output_file_location = "./doc_output/"
input_file_name = "sample_input.pdf"

### Preprocessing

#### - Read Input Document

In [None]:
doc = convert_from_path(input_file_location + input_file_name)

#### - Get the first document page

In [None]:
page = doc[0]
page_number = 1

#### - Convert the page image to a bytearray readable by OpenCV

In [None]:
original_img = cv2.cvtColor(np.asarray(page), code=cv2.COLOR_RGB2BGR)

#### - Convert the original image to grayscale

In [None]:
# Convert Page to gray scale
gray_scale_image = cv2.cvtColor(original_img, cv2.COLOR_BGR2GRAY)

#Show the Original Page
fig = plt.figure(figsize=(15,10))
show_img = fig.add_subplot(1,2,1)
show_img.set_title(r'$\bf Original$'+ '\n\n', fontsize= 16)
show_img.xaxis.tick_top()
img_plot = plt.imshow(cv2.cvtColor(original_img, cv2.COLOR_BGR2RGB))

#Show the redacted page
show_img = fig.add_subplot(1,2,2)
show_img.set_title(r'$\bf Gray Scale$' + '\n\n', fontsize= 16)
show_img.xaxis.tick_top()
img_plot = plt.imshow(cv2.cvtColor(gray_scale_image, cv2.COLOR_BGR2RGB))

#### - Detect edges using OpenCV Canny Edge detector on grayscale image

In [None]:
# Apply a Canny Edge detector to the gray scale page and detect edges in the gray scale page
canny_img = cv2.Canny(gray_scale_image, 0, 255, apertureSize=3, L2gradient=True) 

fig = plt.figure(figsize=(15,10))
show_img = fig.add_subplot(1,2,1)
show_img.set_title(r'$\bf Grayscale$'+ '\n\n', fontsize= 16)
show_img.xaxis.tick_top()
img_plot = plt.imshow(cv2.cvtColor(gray_scale_image, cv2.COLOR_BGR2RGB))

show_img = fig.add_subplot(1,2,2)
show_img.set_title(r'$\bf Canny-Edged$' + '\n\n', fontsize= 16)
show_img.xaxis.tick_top()
img_plot = plt.imshow(cv2.cvtColor(canny_img, cv2.COLOR_BGR2RGB))

#### - Detect the image contours in the Cany Edged image

In [None]:
#Detect the image contours in the edges detected
contours, hierarchy = cv2.findContours(canny_img, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)

#### - Set the minimum width and height for filtering the contours size

In [None]:
#Set the minimun dimensions for the images inside the page
minimum_width = int(original_img.shape[1] * MINIMUM_WIDTH)
minimum_height = int(original_img.shape[0] * MINIMUM_HEIGTH)

#### - Iterate over the contours list to crop and redact from the original document the visuals found in the canny edged image

In [None]:
#Iterate over the contours detected
image_index = 0
save_images = True

for c in contours: 
 #Get the contour corner (x,y) width and height
 x,y,w,h = cv2.boundingRect(c)

 #Verify if the contour dimensions match the minimun dimensions set with minimum_width and minimum_height
 if (w >= minimum_width and h >= minimum_height):
 if (save_images):
 image_index += 1
 #Crop the image inside the contour and save it to a file
 cropped_image = original_img[y - TOP_PADDING:(y + h + BOTTOM_PADDING), x - LEFT_PADDING:(x + w + RIGHT_PADDING)]
 cv2.imwrite(fr"{output_file_location}{Path(input_file_name).stem}_page_{page_number + 1}_image_{image_index}.png",cropped_image)
 cropped_image = None

 #Redact the image inside the page document
 cv2.rectangle(original_img, (x - LEFT_PADDING, y - TOP_PADDING), (x + w + RIGHT_PADDING, y + h + BOTTOM_PADDING), (255, 255, 255), -1)
 
fig = plt.figure(figsize=(15,10))
show_img = fig.add_subplot(1,2,1)
show_img.set_title(r'$\bfCanny-Edged$'+ '\n\n', fontsize= 16)
show_img.xaxis.tick_top()
img_plot = plt.imshow(cv2.cvtColor(canny_img, cv2.COLOR_BGR2RGB))

show_img = fig.add_subplot(1,2,2)
show_img.set_title(r'$\bfNo-Visuals$' + '\n\n', fontsize= 16)
show_img.xaxis.tick_top()
img_plot = plt.imshow(cv2.cvtColor(original_img, cv2.COLOR_BGR2RGB)) 

img_encode = cv2.imencode('.png', original_img)[1]
data_encode = np.array(img_encode)
img_bytes = data_encode.tobytes()

### OCR Textract the No-Visuals images

In [None]:
#create a Textract Client
textract = boto3.client('textract') 
# Analyze Document Page using Textract (OCR)
response = textract.detect_document_text(Document={'Bytes': img_bytes})

#### - Show the Textract Results

In [None]:
for item in response["Blocks"]:
 if item["BlockType"] == "LINE":
 print(item["Text"])