## Option 3 - Pixel Concentration Analysis Method for Removing Visuals

In this notebook we describe an empirical method to infer the position of images in a PDF. This method is highly sensible to the hyperparameter selection. Therefore, we recommend the end user to manually fine-tune them in their data. At a high level this process follows the steps illustrated in the diagram below.

![process diagram](./images/pixel_analysis_flow.png)


### Environmnet Setup

In [None]:
!sudo yum install poppler-utils -y
!pip install pdf2image

### Initialization

In [None]:
import boto3
from pdf2image import convert_from_bytes
import numpy as np
from pprint import pprint
from utils import (rgb2gray, 
 max_sliding_window, 
 get_candidate_segments, 
 filter_segments,
 show_page, 
 plot_aggregation,
 get_segments
 )
from matplotlib import pyplot as plt
import matplotlib.patches as patches
from PIL import Image
from copy import deepcopy

### Optimization Parameters

In [None]:
INPUT_PDF = './doc_input/sample_input.pdf'

# parameters to detect non full text vertical segments
Y_MIN_THR = .15
Y_MAX_THR = .32
Y_WINDOW = 10

# parameters to detect non full text horizontal segments
X_MIN_THR = .05
X_MAX_THR = .22
X_WINDOW = 2

# minimum dimension of image to be removed. e.g. if LOWER_AREA_THR=30 then 30*30
LOWER_AREA_THR = 30

# gray range thresholeds
LOWER_GRAY_THR = .2
UPPER_GRAY_THR = .95

### Step by step walk through

Without loss of generality assume that the pdf has only one page. 

#### Load the pdf page, convert it to numpy array

In [None]:
# load pdf
images = convert_from_bytes(open(INPUT_PDF, 'rb').read())

# select page 0 
page = images[0]

# convert to numpy array
page_array = np.array(page)

show_page(page_array, color=True)

#### Convert the image to gray scale.

In [None]:
# convert to gray
gray_page = rgb2gray(page_array)
show_page(gray_page)

#### Converting gray areas to white. 
Note: This will remove intermediate shades of gray, i.e. non-black or white colors converted to gray scale. Most image will not be visible after this step. Also please noted, this step has the potential remove parts of the text that is not completly black.

In [None]:
# Normalize and remove gray convert to white gray pixels
gray_page = (gray_page - gray_page.min())/gray_page.max()
gray_page[(gray_page>LOWER_GRAY_THR) & (gray_pageLOWER_AREA_THR**2])
 
pprint(coordinates[:3])

The aboves are the edges of the rectangles containing image regions.

In [None]:
px = 1/plt.rcParams['figure.dpi'] # pixel in inches
fig,ax = plt.subplots(1, figsize=(page_array.shape[1]*px,page_array.shape[0]*px))		# Create figure and axes
ax.imshow(page_array)					# Display the image

# Create a Rectangle patch; 
for ((ymin, ymax),(xmin, xmax)) in coordinates:
 # matplotlib.patches.Rectangle(xy, width, height, angle=0.0, **kwargs)
 rect = patches.Rectangle((xmin,ymin),xmax-xmin,ymax-ymin, linewidth=1,edgecolor='r',facecolor='none')
 ax.add_patch(rect)		# Add the patch to the Axes
plt.show()

In [None]:
show_page(page_array, color=True)

In [None]:
page_array_wo_images = deepcopy(page_array)
for (ymin,ymax),(xmin,xmax) in coordinates:
 page_array_wo_images[ymin:ymax,xmin:xmax, :] = 255

# show_page(page_array_wo_images, color=True)


## Final Results:

In [None]:
px = 1/plt.rcParams['figure.dpi'] # pixel in inches
fig,ax = plt.subplots(1, figsize=(page_array.shape[1]*px,page_array.shape[0]*px))		# Create figure and axes
ax.imshow(page_array_wo_images)	
plt.show()