## Option 3 - Pixel Concentration Analysis Method for Removing Visuals

In this notebook we describe an empirical method to infer the position of images in a PDF. This method is highly sensible to the hyperparameter selection. Therefore, we recommend the end user to manually fine-tune them in their data. At a high level this process follows the steps illustrated in the diagram below.

![process diagram](./images/pixel_analysis_flow.png)


### Environmnet Setup

In [None]:
!sudo yum install poppler-utils -y
!pip install pdf2image

### Initialization

In [None]:
import boto3
from pdf2image import convert_from_bytes
import numpy as np
from pprint import pprint
from utils import (rgb2gray, 
                   max_sliding_window, 
                   get_candidate_segments, 
                   filter_segments,
                   show_page, 
                   plot_aggregation,
                   get_segments
                  )
from matplotlib import pyplot as plt
import matplotlib.patches as patches
from PIL import Image
from copy import deepcopy

### Optimization Parameters

In [None]:
INPUT_PDF = './doc_input/sample_input.pdf'

# parameters to detect non full text vertical segments
Y_MIN_THR = .15
Y_MAX_THR = .32
Y_WINDOW = 10

# parameters to detect non full text horizontal segments
X_MIN_THR = .05
X_MAX_THR = .22
X_WINDOW = 2

# minimum dimension of image to be removed. e.g. if LOWER_AREA_THR=30 then 30*30
LOWER_AREA_THR = 30

# gray range thresholeds
LOWER_GRAY_THR = .2
UPPER_GRAY_THR = .95

### Step by step walk through

Without loss of generality assume that the pdf has only one page. 

#### Load the pdf page, convert it to numpy array

In [None]:
# load pdf
images = convert_from_bytes(open(INPUT_PDF, 'rb').read())

# select page 0 
page = images[0]

# convert to numpy array
page_array = np.array(page)

show_page(page_array, color=True)

#### Convert the image to gray scale.

In [None]:
# convert to gray
gray_page = rgb2gray(page_array)
show_page(gray_page)

#### Converting gray areas to white. 
Note: This will remove intermediate shades of gray, i.e. non-black or white colors converted to gray scale. Most image will not be visible after this step. Also please noted, this step has the potential remove parts of the text that is not completly black.

In [None]:
# Normalize and remove gray convert to white gray pixels
gray_page = (gray_page - gray_page.min())/gray_page.max()
gray_page[(gray_page>LOWER_GRAY_THR) & (gray_page<UPPER_GRAY_THR)] = 1

show_page(gray_page)

Collaps pixels horizontally by taking the mean across the "x" dimension. The results `Y` is a vector of length = vertical lenght of the page. The values `Y[i]` of the vector is the average black concentration at the `i` vertical level. We call this vector the vertical distribution of the pixel's values since each value reflects the average pixel value at that vertical level `y=i`.  


In [None]:
y1 = 1 - gray_page.mean(axis=1)
y_smooth = max_sliding_window(y1, Y_WINDOW)

In [None]:
fig = plt.figure(figsize=(15,10))
show_img = fig.add_subplot(1,2,1)
show_img.set_title('Pixel Density', fontweight='bold', fontsize= 16)
plt.plot(y1, range(len(y1)))
plt.xlabel('Average Black Pixel')
plt.ylabel('Vertical Position')
plt.gca().invert_yaxis()


show_img = fig.add_subplot(1,2,2)
show_img.set_title('Smoothen', fontweight='bold', fontsize= 16)
plt.plot(y_smooth, range(len(y_smooth)))
plt.xlabel('Average Black Pixel2')
plt.ylabel('Vertical Position2')
plt.axvline(Y_MIN_THR, color='red')
plt.axvline(Y_MAX_THR, color='red')
plt.gca().invert_yaxis()
plt.show()


The above graphs show a distinguish seperation between paragraphs and visuals.

Using the frequencies above, we identify horizontal strips that are unlikely to be complete "left to right border" paragraphs. We flag those "non full-text" horizontal stripes using a thresholding method via the `Y_MIN_THR` and `Y_MAX_THR` hyperparameters represented above using vertical red lines.

In [None]:
y_segments = get_candidate_segments(y_smooth)
y_segments = filter_segments(y_segments, y_smooth, Y_MIN_THR, Y_MAX_THR)

pprint(y_segments[:3])

The logic above is wrapped in the `get_segments` function:

```python
y_segments = get_segments(img=gray_page, orientation="vertical", 
                          black_min_thr=Y_MIN_THR, black_max_thr=Y_MAX_THR, window_size=Y_WINDOW)
```

For each "non full-text" vertical stripes we collaps pixels vertically by taking the mean across the "y" dimension within the stripe. The result `X` is a vector of length = horizontal lenght of the page. The values `X[j]` of the vector is the average pixel value for across the vertical line `x=j` of the stripe. We call this vector the horizontal distribution of the pixel's values in stripe `i` since each value reflects the average pixel value at that horizontal level `x=j`.  

Using the frequencies above for each vertical segment, we identify regions within the vertical segment that are unlikely to be text. We flag the "non-text" vertical stripes within the horizontal strip using a thresholding method via the `X_MIN_THR` and `X_MAX_THR` hyperparameters. We use the intersection of both "non-text"vertical and "non-text" horizontal strides to obtain the coordinates of the images.

In [None]:
coordinates = []
for ymin, ymax in y_segments:
    vertical_image_region = gray_page[ymin:ymax, :]
    xs_segments = get_segments(img=vertical_image_region, orientation="horizontal", 
                               black_min_thr=X_MIN_THR, black_max_thr=X_MAX_THR, window_size=X_WINDOW)
    coordinates.extend([((ymin, ymax),(xmin, xmax)) for xmin, xmax in xs_segments if (ymax-ymin)*(xmax-xmin)>LOWER_AREA_THR**2])
    
pprint(coordinates[:3])

The aboves are the edges of the rectangles containing image regions.

In [None]:
px = 1/plt.rcParams['figure.dpi']  # pixel in inches
fig,ax = plt.subplots(1, figsize=(page_array.shape[1]*px,page_array.shape[0]*px))		# Create figure and axes
ax.imshow(page_array)					# Display the image

# Create a Rectangle patch; 
for ((ymin, ymax),(xmin, xmax)) in coordinates:
    # matplotlib.patches.Rectangle(xy, width, height, angle=0.0, **kwargs)
    rect = patches.Rectangle((xmin,ymin),xmax-xmin,ymax-ymin, linewidth=1,edgecolor='r',facecolor='none')
    ax.add_patch(rect)		# Add the patch to the Axes
plt.show()

In [None]:
show_page(page_array, color=True)

In [None]:
page_array_wo_images = deepcopy(page_array)
for (ymin,ymax),(xmin,xmax) in coordinates:
    page_array_wo_images[ymin:ymax,xmin:xmax, :] = 255

# show_page(page_array_wo_images, color=True)


## Final Results:

In [None]:
px = 1/plt.rcParams['figure.dpi']  # pixel in inches
fig,ax = plt.subplots(1, figsize=(page_array.shape[1]*px,page_array.shape[0]*px))		# Create figure and axes
ax.imshow(page_array_wo_images)	
plt.show()