{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Option 3 - Pixel Concentration Analysis Method for Removing Visuals\n",
    "\n",
    "In this notebook we describe an empirical method to infer the position of images in a PDF. This method is highly sensible to the hyperparameter selection. Therefore, we recommend the end user to manually fine-tune them in their data. At a high level this process follows the steps illustrated in the diagram below.\n",
    "\n",
    "![process diagram](./images/pixel_analysis_flow.png)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Environmnet Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!sudo yum install poppler-utils -y\n",
    "!pip install pdf2image"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Initialization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import boto3\n",
    "from pdf2image import convert_from_bytes\n",
    "import numpy as np\n",
    "from pprint import pprint\n",
    "from utils import (rgb2gray, \n",
    "                   max_sliding_window, \n",
    "                   get_candidate_segments, \n",
    "                   filter_segments,\n",
    "                   show_page, \n",
    "                   plot_aggregation,\n",
    "                   get_segments\n",
    "                  )\n",
    "from matplotlib import pyplot as plt\n",
    "import matplotlib.patches as patches\n",
    "from PIL import Image\n",
    "from copy import deepcopy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Optimization Parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "INPUT_PDF = './doc_input/sample_input.pdf'\n",
    "\n",
    "# parameters to detect non full text vertical segments\n",
    "Y_MIN_THR = .15\n",
    "Y_MAX_THR = .32\n",
    "Y_WINDOW = 10\n",
    "\n",
    "# parameters to detect non full text horizontal segments\n",
    "X_MIN_THR = .05\n",
    "X_MAX_THR = .22\n",
    "X_WINDOW = 2\n",
    "\n",
    "# minimum dimension of image to be removed. e.g. if LOWER_AREA_THR=30 then 30*30\n",
    "LOWER_AREA_THR = 30\n",
    "\n",
    "# gray range thresholeds\n",
    "LOWER_GRAY_THR = .2\n",
    "UPPER_GRAY_THR = .95"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step by step walk through\n",
    "\n",
    "Without loss of generality assume that the pdf has only one page. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Load the pdf page, convert it to numpy array"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# load pdf\n",
    "images = convert_from_bytes(open(INPUT_PDF, 'rb').read())\n",
    "\n",
    "# select page 0 \n",
    "page = images[0]\n",
    "\n",
    "# convert to numpy array\n",
    "page_array = np.array(page)\n",
    "\n",
    "show_page(page_array, color=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Convert the image to gray scale."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# convert to gray\n",
    "gray_page = rgb2gray(page_array)\n",
    "show_page(gray_page)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Converting gray areas to white. \n",
    "Note: This will remove intermediate shades of gray, i.e. non-black or white colors converted to gray scale. Most image will not be visible after this step. Also please noted, this step has the potential remove parts of the text that is not completly black."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Normalize and remove gray convert to white gray pixels\n",
    "gray_page = (gray_page - gray_page.min())/gray_page.max()\n",
    "gray_page[(gray_page>LOWER_GRAY_THR) & (gray_page<UPPER_GRAY_THR)] = 1\n",
    "\n",
    "show_page(gray_page)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Collaps pixels horizontally by taking the mean across the \"x\" dimension. The results `Y` is a vector of length = vertical lenght of the page. The values `Y[i]` of the vector is the average black concentration at the `i` vertical level. We call this vector the vertical distribution of the pixel's values since each value reflects the average pixel value at that vertical level `y=i`.  \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y1 = 1 - gray_page.mean(axis=1)\n",
    "y_smooth = max_sliding_window(y1, Y_WINDOW)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig = plt.figure(figsize=(15,10))\n",
    "show_img = fig.add_subplot(1,2,1)\n",
    "show_img.set_title('Pixel Density', fontweight='bold', fontsize= 16)\n",
    "plt.plot(y1, range(len(y1)))\n",
    "plt.xlabel('Average Black Pixel')\n",
    "plt.ylabel('Vertical Position')\n",
    "plt.gca().invert_yaxis()\n",
    "\n",
    "\n",
    "show_img = fig.add_subplot(1,2,2)\n",
    "show_img.set_title('Smoothen', fontweight='bold', fontsize= 16)\n",
    "plt.plot(y_smooth, range(len(y_smooth)))\n",
    "plt.xlabel('Average Black Pixel2')\n",
    "plt.ylabel('Vertical Position2')\n",
    "plt.axvline(Y_MIN_THR, color='red')\n",
    "plt.axvline(Y_MAX_THR, color='red')\n",
    "plt.gca().invert_yaxis()\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above graphs show a distinguish seperation between paragraphs and visuals."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using the frequencies above, we identify horizontal strips that are unlikely to be complete \"left to right border\" paragraphs. We flag those \"non full-text\" horizontal stripes using a thresholding method via the `Y_MIN_THR` and `Y_MAX_THR` hyperparameters represented above using vertical red lines."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_segments = get_candidate_segments(y_smooth)\n",
    "y_segments = filter_segments(y_segments, y_smooth, Y_MIN_THR, Y_MAX_THR)\n",
    "\n",
    "pprint(y_segments[:3])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The logic above is wrapped in the `get_segments` function:\n",
    "\n",
    "```python\n",
    "y_segments = get_segments(img=gray_page, orientation=\"vertical\", \n",
    "                          black_min_thr=Y_MIN_THR, black_max_thr=Y_MAX_THR, window_size=Y_WINDOW)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For each \"non full-text\" vertical stripes we collaps pixels vertically by taking the mean across the \"y\" dimension within the stripe. The result `X` is a vector of length = horizontal lenght of the page. The values `X[j]` of the vector is the average pixel value for across the vertical line `x=j` of the stripe. We call this vector the horizontal distribution of the pixel's values in stripe `i` since each value reflects the average pixel value at that horizontal level `x=j`.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using the frequencies above for each vertical segment, we identify regions within the vertical segment that are unlikely to be text. We flag the \"non-text\" vertical stripes within the horizontal strip using a thresholding method via the `X_MIN_THR` and `X_MAX_THR` hyperparameters. We use the intersection of both \"non-text\"vertical and \"non-text\" horizontal strides to obtain the coordinates of the images."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "coordinates = []\n",
    "for ymin, ymax in y_segments:\n",
    "    vertical_image_region = gray_page[ymin:ymax, :]\n",
    "    xs_segments = get_segments(img=vertical_image_region, orientation=\"horizontal\", \n",
    "                               black_min_thr=X_MIN_THR, black_max_thr=X_MAX_THR, window_size=X_WINDOW)\n",
    "    coordinates.extend([((ymin, ymax),(xmin, xmax)) for xmin, xmax in xs_segments if (ymax-ymin)*(xmax-xmin)>LOWER_AREA_THR**2])\n",
    "    \n",
    "pprint(coordinates[:3])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The aboves are the edges of the rectangles containing image regions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "px = 1/plt.rcParams['figure.dpi']  # pixel in inches\n",
    "fig,ax = plt.subplots(1, figsize=(page_array.shape[1]*px,page_array.shape[0]*px))\t\t# Create figure and axes\n",
    "ax.imshow(page_array)\t\t\t\t\t# Display the image\n",
    "\n",
    "# Create a Rectangle patch; \n",
    "for ((ymin, ymax),(xmin, xmax)) in coordinates:\n",
    "    # matplotlib.patches.Rectangle(xy, width, height, angle=0.0, **kwargs)\n",
    "    rect = patches.Rectangle((xmin,ymin),xmax-xmin,ymax-ymin, linewidth=1,edgecolor='r',facecolor='none')\n",
    "    ax.add_patch(rect)\t\t# Add the patch to the Axes\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "show_page(page_array, color=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "page_array_wo_images = deepcopy(page_array)\n",
    "for (ymin,ymax),(xmin,xmax) in coordinates:\n",
    "    page_array_wo_images[ymin:ymax,xmin:xmax, :] = 255\n",
    "\n",
    "# show_page(page_array_wo_images, color=True)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Final Results:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "px = 1/plt.rcParams['figure.dpi']  # pixel in inches\n",
    "fig,ax = plt.subplots(1, figsize=(page_array.shape[1]*px,page_array.shape[0]*px))\t\t# Create figure and axes\n",
    "ax.imshow(page_array_wo_images)\t\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}