{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Option 3 - Pixel Concentration Analysis Method for Removing Visuals\n", "\n", "In this notebook we describe an empirical method to infer the position of images in a PDF. This method is highly sensible to the hyperparameter selection. Therefore, we recommend the end user to manually fine-tune them in their data. At a high level this process follows the steps illustrated in the diagram below.\n", "\n", "![process diagram](./images/pixel_analysis_flow.png)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Environmnet Setup" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "!sudo yum install poppler-utils -y\n", "!pip install pdf2image" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Initialization" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import boto3\n", "from pdf2image import convert_from_bytes\n", "import numpy as np\n", "from pprint import pprint\n", "from utils import (rgb2gray, \n", " max_sliding_window, \n", " get_candidate_segments, \n", " filter_segments,\n", " show_page, \n", " plot_aggregation,\n", " get_segments\n", " )\n", "from matplotlib import pyplot as plt\n", "import matplotlib.patches as patches\n", "from PIL import Image\n", "from copy import deepcopy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Optimization Parameters" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "INPUT_PDF = './doc_input/sample_input.pdf'\n", "\n", "# parameters to detect non full text vertical segments\n", "Y_MIN_THR = .15\n", "Y_MAX_THR = .32\n", "Y_WINDOW = 10\n", "\n", "# parameters to detect non full text horizontal segments\n", "X_MIN_THR = .05\n", "X_MAX_THR = .22\n", "X_WINDOW = 2\n", "\n", "# minimum dimension of image to be removed. e.g. if LOWER_AREA_THR=30 then 30*30\n", "LOWER_AREA_THR = 30\n", "\n", "# gray range thresholeds\n", "LOWER_GRAY_THR = .2\n", "UPPER_GRAY_THR = .95" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step by step walk through\n", "\n", "Without loss of generality assume that the pdf has only one page. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Load the pdf page, convert it to numpy array" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# load pdf\n", "images = convert_from_bytes(open(INPUT_PDF, 'rb').read())\n", "\n", "# select page 0 \n", "page = images[0]\n", "\n", "# convert to numpy array\n", "page_array = np.array(page)\n", "\n", "show_page(page_array, color=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Convert the image to gray scale." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# convert to gray\n", "gray_page = rgb2gray(page_array)\n", "show_page(gray_page)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Converting gray areas to white. \n", "Note: This will remove intermediate shades of gray, i.e. non-black or white colors converted to gray scale. Most image will not be visible after this step. Also please noted, this step has the potential remove parts of the text that is not completly black." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Normalize and remove gray convert to white gray pixels\n", "gray_page = (gray_page - gray_page.min())/gray_page.max()\n", "gray_page[(gray_page>LOWER_GRAY_THR) & (gray_pageLOWER_AREA_THR**2])\n", " \n", "pprint(coordinates[:3])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The aboves are the edges of the rectangles containing image regions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "px = 1/plt.rcParams['figure.dpi'] # pixel in inches\n", "fig,ax = plt.subplots(1, figsize=(page_array.shape[1]*px,page_array.shape[0]*px))\t\t# Create figure and axes\n", "ax.imshow(page_array)\t\t\t\t\t# Display the image\n", "\n", "# Create a Rectangle patch; \n", "for ((ymin, ymax),(xmin, xmax)) in coordinates:\n", " # matplotlib.patches.Rectangle(xy, width, height, angle=0.0, **kwargs)\n", " rect = patches.Rectangle((xmin,ymin),xmax-xmin,ymax-ymin, linewidth=1,edgecolor='r',facecolor='none')\n", " ax.add_patch(rect)\t\t# Add the patch to the Axes\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "show_page(page_array, color=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "page_array_wo_images = deepcopy(page_array)\n", "for (ymin,ymax),(xmin,xmax) in coordinates:\n", " page_array_wo_images[ymin:ymax,xmin:xmax, :] = 255\n", "\n", "# show_page(page_array_wo_images, color=True)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Final Results:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "px = 1/plt.rcParams['figure.dpi'] # pixel in inches\n", "fig,ax = plt.subplots(1, figsize=(page_array.shape[1]*px,page_array.shape[0]*px))\t\t# Create figure and axes\n", "ax.imshow(page_array_wo_images)\t\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 4 }