{ "cells": [ { "cell_type": "markdown", "id": "5e46ca35", "metadata": { "tags": [] }, "source": [ "# Public sector use case: Benefit application\n", "\n", "## Lab 1 - Textract" ] }, { "cell_type": "markdown", "id": "9e85f463-7c16-46a7-a9dc-bd750c1d9692", "metadata": {}, "source": [ "---\n", "\n", "## Introduction\n", "Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, and data from scanned documents. It goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables. In this session we will demonstrate how to programatically use Amazon Textract to solve common intelligent document processing challenges such as: detecting text within a document, analyzing a document for relationships between detected items, using asynchronous operations for batch processing, analyzing a document for financially-related relationships between text, and more.\n", "\n", "
\n", "\n", "- 1. [Prerequisites](#section_1_0)\n", " - 1.1 [Install packages](#section_1_1)\n", " - 1.2 [Import packages and modules](#section_1_2)\n", " - 1.3 [Setup the notebook role and session](#section_1_3)\n", " - 1.4 [Setup the AWS service clients](#section_1_4)\n", " - 1.5 [Upload resource files](#section_1_5) \n", "- 2. [An Introduction to Document Processing with Amazon Textract](#section_2_0)\n", " - 2.1 [Sample document for text detection](#section_2_1)\n", " - 2.2 [Detecting document text](#section_2_2)\n", " - 2.3 [Amazon Textract Response Objects](#section_2_3)\n", " - 2.4 [Visualizing the Response Objects Hierarchy](#section_2_4)\n", "- 3. [Asynchronous processing](#section_3_0)\n", " - 3.1 [Sample document for asynchronous processing](#section_3_1)\n", " - 3.2 [Asynchronous Document Analysis](#section_3_2)\n", " - 3.3 [Asynchronous process workflow](#section_3_3)\n", "- 4. [Expense Document Analysis](#section_4_0)\n", " - 4.1 [Sample document for expense document analysis processing](#section_4_1)\n", " - 4.2 [Execute a synchronous analyze expense job](#section_4_2)\n", "- 5. [Processing Identity Documents](#section_5_0)\n", " - 5.1 [Sample document for identity document analysis processing](#section_5_1)\n", " - 5.2 [Execute a synchronous analyze identity process](#section_5_2)\n", " - 5.3 [Sample identity document for document analysis queries](#section_5_3)\n", " - 5.4 [Processing a U.S. Social Security Card with Queries](#section_5_4) \n", "- 6. [Document Enrichment using Redaction (Optional)](#section_6_0)\n", "- 7. [Conclusion](#section_7_0)\n", "- 8. [Additional Resources](#section_8_0)\n", "\n", "##### **Let's get started!**" ] }, { "cell_type": "markdown", "id": "a8c68a4c-3971-4026-8618-6721a967f1c8", "metadata": { "tags": [] }, "source": [ "---\n", "\n", "## 1. Prerequisites\n", "\n", "\n", "In this section, we'll install and import packages, establish the notebook execution role and session, and setup the AWS service clients." ] }, { "cell_type": "markdown", "id": "67501915-0bfe-42d0-858a-481d84aa4b98", "metadata": { "tags": [] }, "source": [ "### 1.1 Install packages\n", "\n", "\n", "We use *pip* to install packages from the Python Package Index and other indexes. A package contains all the files you need for a module.\n", "Modules are Python code libraries you can include in your project. You can think of Python packages as the directories on a file system and modules as files within directories. \n", "\n", "**Note:** after executing code in this cell there will be lots of debug output, this is normal, and expected." ] }, { "cell_type": "code", "execution_count": null, "id": "2287b8f4", "metadata": { "tags": [] }, "outputs": [], "source": [ "!pip install amazon-textract-caller\n", "!pip install amazon-textract-prettyprinter\n", "!pip install amazon-textract-response-parser \n", "!pip install boto3 \n", "!pip install botocore \n", "!pip install s3fs\n", "!pip install textract-trp " ] }, { "cell_type": "markdown", "id": "cac344b6-547f-4c60-97eb-cfdedf3fc883", "metadata": { "tags": [] }, "source": [ "### 1.2 Import packages and modules\n", "\n", "\n", "Python code in one module gains access to the code in another module by the process of importing it. In this section, we import packages and modules needed to execute code cells in this notebook." ] }, { "cell_type": "code", "execution_count": null, "id": "15fc6eb4-f9fb-4437-abf2-ca589237ef2c", "metadata": { "tags": [] }, "outputs": [], "source": [ "import sagemaker\n", "import boto3\n", "import s3fs\n", "import time\n", "import json\n", "import io\n", "import os\n", "from io import BytesIO\n", "\n", "from PIL import Image, ImageDraw, ImageOps\n", "\n", "from trp import Document\n", "import urllib.request\n", "\n", "from textractcaller import call_textract_analyzeid\n", "import trp.trp2_analyzeid as t2id\n", "\n", "from tabulate import tabulate\n", "import trp.trp2 as t2" ] }, { "cell_type": "markdown", "id": "128559ed-325b-4114-b0a9-d3fca33f140e", "metadata": {}, "source": [ "### 1.3 Setup the notebook role and session\n", "\n", "\n", "As a managed service, Amazon SageMaker performs operations on your behalf on the AWS hardware that is managed by SageMaker. SageMaker can perform only operations that the user permits. A SageMaker user can grant these permissions with an IAM role (referred to as an execution role).\n", "\n", "To create and use a locally available execution role, execute the code in the following cell" ] }, { "cell_type": "code", "execution_count": null, "id": "3b01dc3c-7088-4ea4-ad52-0b564c07eba7", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Get the IAM role and Sagemaker session\n", "try:\n", " role = sagemaker.get_execution_role()\n", "except:\n", " role = get_execution_role()\n", "\n", "# Get the SakeMaker session\n", "session = sagemaker.Session()\n", "\n", "# Get the region name\n", "region = session.boto_region_name\n", "\n", "print('Using IAM role arn: {}'.format(role))\n", "print('Using region: {}'.format(region))" ] }, { "cell_type": "markdown", "id": "66962a6f-7dc9-4b2f-88f7-0aa6ba7b3a79", "metadata": {}, "source": [ "### 1.4 Setup the AWS service clients\n", "\n", "\n", "AWS' Boto3 library is used commonly to integrate Python applications with various AWS services. Clients provide a low-level interface to the AWS service. In this section, we will create two Boto3 clients: s3 and textract, to help execute code cells in this notebook." ] }, { "cell_type": "code", "execution_count": null, "id": "921e08e6-dd4e-4877-90c0-95db1aed42fa", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Setup the S3 client\n", "s3_client = boto3.client('s3')\n", "\n", "# Setup the Textract client\n", "textract_client = boto3.client('textract', region_name=region)\n", "\n", "# Get bucket settings\n", "s3_bucket = session.default_bucket()\n", "s3_bucket_prefix = 'sample-files'\n", "s3_file_path = 's3://{}/{}'.format(s3_bucket, s3_bucket_prefix)\n", "\n", "# The path to our files in the S3 bucket\n", "print('S3 FILE PATH: {}'.format(s3_file_path))" ] }, { "cell_type": "markdown", "id": "680f4d5c-8886-4eb4-923f-08887b1ec4da", "metadata": {}, "source": [ "### 1.5 Upload resource files\n", "\n", "\n", "In this cell, we'll use the high-level S3 AWS CLI to upload our resource files to the default S3 bucket" ] }, { "cell_type": "code", "execution_count": null, "id": "63d91c44-b4b9-42dd-9bac-f3fb706c6410", "metadata": { "tags": [] }, "outputs": [], "source": [ "cmd = 'aws s3 cp {} {} --recursive'.format('./sample-files', s3_file_path)\n", "os.system(cmd)\n", "\n", "#!aws s3 cp ./aim204/sample-files