{ "cells": [ { "cell_type": "markdown", "id": "5e46ca35", "metadata": { "tags": [] }, "source": [ "# Public sector use case: Benefit application\n", "\n", "## Lab 1 - Textract" ] }, { "cell_type": "markdown", "id": "9e85f463-7c16-46a7-a9dc-bd750c1d9692", "metadata": {}, "source": [ "---\n", "\n", "## Introduction\n", "Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, and data from scanned documents. It goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables. In this session we will demonstrate how to programatically use Amazon Textract to solve common intelligent document processing challenges such as: detecting text within a document, analyzing a document for relationships between detected items, using asynchronous operations for batch processing, analyzing a document for financially-related relationships between text, and more.\n", "\n", "

\n", "\n", "- 1. [Prerequisites](#section_1_0)\n", " - 1.1 [Install packages](#section_1_1)\n", " - 1.2 [Import packages and modules](#section_1_2)\n", " - 1.3 [Setup the notebook role and session](#section_1_3)\n", " - 1.4 [Setup the AWS service clients](#section_1_4)\n", " - 1.5 [Upload resource files](#section_1_5) \n", "- 2. [An Introduction to Document Processing with Amazon Textract](#section_2_0)\n", " - 2.1 [Sample document for text detection](#section_2_1)\n", " - 2.2 [Detecting document text](#section_2_2)\n", " - 2.3 [Amazon Textract Response Objects](#section_2_3)\n", " - 2.4 [Visualizing the Response Objects Hierarchy](#section_2_4)\n", "- 3. [Asynchronous processing](#section_3_0)\n", " - 3.1 [Sample document for asynchronous processing](#section_3_1)\n", " - 3.2 [Asynchronous Document Analysis](#section_3_2)\n", " - 3.3 [Asynchronous process workflow](#section_3_3)\n", "- 4. [Expense Document Analysis](#section_4_0)\n", " - 4.1 [Sample document for expense document analysis processing](#section_4_1)\n", " - 4.2 [Execute a synchronous analyze expense job](#section_4_2)\n", "- 5. [Processing Identity Documents](#section_5_0)\n", " - 5.1 [Sample document for identity document analysis processing](#section_5_1)\n", " - 5.2 [Execute a synchronous analyze identity process](#section_5_2)\n", " - 5.3 [Sample identity document for document analysis queries](#section_5_3)\n", " - 5.4 [Processing a U.S. Social Security Card with Queries](#section_5_4) \n", "- 6. [Document Enrichment using Redaction (Optional)](#section_6_0)\n", "- 7. [Conclusion](#section_7_0)\n", "- 8. [Additional Resources](#section_8_0)\n", "\n", "##### **Let's get started!**" ] }, { "cell_type": "markdown", "id": "a8c68a4c-3971-4026-8618-6721a967f1c8", "metadata": { "tags": [] }, "source": [ "---\n", "\n", "## 1. Prerequisites\n", "\n", "\n", "In this section, we'll install and import packages, establish the notebook execution role and session, and setup the AWS service clients." ] }, { "cell_type": "markdown", "id": "67501915-0bfe-42d0-858a-481d84aa4b98", "metadata": { "tags": [] }, "source": [ "### 1.1 Install packages\n", "\n", "\n", "We use *pip* to install packages from the Python Package Index and other indexes. A package contains all the files you need for a module.\n", "Modules are Python code libraries you can include in your project. You can think of Python packages as the directories on a file system and modules as files within directories. \n", "\n", "**Note:** after executing code in this cell there will be lots of debug output, this is normal, and expected." ] }, { "cell_type": "code", "execution_count": null, "id": "2287b8f4", "metadata": { "tags": [] }, "outputs": [], "source": [ "!pip install amazon-textract-caller\n", "!pip install amazon-textract-prettyprinter\n", "!pip install amazon-textract-response-parser \n", "!pip install boto3 \n", "!pip install botocore \n", "!pip install s3fs\n", "!pip install textract-trp " ] }, { "cell_type": "markdown", "id": "cac344b6-547f-4c60-97eb-cfdedf3fc883", "metadata": { "tags": [] }, "source": [ "### 1.2 Import packages and modules\n", "\n", "\n", "Python code in one module gains access to the code in another module by the process of importing it. In this section, we import packages and modules needed to execute code cells in this notebook." ] }, { "cell_type": "code", "execution_count": null, "id": "15fc6eb4-f9fb-4437-abf2-ca589237ef2c", "metadata": { "tags": [] }, "outputs": [], "source": [ "import sagemaker\n", "import boto3\n", "import s3fs\n", "import time\n", "import json\n", "import io\n", "import os\n", "from io import BytesIO\n", "\n", "from PIL import Image, ImageDraw, ImageOps\n", "\n", "from trp import Document\n", "import urllib.request\n", "\n", "from textractcaller import call_textract_analyzeid\n", "import trp.trp2_analyzeid as t2id\n", "\n", "from tabulate import tabulate\n", "import trp.trp2 as t2" ] }, { "cell_type": "markdown", "id": "128559ed-325b-4114-b0a9-d3fca33f140e", "metadata": {}, "source": [ "### 1.3 Setup the notebook role and session\n", "\n", "\n", "As a managed service, Amazon SageMaker performs operations on your behalf on the AWS hardware that is managed by SageMaker. SageMaker can perform only operations that the user permits. A SageMaker user can grant these permissions with an IAM role (referred to as an execution role).\n", "\n", "To create and use a locally available execution role, execute the code in the following cell" ] }, { "cell_type": "code", "execution_count": null, "id": "3b01dc3c-7088-4ea4-ad52-0b564c07eba7", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Get the IAM role and Sagemaker session\n", "try:\n", " role = sagemaker.get_execution_role()\n", "except:\n", " role = get_execution_role()\n", "\n", "# Get the SakeMaker session\n", "session = sagemaker.Session()\n", "\n", "# Get the region name\n", "region = session.boto_region_name\n", "\n", "print('Using IAM role arn: {}'.format(role))\n", "print('Using region: {}'.format(region))" ] }, { "cell_type": "markdown", "id": "66962a6f-7dc9-4b2f-88f7-0aa6ba7b3a79", "metadata": {}, "source": [ "### 1.4 Setup the AWS service clients\n", "\n", "\n", "AWS' Boto3 library is used commonly to integrate Python applications with various AWS services. Clients provide a low-level interface to the AWS service. In this section, we will create two Boto3 clients: s3 and textract, to help execute code cells in this notebook." ] }, { "cell_type": "code", "execution_count": null, "id": "921e08e6-dd4e-4877-90c0-95db1aed42fa", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Setup the S3 client\n", "s3_client = boto3.client('s3')\n", "\n", "# Setup the Textract client\n", "textract_client = boto3.client('textract', region_name=region)\n", "\n", "# Get bucket settings\n", "s3_bucket = session.default_bucket()\n", "s3_bucket_prefix = 'sample-files'\n", "s3_file_path = 's3://{}/{}'.format(s3_bucket, s3_bucket_prefix)\n", "\n", "# The path to our files in the S3 bucket\n", "print('S3 FILE PATH: {}'.format(s3_file_path))" ] }, { "cell_type": "markdown", "id": "680f4d5c-8886-4eb4-923f-08887b1ec4da", "metadata": {}, "source": [ "### 1.5 Upload resource files\n", "\n", "\n", "In this cell, we'll use the high-level S3 AWS CLI to upload our resource files to the default S3 bucket" ] }, { "cell_type": "code", "execution_count": null, "id": "63d91c44-b4b9-42dd-9bac-f3fb706c6410", "metadata": { "tags": [] }, "outputs": [], "source": [ "cmd = 'aws s3 cp {} {} --recursive'.format('./sample-files', s3_file_path)\n", "os.system(cmd)\n", "\n", "#!aws s3 cp ./aim204/sample-files --recursive" ] }, { "cell_type": "markdown", "id": "9b77ef2f-02eb-41c1-8aa8-896f1fdb58f9", "metadata": { "tags": [] }, "source": [ "## 2. An Introduction to Document Processing with Amazon Textract\n", "\n", "\n", "In this section, we'll demonstrate how to programatically use the Amazon Textract SDK to handle processing common document types." ] }, { "cell_type": "markdown", "id": "45fee620-10a3-4450-8356-639eb81ce0c8", "metadata": { "tags": [] }, "source": [ "### 2.1 Sample document for text detection\n", "\n", "\n", "Let's have a look at the sample input document we'll use as input for Amazon Textract to detect the document's text" ] }, { "cell_type": "code", "execution_count": null, "id": "191c7ada-310b-4dc1-8bbe-5e594997314f", "metadata": { "tags": [] }, "outputs": [], "source": [ "# A sample PNG image file that contains a single line of text\n", "file_name = \"hello-world-image.png\"\n", "\n", "# The location of the sample PNG image file in your Amazon S3 bucket.\n", "object_key = '{}/{}'.format(s3_bucket_prefix, file_name)\n", "\n", "# Display the PNG image \n", "fs = s3fs.S3FileSystem()\n", "with fs.open('{}/{}'.format(s3_bucket, object_key)) as f:\n", " display(ImageOps.expand(Image.open(f),border=1,fill='black'))" ] }, { "cell_type": "markdown", "id": "768a70ea-3575-44c0-9525-3b5ee25d2ccb", "metadata": { "tags": [] }, "source": [ "### 2.2 Detecting document text\n", "\n", "\n", "In this code cell, we'll be calling Textract to synchronously process a PNG image document containing a single line of text. This is a simple example to help us become familiar with the Amazon Textract request and response syntax. You can consider this example the 'Hello World' of progamatically leveraging Amazon Textract." ] }, { "cell_type": "code", "execution_count": null, "id": "be7bf4c0-ed97-4961-8126-fcfce8fa7033", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Call Amazon Textract\n", "detect_text_response = textract_client.detect_document_text(\n", " Document={\n", " 'S3Object': {\n", " 'Bucket': s3_bucket,\n", " 'Name': object_key\n", " }\n", " }\n", ")\n", "\n", "# Display the single line of text in the PNG file detected by Textract\n", "for item in detect_text_response[\"Blocks\"]:\n", " if item[\"BlockType\"] == \"LINE\":\n", " print('The line of text detected by Amazon Textract from our sample PNG image file is, you guessed it: {}\\n'.format(item[\"Text\"]))" ] }, { "cell_type": "markdown", "id": "d031b291-f0cb-48c8-9f24-c6b868a4fc2c", "metadata": { "tags": [] }, "source": [ "### 2.3 Amazon Textract Response Objects\n", "\n", "\n", "Before we dive deeper into the lab, let's take a minute or two to better understand Amazon Textract Response Objects.\n", "\n", "Amazon Textract **operations return different types of objects depending on the operations run**. \n", "\n", "- For **detecting text and analyzing a generic document**, the operation **returns a Block object**. \n", "- For **analyzing an invoice or receipt**, the operation **returns an ExpenseDocuments object**. \n", "- For **analyzing identity documentation**, the operation **returns an IdentityDocumentFields object**." ] }, { "cell_type": "code", "execution_count": null, "id": "9698df7c-f05d-4635-a75e-539e7c900659", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Let's inspect the Amazon Textract Response Objects for detecting text (returned from the previous cell)\n", "\n", "print('The Amazon Textract Response Objects for invoking detect_document_text:\\n')\n", "print(json.dumps(detect_text_response, indent=4))" ] }, { "cell_type": "markdown", "id": "81042924-4540-49d2-8250-67cb033e41fe", "metadata": { "tags": [] }, "source": [ "### 2.4 Visualizing the Response Objects Hierarchy\n", "\n", "\n", "To help us visualize the hierachy of the Amazon Textract response objects for detecting text, here's an illustration of a portion of the JSON response object hierarchy" ] }, { "cell_type": "code", "execution_count": null, "id": "dcb16cee-093c-498f-a502-992452776ebc", "metadata": { "tags": [] }, "outputs": [], "source": [ "urllib.request.urlretrieve(\n", " 'https://docs.aws.amazon.com/images/textract/latest/dg/images/hieroglyph-text-detection.png', 'textract-pages-lines-words.png')\n", "img = Image.open(\"textract-pages-lines-words.png\")\n", "display(img)" ] }, { "cell_type": "markdown", "id": "b16d7c03-ffdb-44b6-a8b6-8760bcba7d15", "metadata": {}, "source": [ "In our example, the Amazon Textract Response contains one **PAGE** BlockType, which contains one child **LINE** BlockType, which contains two children **WORD** BlockTypes.\n", "\n", "Additionally, the **LINE** BlockType contains the value **'Amazon Textract'** and also has two child **WORD** BlockTypes which are: **Amazon** and **Textract**\n", "\n", "Referencing the **Ids** list object within the **Relationships** list object is helpful for understanding the hierarchical relationships between Amazon Textract Response objects. We've included some additional resources at the end of this notebook for further reading on this topic.\n", "\n", "Lastly, **Confidence** scores are numeric values ranging between 0 and 100 that indicate the probability that a given prediction is correct, where highly confident scores will be closer to 100." ] }, { "cell_type": "markdown", "id": "2c3896c0", "metadata": {}, "source": [ "## 3. Asynchronous processing\n", "\n", "\n", "In this section, we'll introduce the concept of Textract Asynchronous Operations, which queue incoming requests and processes them asynchronously. By using asynchronous jobs, documents can be processed without blocking computing resources. \n", "\n", "Amazon Textract provides an asynchronous API that you can use to process **multipage** documents in **PDF or TIFF** format. You can also use asynchronous operations to process **single-page** documents that are in **JPEG, PNG, TIFF, or PDF** format." ] }, { "cell_type": "markdown", "id": "b9d4c8f1-88a1-4649-b0a5-a4813c294da1", "metadata": { "tags": [] }, "source": [ "### 3.1 Sample document for asynchronous processing\n", "\n", "\n", "Let's have a look at the sample input document we'll use as input for Amazon Textract to detect document's text" ] }, { "cell_type": "code", "execution_count": null, "id": "13143eaf-b284-4416-8905-aba98cd2b9bc", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Sample document\n", "file_name = \"Benefit_Application_Form_John_Doe.png\"\n", "\n", "# The location of the sample PNG image file in your Amazon S3 bucket.\n", "object_key = '{}/{}'.format(s3_bucket_prefix, file_name)\n", "\n", "# Display the PNG image \n", "fs = s3fs.S3FileSystem()\n", "with fs.open('{}/{}'.format(s3_bucket, object_key)) as f:\n", " display(ImageOps.scale(Image.open(f),0.75))" ] }, { "cell_type": "markdown", "id": "10228717-098c-47f4-9da4-14ff30664da4", "metadata": {}, "source": [ "### 3.2 Asynchronous Document Analysis\n", "\n", "\n", "In this section we'll start a document analysis job that will analyze a sample benefits application form to find relationships between detected items. The document analysis job will return a job identifier (JobId) as a reference to the job.\n", "\n", "We'll also create three helper functions for asynchronously processing a Textract Document Analysis Job.\n", "\n", "1. **start_analyze_job**: responsible for initiating the asynchronous document analysis job of a benefit application form document\n", "2. **is_job_complete**: periodically checks for job status completion (i.e. no longer IN_PROGRESS)\n", "3. **get_job_results**: gets the results from the Amazon Textract asynchronous job we executed\n" ] }, { "cell_type": "code", "execution_count": null, "id": "110672ac", "metadata": { "tags": [] }, "outputs": [], "source": [ "def start_analyze_job(s3_bucket, key_name):\n", " response = None\n", " response = textract_client.start_document_analysis(\n", " DocumentLocation={\n", " 'S3Object': {\n", " 'Bucket': s3_bucket,\n", " 'Name': key_name\n", " }\n", " },\n", " FeatureTypes=['FORMS'],\n", " OutputConfig={\n", " 'S3Bucket': s3_bucket,\n", " 'S3Prefix': 'textract-output/benefit-app'\n", " }\n", " )\n", " \n", " return response[\"JobId\"]\n", "\n", "def is_job_complete(jobId):\n", " response = textract_client.get_document_analysis(JobId=jobId)\n", " status = response[\"JobStatus\"]\n", " print(\"Job status: {}\".format(status))\n", "\n", " while(status == \"IN_PROGRESS\"):\n", " time.sleep(5)\n", " response = textract_client.get_document_analysis(JobId=jobId)\n", " status = response[\"JobStatus\"]\n", " print(\"Job status: {}\".format(status))\n", "\n", " return status\n", "\n", "def get_job_results(jobId):\n", " pages = []\n", " response = textract_client.get_document_analysis(JobId=jobId)\n", " pages.append(response)\n", " print(\"Result page recieved: {}\".format(len(pages)))\n", " nextToken = None\n", " if('NextToken' in response):\n", " nextToken = response['NextToken']\n", "\n", " while(nextToken):\n", " response = textract_client.get_document_analysis(JobId=jobId, NextToken=nextToken)\n", "\n", " pages.append(response)\n", " print(\"Result page recieved: {}\".format(len(pages)))\n", " nextToken = None\n", " if('NextToken' in response):\n", " nextToken = response['NextToken']\n", " \n", " return pages" ] }, { "cell_type": "code", "execution_count": null, "id": "604d1f88-319b-4562-9c91-34ad5a4064f4", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Sample document\n", "file_name = \"Benefit_Application_Form_John_Doe.pdf\"\n", "\n", "# The location of the sample PNG image file in your Amazon S3 bucket.\n", "object_key = '{}/{}'.format(s3_bucket_prefix, file_name)\n", "\n", "# Start the asynchronous processing job\n", "job_id = start_analyze_job(s3_bucket, object_key)\n", "\n", "# Print the JobId and the input document location\n", "print(\"Started an asynchronous Analyze Document Job with job_id: {}\".format(job_id))\n", "print(\"Input document URI: {}\\n\".format(object_key))\n", "\n", "# Invoke our helper functions to check the job status and get the results\n", "if(is_job_complete(job_id)):\n", " benefit_application_response = get_job_results(job_id)\n", "\n", "# Use the Textract Response Parser library to easily parser JSON returned by Amazon Textract\n", "doc = Document(benefit_application_response)\n", "\n", "# Display the key/value pairs detected by the Analyze Document Job\n", "print(\"Result Fields:\")\n", "\n", "for field in doc.pages[0].form.fields:\n", " print(\"Key: {}, Value: {}\".format(field.key, field.value))" ] }, { "cell_type": "markdown", "id": "c94f00fd-753e-4e9b-9f13-aa76f6ad1ef3", "metadata": {}, "source": [ "### 3.3 Asynchronous process workflow\n", "\n", "\n", "Now that we've completed an example job, let's briefly a look at the asynchronous process flow diagram to understand how the job was executed.\n", "\n", "1. First, we start the aschronous analyze text job by calling the StartDocumentAnalysis method (via our start_analyze_job() helper function), which returns a JobId for reference.\n", "2. Next, we attempt to get the job results - using the JobId - by calling the get_document_analysis function (via our is_job_complete() helper function). We repeat this call every 5 seconds until the job status in no longer IN_PROGRESS.\n", "3. Once the job status is no longer IN_PROGRESS, we again attempt to get the job results - using the JobId - by calling the get_document_analysis function, this time via our get_job_results() helper function).\n", "4. Finally, we use the Textract Response Parser library to parse the JSON returned by Amazon Textract and display the results.\n", "\n", "Note: We don't recommend getting the request completion status by repeatedly calling the Amazon Textract Get operation. This is because Amazon Textract throttles the Get operation if too many requests are made. If you're processing multiple documents at the same time, it's simpler and more efficient to monitor one SQS queue for the completion notification than to poll Amazon Textract for the status of each job individually. [More details here](https://docs.aws.amazon.com/textract/latest/dg/api-async.html)\n" ] }, { "cell_type": "code", "execution_count": null, "id": "9795fa69-7419-4262-9d6b-cafd06b8471a", "metadata": { "tags": [] }, "outputs": [], "source": [ "urllib.request.urlretrieve(\n", " 'https://docs.aws.amazon.com/images/textract/latest/dg/images/asynchronous.png', 'textract-asynchronous-process-flow.png')\n", "img = Image.open('textract-asynchronous-process-flow.png')\n", "display(ImageOps.scale(img,0.75))" ] }, { "cell_type": "markdown", "id": "79b81a25-96d4-49af-a4f2-8261e41a2c05", "metadata": {}, "source": [ "## 4.0 Expense Document Analysis\n", "\n", "\n", "In this section, we will initiate a synchronous expense analysis job, which is designed to detect items in invoices or receipts for data like contact information, items purchased, and vendor names." ] }, { "cell_type": "markdown", "id": "ad290cb1-199d-46a0-8a0a-f1b1b671bd0e", "metadata": {}, "source": [ "### 4.1 Sample document for expense document analysis processing\n", "\n", "\n", "Let's have a look at the sample input document we'll use as input for Amazon Textract to perform expense document analysis" ] }, { "cell_type": "code", "execution_count": null, "id": "851ef93f-5162-4e48-ba82-7e13d11ce502", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Sample document\n", "file_name = \"Utility_Bill_John_Doe.png\"\n", "\n", "# The location of the sample PNG image file in your Amazon S3 bucket.\n", "object_key = '{}/{}'.format(s3_bucket_prefix, file_name)\n", "\n", "# Display the PNG image \n", "fs = s3fs.S3FileSystem()\n", "with fs.open('{}/{}'.format(s3_bucket, object_key)) as f:\n", " display(ImageOps.scale(Image.open(f),0.30))" ] }, { "cell_type": "markdown", "id": "c0647540-3055-463d-9c6c-10f72a792cde", "metadata": {}, "source": [ "### 4.2 Execute a synchronous analyze expense job\n", "\n", "\n", "Running the cells below will execute the analyze expense job and display detected labels and values for LineItemExpenseFields and SummaryFields\n", "\n", "We've created two helper functions to help with parsing and printing:\n", "1. **print_labels_and_values**: displays values if labels are detected and returned\n", "2. **process_expense_analysis**: displays LineItemExpenseFields and SummaryFields, along with their corresponding confidence score. \n", "\n", "Observe that some fields with lower confidence scores may be affected by smaller font size." ] }, { "cell_type": "code", "execution_count": null, "id": "6063c723", "metadata": { "tags": [] }, "outputs": [], "source": [ "def print_labels_and_values(field):\n", " \n", " # Only if labels are detected and returned\n", " if \"LabelDetection\" in field and \"ValueDetection\" in field:\n", " print(\n", " \"Label Detection - Confidence: {}\".format(\n", " str(field.get(\"LabelDetection\")[\"Confidence\"])\n", " )\n", " + \", \"\n", " + \"Summary Values: {}\".format(str(field.get(\"LabelDetection\")[\"Text\"]))\n", " + \"\\n\"\n", " + \"Value Detection - Confidence: {}\".format(\n", " str(field.get(\"ValueDetection\")[\"Confidence\"])\n", " )\n", " + \", \"\n", " + \"Summary Values: {}\".format(str(field.get(\"ValueDetection\")[\"Text\"]))\n", " + \"\\n\"\n", " )\n", "\n", "def process_expense_analysis(bucket, document, region):\n", "\n", " # process using S3 object\n", " response = textract_client.analyze_expense(\n", " Document={'S3Object': {'Bucket': bucket, 'Name': document}}\n", " )\n", "\n", " for expense_doc in response[\"ExpenseDocuments\"]:\n", " for line_item_group in expense_doc[\"LineItemGroups\"]:\n", " for line_items in line_item_group[\"LineItems\"]:\n", " for expense_fields in line_items[\"LineItemExpenseFields\"]:\n", " print_labels_and_values(expense_fields)\n", "\n", " print(\"Summary:\")\n", " for summary_field in expense_doc[\"SummaryFields\"]:\n", " print_labels_and_values(summary_field)" ] }, { "cell_type": "code", "execution_count": null, "id": "639aaadb-fa29-45f3-ae3f-465936dde383", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Execute the job\n", "process_expense_analysis(s3_bucket, object_key, region)" ] }, { "cell_type": "markdown", "id": "d2a4e67b-418c-469c-929c-03ec62e983b2", "metadata": {}, "source": [ "## 5.0 Processing Identity Documents\n", "\n", "\n", "In the next two sections, we'll demonstate how to process two common types of identity documents issued by the US Government: a U.S. Driver's License and a U.S. Social Security Card. \n", "We'll first process a US Driver's license using the Amazon Textract AnalyzeID API. Next, we'll use a feature of the Amazon Textract AnalyzeDocument API called Queries, which allow customers to ask about information within the document using natural language. For example, What is SSN number?" ] }, { "cell_type": "markdown", "id": "3e40cdb3-2bf8-4fea-ada1-b53913b25ef9", "metadata": {}, "source": [ "### 5.1 Sample document for identity document analysis processing\n", "\n", "\n", "Let's have a look at the sample input document we'll use as input for Amazon Textract to perform identity document analysis" ] }, { "cell_type": "code", "execution_count": null, "id": "68f62a8e-b80a-4301-a225-1ca06fcb7ae9", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Sample document\n", "file_name = \"drivers-license.png\"\n", "\n", "# The location of the sample PNG image file in your Amazon S3 bucket.\n", "object_key = '{}/{}'.format(s3_bucket_prefix, file_name)\n", "\n", "# Display the PNG image \n", "fs = s3fs.S3FileSystem()\n", "with fs.open('{}/{}'.format(s3_bucket, object_key)) as f:\n", " display(ImageOps.scale(Image.open(f),0.30))" ] }, { "cell_type": "markdown", "id": "de9349a7-7bee-4ad7-be1e-3283b1fe238c", "metadata": {}, "source": [ "### 5.2 Execute a synchronous analyze identity process\n", "\n", "\n", "In this code cell, we also use [Amazon Textract Caller tools](https://pypi.org/project/amazon-textract-caller/), which provide a collection of ready to use functions and sample implementations to speed up the evaluation and development for any project using Amazon Textract.\n", "\n", "The AnalyzeID API returns information in a structure called IdentityDocumentFields, which contains the normalized keys and their corresponding value. \n", "For example, the FIRST_NAME key corresponds to the value of John" ] }, { "cell_type": "code", "execution_count": null, "id": "799cab57", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Using Amazon Textract Caller tools to call Textract AnalyzeId \n", "response_drivers_license = call_textract_analyzeid(document_pages=[f's3://{s3_bucket}/{object_key}'])\n", "\n", "# Here we iterate of the collection IdentityDocuments and IdentityDocumentFields\n", "for id_docs in response_drivers_license['IdentityDocuments']:\n", " id_doc_kvs={}\n", " for field in id_docs['IdentityDocumentFields']:\n", " if field['ValueDetection']['Text']:\n", " id_doc_kvs[field['Type']['Text']] = field['ValueDetection']['Text']\n", "\n", "# Display the key/value pairs detected by Textract\n", "print(json.dumps(id_doc_kvs, indent=4))" ] }, { "cell_type": "markdown", "id": "ea478f0f-246c-4374-b015-ea9d25bbe7d3", "metadata": {}, "source": [ "### 5.3 Sample identity document for document analysis queries\n", "\n", "\n", "Let's have a look at the sample input document we'll use as input for Amazon Textract to perform a document analysis using queries." ] }, { "cell_type": "code", "execution_count": null, "id": "85cd80b1-be3d-4697-98ef-3be08d3b95b6", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Sample document\n", "file_name = \"SSN_John_Doe.jpg\"\n", "\n", "# The location of the sample PNG image file in your Amazon S3 bucket.\n", "object_key = '{}/{}'.format(s3_bucket_prefix, file_name)\n", "\n", "# Display the PNG image \n", "fs = s3fs.S3FileSystem()\n", "with fs.open('{}/{}'.format(s3_bucket, object_key)) as f:\n", " display(ImageOps.scale(Image.open(f),0.75))" ] }, { "cell_type": "markdown", "id": "73fe468a-5a6c-4e91-b594-6249776854cf", "metadata": {}, "source": [ "### 5.4 Processing a U.S. Social Security Card with Queries\n", "\n", "\n", "In the following cell observe that, in the asynchronous start_analyze_job() function, there are two queries in the JSON request: **\"What is SSN number?\"** and **\"What is the name on SSN?\"**. Each query contains the question you want to ask in the **Text** attribute and the **Alias** attribute you want to associate with the question, returned in the JSON response. The aliases to the aforementioned questions are **\"SSN_NUMBER\"** and **\"SSN_OWNER_NAME\"**, respectively. Additionally, there are two helper functions to help with monitoring the analyze document job status and displaying the query results." ] }, { "cell_type": "code", "execution_count": null, "id": "1bde0b9d", "metadata": { "tags": [] }, "outputs": [], "source": [ "def start_analyze_job(s3_bucket, key_name):\n", " response = None\n", " response = textract_client.start_document_analysis(\n", " DocumentLocation={\n", " 'S3Object': {\n", " 'Bucket': s3_bucket,\n", " 'Name': key_name\n", " }\n", " },\n", " FeatureTypes=['QUERIES'],\n", " QueriesConfig={\n", " \"Queries\": [\n", " {\n", " \"Text\": \"What is SSN number?\",\n", " \"Alias\": \"SSN_NUMBER\"\n", " },\n", " {\n", " \"Text\": \"What is the name on SSN?\",\n", " \"Alias\": \"SSN_OWNER_NAME\"\n", " }\n", " ]},\n", " OutputConfig={\n", " 'S3Bucket': s3_bucket,\n", " 'S3Prefix': 'textract-output/ssn'\n", " }\n", " )\n", "\n", " return response[\"JobId\"]\n", "\n", "def is_job_complete(jobId):\n", " response = textract_client.get_document_analysis(JobId=jobId)\n", " status = response[\"JobStatus\"]\n", " print(\"Job status: {}\".format(status))\n", "\n", " while(status == \"IN_PROGRESS\"):\n", " time.sleep(5)\n", " response = textract_client.get_document_analysis(JobId=jobId)\n", " status = response[\"JobStatus\"]\n", " print(\"Job status: {}\".format(status))\n", "\n", " return status\n", "\n", "def print_result_in_document(response):\n", " doc = t2.TDocumentSchema().load(response[0])\n", " page = doc.pages[0]\n", " doc_query_answers = doc.get_query_answers(page=page)\n", " print(tabulate(doc_query_answers, tablefmt=\"github\"))" ] }, { "cell_type": "code", "execution_count": null, "id": "9676652c-c9d9-42d4-8f48-9fa7f946bcaa", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Main code for execution\n", "# -----\n", "job_id = start_analyze_job(s3_bucket, object_key)\n", "\n", "# Monitor the job status\n", "print(\"Started analyze job with id: {}, document is: {}\".format(job_id, object_key))\n", "if(is_job_complete(job_id)):\n", " ssn_response = get_job_results(job_id)\n", "\n", "# Print the result\n", "print_result_in_document(ssn_response)" ] }, { "cell_type": "markdown", "id": "10d69559", "metadata": {}, "source": [ "## 6. Document Enrichment using Redaction \n", "\n", "\n", "Redacting personally identifiable information within documents is important to avoid identity theft. Redaction is a form of document enrichment used to conceal or remove confidential information.\n", "In this code cell, we demonstrate how to use the Geometry data type within the JSON response object to draw a solid black rectangle over the social security number on the SSN card. \n", "\n", "We'll resuse the job results from the previous section since we just need to parse the JSON response and check if any of the query results (i.e. the answers to our questions) contain a digit -- if so, then we get the geometric coordinates of the query result and draw a solid black rectangle. " ] }, { "cell_type": "code", "execution_count": null, "id": "5ac31643", "metadata": { "tags": [] }, "outputs": [], "source": [ "def ssn_redaction(response, s3_bucket, object_key):\n", "\n", " fs = s3fs.S3FileSystem()\n", " with fs.open('{}/{}'.format(s3_bucket, object_key)) as f:\n", " img = Image.open(f)\n", " width, height = img.size\n", "\n", " # Add redaction to SSN\n", " for item in response[0][\"Blocks\"]:\n", " if item[\"BlockType\"] == \"QUERY_RESULT\":\n", " result = any([char.isdigit() for char in item[\"Text\"]])\n", " if(result):\n", " x1 = item[\"Geometry\"][\"BoundingBox\"][\"Left\"]*width\n", " y1 = item[\"Geometry\"][\"BoundingBox\"][\"Top\"]*height-2\n", " x2 = x1 + (item[\"Geometry\"][\"BoundingBox\"][\"Width\"]*width)+5\n", " y2 = y1 + (item[\"Geometry\"][\"BoundingBox\"][\"Height\"]*height)+2\n", "\n", " draw = ImageDraw.Draw(img)\n", " draw.rectangle([x1, y1, x2, y2], fill=\"Black\")\n", " display(img)" ] }, { "cell_type": "code", "execution_count": null, "id": "096e6234-90b7-4b4f-805d-f99c5a6a1039", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Main code for execution\n", "# -----\n", "\n", "# Let's reuse the job_id from the previous section\n", "response = get_job_results(job_id)\n", "\n", "# Redact the SSN\n", "ssn_redaction(response, s3_bucket, object_key)" ] }, { "cell_type": "markdown", "id": "f490935f-85f4-48ae-b755-1088b6cb4409", "metadata": { "tags": [] }, "source": [ "## 7. Conclusion\n", "\n", "\n", "In this lab we demonsrated how to use Amazon Textract to programatically perform common document processing tasks such as detecting text within a document, analyzing a document for relationships between detected items, using asynchronous operations, analyzing a document for financially-related relationships between text, and performing document redaction.\n", "\n", "For customers looking for a GUI interface, Amazon Textract also offers many of the same capabilities via the AWS Management console. \n", "\n", "For customers looking for an Amazon Textract Partner from the AWS Partner Network, [click here](https://aws.amazon.com/textract/partners/).\n", "\n" ] }, { "cell_type": "markdown", "id": "f02152f3-b22f-49f3-a08a-9d64b225847b", "metadata": { "tags": [] }, "source": [ "## 8. Additional Resources\n", "\n", "\n", "- https://docs.aws.amazon.com/textract/latest/dg/API_BoundingBox.html\n", "- https://docs.aws.amazon.com/textract/latest/dg/text-location.html\n", "- https://docs.aws.amazon.com/textract/latest/dg/how-it-works-lines-words.html" ] } ], "metadata": { "availableInstances": [ { "_defaultOrder": 0, "_isFastLaunch": true, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 4, "name": "ml.t3.medium", "vcpuNum": 2 }, { "_defaultOrder": 1, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.t3.large", "vcpuNum": 2 }, { "_defaultOrder": 2, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.t3.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 3, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.t3.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 4, "_isFastLaunch": true, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.m5.large", "vcpuNum": 2 }, { "_defaultOrder": 5, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.m5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 6, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.m5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 7, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.m5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 8, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.m5.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 9, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.m5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 10, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.m5.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 11, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.m5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 12, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.m5d.large", "vcpuNum": 2 }, { "_defaultOrder": 13, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.m5d.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 14, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.m5d.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 15, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.m5d.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 16, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.m5d.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 17, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.m5d.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 18, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.m5d.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 19, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.m5d.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 20, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": true, "memoryGiB": 0, "name": "ml.geospatial.interactive", "supportedImageNames": [ "sagemaker-geospatial-v1-0" ], "vcpuNum": 0 }, { "_defaultOrder": 21, "_isFastLaunch": true, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 4, "name": "ml.c5.large", "vcpuNum": 2 }, { "_defaultOrder": 22, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.c5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 23, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.c5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 24, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.c5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 25, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 72, "name": "ml.c5.9xlarge", "vcpuNum": 36 }, { "_defaultOrder": 26, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 96, "name": "ml.c5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 27, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 144, "name": "ml.c5.18xlarge", "vcpuNum": 72 }, { "_defaultOrder": 28, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.c5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 29, "_isFastLaunch": true, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.g4dn.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 30, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.g4dn.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 31, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.g4dn.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 32, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.g4dn.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 33, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.g4dn.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 34, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.g4dn.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 35, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 61, "name": "ml.p3.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 36, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 244, "name": "ml.p3.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 37, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 488, "name": "ml.p3.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 38, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 768, "name": "ml.p3dn.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 39, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.r5.large", "vcpuNum": 2 }, { "_defaultOrder": 40, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.r5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 41, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.r5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 42, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.r5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 43, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.r5.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 44, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.r5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 45, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 512, "name": "ml.r5.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 46, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 768, "name": "ml.r5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 47, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.g5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 48, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.g5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 49, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.g5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 50, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.g5.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 51, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.g5.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 52, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.g5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 53, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.g5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 54, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 768, "name": "ml.g5.48xlarge", "vcpuNum": 192 } ], "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "Python 3 (Data Science)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-2:429704687514:image/datascience-1.0" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 5 }