{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Intelligent Document Processing\n", "\n", "Documents contain valuable information and come in various shapes and forms. In most cases, you are manually processing these documents which is time consuming, prone to error, and costly. Not only do you want this information extracted quickly but can also automate business processes that presently relies on manual inputs and intervention across various file types and formats.\n", "\n", "To help you overcome these challenges, AWS Machine Learning (ML) now provides you choices when it comes to extracting information from complex content in any document format such as insurance claims, mortgages, healthcare claims, contracts, and legal contracts.\n", "\n", "The diagram below shows an architecture for an Intelligent document processing workflow. It starts with data capture stage to securely store and aggregate different types (PDF, PNG, JPEG, and TIFF), formats, and layouts of documents. Followed by accurate classification of documents and extracting text and key insights from documents and perform further enrichments of the documents (such as identity entities, redaction etc.). Finally, the verification and review stage involves manual review of the documents for quality and accuracy, followed by consumption of the documents and extracted information into downstream databases/applications.\n", "\n", "In this workshop, we will explore the various aspects of this workflow such as the document classification, text and insights extraction, enrichments, and human review.\n", "\n", "![Arch](./images/idp.png)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Document Classification\n", "In this lab we will walk you through an hands-on lab on document classification using Amazon Comprehend\n", "Custom Classifier. We will use Amazon Textract to first extract the text out of our documents and then label them and then use the data for training our Amazon comprehend custom classifier. We will create an Amazon Comprehend real time endpoint with the custom classifier to classify our documents.\n", "\n", "\n", "

\n", " \"IDP\n", "

\n", "\n", "\n", "- [Step 1: Create Amazon Comprehend Classification Training Job](#step1)\n", "- [Step 2: Create Amazon Comprehend real time endpoint](#step2)\n", "- [Step 3: Classify Documents using the real-time endpoint](#step3)\n", "\n", "\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Step 1: Create Amazon Comprehend Classification Training Job \n", "\n", "In this step, we will import some necessary libraries that will be used throughout this notebook.\n", "\n", "We will then use a prepared dataset, of the appropriate filetype (.csv) and structure - one column containing the raw text of a document, and the other column containing the label of that document.\n", "\n", "Please see this [notebook](https://github.com/aws-samples/aws-ai-intelligent-document-processing/blob/main/industry/mortgage/01-document-classification.ipynb) from our mortgage blog series for detailed steps on data preparation for ingestion into Amazon Comprehend for [Amazon Comprehend custom classification model](https://docs.aws.amazon.com/comprehend/latest/dg/how-document-classification.html). Note: While the documents may be different, the process, from a code perspective, is identical." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "!python -m pip install -q amazon-textract-response-parser --upgrade\n", "!python -m pip install -q amazon-textract-caller --upgrade\n", "!python -m pip install -q amazon-textract-prettyprinter --upgrade" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from textractcaller.t_call import call_textract, Textract_Features\n", "from textractprettyprinter.t_pretty_print import Textract_Pretty_Print, get_string\n", "from trp import Document" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the import statements above fails then please restart the notebook kernel by clicking the circular arrow button at the top of the notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import boto3\n", "import botocore\n", "import sagemaker\n", "import os\n", "import io\n", "import datetime\n", "import pandas as pd\n", "from PIL import Image\n", "from pathlib import Path\n", "import multiprocessing as mp\n", "from IPython.display import Image, display, HTML, JSON\n", "\n", "# variables\n", "data_bucket = sagemaker.Session().default_bucket()\n", "region = boto3.session.Session().region_name\n", "\n", "os.environ[\"BUCKET\"] = data_bucket\n", "os.environ[\"REGION\"] = region\n", "role = sagemaker.get_execution_role()\n", "\n", "print(f\"SageMaker role is: {role}\\nDefault SageMaker Bucket: s3://{data_bucket}\")\n", "\n", "s3=boto3.client('s3')\n", "textract = boto3.client('textract', region_name=region)\n", "comprehend=boto3.client('comprehend', region_name=region)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will use the pre-prepared dataset and upload it to Amazon S3. The dataset is in `CSV` format and will be named `comprehend_train_data.csv`. Note that you can have more than one `CSV` file in an S3 bucket for training a Comprehend custom classifier. If you have more than one file, you can specify only the bucket/prefix in call to train the custom classifier. Amazon Comprehend will automatically use all the files under the bucket/prefix for training purposes.\n", "\n", "The following code cells will upload the training data to the S3 bucket, and create a Custom Comprehend Classifier. You can also create a custom classifier manually, please see the subsequent sections for instructions on how to do that." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Upload Comprehend training data to S3\n", "key='idp/comprehend/comprehend_train_data.csv'\n", "s3.upload_file(Filename='./dataset/comprehend_train_data.csv', \n", " Bucket=data_bucket, \n", " Key=key)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "Once we have a labeled dataset ready we are going to create and train a [Amazon Comprehend custom classification model](https://docs.aws.amazon.com/comprehend/latest/dg/how-document-classification.html) with the dataset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create Amazon Comprehend custom classification Training Job\n", "\n", "We will use Amazon Comprehend's Custom Classification to train our own model for classifying the documents. We will use Amazon Comprehend `CreateDocumentClassifier` API to create a classifier which will train a custom model using the labeled data CSV file we created above. The training data contains extracted text, that was extracted using Amazon Textract, and then labeled." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# Create a document classifier\n", "account_id = boto3.client('sts').get_caller_identity().get('Account')\n", "id = str(datetime.datetime.now().strftime(\"%s\"))\n", "\n", "document_classifier_name = 'Sample-Doc-Classifier-IDP'\n", "document_classifier_version = 'Sample-Doc-Classifier-IDP-v1'\n", "document_classifier_arn = ''\n", "response = None\n", "\n", "try:\n", " create_response = comprehend.create_document_classifier(\n", " InputDataConfig={\n", " 'DataFormat': 'COMPREHEND_CSV',\n", " 'S3Uri': f's3://{data_bucket}/{key}'\n", " },\n", " DataAccessRoleArn=role,\n", " DocumentClassifierName=document_classifier_name,\n", " VersionName=document_classifier_version,\n", " LanguageCode='en',\n", " Mode='MULTI_CLASS'\n", " )\n", " \n", " document_classifier_arn = create_response['DocumentClassifierArn']\n", " \n", " print(f\"Comprehend Custom Classifier created with ARN: {document_classifier_arn}\")\n", "except Exception as error:\n", " if error.response['Error']['Code'] == 'ResourceInUseException':\n", " print(f'A classifier with the name \"{document_classifier_name}\" already exists.')\n", " document_classifier_arn = f'arn:aws:comprehend:{region}:{account_id}:document-classifier/{document_classifier_name}/version/{document_classifier_version}'\n", " print(f'The classifier ARN is: \"{document_classifier_arn}\"')\n", " else:\n", " print(error)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%store document_classifier_arn\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check status of the Comprehend Custom Classification Job" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "%%time\n", "# Loop through and wait for the training to complete . Takes up to 10 mins \n", "from IPython.display import clear_output\n", "import time\n", "from datetime import datetime\n", "\n", "jobArn = create_response['DocumentClassifierArn']\n", "\n", "max_time = time.time() + 3*60*60 # 3 hours\n", "while time.time() < max_time:\n", " now = datetime.now()\n", " current_time = now.strftime(\"%H:%M:%S\")\n", " describe_custom_classifier = comprehend.describe_document_classifier(\n", " DocumentClassifierArn = jobArn\n", " )\n", " status = describe_custom_classifier[\"DocumentClassifierProperties\"][\"Status\"]\n", " clear_output(wait=True)\n", " print(f\"{current_time} : Custom document classifier: {status}\")\n", " \n", " if status == \"TRAINED\" or status == \"IN_ERROR\":\n", " break\n", " \n", " time.sleep(60)\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Alternatively, to create a Comprehend Custom Classifier Job manually using the console go to [Amazon Comprehend Console](https://console.aws.amazon.com/comprehend/v2/home?region=us-east-1#classification)\n", " \n", "- On the left menu click \"Custom Classification\"\n", "- In the \"Classifier models\" section, click on \"Create new model\"\n", "- In Model Setting for Model name, enter a name \n", "- In Data Specification; select \"Using Single-label\" mode and for Data format select CSV file\n", "- For Training dataset browse to your data-bucket created above and select the file `comprehend_train_data.csv`\n", "- For IAM role select \"Create an IAM role\" and specify a prefix (this will create a new IAM Role for Comprehend)\n", "- Click create" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This job can take ~30 minutes to complete. Once the training job is completed move on to next step." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "# Step 2: Create Amazon Comprehend real time endpoint \n", "\n", "Once our Comprehend custom classifier is fully trained (i.e. status = `TRAINED`). We can create a real-time endpoint. We will use this endpoint to classify documents in real time. The following code cells use the `comprehend` Boto3 client to create an endpoint, but you can also create one manually via the console. Instructions on how to do that can be found in the subsequent section." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#create comprehend endpoint\n", "model_arn = document_classifier_arn\n", "ep_name = 'idp-endpoint'\n", "\n", "try:\n", " endpoint_response = comprehend.create_endpoint(\n", " EndpointName=ep_name,\n", " ModelArn=model_arn,\n", " DesiredInferenceUnits=1, \n", " DataAccessRoleArn=role\n", " )\n", " ENDPOINT_ARN=endpoint_response['EndpointArn']\n", " print(f'Endpoint created with ARN: {ENDPOINT_ARN}') \n", "except Exception as error:\n", " if error.response['Error']['Code'] == 'ResourceInUseException':\n", " print(f'An endpoint with the name \"{ep_name}\" already exists.')\n", " ENDPOINT_ARN = f'arn:aws:comprehend:{region}:{account_id}:document-classifier-endpoint/{ep_name}'\n", " print(f'The classifier endpoint ARN is: \"{ENDPOINT_ARN}\"')\n", " %store ENDPOINT_ARN\n", " else:\n", " print(error)\n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%store ENDPOINT_ARN" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "display(endpoint_response)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, use the steps below to create a Comprehend endpoint using the AWS console.\n", "\n", "- Go to [Comprehend on AWS Console](https://console.aws.amazon.com/comprehend/v2/home?region=us-east-1#endpoints) and click on Endpoints in the left menu.\n", "- Click on \"Create endpoint\"\n", "- Give an Endpoint name; for Custom model type select Custom classification; for version select no version or the latest version of the model.\n", "- For Classifier model select from the drop down menu\n", "- For Inference Unit select 1\n", "- Check \"Acknowledge\"\n", "- Click \"Create endpoint\"\n", "\n", "[It may take ~15 minutes](https://console.aws.amazon.com/comprehend/v2/home?region=us-east-1#endpoints) for the endpoint to get created. The code cell below checks the creation status.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "%%time\n", "# Loop through and wait for the training to complete . Takes up to 10 mins \n", "from IPython.display import clear_output\n", "import time\n", "from datetime import datetime\n", "\n", "ep_arn = endpoint_response[\"EndpointArn\"]\n", "\n", "max_time = time.time() + 3*60*60 # 3 hours\n", "while time.time() < max_time:\n", " now = datetime.now()\n", " current_time = now.strftime(\"%H:%M:%S\")\n", " describe_endpoint_resp = comprehend.describe_endpoint(\n", " EndpointArn=ep_arn\n", " )\n", " status = describe_endpoint_resp[\"EndpointProperties\"][\"Status\"]\n", " clear_output(wait=True)\n", " print(f\"{current_time} : Custom document classifier: {status}\")\n", " \n", " if status == \"IN_SERVICE\" or status == \"FAILED\":\n", " break\n", " \n", " time.sleep(10)\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "# Step 3: Classify Documents using the real-time endpoint \n", "\n", "Once the endpoint has been created, we will use a mix of documents under the `/samples/mixedbag/` directory and try to classify them to bank statement, invoice, and receipt documents respectively." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "root = \"./dataset/document_samples/\"\n", "files = []\n", "\n", "for file in os.listdir(root):\n", " if not file.startswith('.'):\n", " files.append(f'./dataset/document_samples/{file}')\n", "\n", "files_df = pd.DataFrame(files, columns=[\"Document\"])\n", "files_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's view one of the documents" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "file = files_df.sample().iloc[0]['Document']\n", "display(Image(filename=file, width=400, height=500))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Extract Text from the sample test documents using Textract. We will first convert the documents to ByteArray and then use Textract `detect_document_text` API to extract the text from the sample documents. We will create a utility function that reads each document and converts it into ByteArray for us to use with Textract. Once we extract text using Textract we will call Amazon Comprehend on each of them to classify them.\n", "\n", "
\n", " ⚠️ Note: Extracting text using Amazon Textract in advance is only required if you intend to use Amazon Comprehend real-time endpoint for document classification. In case you intend to use Amazon Comprehend Asynchronous classification job to classify documents in bulk, you can skip this since the Amazon Comprehend asynchronous job can internally use Textract to extract text out of your document and then classify it. Refer to StartDocumentClassificationJob API documentation for more info.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Extract the text from all the sample documents in the list\n", "\n", "We will create yet another small utility function that receives the document bytearray, extracts text from the document with Textract and returns the extracted text." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def extract_text(doc):\n", " response = call_textract(input_document=doc)\n", " page_string = get_string(textract_json=response, output_type=[Textract_Pretty_Print.LINES])\n", " return page_string" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "files_df['DocText'] = files_df.apply(lambda row : extract_text(row['Document']), axis = 1)\n", "files_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have the extracted text in the dataframe for each of our document, the next step is to use the Amazon Comprehend real-time endpoint to classify them. We will create a small utility function that does the classification using the endpoint and returns the document type. Note that Comprehend will return all the classes of documents with a confidence score linked to each class in an array of key-value pairs (Name-Score), we will pick only the document class with the highest confidence score from the endpoint's response." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import time \n", "from datetime import datetime\n", "\n", "def classify_doc(document):\n", " now = datetime.now()\n", " current_time = now.strftime(\"%H:%M:%S\")\n", " print(f'{current_time} : Processing')\n", " time.sleep(10) #to avoid Comprehend API throttling\n", " try:\n", " response = comprehend.classify_document(\n", " Text= document,\n", " EndpointArn=ENDPOINT_ARN\n", " )\n", " print(response)\n", " response_df = pd.DataFrame(response['Classes'])\n", " result = response_df.iloc[response_df['Score'].idxmax()] # choose the class with highest score \n", " return result.Name # return the corresponding class name\n", " except Exception as e:\n", " print(e)\n", " return 'error'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets now run the inference on our data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# import time\n", "files_df['DocType'] = files_df.apply(lambda row : classify_doc(row['DocText']), axis = 1)\n", "files_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have now classified all these documents into their respective classes. Let's review to check if the classifier did a correct job of classifying." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "document = files_df.iloc[0]['Document']\n", "document_type = files_df.iloc[0]['DocType']\n", "\n", "display(HTML(f'

Document Category: \"{document_type}\"

'))\n", "display(Image(filename=document, width=400, height=500))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Conclusion\n", "\n", "As we can see from above, our classifier was able to correctly classify the test document!\n", "\n", "In this notebook we have trained an Amazon Comprehend custom classifier using our pre-prepared dataset, that was constructed from sample documents by extracting the text from the documents using Amazon Textract and labeling the data into a CSV file format. We then trained an Amazon Comprehend custom classifier with the extracted text and created an Amazon Comprehend Classifier real time endpoint to performe classification of documents.\n", "\n", "In the next notebook we will look at a few methods to perfrom extraction of key insights from our documents using Amazon Textract." ] } ], "metadata": { "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "Python 3 (Data Science)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-2:429704687514:image/datascience-1.0" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 5 }