# Public sector use case: Benefit application

## Lab 2 - Comprehend

---

## Introduction
Amazon Comprehend is a natural-language processing (NLP) service that uses machine learning to uncover valuable insights and connections in text. 

In this lab we will demonstrate how to use the AWS SDK for Python (Boto3) to perform real-time analysis using the Amazon Comprehend Custom Classifier model we trained via the Amazon Managment Console. We'll also demonstrate performing inference against the endpoint and review the results.

**Note:** This notebook requires that you completed the lab for training an Amazon Comprehend Custom Classifier model.

</p>

- 1. [Prerequisites](#section_1_0)
    - 1.1 [Install packages](#section_1_1)
    - 1.2 [Import packages and modules](#section_1_2)
    - 1.3 [Setup the notebook role and session](#section_1_3)
    - 1.4 [Setup the AWS service clients](#section_1_4)
- 2. [Deploying an Amazon Comprehend Custom Classifier ](#section_2_0)
    - 2.1 [Check the model's training and deployment status](#section_2_1)
    - 2.2 [Deploy the Custom Classifier model](#section_2_2)
    - 2.3 [Custom Classifier model inference](#section_2_3)
- 3. [Cleanup Resources ](#section_3_0)
- 4. [Conclusion](#section_4_0)
- 5. [Additional Resources](#section_5_0)

##### **Let's get started!**

---

## 1. Prerequisites
<a id=section_1_0></a>

In this section, we'll install and import packages, establish the notebook execution role and session, and setup the AWS service clients.

### 1.1 Install packages
<a id=section_1_1></a>

We use *pip* to install packages from the Python Package Index and other indexes. A package contains all the files you need for a module.
Modules are Python code libraries you can include in your project. You can think of Python packages as the directories on a file system and modules as files within directories. 

**Note:** after executing code in this cell there will be lots of debug output, this is normal, and expected.

In [None]:
!pip install boto3 
!pip install botocore 
!pip install s3fs

### 1.2 Import packages and modules
<a id=section_1_2></a>

Python code in one module gains access to the code in another module by the process of importing it. In this section, we import packages and modules needed to execute code cells in this notebook.

In [None]:
import boto3
import sagemaker
import s3fs
import time
import json

from tabulate import tabulate
import trp.trp2 as t2

### 1.3 Setup the notebook role and session
<a id=section_1_3></a>

As a managed service, Amazon SageMaker performs operations on your behalf on the AWS hardware that is managed by SageMaker. SageMaker can perform only operations that the user permits. A SageMaker user can grant these permissions with an IAM role (referred to as an execution role).

To create and use a locally available execution role, execute the code in the following cell

In [None]:
# Get the IAM role and Sagemaker session
try:
    role = sagemaker.get_execution_role()
except:
    role = get_execution_role()

# Get the SakeMaker session
session = sagemaker.Session()

# Get the region name
region = session.boto_region_name

print('Using IAM role arn: {}'.format(role))
print('Using region: {}'.format(region))

### 1.4 Setup the AWS service clients
<a id=section_1_4></a>

AWS' Boto3 library is used commonly to integrate Python applications with various AWS services. Clients provide a low-level interface to the AWS service. In this section, we will create the Amazon Comprehend Boto3 client to help execute code cells in this notebook.

In [None]:
# Setup the Comprehend client
comprehend_client = boto3.client('comprehend', region_name=region)

## 2. Deploying an Amazon Comprehend Custom Classifier 
<a id=section_2_0></a>

In this section, we'll demonstrate how to programatically use the Amazon Comprehend SDK to deploy the custom classifier we previously setup in the AWS management console for real-time inference. **Important**: Make sure you have ran all the code cells in Section 1 of this notebook and have completed the instructor guided Custom Classifier lab.

In [None]:
# ENTER THE NAME OF YOUR CUSTOM CLASSIFIER HERE
documentClassifierName = "ENTER THE NAME OF YOUR CUSTOM CLASSIFIER MODEL HERE"

# Well append '-endpoint' to maintain consistency in our naming convention
documentClassifierEndpointName = documentClassifierName + "-endpoint"

### 2.1 Check the model's training and deployment status
<a id=section_2_1></a>

Some Amazon Comprehend Custom Classifier models can take more than 30 minutes to finish training, so let's create a function to periodically check the training status of the model. We want the model to return a 'TRAINED' status value before deploying the model to an inference endpoint. Additionally, deploying the model inference endpoint for this lab will take around 8 - 10 minuntes, so we'll periodically check its deployment status. We want the model endpoint to return a 'IN_SERVICE' status value before peforming an inference against the endpoint.

In [None]:
def get_custom_classifier_status(documentClassifierName, documentClassifierStatus):
    
    response = comprehend_client.list_document_classifiers(
        Filter={
            'DocumentClassifierName': documentClassifierName
        })
    
    status = response.get('DocumentClassifierPropertiesList')[0].get('Status')
    print("Job status: {}".format(status))

    while(status != documentClassifierStatus):
        time.sleep(30)
        response = comprehend_client.list_document_classifiers(
            Filter={
                'DocumentClassifierName': documentClassifierName
            })
        status = response.get('DocumentClassifierPropertiesList')[0].get('Status')
        print("Job status: {}".format(status))

    return status

def get_custom_classifier_endpoint_status(custom_classifier_endpoint_arn, documentClassifierEndpointStatus):
    
    response = comprehend_client.describe_endpoint(EndpointArn=custom_classifier_endpoint_arn)
    status = response.get('EndpointProperties').get('Status')
    print("Job status: {}".format(status))

    while(status != documentClassifierEndpointStatus):
        time.sleep(15)
        response = comprehend_client.describe_endpoint(EndpointArn=custom_classifier_endpoint_arn)
        status = response.get('EndpointProperties').get('Status')
        print("Job status: {}".format(status))

    return status

### 2.2 Deploy the Custom Classifier model
<a id=section_2_2></a>

Once the model has finished training, we can deploy it to a real-time inference endpoint. A status message will be periodically displayed to indicate the model training and model endpoint status.

**Note this process may take between 8-10 minutes to complete**

In [None]:
documentClassifierArn = ""
custom_classifier_endpoint_arn = ""

# Check if the model has finished training
if (get_custom_classifier_status(documentClassifierName, 'TRAINED') == 'TRAINED'):
    
    # Gets a list of the document classifiers that you have created
    document_classifiers = comprehend_client.list_document_classifiers(
        Filter={
            'DocumentClassifierName': documentClassifierName
        })
    
    # Since we only trained one, we'll get the first in the list
    documentClassifierArn = document_classifiers.get('DocumentClassifierPropertiesList')[0].get('DocumentClassifierArn')
    print(documentClassifierArn)

    # Deploy the custom classifer to a real-time endpoint
    response = comprehend_client.create_endpoint(
        EndpointName=documentClassifierEndpointName,
        ModelArn=documentClassifierArn,
        DesiredInferenceUnits=1 # Each inference unit represents of a throughput of 100 characters per second
    )
    
    # Get the ARN of the custom classifer real-time endpoint
    custom_classifier_endpoint_arn = response.get('EndpointArn')
    
    if(get_custom_classifier_endpoint_status(custom_classifier_endpoint_arn, 'IN_SERVICE') == 'IN_SERVICE'):
        print("The custom classification model has been deployed to a real-time inference endpoint")
        print("The EndpointArn is: {}".format(custom_classifier_endpoint_arn))

### 2.3 Custom Classifier model inference
<a id=section_2_3></a>

Now that the custom model is deployed, let's see how it classifies some text from a sample utility bill.

Note, for real-time analysis, for all input document types, the input file maximum is one page, with no more than 10,000 characters. 

For convenience, we print the results in a tabular format. In the left column is the predicted class names and the right column has the predicted class value. 


In [None]:
%%time
sample_text_for_inference = "JOHN SMITH NYSEG Account Number: 1234-5478-123 Statement Date: January 1, 2021 Amount Due: $123.45 Service Address: 123 MAIN ST, ANYTOWN CA 90210 Page 1 of 4 Next Scheduled Read Date: On or about January 24, 2022 EBPP Account Summary Previous invoice $101.03 Payments received as of 12/27/21 -101.03 Residential Balance forward 0.00 Energy charges 384.71 Residential consumer Miscellaneous charges 0.92 discount $ 2.44 See details beginning Payment due upon receipt. $385.63 on page 3 To avoid a 1.5% late payment charge, please ensure payment is See messages on page 2 received by the date displayed below. Think of the minutes, money and natural resources you'll save by doing business online or by phone 24/7. Visit nyseg.com to: View and pay your bill online Submit and view meter readings Enroll and manage budget billing Enroll in Autopay Call our self-service line at 1.800.600.2275 for billing information, provide a meter reading and to pay by phone. Add $1, $2, or $5 to your payment to make a tax-deductible donation to NYSEG and RG&E Project SHARE Heating Fund. Learn more at nyseg.com. Please return bottom portion with your payment. Make checks payable to NYSEG. Account Number NYSEG 10041076133 Late Fee After 01/20/22 NYSEG P.O. BOX 847812 Due Upon Receipt BOSTON, MA 02284-7812 $385.63 Amount Paid $ JESSE ROBERTS 193 EDGEWOOD DR AVERILL PARK NY 12018-2510 Please do not write below this line. 601004107613300000385630000038563"

# Let's test our custom classifier 
classify_document_response = comprehend_client.classify_document(
    EndpointArn=custom_classifier_endpoint_arn,    
    Text=sample_text_for_inference
)

# Here are the results of the custom classifier's predictions
print(tabulate(classify_document_response.get('Classes')))

## 3. Cleanup Resources
<a id=section_3_0></a>

Run the following cells to delete the customer classifier endpoint and model resources.

In [None]:
%%time
try:
    # Check if real-time endpoint still exists    
    response = comprehend_client.list_endpoints(
        Filter={
            'ModelArn': documentClassifierArn,
        }
    )
    
    # Get the endpoint Arn
    custom_classifier_endpoint_arn = response.get('EndpointPropertiesList')[0].get('EndpointArn')
    
    # If exists, delete the real-time endpoint
    if(response.get('EndpointPropertiesList')):

        response = comprehend_client.delete_endpoint(
            EndpointArn=custom_classifier_endpoint_arn
        )
        print("Deleting the custom document classifier model endpoint.")
        time.sleep(350)
        
except Exception:
    print("The custom document classifier model endpoint not found")

We need to wait (5 - 7 mins) for the model endpoint to be deleted before deleting the custom classifier model itself. 

In [None]:
%%time
try:
    
    # Check if real-time endpoint still exists
    response = comprehend_client.list_endpoints(
        Filter={
            'ModelArn': documentClassifierArn,
        }
    )
    
    # Delete the custom classifier model if the endpoint no longer exists
    if(not response.get('EndpointPropertiesList')):
        response = comprehend_client.delete_document_classifier(
            DocumentClassifierArn=documentClassifierArn
        )
        print("Deleting the custom document classifier model.")        
    else:
        print("The custom document classifier model {} endpoint still exists".format(documentClassifierArn))
        
except Exception:
    print("The custom document classifier model is not found")        

## 4. Conclusion
<a id=section_4_0></a>

In this lab we demonsrated how to programatically deploy a trained Amazon Comprehend Custom Classifier to a real-time endpoint using the AWS SDK for Python (Boto3).


## 5. Additional Resources
<a id=section_5_0></a>

- [Amazon Comprehend Custom classification Developer Guide](https://docs.aws.amazon.com/comprehend/latest/dg/how-document-classification.html)
- [Amazon Comprehend Boto3 documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/comprehend.html)