[Workshop](../../README.md) | [Lab 0](../../Lab0/README.md) # LAB 1 - Asynchronous - Extract text from documents with Textract [Amazon Textract](https://docs.aws.amazon.com/textract/latest/dg/what-is.html) is a service that automatically extract text and data from scanned documents. With Textract, you can quickly automate document workflows and process millions of document pages in hours. ## Architecture  1. Upload a document in S3. 2. S3 triggers the execution of a Lambda function (already done in [Lab 0](../Lab0/README.md)). 3. The function use the asynchronous Textract API (``StartDocumentTextDetection``). Textract returns a JobId to the Lambda function . 4. Textract reads the document in S3 and performs the text extraction. 5. Textract publishes the result of the extraction in a SNS Topic. 6. Another Lambda function, registered to the topic, is triggered. 7. The second Lambda function call the Textract API (``GetDocumentTextDetection``) with the JobId provided in the SNS message, to get the result of the extraction. ## Lambda (step 3-4) In [Lambda console](https://console.aws.amazon.com/lambda/home#/functions), click on your *documentTextract-xyz* function, scroll down to edit code inside the browser. Replace the code with the following one and click **Save**: ```python import urllib import boto3 import os textract = boto3.client('textract') sns_topic_arn = os.environ["SNS_TOPIC_ARN"] sns_role_arn = os.environ["SNS_ROLE_ARN"] def handler(event, context): source_bucket = event['Records'][0]['s3']['bucket']['name'] object_key = urllib.parse.unquote_plus( event['Records'][0]['s3']['object']['key']) textract_result = textract.start_document_text_detection( DocumentLocation={ "S3Object": { "Bucket": source_bucket, "Name": object_key } }, NotificationChannel={ "SNSTopicArn": sns_topic_arn, "RoleArn": sns_role_arn } ) print(textract_result) ``` We use the [``StartDocumentTextDetection``](https://docs.aws.amazon.com/textract/latest/dg/API_StartDocumentTextDetection.html) API to start asynchronous detection of text in a document (JPG, PNG, PDF). When the text detection is finished, Textract publishes a completion status to the SNS topic specified in ``NotificationChannel``. As you notice, we need to provide the ARN of the SNS Topic and the ARN of a role. This role will be assumed by Textract and will allow Textract to publish in SNS. Let's setup this... ## Setup SNS (step 5) ### Create the SNS Topic In Amazon [SNS console](https://console.aws.amazon.com/sns/v3/home#/topics) (Simple Notification Service), click **Create Topic**. Choose a name for your topic and leave the details as is, then click **Create topic** at the bottom of the page:  [More details on the creation of an Amazon SNS Topic](https://docs.aws.amazon.com/sns/latest/dg/sns-tutorial-create-topic.html). Copy the Topic ARN in a text document for later use. ### Give Textract access to the SNS Topic In order for Textract to publish messages in the topic, we need to give him the permissions to do so. In [IAM console](https://console.aws.amazon.com/iam/home), choose Roles on the naviagtion page and create a new Role. In the role creation process (step 1), select **AWS service** as type of trusted entity and **EC2** as the service that will use this role (we'll change that in a minute). Then click **Next: Permissions**:  In step 2, choose **Create Policy**. In the newly opened window, select the JSON tab and past the following policy, replace the ARN with the Topic ARN previously copied. This policy will allow to publish some messages on the Topic previously created: ```json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "sns:Publish" ], "Resource": "arn:aws:sns:REGION:ACCOUNTID:TOPIC_NAME>" } ] } ``` Click **Review Policy**. In the "Review Policy" screen, add a name and a description and click **Create Policy**:  Back to the role creation screen, hit the refresh button () and type the beginning of the policy name in the filter, select the policy and click **Next: Tags**:  You don't need to add tags, go to Step 4 (click **Next: Review**), add a name and a description for the role and click **Create Role**:  This role can be assumed by an EC2 instance, not yet by Textract. To change this, we need to update the trust relationship. Select your role to get the details and select the **Trust relationships** tab, then click on **Edit trust relationship**:  In the trust relationship screen, replace **ec2** with **textract**. You should get the following policy:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": "textract.amazonaws.com" }, "Action": "sts:AssumeRole" } ] }Once it is done, click **Update trust policy** button. With this policy, Textract is now able to assume the role. Copy the role ARN for later use. ### Update the lambda function In [Lambda](https://console.aws.amazon.com/lambda/home#/functions), click on your *documentTextract-xyz* function, scroll down to **Environment variables** and add the following variables (retrieve the SNS topic ARN and Role ARN previously created) and click **Save**:  ### Test In order to test the process, you need to upload an document in the *workshop-textract-xyz* S3 bucket. You can take any PDF from this [folder](../../documents) and upload it to the bucket. In [S3 console](https://s3.console.aws.amazon.com/s3/buckets/), click on your *workshop-textract-xyz* bucket, and click on **Upload**. If you go to [CloudWatch logs](https://console.aws.amazon.com/cloudwatch/home#logs:prefix=/aws/lambda/documentTextract), you will be able to display the output of your lambda execution. You should get a json containing a ``JobId`` (see [documentation](https://docs.aws.amazon.com/textract/latest/dg/async-notification-payload.html) for details on the result). Copy that JobId, we will use it later for another test. As for now, we don't have any subscription to the SNS Topic. Let's arrange that... ## Setup the 2nd Lambda function (step 6-7) ### Create the function that will process the result of Textract In [Lambda](https://console.aws.amazon.com/lambda/home#/functions), click on **Create function**, select **Author from scratch** and fill the basic information as follow, leave the permissions as is and click on **Create function** when it's done:  #### Configure the timeout and memory Within the Lambda function screen, if you scroll down, you should be able to see the **Basic Settings** of the function. Adjust the memory slider to 256 MB and Timeout to 5 minutes. Don't forget to hit **Save**:  #### Add permissions to the function The function needs persmissions to invoke Textract. Let's update the role automatically created during the function creation. Click on the *documentAnalysis* function:  Then scroll down to the **Execution Role** and click **View the documentAnalysis-role-xyz**:  In the new window, click on **Attach policies**, search for "Textract". You should have the following screen. Select the ``AmazonTextractFullAccess`` policy and click **Attach policy**:  Back to the lambda function screen, refresh the page, you should now see Amazon Textract in the *Permissions* tab. Our lambda function is able to call Textract APIs:  ### Subscribe the Lambda function to the SNS Topic Now we need to trigger the function when a message is published in the SNS topic previously created. To do that, click **Add a trigger**:  Choose **SNS** and select the SNS topic to subscribe to, then click **Add**:  The lambda function is now ready to receive the notification events from SNS. ### Update the lambda code to get Textract result We will now edit the lambda code to call Textract and retrieve the content of the document. In Lambda, click on your documentAnalysis-xyz function, scroll down to edit code and replace the code with the following one: ```python import json import boto3 textract = boto3.client('textract') def lambda_handler(event, context): message = json.loads(event['Records'][0]['Sns']['Message']) jobId = message['JobId'] print("JobId="+jobId) status = message['Status'] print("Status="+status) if status != "SUCCEEDED": return { # TODO (not in this workshop): handle error with Dead letter queue # https://docs.aws.amazon.com/lambda/latest/dg/dlq.html "status": status } result = textract.get_document_text_detection( JobId=jobId ) print(result) ``` We retrieve the SNS message from the lambda event and then use the [``GetDocumentTextDetection``](https://docs.aws.amazon.com/textract/latest/dg/API_GetDocumentTextDetection.html) API (``get_document_text_detection``in python), passing the ``JobId`` as parameter, to get Textract result. ### Test In order to test the process, we will create a Test Event and simulate a SNS notification. On the top right of the screen, click **Configure test events**:  Give a name and paste the following json, replace the JobId placeholder with the one you copied before and hit **Create** when it's done: ```json { "Records": [ { "EventSource": "aws:sns", "EventVersion": "1.0", "Sns": { "Type": "Notification", "MessageId": "b4755b6e-f9df-53c5-8b16-0e57fe4af33a", "Subject": "", "Message": "{\"JobId\":\"