# Analysis Document State Machine

The Analysis Document State Machine runs Amazon Textract service to extract tabular data from the document (PDF format) and indexes the metadata into Amazon OpenSearch cluster.

![Analysis Document state machine](../../../../deployment/tutorials/images/analysis-document-state-machine.png)

__

## Execution input
The state execution input is similar to the [Analysis Main State Machine](../main/README.md#execution-input) with additional fields generated by the [Prepare analysis](../main/README.md#state-prepare-analysis) state.

```json
{
    "input": {
        ...,
        "document": {
            "enabled": true,
            "prefix": "IMAGE_PROXIES_PREFIX",
            "numPages": 68
        },
        "request": {
            "timestamp": 1637743896177
        }
    }
}
```

| Field | Description | Comments |
| :-----| :-----------| :---------|
| input.document.enabled | indicates document analysis is required | Must be true |
| input.document.prefix | location of the image proxies (PNG files) generated from the [Ingest Document State Machine](../../ingest/document/README.md) | Must exist |
| input.document.numPages | total number of pages extracted from the Ingest Document State Machine | Must exist |
| input.request.timestamp | request timestamp | If present, the timestamp (_DATETIME_) is concatenated to the path to store the raw analysis results |

__

## State: Analyze document
A state where a lambda function uses [Amazon Textract AnalyzeDocument](https://docs.aws.amazon.com/textract/latest/dg/API_AnalyzeDocument.html) to extract tabular metadata from all pages within a PDF document. The raw JSON results are stored to _s3://PROXY_BUCKET/UUID/FILE_BASENAME/raw/DATETIME/textract/XXX.json_.

__

## State: More pages?
A Choice state to check _$.status_ field. If it is set to _COMPLETED_ indicating all pages have been processed, the state machine transitions to the next state, ```Index analysis results``` state. Otherwise, it moves to ```Analyze document``` state to continue the rest of the document.

__

## State: Index analysis results
A state where a lambda function downloads and parses the tabular data and indexes to the Amazon OpenSearch cluster under the ```textract``` indice.

__

## AWS Lambda function (analysis-document)
The analysis-document lambda function provides the implementation to support different states of the Analysis Document state machine. The following AWS XRAY trace diagram illustrates the AWS resources this lambda function communicates to.

![Analysis Document Lambda function](../../../../deployment/tutorials/images/analysis-document-lambda.png)

__

## IAM Role Permission

```json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": "s3:ListBucket",
            "Resource": "PROXY_BUCKET",
            "Effect": "Allow"
        },
        {
            "Action": [
                "s3:GetObject",
                "s3:PutObject"
            ],
            "Resource": "PROXY_BUCKET/*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "dynamodb:Scan",
                "dynamodb:Query",
                "dynamodb:UpdateItem",
                "dynamodb:DeleteItem"
            ],
            "Resource": [
                "SERVICE_TOKEN_TABLE",
            ],
            "Effect": "Allow"
        },
        {
            "Action": "textract:AnalyzeDocument",
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "es:ESHttpGet",
                "es:ESHttpHead",
                "es:ESHttpPost",
                "es:ESHttpPut",
                "es:ESHttpDelete"
            ],
            "Resource": "OPENSEARCH_CLUSTER",
            "Effect": "Allow"
        }
    ]
}
```
__

Back to [Analysis Main State Machine](../main/README.md) | Back to [Table of contents](../../../../README.md#table-of-contents)