# Amazon Textract PDF Text Extractor Improve data extraction and document processing with Amazon Textract This project provides a mechanism to use Amazon Textract to extract meaningful actionable data from a wide range of complex multi-format PDF files. PDF files are challenging, they can have a variety of data elements like headers, footers, tables with data in multiple columns, images, graphs, sentences and paragraphs in different formats. We explore the data extraction phase of IDP as shown in the following figure, and how they connect to the steps involved in a document process, such as ingestion, extraction and post processing. # Solution Architecture ![Solution Architecture](images/solution_architecture.png) ## Prerequisites You can either use AWS Cloud9 or your local to deploy this solution. ## Prerequisites for local setup 1. Download and install the latest version of Python for your OS from [here](https://www.python.org/downloads/). We will be using Python 3.8+. 2. You will need to install version 2 of the [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html) as well. If you already have AWS CLI, please upgrade to a minimum version of 2.0.5 following the instructions on the link above. 3. AWS CDK 4. Docker ## Deployment Instructions 1. Clone this repo to your local or Cloud9. 2. Run the following commands: pip install -r requirements.txt cdk bootstrap cdk deploy SimpleAsyncWorkflow ## Execution Instructions Follow the instructions in blog [post](https://aws.amazon.com/blogs/machine-learning/improve-data-extraction-and-document-processing-with-amazon-textract/). ## Further Reading: [IDP constructs](https://constructs.dev/packages/amazon-textract-idp-cdk-constructs/v/0.0.7/api/TextractGenericAsyncSfnTask?lang=typescript) ## License This library is licensed under the [MIT-0 License](https://github.com/aws/mit-0).