/*! Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. SPDX-License-Identifier: MIT-0 */ import React from 'react'; import { Panel, Container, Header, IconButton} from 'rsuite'; import FileDownloadIcon from '@rsuite/icons/FileDownload'; import './styles.less'; import data from './data.json'; import arch from 'src/assets/architecture.png' const Home = () => { return (

Bulk document de-identification and OCR

with Intelligent Document Processing (IDP)

An application to help evaluate de-identification (redaction) of PHI information from bulk documents with Amazon Textract and Amazon Comprehend Medical.

Introduction

You can use this application to try out document processing workflow to perform bulk OCR from documents and de-identify PHI in documents. The bulk OCR will extract lines, forms (if any), and table data (if any) from your documents. It uses Amazon Textract and will generate Excel reports which can be used to evaluate the accuracy scores. The de-identification flow utilizes the text output generated by Amazon Textract OCR, and uses it to perform PHI entity detection using Amazon Comprehend Medical. Documents are subsequently de-identified by redacting the PHI entities detected using the bounding box geometry information generated by Amazon Textract.

Quick start

Click the button below to download a set of sample documents to try out. Once you have downloaded the zip file, unzip it, navigate to the "Process Documet" screen from the left panel, select some or all the documents in the folder, drag and drop into the file upload area. If you want to test out document de-identification then make sure to check the "De-identify documents" checkbox in the De-identification section.

} > Download samples

What is de-identification?

According to U.S. Dept of Health and Human Services (HHS) -

The increasing adoption of health information technologies in the United States accelerates their potential to facilitate beneficial studies that combine large, complex data sets from multiple sources. The process of de-identification, by which identifiers are removed from the health information, mitigates privacy risks to individuals and thereby supports the secondary use of data for comparative effectiveness studies, policy assessment, life sciences research, and other endeavors.

For more information and rationale behind de-identification of documents containing PHI data, visit HHS guidance on Methods for De-identification of PHI.

How it works

The diagram shown below is a high-level architecture of the application and the AWS services it uses to accomplish the task of performing bulk OCR and de-identification. Most notable services used in the architecture are -

{/*

*/}

{/*

*/}

Using the application

If you are reading this, then it means you have successfully deployed the application and all it's relevant components into your AWS account. Using this application is straight forward, you can simply navigate to the "Process Documents" page using the left menu, and upload your documents to kick-off a document processing workflow.

Documents uploaded together will be processed in a single batch, within a single IDP workflow execution called an "Analysis job". You can upload up-to 200 documents at a time, per analysis job. This is an arbitrary limit that is only imposed by this implementation of the demo, however, virtually unlimited documents can be processed using this architecture. Feel free to customize the aspects of this application to suit your evaluation needs.

For Amazon Textract related limits, please refer to the Amazon Textract Hard limits. For Amazon Comprehend Medical related limits, please refer to Amazon Comprehend Medical guidelines and quotas.

Usage options

Bulk OCR with Amazon Textract: In this case, you can simply upload your documents to the application. Once the upload is complete, the IDP workflow deployed in the back-end will extract WORDS, LINES and other structural information such as FORMS, TABLE data from all your documents by default. You can then review the output of each of these documents and view the confidence scores, geometry information, and other metadata related to each document.
Document de-identification: You may also choose to enable document de-identification which will utilize Amazon Comprehend Medical to identify PHI entities and then utilize Amazon Textract generated bounding box geometry information of the OCR'd text to perform redactions.

) } export default Home