# Medical Document Processing with Amazon Textract, Amazon Comprehend, and Amazon Comprehend Medical

In this notebook, we will walkthrough on how to build a data processing pipeline that will process electronic medical reports (EMR) in PDF format to extract relevant medical information by using the following AWS services:

- [Textract](https://aws.amazon.com/textract/): To extract text from the PDF medical report
- [Comprehend](https://aws.amazon.com/comprehend/): To process general language data from the output of Textract.
- [Comprehend Medical](https://aws.amazon.com/comprehend/medical/): To process medical-domain information from the output of Textract.

NOTE: This notebook requires that the SageMaker Execution Role has additional permission to call the Textract, Comprehend, and Comprehend Medical services. Please reach out to your system administrator if you are running this outside of an AWS-hosted workshop.

## Contents

1. [Objective](#1.-Objective)
1. [Setup Environment](#2.-Setup-Environment)
1. [Extract text from a medical PDF document with Amazon Textract](#3.-Extract-text-from-a-medical-PDF-document-with-Amazon-Textract)
1. [Process general text information with Amazon Comprehend](#4.-Process-general-text-information-with-Amazon-Comprehend)
1. [Process medical domain information using Amazon Comprehend Medical](#5.-Process-medical-domain-information-using-Amazon-Comprehend-Medical)
1. [Clean up](#6.-Clean-up)

---

# 1. Objective

The objective of this section of the workshop is to learn how to use Amazon Textract and Comprehend Medical to extract the medical information from an electronic medical report in PDF format.

---

# 2. Setup environment

Before be begin, let us setup our environment. We will need the following:

* Amazon Textract Results Parser `textract-trp` to process our Textract results.
* Python libraries 
* Pre-processing functions that will help with processing and visualization of our results. For the purpose of this workshop, we have provided a pre-processing function library that can be found in [util/preprocess.py](./util/preprocess.py)

Note: `textract-trp` will require Python 3.6 or newer.

In [None]:
!pip install textract-trp

In [None]:
import boto3
import time
import sagemaker
import os
import trp
import pandas as pd

bucket = sagemaker.Session().default_bucket()
prefix = "sagemaker/medical_notes"

---

# 3. Extract text from a medical PDF document with Amazon Textract

In this section we will be extracting the text from a medical report in PDF format using Textract. To facilitate this workshop, we have generated a [sample PDF medical report](./data/sample_report_1.pdf) using the [MTSample dataset](https://www.kaggle.com/tboyle10/medicaltranscriptions) from kaggle.

## About Textract
Amazon Textract can detect lines of text and the words that make up a line of text. Textract can handle documents in either synchronous or asynchronous processing:
+ [synchronous API](https://docs.aws.amazon.com/textract/latest/dg/sync.html): supports *The input document must be an image in `JPEG` or `PNG` format*. Single page document analysis can be performed using a Textract synchronous operation.
 1. *`detect_document_text`*: detects text in the input document. 
 2. *`analyze_document`*: analyzes an input document for relationships between detected items.
+ [asynchronous API](https://docs.aws.amazon.com/textract/latest/dg/async.html): *can analyze text in documents that are in `JPEG`, `PNG`, and `PDF` format. Multi page processing is an asynchronous operation. The documents are stored in an Amazon S3 bucket. Use DocumentLocation to specify the bucket name and file name of the document.*
 1. for context analysis:
 1. *`start_document_text_detection`*: starts the asynchronous detection of text in a document. 
 2. *`get_document_text_detection`*: gets the results for an Amazon Textract asynchronous operation that detects text in a document.
 2. for relationships between detected items :
 1. *`start_document_analysis`*: starts the asynchronous analysis of relationship in a document. 
 2. *`get_document_analysis`*: Gets the results for an Amazon Textract asynchronous operation that analyzes text in a document
 
For detailed api, refer to documentation [here](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract.html#Textract.Client.analyze_document).

In this demo, as the input is in pdf format and has multiple pages, we will be using the multi page textract operation, we will need to upload our sample medical record to an S3 bucket. Run the next cell to upload our sample medical report.

In [None]:
fileName = "sample_report_1.pdf"
fileUploadPath = os.path.join("./data", fileName)
textractObjectName = os.path.join(prefix, "data", fileName)

# Upload medical report file
boto3.Session().resource("s3").Bucket(bucket).Object(textractObjectName).upload_file(
 fileUploadPath
)

## Start text detection asynchonously in the pdf
In the next step, we will start the asynchronous textract operation by calling the `start_document_analysis()` function. The function will kickoff an asynchronous job that will process our medical report file in the stipulated S3 bucket.

In [None]:
textract = boto3.client("textract")
response = textract.start_document_analysis(
 DocumentLocation={"S3Object": {"Bucket": bucket, "Name": textractObjectName}},
 FeatureTypes=[
 "TABLES",
 ],
)

textractJobId = response["JobId"]
print("job id is: ", textractJobId)

## Monitor the job status

As the job is kicked off in the background, we can monitor the progress of the job by calling the `get_document_analysis()` function and passing the job id of the job that we created. 

Run the next cell and wait for the Textract Job status to return a SUCCEEDED status.
the outcome is in json format

In [None]:
%%time
time.sleep(5)
response = textract.get_document_analysis(JobId=textractJobId)
status = response["JobStatus"]

while status == "IN_PROGRESS":
 time.sleep(5)
 response = textract.get_document_analysis(JobId=textractJobId)
 status = response["JobStatus"]
 print("Textract Job status: {}".format(status))

## View Textract results
Now that we've successfully extracted the text from the medical report, let us extract the textract results and consolidate the text so that we can pass it to Comprehend Medical to start extract medical information from the report.

In [None]:
%%time
pages = []

time.sleep(5)

response = textract.get_document_analysis(JobId=textractJobId)

pages.append(response)

nextToken = None
if "NextToken" in response:
 nextToken = response["NextToken"]

while nextToken:
 time.sleep(5)

 response = textract.get_document_analysis(JobId=textractJobId, NextToken=nextToken)

 pages.append(response)
 print("Resultset page recieved: {}".format(len(pages)))
 nextToken = None
 if "NextToken" in response:
 nextToken = response["NextToken"]

Let's take a look at the output from textract by using the trp library to extract and format the textract results.

In [None]:
doc = trp.Document(pages)
print("Total length of document is", len(doc.pages))
idx = 1
full_text = ""
for page in doc.pages:
 print(f"Results from page {idx}: \n", page.text)
 full_text += page.text
 idx = idx + 1

---

# 4. Process general text information with Amazon Comprehend

### What is the dominant language?

In [None]:
import pprint

comprehend_client = boto3.client(service_name="comprehend", region_name="us-east-1")

response = comprehend_client.detect_dominant_language(Text=full_text).get(
 "Languages", []
)

for language in response:
 print(
 f"Detected language is {language.get('LanguageCode', [])}, with a confidence score of {language.get('Score', [])}"
 )

### What are the named entities?

In [None]:
response = comprehend_client.detect_entities(Text=full_text, LanguageCode="en")

entities_df = pd.DataFrame(
 [
 [
 entity["Text"],
 entity["Type"],
 entity["Score"],
 entity["BeginOffset"],
 entity["EndOffset"],
 ]
 for entity in response["Entities"]
 ],
 columns=["Text", "Type", "Score", "BeginOffset", "EndOffset"],
).sort_values(by="Score", ascending=False)

display(entities_df)

### What are the key phrases?

In [None]:
response = comprehend_client.detect_key_phrases(Text=full_text, LanguageCode="en")

entities_df = pd.DataFrame(
 [
 [entity["Text"], entity["Score"], entity["BeginOffset"], entity["EndOffset"]]
 for entity in response["KeyPhrases"]
 ],
 columns=["Text", "Score", "BeginOffset", "EndOffset"],
).sort_values(by="Score", ascending=False)

display(entities_df)

### Is there any personally-identifiable information?

In [None]:
response = comprehend_client.detect_pii_entities(Text=full_text, LanguageCode="en")

entities_df = pd.DataFrame(
 [
 [entity["Type"], entity["Score"], entity["BeginOffset"], entity["EndOffset"]]
 for entity in response["Entities"]
 ],
 columns=["Type", "Score", "BeginOffset", "EndOffset"],
).sort_values(by="Score", ascending=False)

display(entities_df)

### What is the overall sentiment?

In [None]:
response = comprehend_client.detect_sentiment(Text=full_text[:5000], LanguageCode="en")
print(response.get("Sentiment", []))
pprint.pprint(response.get("SentimentScore", []))

### What are the parts of speech?

In [None]:
response = comprehend_client.detect_syntax(Text=full_text[:5000], LanguageCode="en")
entities_df = pd.DataFrame(
 [
 [
 entity["Text"],
 entity["PartOfSpeech"]["Tag"],
 entity["PartOfSpeech"]["Score"],
 entity["BeginOffset"],
 entity["EndOffset"],
 ]
 for entity in response["SyntaxTokens"]
 ],
 columns=["Type", "PartOfSpeech", "Score", "BeginOffset", "EndOffset"],
).sort_values(by="Score", ascending=False)

display(entities_df)

---

# 5. Process medical domain information using Amazon Comprehend Medical

## About Amazon Comprehend Medical

Comprehend Medical detects useful information in unstructured clinical text. As much as 75% of all health record data is found in unstructured text such as physician's notes, discharge summaries, test results, and case notes. Amazon Comprehend Medical uses Natural Language Processing (NLP) models to sort through text for valuable information. 

Using Amazon Comprehend Medical, you can quickly and accurately gather information, such as medical condition, medication, dosage, strength, and frequency from a variety of sources like doctors’ notes. Amazon Comprehend Medical uses advanced machine learning models to accurately and quickly identify medical information, such as medical conditions and medications, and determines their relationship to each other, for instance, medicine dosage and strength. Amazon Comprehend Medical can also link the detected information to medical ontologies such as ICD-10-CM or RxNorm

Currently, Amazon Comprehend Medical only detects medical entities in English language texts.

![Image of Comprehend Medical](https://d1.awsstatic.com/diagrams/product-page-diagram-Ontology-Linking_How-It-Works@2x.f2dde99f71240451d64b24bdd202573ff9a26d35.png)

In this workshop, we will be using the detect entities function ([detect_entities_v2](https://docs.aws.amazon.com/comprehend/latest/dg/extracted-med-info-V2.html)) to extract medical conditions. Then, we'll use the ICD-10-CM Linking function ([infer_icd10_cm](https://docs.aws.amazon.com/comprehend-medical/latest/dev/ontology-icd10.html)]) to code the conditions.

## Detect medical entities
The output of *detect_entities_v2* can detect the following entities:


- `MEDICAL_CONDITION`: The signs, symptoms, and diagnosis of medical conditions.
- `Score` - The level of confidence that Amazon Comprehend Medical has in the accuracy of the detection
- `Trait` - Contextual information for the entity

Other information extracted by Comprehend Medical:
- `MEDICATION`: Medication and dosage information for the patient.
- `PROTECTED_HEALTH_INFORMATION`: patient's personal information, e.g. name, age, gender
- `TEST_TREATMENT_PROCEDURE`: the procedures that are used to determine a medical condition.
- `TIME_EXPRESSION`: Entities related to time when they are associated with a detected entity.

In [None]:
maxLength = 20000
pd.options.display.max_rows = 999
comprehendResponse = []
comprehend_medical_client = boto3.client(
 service_name="comprehendmedical", region_name="us-east-1"
)

response = comprehend_medical_client.detect_entities_v2(Text=full_text)

entities_df = pd.DataFrame(
 [
 [
 entity["Id"],
 entity["Text"],
 entity["Category"],
 entity["Type"],
 entity["Score"],
 entity["BeginOffset"],
 entity["EndOffset"],
 entity["Attributes"][0]["RelationshipType"]
 if "Attributes" in entity
 else "",
 entity["Attributes"][0]["Text"] if "Attributes" in entity else "",
 entity["Attributes"][0]["Category"] if "Attributes" in entity else "",
 ]
 for entity in response["Entities"]
 ],
 columns=[
 "Id",
 "Text",
 "Category",
 "Type",
 "Score",
 "BeginOffset",
 "EndOffset",
 "RelationshipType",
 "Text2",
 "Category2",
 ],
).sort_values(by="Score", ascending=False)

display(entities_df)

## Link ICD-10 concepts

In [None]:
response = comprehend_medical_client.infer_icd10_cm(Text=full_text[:5000])

In [None]:
for entity in response["Entities"][10:]:
 if entity.get("Score", []) < 0.8:
 continue

 print(f"Text: {entity.get('Text', [])}")
 print(f"Type: {entity.get('Type', [])}")
 print(f"Category: {entity.get('Category', [])}")
 print(f"Score: {entity.get('Score', [])}")

 icd10_df = pd.DataFrame(
 [
 [concept["Code"], concept["Description"], concept["Score"]]
 for concept in entity["ICD10CMConcepts"]
 ],
 columns=["Code", "Description", "Score"],
 ).sort_values(by="Score", ascending=False)

 display(icd10_df)

# 6. Clean up

In [None]:
boto3.Session().resource("s3").Bucket(bucket).Object(textractObjectName).delete()