{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Getting Started\n",
    "\n",
    "We will be using [boto3](https://aws.amazon.com/sdk-for-python/) the Python SDK for AWS. It allows you to create, update, and delete AWS resources from your Python scripts. This library is preinstalled in this SageMaker Jupyter notebook. If yoou want to install this locally you can follow the instruction [here](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html). \n",
    "\n",
    "We will be using Amazon Comprehend, this service provides natural language processing, topic modeling, and Custom Classification capabilities, enabling a broad range of applications that can analyze text.\n",
    "\n",
    "* **Natural Language Processing**: Amazon Comprehend requests for Entity Recognition, Sentiment Analysis, Syntax Analysis, Key Phrase Extraction, and Language Detection.\n",
    "* **Topic Modeling**: Topic Modeling identifies relevant terms or topics from a collection of documents stored in Amazon S3. It will identify the most common topics in the collection and organize them in groups and then map which documents belong to which topic. \n",
    "* **Custom Comprehend**: The Custom Classification and Entities APIs can train a custom NLP model to categorize text and extract custom entities.\n",
    "\n",
    "In this workshop we are going to be get visualize sentiment of Yelp reviews from 2015 using a data set from the AWS Open Data Registry [Yelp Reviews NLP Fast.ai](https://registry.opendata.aws/fast-ai-nlp/). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install boto3 --upgrade"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### [Download Yelp Reviews](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-example-download-file.html) \n",
    "\n",
    "We will download the reviews from the Fast.ai NLP dataset available on the [AWS Open Data Registry](https://registry.opendata.aws/fast-ai-nlp/). We will be using the boto3 library to instatntiate an S3 bucket reference to download the Yelp reviews"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import boto3\n",
    "\n",
    "open_data_bucket = 'fast-ai-nlp'\n",
    "s3 = boto3.resource('s3')\n",
    "\n",
    "s3.Bucket(open_data_bucket).download_file('yelp_review_full_csv.tgz', 'yelp_review_full_csv.tgz')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Untar Yelp Reviews\n",
    "\n",
    "There are two `csv` files in the tarball. One is called `train.csv`, the other is `test.csv`. If we were to leverage this data set to build an new AI/ML model we would have cleaned and split data sets for both train and test. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!tar -xvzf yelp_review_full_csv.tgz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For those interested, The `readme.txt` file describes in more details the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!cat yelp_review_full_csv/readme.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### View raw csv file\n",
    "\n",
    "We will use [Pandas](https://pandas.pydata.org/) to read the csv and view the data set. You will notice the data contains 2 unnamed columns for rating and review. The rating is between 1-5 and the review is a free form text field."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "pd.set_option('display.max_colwidth', -1)\n",
    "\n",
    "df = pd.read_csv('yelp_review_full_csv/train.csv', header=None)\n",
    "df.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have a reference to the Yelp reviews with a pandas DataFrame we can pass the first review to the Amazon Comprehend APIs. We will create a reference to the Amazon Comprhend service and call each API in the cells below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "comprehend = boto3.client(service_name='comprehend')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is all that is needed to start using Amazon Comprehend in Python. The below cells will execute on the NLP features available in Amazon Comprehend."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Sentiment Analysis\n",
    "\n",
    "The Sentiment Analysis API returns the overall sentiment of a text (Positive, Negative, Neutral, or Mixed).\n",
    "\n",
    "We will use `pandas` to take a section of the json response and convert into a DataFrame to better see the result in this notebook. The raw json will look like below:\n",
    "\n",
    "```json\n",
    "{\n",
    "    'Sentiment': 'NEUTRAL', \n",
    "    'SentimentScore': {\n",
    "        'Positive': 0.00525312777608633, \n",
    "        'Negative': 0.0001172995544038713, \n",
    "        'Neutral': 0.9946269392967224, \n",
    "        'Mixed': 2.5949941573344404e-06}, \n",
    "    'ResponseMetadata': {\n",
    "        'RequestId': '7bebc984-9c38-4ddc-b041-4a0e5a061b6e', \n",
    "        'HTTPStatusCode': 200, \n",
    "        'HTTPHeaders': {\n",
    "            'x-amzn-requestid': '7bebc984-9c38-4ddc-b041-4a0e5a061b6e', \n",
    "            'content-type': 'application/x-amz-json-1.1', \n",
    "            'content-length': '164', \n",
    "            'date': 'Wed, 22 Jan 2020 19:27:13 GMT'\n",
    "        }, \n",
    "        'RetryAttempts': 0\n",
    "    }\n",
    "}\n",
    "```\n",
    "\n",
    "Feel free to experiment with your own text and see the results that are generated."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.columns = ['rating', 'review']\n",
    "review = df.at[0, 'review']\n",
    "review"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas.io.json import json_normalize\n",
    "\n",
    "sentiment = comprehend.detect_sentiment(Text = review, LanguageCode= 'en')\n",
    "\n",
    "json_normalize(sentiment['SentimentScore'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Keyphrase Extraction\n",
    "\n",
    "The Keyphrase Extraction API returns the key phrases or talking points and a confidence score to support that this is a key phrase. In the resulting `KeyPhrases` object you will see the `BeginOffset` and `EndOffset` of the key phrase, the `Text` itself and `Score`. The `Score` is the level of confidence that Amazon Comprehend has in the accuracy of the detection."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "keyPhrases = comprehend.detect_key_phrases(Text = review, LanguageCode = 'en')\n",
    "\n",
    "json_normalize(keyPhrases['KeyPhrases'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Entity Recognition\n",
    "\n",
    "The Entity Recognition API returns the named entities (\"People,\" \"Places,\" \"Locations,\" etc.) that are automatically categorized based on the provided text. We will use the same `review` from the other API calls for continuity but feel free to experiment with your own text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "namedEntities = comprehend.detect_entities(Text = review, LanguageCode = 'en')\n",
    "\n",
    "json_normalize(namedEntities['Entities'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Language Detection\n",
    "\n",
    "The Language Detection API automatically identifies text written in over 100 languages and returns the dominant language with a confidence score to support that a language is dominant."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "language = comprehend.detect_dominant_language(Text = review)\n",
    "\n",
    "json_normalize(language['Languages'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Amazon Translate\n",
    "\n",
    "We will introduce you to another AI service to show how you can translate the text of the review into another language and detect a language other than English. To do so, much like Amazon Comprehend you provide a reference to the API."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "translate = boto3.client('translate')\n",
    "\n",
    "result_de = translate.translate_text(Text=review, SourceLanguageCode=\"en\", TargetLanguageCode=\"de\")\n",
    "\n",
    "translated = result_de['TranslatedText']\n",
    "print(translated)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "translated_language = comprehend.detect_dominant_language(Text = translated)\n",
    "\n",
    "json_normalize(translated_language['Languages'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Comprehend Medical\n",
    "\n",
    "Medical Named Entity and Relationship Extraction (NERe)\n",
    "\n",
    "The Amazon Comprehend APIs are generalized models to be used in many areas. Amazon Comprehend Medical provides a more specific NLP model for medical information.\n",
    "\n",
    "The Medical NERe API returns the medical information such as medication, medical condition, test, treatment and procedures (TTP), anatomy, and Protected Health Information (PHI). It also identifies relationships between extracted sub-types associated to Medications and TTP. There is also contextual information provided as entity “traits” (negation, or if a diagnosis is a sign or symptom). The table below shows the extracted information with relevant sub-types and entity traits.\n",
    "\n",
    "To only extract PHI, you can use the Protected Health Information Data Identification (PHId) API.\n",
    "\n",
    "We will use the sample medical note below for this demo."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "medical_note = '''\n",
    "Sample Type / Medical Specialty: Lab Medicine - Pathology\n",
    "Sample Name: Lung Biopsy Pathology Report\n",
    "Description: Lung, wedge biopsy right lower lobe and resection right upper lobe. Lymph node, biopsy level 2 and 4 and biopsy level 7 subcarinal. PET scan demonstrated a mass in the right upper lobe and also a mass in the right lower lobe, which were also identified by CT scan.\n",
    "(Medical Transcription Sample Report)\n",
    "CLINICAL HISTORY: A 48-year-old smoker found to have a right upper lobe mass on chest x-ray and is being evaluated for chest pain. PET scan demonstrated a mass in the right upper lobe and also a mass in the right lower lobe, which were also identified by CT scan. The lower lobe mass was approximately 1 cm in diameter and the upper lobe mass was 4 cm to 5 cm in diameter. The patient was referred for surgical treatment.\n",
    "\n",
    "SPECIMEN:\n",
    "A. Lung, wedge biopsy right lower lobe\n",
    "B. Lung, resection right upper lobe\n",
    "C. Lymph node, biopsy level 2 and 4\n",
    "D. Lymph node, biopsy level 7 subcarinal\n",
    "\n",
    "FINAL DIAGNOSIS:\n",
    "A. Wedge biopsy of right lower lobe showing: Adenocarcinoma, Grade 2, Measuring 1 cm in diameter with invasion of the overlying pleura and with free resection margin.\n",
    "B. Right upper lobe lung resection showing: Adenocarcinoma, grade 2, measuring 4 cm in diameter with invasion of the overlying pleura and with free bronchial margin. Two (2) hilar lymph nodes with no metastatic tumor.\n",
    "C. Lymph node biopsy at level 2 and 4 showing seven (7) lymph nodes with anthracosis and no metastatic tumor.\n",
    "D. Lymph node biopsy, level 7 subcarinal showing (5) lymph nodes with anthracosis and no metastatic tumor.\n",
    "\n",
    "COMMENT: The morphology of the tumor seen in both lobes is similar and we feel that the smaller tumor involving the right lower lobe is most likely secondary to transbronchial spread from the main tumor involving the right upper lobe. This suggestion is supported by the fact that no obvious vascular or lymphatic invasion is demonstrated and adjacent to the smaller tumor, there is isolated nests of tumor cells within the air spaces. Furthermore, immunoperoxidase stain for Ck-7, CK-20 and TTF are performed on both the right lower and right upper lobe nodule. The immunohistochemical results confirm the lung origin of both tumors and we feel that the tumor involving the right lower lobe is due to transbronchial spread from the larger tumor nodule involving the right upper lobe.\n",
    "'''"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Detect PHI\n",
    "\n",
    "Inspects the clinical text for protected health information (PHI) entities and entity category, location, and confidence score on that information.\n",
    "\n",
    "If you are looking just PHI data you can use this call and find where the PHI is detected in the text with the `BeginOffset` and `EndOffset`, the `Score` on how accurate it has been detected, and the `Type` of PHI detected."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "comprehend_medical = boto3.client('comprehendmedical')\n",
    "\n",
    "phi_response = comprehend_medical.detect_phi(Text = medical_note)\n",
    "\n",
    "json_normalize(phi_response['Entities'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Detect Entities\n",
    "\n",
    "The DetectEntities operation will detect medical entities in your text. It detects entities in the following categories:\n",
    "\n",
    "* ANATOMY\n",
    "* MEDICAL_CONDITION\n",
    "* MEDICATION\n",
    "* PROTECTED_HEALTH_INFORMATION\n",
    "* TEST_TREATMENT_PROCEDURE\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "entities_response = comprehend_medical.detect_entities(Text = medical_note)\n",
    "\n",
    "json_normalize(entities_response['Entities'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This workshop has given you more information about how you can leverage our NLP services in your applications without having to have a deep understanding of ML and provide great insight in your domains. If you would like to dive deeper into the other features of our AI services check out this [link](https://aws.amazon.com/machine-learning/ai-services/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}