{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Amazon Comprehend Custom Classification\n",
    "\n",
    "This notebook will serve as a template for the overall process of taking a text dataset and integrating it into [Amazon Comprehend Custom Classification](https://docs.aws.amazon.com/comprehend/latest/dg/how-document-classification.html) and perform NLP for custom classification.\n",
    "\n",
    "## Overview\n",
    "\n",
    "1. [Introduction to Amazon Comprehend Custom Classification](#Introduction)\n",
    "1. [Obtaining Your Data](#data)\n",
    "1. [Pre-processing data](#preprocess)\n",
    "1. [Building Custom Classification model](#build)\n",
    "1. [Real time inference](#inference)\n",
    "1. [Cleanup](#cleanup)\n",
    "\n",
    "\n",
    "## Introduction to Amazon Comprehend Custom Classification <a class=\"anchor\" id=\"Introduction\"/>\n",
    "\n",
    "If you are not familiar with Amazon Comprehend Custom Classification you can learn more about this tool on these pages:\n",
    "\n",
    "* [Product Page](https://aws.amazon.com/comprehend/)\n",
    "* [Product Docs](https://docs.aws.amazon.com/comprehend/latest/dg/how-document-classification.html)\n",
    "\n",
    "## Training a custom classifier\n",
    "\n",
    "Custom classification is a two-step process. First, you train a custom classifier to recognize the classes that are of interest to you. Then you send unlabeled documents to be classified.\n",
    "\n",
    "To train the classifier, specify the options you want, and send Amazon Comprehend documents to be used as training material. Based on the options you indicated, Amazon Comprehend creates a custom ML model that it trains based on the documents you provided. This custom model (the classifier) examines each document you submit. It then returns either the specific class that best represents the content (if you're using multi-class mode) or the set of classes that apply to it (if you're using multi-label mode).\n",
    "\n",
    "We are going to use a Hugging Face pre-canned dataset of customer reviews and use the multi-class mode. We ensure that dataset is a .csv and the format of the file must be one class and document per line. For example:\n",
    "```\n",
    "CLASS,Text of document 1\n",
    "CLASS,Text of document 2\n",
    "CLASS,Text of document 3\n",
    "```\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Install Hugging Face datasets package\n",
    "!pip --disable-pip-version-check install datasets --quiet"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With the datasets installed, now we will import the Pandas library as well as a few other data science tools in order to inspect the information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import json\n",
    "import time\n",
    "import uuid\n",
    "import boto3\n",
    "import pprint\n",
    "import string\n",
    "import random\n",
    "import datetime \n",
    "import subprocess\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from time import sleep"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lets load the data in to dataframe and look at the data we uploaded. Examine the number of columns that are present. Look at few samples to see the content of the data. **This will take 5 to 8 minutes**.\n",
    "\n",
    "**Note:** CTA means call to action. No CTA means no call to action. This is a metric to determine if the customer's concern was addressed by the agent during the call. A CTA indicates that the customer is satisfied that their concerns has been or will be addressed by the company."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from datasets import load_dataset\n",
    "dataset = load_dataset('amazon_us_reviews', 'Electronics_v1_00', split='train[:10%]')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset.set_format(type='pandas')\n",
    "df = dataset[:1000]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To convert data to the format that is required by Amazon Comprehend Custom Classifier,\n",
    "\n",
    "```\n",
    "CLASS,Text of document 1\n",
    "CLASS,Text of document 2\n",
    "CLASS,Text of document 3\n",
    "```\n",
    "We will identify the column which are class and which have the text content we would like to train on, we can create a new dataframe with selected columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df1 = df[['star_rating','review_body']]\n",
    "df1 = df1.rename(columns={\"review_body\": \"text\", \"star_rating\": \"class\"})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will translate the customer product ratings to CTA (call-to-action) and No CTA (no call-to-action). All ratings from 3 and above are considerd as CTA (customer is satisfied) with 1 and 2 considered as No CTA (customer is not satisfied)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df1.loc[df1['class'] >= 3, 'class'] = 'CTA'\n",
    "df1.loc[df1['class'] != 'CTA', 'class'] = 'No CTA'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remove all punctuation from the text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import string\n",
    "for i,row in df1.iterrows():\n",
    "    a = row['text'].strip(string.punctuation)\n",
    "    df1.loc[i,'text'] = a"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df1.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df1['class'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pre-processing data<a class=\"anchor\" id=\"preprocess\"/> \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For training, the file format must conform with the [following](https://docs.aws.amazon.com/comprehend/latest/dg/how-document-classification-training.html):\n",
    "\n",
    "- File must contain one label and one text per line – 2 columns\n",
    "- No header\n",
    "- Format UTF-8, carriage return “\\n”.\n",
    "\n",
    "Labels “must be uppercase, can be multitoken, have whitespace, consist of multiple words connect by underscores or hyphens or may even contain a comma in it, as long as it is correctly escaped.”\n",
    "\n",
    "For the inference part of it - when you want your custom model to determine which label corresponds to a given text -, the file format must conform with the following:\n",
    "\n",
    "- File must contain text per line\n",
    "- No header\n",
    "- Format UTF-8, carriage return “\\n”."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At this point we have all the data the 2 needed files. \n",
    "\n",
    "### Building The Target Train and Test Files\n",
    "\n",
    "With all of the above spelled out the next thing to do is to build training file:\n",
    "\n",
    "1. `comprehend-train.csv` - A CSV file containing 2 columns without header, first column class, second column text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "DSTTRAINFILE='comprehend-train.csv'\n",
    "\n",
    "df1.to_csv(path_or_buf=DSTTRAINFILE,\n",
    "                  header=False,\n",
    "                  index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train an Amazon Comprehend custom classifier\n",
    "Now that all of the required data to get started exists, we can start working on Comprehend Custom Classfier. \n",
    "\n",
    "The custom classifier workload is built in two steps:\n",
    "\n",
    "1. Training the custom model – no particular machine learning or deep learning knowledge is necessary\n",
    "1. Classifying new data\n",
    "\n",
    "Lets follow below steps for Training the custom model:\n",
    "\n",
    "1. Specify the bucket name that was pre-created for you that will host training data artifacts and production results. \n",
    "1. Configure an IAM role allowing Comprehend to [access newly created buckets](https://docs.aws.amazon.com/comprehend/latest/dg/access-control-managing-permissions.html#auth-role-permissions)\n",
    "1. Prepare data for training\n",
    "1. Upload training data in the S3 bucket\n",
    "1. Launch a “Train Classifier” job from the console: “Amazon Comprehend” > “Custom Classification” > “Train Classifier”\n",
    "1. Prepare data for classification (one text per line, no header, same format as training data). Some more details [here](https://docs.aws.amazon.com/comprehend/latest/dg/how-class-run.html)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get notebook's region\n",
    "region = boto3.Session().region_name\n",
    "print(region)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Configure your AWS APIs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sagemaker\n",
    "\n",
    "s3 = boto3.client('s3')\n",
    "comprehend = boto3.client('comprehend')\n",
    "role = sagemaker.get_execution_role()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Specify an Amazon s3 bucket that will host training data and test data. **Note:** This bucket should have been created already for you. Please go the Amazon S3 console to verify the bucket is present. It should start with `aim317...`. **Specify your bucket name in the cell below**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bucket = '<your-s3-bucket>' # Provide your bucket name here\n",
    "prefix = 'comprehend-custom-classifier' # you can leave this as it is\n",
    "\n",
    "try:\n",
    "    s3.head_bucket(Bucket=bucket)\n",
    "except:\n",
    "    print(\"The S3 bucket name {} you entered seems to be incorrect, please try again\".format(bucket))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Uploading the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s3.upload_file(DSTTRAINFILE, bucket, prefix+'/' + DSTTRAINFILE)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Building Custom Classification model <a class=\"anchor\" id=\"#build\"/>\n",
    "\n",
    "Launch the classifier training:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s3_train_data = 's3://{}/{}/{}'.format(bucket, prefix, DSTTRAINFILE)\n",
    "s3_output_job = 's3://{}/{}/{}'.format(bucket, prefix, 'output/train_job')\n",
    "print('training data location: ',s3_train_data, \"output location:\", s3_output_job)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "uid = uuid.uuid4()\n",
    "\n",
    "training_job = comprehend.create_document_classifier(\n",
    "    DocumentClassifierName='aim317-cc-' + str(uid),\n",
    "    DataAccessRoleArn=role,\n",
    "    InputDataConfig={\n",
    "        'S3Uri': s3_train_data\n",
    "    },\n",
    "    OutputDataConfig={\n",
    "        'S3Uri': s3_output_job\n",
    "    },\n",
    "    LanguageCode='en',\n",
    "    VersionName='v001'\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Check training status in Amazon Comprehend console\n",
    "\n",
    "[Go to Amazon Comprehend Console](https://console.aws.amazon.com/comprehend/v2/home?region=us-east-1#classification)\n",
    "\n",
    "This will take approximately 30 minutes. Go to the **Classifier Metrics** step below after the classifier has been created and is ready for use. Running the cells prior to classifier being ready, will throw an error. Simply re-execute the cell again after the classifier is ready."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Classifier Metrics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "response = comprehend.describe_document_classifier(\n",
    "    DocumentClassifierArn=training_job['DocumentClassifierArn']\n",
    ")\n",
    "print(response['DocumentClassifierProperties']['ClassifierMetadata']['EvaluationMetrics'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Real time inference <a class=\"anchor\" id=\"inference\"/>\n",
    "We will now use a custom classifier real time endpoint to detect if the audio transcripts and translated text contain indication of there is a clear CTA or not. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create endpoint"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_arn = response[\"DocumentClassifierProperties\"][\"DocumentClassifierArn\"]\n",
    "print('Model used for real time endpoint ' + model_arn)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let's create an endpoint with 4 Inference Units to account for us sending approximately 400 characters per second to the endpoint\n",
    "\n",
    "create_endpoint_response = comprehend.create_endpoint(\n",
    "    EndpointName='aim317-cc-ep',\n",
    "    ModelArn=model_arn,\n",
    "    DesiredInferenceUnits=4,\n",
    "    \n",
    ")\n",
    "\n",
    "print(create_endpoint_response['EndpointArn'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Check endpoint status in Amazon Comprehend console\n",
    "\n",
    "[Go to Amazon Comprehend Console](https://console.aws.amazon.com/comprehend/v2/home?region=us-east-1#endpoints)\n",
    "\n",
    "This will take approximately 10 minutes. Go to the **Run Inference** step below after the classifier has been created and is ready for use. Running the cells prior to classifier being ready, will lock the cell. This will presume only after classifier has been trained."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Run Inference\n",
    "\n",
    "Lets review the list of files ready for inference in the `comprehend/input` folder of our S3 bucket. These files were created by the notebook available in `1-Transcribe-Translate-Calls`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Input files ready for classification\n",
    "!aws s3 ls s3://{bucket}/comprehend/input/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Prepare to page through our transcripts in S3\n",
    "\n",
    "# Define the S3 handles\n",
    "s3 = boto3.client('s3')\n",
    "s3_resource = boto3.resource('s3')\n",
    "\n",
    "\n",
    "# We will be merging the classifier predictions with the transcript segments we created for quicksight in 1-Transcribe-Translate\n",
    "t_prefix = 'quicksight/data/cta'\n",
    "\n",
    "\n",
    "# Lets define the bucket name that contains the transcripts first\n",
    "# So far we used a session bucket we created for training and testing the classifier\n",
    "\n",
    "paginator = s3.get_paginator('list_objects_v2')\n",
    "pages = paginator.paginate(Bucket=bucket, Prefix='comprehend/input')\n",
    "a = []\n",
    "\n",
    "\n",
    "# We will define a DataFrame to store the results of the classifier\n",
    "cols = ['transcript_name', 'cta_status']\n",
    "df_class = pd.DataFrame(columns=cols)\n",
    "\n",
    "# Now lets page through the transcripts\n",
    "for page in pages:\n",
    "    for obj in page['Contents']:\n",
    "        cta = ''\n",
    "        # get the transcript file name\n",
    "        transcript_file_name = obj['Key'].split('/')[2]\n",
    "        # now lets get the transcript file contents\n",
    "        temp = s3_resource.Object(bucket, obj['Key'])\n",
    "        transcript_content = temp.get()['Body'].read().decode('utf-8')\n",
    "        # Send the last few sentence(s) for classification\n",
    "        transcript_truncated = transcript_content[1500:1900]\n",
    "        # Call Comprehend to classify input text\n",
    "        response = comprehend.classify_document(Text=transcript_truncated, EndpointArn=create_endpoint_response['EndpointArn'])\n",
    "        # Now we need to determine which of the two classes has the higher confidence score\n",
    "        # Use the name for that score as our predicted label\n",
    "        a = response['Classes']\n",
    "        # We will use this temp DataFrame to extract the class with maximum confidence level for CTA\n",
    "        tempcols = ['Name', 'Score']\n",
    "        df_temp = pd.DataFrame(columns=tempcols)\n",
    "        for i in range(0, 2):\n",
    "            df_temp.loc[len(df_temp.index)] = [a[i]['Name'], a[i]['Score']]\n",
    "        cta = df_temp.iloc[df_temp.Score.argmax(), 0:2]['Name']\n",
    "        \n",
    "        # Update the results DataFrame with the cta predicted label\n",
    "        # Create a CSV file with cta label from this DataFrame\n",
    "        df_class.loc[len(df_class.index)] = [transcript_file_name.strip('en-').strip('.txt'), cta]        \n",
    "\n",
    "df_class.to_csv('s3://' + bucket + '/' + t_prefix + '/' + 'cta_status.csv', index=False)\n",
    "df_class"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### End of notebook\n",
    "Please go back to the workshop instructions to continue to the next step"
   ]
  }
 ],
 "metadata": {
  "instance_type": "ml.t3.medium",
  "interpreter": {
   "hash": "9ddb102edfbd95000dbbd260d8bbcf82701cc06b4dcf114fa04ba84aab75adcb"
  },
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}