{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Evaluate Model with Amazon SageMaker Processing Jobs and Scikit-Learn\n",
    "\n",
    "Often, distributed data processing frameworks such as Scikit-Learn are used to pre-process data sets in order to prepare them for training.  \n",
    "\n",
    "In this notebook we'll use Amazon SageMaker Processing, and leverage the power of Scikit-Learn in a managed SageMaker environment to run our processing workload."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# NOTE:  THIS NOTEBOOK WILL TAKE A 5-10 MINUTES TO COMPLETE.\n",
    "\n",
    "# PLEASE BE PATIENT."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](img/prepare_dataset_bert.png)\n",
    "\n",
    "![](img/processing.jpg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Contents\n",
    "\n",
    "1. Setup Environment\n",
    "1. Setup Input Data\n",
    "1. Setup Output Data\n",
    "1. Build a Spark container for running the processing job\n",
    "1. Run the Processing Job using Amazon SageMaker\n",
    "1. Inspect the Processed Output Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Setup Environment\n",
    "\n",
    "Let's start by specifying:\n",
    "* The S3 bucket and prefixes that you use for training and model data. Use the default bucket specified by the Amazon SageMaker session.\n",
    "* The IAM role ARN used to give processing and training access to the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import sagemaker\n",
    "import boto3\n",
    "\n",
    "sess = sagemaker.Session()\n",
    "role = sagemaker.get_execution_role()\n",
    "bucket = sess.default_bucket()\n",
    "region = boto3.Session().region_name\n",
    "\n",
    "sm = boto3.Session().client(service_name=\"sagemaker\", region_name=region)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%store -r training_job_name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "try:\n",
    "    training_job_name\n",
    "    print(\"[OK]\")\n",
    "except NameError:\n",
    "    print(\"+++++++++++++++++++++++++++++++\")\n",
    "    print(\"[ERROR] Please run the notebooks in the previous TRAIN section before you continue.\")\n",
    "    print(\"+++++++++++++++++++++++++++++++\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(training_job_name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%store -r raw_input_data_s3_uri"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "try:\n",
    "    raw_input_data_s3_uri\n",
    "except NameError:\n",
    "    print(\"++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++\")\n",
    "    print(\"[ERROR] Please run the notebooks in the PREPARE section before you continue.\")\n",
    "    print(\"++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(raw_input_data_s3_uri)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%store -r max_seq_length"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "try:\n",
    "    max_seq_length\n",
    "    print(\"[OK]\")\n",
    "except NameError:\n",
    "    print(\"+++++++++++++++++++++++++++++++\")\n",
    "    print(\"[ERROR] Please run the notebooks in the previous TRAIN section before you continue.\")\n",
    "    print(\"+++++++++++++++++++++++++++++++\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(max_seq_length)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%store -r experiment_name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "try:\n",
    "    experiment_name\n",
    "    print(\"[OK]\")\n",
    "except NameError:\n",
    "    print(\"+++++++++++++++++++++++++++++++\")\n",
    "    print(\"[ERROR] Please run the notebooks in the previous TRAIN section before you continue.\")\n",
    "    print(\"+++++++++++++++++++++++++++++++\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(experiment_name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%store -r trial_name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "try:\n",
    "    trial_name\n",
    "    print(\"[OK]\")\n",
    "except NameError:\n",
    "    print(\"+++++++++++++++++++++++++++++++\")\n",
    "    print(\"[ERROR] Please run the notebooks in the previous TRAIN section before you continue.\")\n",
    "    print(\"+++++++++++++++++++++++++++++++\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(trial_name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(training_job_name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "from sagemaker.tensorflow.estimator import TensorFlow\n",
    "\n",
    "describe_training_job_response = sm.describe_training_job(TrainingJobName=training_job_name)\n",
    "print(describe_training_job_response)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_dir_s3_uri = describe_training_job_response[\"ModelArtifacts\"][\"S3ModelArtifacts\"].replace(\"model.tar.gz\", \"\")\n",
    "model_dir_s3_uri"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Run the Processing Job using Amazon SageMaker\n",
    "\n",
    "Next, use the Amazon SageMaker Python SDK to submit a processing job using our custom python script."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Create the `Experiment Config`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "experiment_config = {\n",
    "    \"ExperimentName\": experiment_name,\n",
    "    \"TrialName\": trial_name,\n",
    "    \"TrialComponentDisplayName\": \"evaluate\",\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Set the Processing Job Hyper-Parameters "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "processing_instance_type = \"ml.m5.xlarge\"\n",
    "processing_instance_count = 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Choosing a `max_seq_length` for BERT\n",
    "Since a smaller `max_seq_length` leads to faster training and lower resource utilization, we want to find the smallest review length that captures `80%` of our reviews.\n",
    "\n",
    "Remember our distribution of review lengths from a previous section?\n",
    "\n",
    "```\n",
    "mean         51.683405\n",
    "std         107.030844\n",
    "min           1.000000\n",
    "10%           2.000000\n",
    "20%           7.000000\n",
    "30%          19.000000\n",
    "40%          22.000000\n",
    "50%          26.000000\n",
    "60%          32.000000\n",
    "70%          43.000000\n",
    "80%          63.000000\n",
    "90%         110.000000\n",
    "100%       5347.000000\n",
    "max        5347.000000\n",
    "```\n",
    "\n",
    "![](img/review_word_count_distribution.png)\n",
    "\n",
    "Review length `63` represents the `80th` percentile for this dataset.  However, it's best to stick with powers-of-2 when using BERT.  So let's choose `64` as this is the smallest power-of-2 greater than `63`.  Reviews with length > `64` will be truncated to `64`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "from sagemaker.sklearn.processing import SKLearnProcessor\n",
    "\n",
    "processor = SKLearnProcessor(\n",
    "    framework_version=\"0.23-1\",\n",
    "    role=role,\n",
    "    instance_type=processing_instance_type,\n",
    "    instance_count=processing_instance_count,\n",
    "    max_runtime_in_seconds=7200,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.processing import ProcessingInput, ProcessingOutput\n",
    "\n",
    "processor.run(\n",
    "    code=\"evaluate_model_metrics.py\",\n",
    "    inputs=[\n",
    "        ProcessingInput(\n",
    "            input_name=\"model-tar-s3-uri\", source=model_dir_s3_uri, destination=\"/opt/ml/processing/input/model/\"\n",
    "        ),\n",
    "        ProcessingInput(\n",
    "            input_name=\"evaluation-data-s3-uri\",\n",
    "            source=raw_input_data_s3_uri,\n",
    "            destination=\"/opt/ml/processing/input/data/\",\n",
    "        ),\n",
    "    ],\n",
    "    outputs=[\n",
    "        ProcessingOutput(s3_upload_mode=\"EndOfJob\", output_name=\"metrics\", source=\"/opt/ml/processing/output/metrics\"),\n",
    "    ],\n",
    "    arguments=[\"--max-seq-length\", str(max_seq_length)],\n",
    "    experiment_config=experiment_config,\n",
    "    logs=True,\n",
    "    wait=False,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "scikit_processing_job_name = processor.jobs[-1].describe()[\"ProcessingJobName\"]\n",
    "print(scikit_processing_job_name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "from IPython.core.display import display, HTML\n",
    "\n",
    "display(\n",
    "    HTML(\n",
    "        '<b>Review <a target=\"blank\" href=\"https://console.aws.amazon.com/sagemaker/home?region={}#/processing-jobs/{}\">Processing Job</a></b>'.format(\n",
    "            region, scikit_processing_job_name\n",
    "        )\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "from IPython.core.display import display, HTML\n",
    "\n",
    "display(\n",
    "    HTML(\n",
    "        '<b>Review <a target=\"blank\" href=\"https://console.aws.amazon.com/cloudwatch/home?region={}#logStream:group=/aws/sagemaker/ProcessingJobs;prefix={};streamFilter=typeLogStreamPrefix\">CloudWatch Logs</a> After About 5 Minutes</b>'.format(\n",
    "            region, scikit_processing_job_name\n",
    "        )\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "from IPython.core.display import display, HTML\n",
    "\n",
    "display(\n",
    "    HTML(\n",
    "        '<b>Review <a target=\"blank\" href=\"https://s3.console.aws.amazon.com/s3/buckets/{}/{}/?region={}&tab=overview\">S3 Output Data</a> After The Processing Job Has Completed</b>'.format(\n",
    "            bucket, scikit_processing_job_name, region\n",
    "        )\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Monitor the Processing Job"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "running_processor = sagemaker.processing.ProcessingJob.from_processing_name(\n",
    "    processing_job_name=scikit_processing_job_name, sagemaker_session=sess\n",
    ")\n",
    "\n",
    "processing_job_description = running_processor.describe()\n",
    "\n",
    "print(processing_job_description)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "processing_evaluation_metrics_job_name = processing_job_description[\"ProcessingJobName\"]\n",
    "print(processing_evaluation_metrics_job_name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "running_processor.wait(logs=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# _Please Wait Until the ^^ Processing Job ^^ Completes Above._"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Inspect the Processed Output Data\n",
    "\n",
    "Take a look at a few rows of the transformed dataset to make sure the processing was successful."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "processing_job_description = running_processor.describe()\n",
    "\n",
    "output_config = processing_job_description[\"ProcessingOutputConfig\"]\n",
    "for output in output_config[\"Outputs\"]:\n",
    "    if output[\"OutputName\"] == \"metrics\":\n",
    "        processed_metrics_s3_uri = output[\"S3Output\"][\"S3Uri\"]\n",
    "\n",
    "print(processed_metrics_s3_uri)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!aws s3 ls $processed_metrics_s3_uri/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Show the test accuracy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "from pprint import pprint\n",
    "\n",
    "evaluation_json = sagemaker.s3.S3Downloader.read_file(\"{}/evaluation.json\".format(processed_metrics_s3_uri))\n",
    "\n",
    "pprint(json.loads(evaluation_json))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# !aws s3 cp $processed_metrics_s3_uri/confusion_matrix.png ./model_evaluation/\n",
    "\n",
    "# import time\n",
    "\n",
    "# time.sleep(10)  # Slight delay for our notebook to recognize the newly-downloaded file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# %%html\n",
    "\n",
    "# <img src='./model_evaluation/confusion_matrix.png'>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Pass Variables to the Next Notebook(s)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%store processing_evaluation_metrics_job_name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "%store processed_metrics_s3_uri"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%store"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Show the Experiment Tracking Lineage"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.analytics import ExperimentAnalytics\n",
    "\n",
    "import pandas as pd\n",
    "\n",
    "pd.set_option(\"max_colwidth\", 500)\n",
    "\n",
    "experiment_analytics = ExperimentAnalytics(\n",
    "    sagemaker_session=sess, experiment_name=experiment_name, sort_by=\"CreationTime\", sort_order=\"Descending\"\n",
    ")\n",
    "\n",
    "experiment_analytics_df = experiment_analytics.dataframe()\n",
    "experiment_analytics_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "trial_component_name = experiment_analytics_df.TrialComponentName[0]\n",
    "print(trial_component_name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "trial_component_description = sm.describe_trial_component(TrialComponentName=trial_component_name)\n",
    "trial_component_description"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.lineage.visualizer import LineageTableVisualizer\n",
    "\n",
    "lineage_table_viz = LineageTableVisualizer(sess)\n",
    "lineage_table_viz_df = lineage_table_viz.show(processing_job_name=processing_evaluation_metrics_job_name)\n",
    "lineage_table_viz_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Release Resources"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%html\n",
    "\n",
    "<p><b>Shutting down your kernel for this notebook to release resources.</b></p>\n",
    "<button class=\"sm-command-button\" data-commandlinker-command=\"kernelmenu:shutdown\" style=\"display:none;\">Shutdown Kernel</button>\n",
    "        \n",
    "<script>\n",
    "try {\n",
    "    els = document.getElementsByClassName(\"sm-command-button\");\n",
    "    els[0].click();\n",
    "}\n",
    "catch(err) {\n",
    "    // NoOp\n",
    "}    \n",
    "</script>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "%%javascript\n",
    "\n",
    "try {\n",
    "    Jupyter.notebook.save_checkpoint();\n",
    "    Jupyter.notebook.session.delete();\n",
    "}\n",
    "catch(err) {\n",
    "    // NoOp\n",
    "}"
   ]
  }
 ],
 "metadata": {
  "instance_type": "ml.m5.2xlarge",
  "kernelspec": {
   "display_name": "Python 3 (Data Science)",
   "language": "python",
   "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}