{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Evaluate Model with Amazon SageMaker Processing Jobs and Scikit-Learn\n", "\n", "Often, distributed data processing frameworks such as Scikit-Learn are used to pre-process data sets in order to prepare them for training. \n", "\n", "In this notebook we'll use Amazon SageMaker Processing, and leverage the power of Scikit-Learn in a managed SageMaker environment to run our processing workload." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# NOTE: THIS NOTEBOOK WILL TAKE A 5-10 MINUTES TO COMPLETE.\n", "\n", "# PLEASE BE PATIENT." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](img/prepare_dataset_bert.png)\n", "\n", "![](img/processing.jpg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Contents\n", "\n", "1. Setup Environment\n", "1. Setup Input Data\n", "1. Setup Output Data\n", "1. Build a Spark container for running the processing job\n", "1. Run the Processing Job using Amazon SageMaker\n", "1. Inspect the Processed Output Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Setup Environment\n", "\n", "Let's start by specifying:\n", "* The S3 bucket and prefixes that you use for training and model data. Use the default bucket specified by the Amazon SageMaker session.\n", "* The IAM role ARN used to give processing and training access to the dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "import sagemaker\n", "import boto3\n", "\n", "sess = sagemaker.Session()\n", "role = sagemaker.get_execution_role()\n", "bucket = sess.default_bucket()\n", "region = boto3.Session().region_name\n", "\n", "sm = boto3.Session().client(service_name=\"sagemaker\", region_name=region)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%store -r training_job_name" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "try:\n", " training_job_name\n", " print(\"[OK]\")\n", "except NameError:\n", " print(\"+++++++++++++++++++++++++++++++\")\n", " print(\"[ERROR] Please run the notebooks in the previous TRAIN section before you continue.\")\n", " print(\"+++++++++++++++++++++++++++++++\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(training_job_name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%store -r raw_input_data_s3_uri" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "try:\n", " raw_input_data_s3_uri\n", "except NameError:\n", " print(\"++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++\")\n", " print(\"[ERROR] Please run the notebooks in the PREPARE section before you continue.\")\n", " print(\"++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(raw_input_data_s3_uri)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%store -r max_seq_length" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "try:\n", " max_seq_length\n", " print(\"[OK]\")\n", "except NameError:\n", " print(\"+++++++++++++++++++++++++++++++\")\n", " print(\"[ERROR] Please run the notebooks in the previous TRAIN section before you continue.\")\n", " print(\"+++++++++++++++++++++++++++++++\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(max_seq_length)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%store -r experiment_name" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "try:\n", " experiment_name\n", " print(\"[OK]\")\n", "except NameError:\n", " print(\"+++++++++++++++++++++++++++++++\")\n", " print(\"[ERROR] Please run the notebooks in the previous TRAIN section before you continue.\")\n", " print(\"+++++++++++++++++++++++++++++++\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(experiment_name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%store -r trial_name" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "try:\n", " trial_name\n", " print(\"[OK]\")\n", "except NameError:\n", " print(\"+++++++++++++++++++++++++++++++\")\n", " print(\"[ERROR] Please run the notebooks in the previous TRAIN section before you continue.\")\n", " print(\"+++++++++++++++++++++++++++++++\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(trial_name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(training_job_name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "from sagemaker.tensorflow.estimator import TensorFlow\n", "\n", "describe_training_job_response = sm.describe_training_job(TrainingJobName=training_job_name)\n", "print(describe_training_job_response)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model_dir_s3_uri = describe_training_job_response[\"ModelArtifacts\"][\"S3ModelArtifacts\"].replace(\"model.tar.gz\", \"\")\n", "model_dir_s3_uri" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Run the Processing Job using Amazon SageMaker\n", "\n", "Next, use the Amazon SageMaker Python SDK to submit a processing job using our custom python script." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Create the `Experiment Config`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "experiment_config = {\n", " \"ExperimentName\": experiment_name,\n", " \"TrialName\": trial_name,\n", " \"TrialComponentDisplayName\": \"evaluate\",\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Set the Processing Job Hyper-Parameters " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "processing_instance_type = \"ml.m5.xlarge\"\n", "processing_instance_count = 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Choosing a `max_seq_length` for BERT\n", "Since a smaller `max_seq_length` leads to faster training and lower resource utilization, we want to find the smallest review length that captures `80%` of our reviews.\n", "\n", "Remember our distribution of review lengths from a previous section?\n", "\n", "```\n", "mean 51.683405\n", "std 107.030844\n", "min 1.000000\n", "10% 2.000000\n", "20% 7.000000\n", "30% 19.000000\n", "40% 22.000000\n", "50% 26.000000\n", "60% 32.000000\n", "70% 43.000000\n", "80% 63.000000\n", "90% 110.000000\n", "100% 5347.000000\n", "max 5347.000000\n", "```\n", "\n", "![](img/review_word_count_distribution.png)\n", "\n", "Review length `63` represents the `80th` percentile for this dataset. However, it's best to stick with powers-of-2 when using BERT. So let's choose `64` as this is the smallest power-of-2 greater than `63`. Reviews with length > `64` will be truncated to `64`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "from sagemaker.sklearn.processing import SKLearnProcessor\n", "\n", "processor = SKLearnProcessor(\n", " framework_version=\"0.23-1\",\n", " role=role,\n", " instance_type=processing_instance_type,\n", " instance_count=processing_instance_count,\n", " max_runtime_in_seconds=7200,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.processing import ProcessingInput, ProcessingOutput\n", "\n", "processor.run(\n", " code=\"evaluate_model_metrics.py\",\n", " inputs=[\n", " ProcessingInput(\n", " input_name=\"model-tar-s3-uri\", source=model_dir_s3_uri, destination=\"/opt/ml/processing/input/model/\"\n", " ),\n", " ProcessingInput(\n", " input_name=\"evaluation-data-s3-uri\",\n", " source=raw_input_data_s3_uri,\n", " destination=\"/opt/ml/processing/input/data/\",\n", " ),\n", " ],\n", " outputs=[\n", " ProcessingOutput(s3_upload_mode=\"EndOfJob\", output_name=\"metrics\", source=\"/opt/ml/processing/output/metrics\"),\n", " ],\n", " arguments=[\"--max-seq-length\", str(max_seq_length)],\n", " experiment_config=experiment_config,\n", " logs=True,\n", " wait=False,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "scikit_processing_job_name = processor.jobs[-1].describe()[\"ProcessingJobName\"]\n", "print(scikit_processing_job_name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "from IPython.core.display import display, HTML\n", "\n", "display(\n", " HTML(\n", " 'Review Processing Job'.format(\n", " region, scikit_processing_job_name\n", " )\n", " )\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "from IPython.core.display import display, HTML\n", "\n", "display(\n", " HTML(\n", " 'Review CloudWatch Logs After About 5 Minutes'.format(\n", " region, scikit_processing_job_name\n", " )\n", " )\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "from IPython.core.display import display, HTML\n", "\n", "display(\n", " HTML(\n", " 'Review S3 Output Data After The Processing Job Has Completed'.format(\n", " bucket, scikit_processing_job_name, region\n", " )\n", " )\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Monitor the Processing Job" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "running_processor = sagemaker.processing.ProcessingJob.from_processing_name(\n", " processing_job_name=scikit_processing_job_name, sagemaker_session=sess\n", ")\n", "\n", "processing_job_description = running_processor.describe()\n", "\n", "print(processing_job_description)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "processing_evaluation_metrics_job_name = processing_job_description[\"ProcessingJobName\"]\n", "print(processing_evaluation_metrics_job_name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "%%time\n", "\n", "running_processor.wait(logs=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# _Please Wait Until the ^^ Processing Job ^^ Completes Above._" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Inspect the Processed Output Data\n", "\n", "Take a look at a few rows of the transformed dataset to make sure the processing was successful." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "processing_job_description = running_processor.describe()\n", "\n", "output_config = processing_job_description[\"ProcessingOutputConfig\"]\n", "for output in output_config[\"Outputs\"]:\n", " if output[\"OutputName\"] == \"metrics\":\n", " processed_metrics_s3_uri = output[\"S3Output\"][\"S3Uri\"]\n", "\n", "print(processed_metrics_s3_uri)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "!aws s3 ls $processed_metrics_s3_uri/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Show the test accuracy" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "from pprint import pprint\n", "\n", "evaluation_json = sagemaker.s3.S3Downloader.read_file(\"{}/evaluation.json\".format(processed_metrics_s3_uri))\n", "\n", "pprint(json.loads(evaluation_json))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# !aws s3 cp $processed_metrics_s3_uri/confusion_matrix.png ./model_evaluation/\n", "\n", "# import time\n", "\n", "# time.sleep(10) # Slight delay for our notebook to recognize the newly-downloaded file" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# %%html\n", "\n", "# " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Pass Variables to the Next Notebook(s)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%store processing_evaluation_metrics_job_name" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "%store processed_metrics_s3_uri" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%store" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Show the Experiment Tracking Lineage" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.analytics import ExperimentAnalytics\n", "\n", "import pandas as pd\n", "\n", "pd.set_option(\"max_colwidth\", 500)\n", "\n", "experiment_analytics = ExperimentAnalytics(\n", " sagemaker_session=sess, experiment_name=experiment_name, sort_by=\"CreationTime\", sort_order=\"Descending\"\n", ")\n", "\n", "experiment_analytics_df = experiment_analytics.dataframe()\n", "experiment_analytics_df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "trial_component_name = experiment_analytics_df.TrialComponentName[0]\n", "print(trial_component_name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "trial_component_description = sm.describe_trial_component(TrialComponentName=trial_component_name)\n", "trial_component_description" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.lineage.visualizer import LineageTableVisualizer\n", "\n", "lineage_table_viz = LineageTableVisualizer(sess)\n", "lineage_table_viz_df = lineage_table_viz.show(processing_job_name=processing_evaluation_metrics_job_name)\n", "lineage_table_viz_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Release Resources" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%html\n", "\n", "

Shutting down your kernel for this notebook to release resources.

\n", "\n", " \n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "%%javascript\n", "\n", "try {\n", " Jupyter.notebook.save_checkpoint();\n", " Jupyter.notebook.session.delete();\n", "}\n", "catch(err) {\n", " // NoOp\n", "}" ] } ], "metadata": { "instance_type": "ml.m5.2xlarge", "kernelspec": { "display_name": "Python 3 (Data Science)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 4 }