{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Using SageMaker debugger to monitor autoencoder model training\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", "\n", "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/sagemaker-debugger|model_specific_realtime_analysis|autoencoder_mnist|autoencoder_mnist.ipynb)\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "This notebook will train a convolutional autoencoder model on MNIST dataset and use SageMaker debugger to monitor key metrics in realtime. An autoencoder consists of an encoder that downsamples input data and a decoder that tries to reconstruct the original input. In this notebook we will use an autoencoder with the following architecture:\n", "```\n", "--------------------------------------------------------------------------------\n", " Layer (type) Output Shape Param #\n", "================================================================================\n", " Input (1, 1, 28, 28) 0\n", " Activation-1 0\n", " Activation-2 (1, 32, 24, 24) 0\n", " Conv2D-3 (1, 32, 24, 24) 832\n", " MaxPool2D-4 (1, 32, 12, 12) 0\n", " Activation-5 0\n", " Activation-6 (1, 32, 8, 8) 0\n", " Conv2D-7 (1, 32, 8, 8) 25632\n", " MaxPool2D-8 (1, 32, 4, 4) 0\n", " Dense-9 (1, 20) 10260\n", " Activation-10 0\n", " Activation-11 (1, 512) 0\n", " Dense-12 (1, 512) 10752\n", " HybridLambda-13 (1, 32, 8, 8) 0\n", " Activation-14 0\n", " Activation-15 (1, 32, 12, 12) 0\n", " Conv2DTranspose-16 (1, 32, 12, 12) 25632\n", " HybridLambda-17 (1, 32, 24, 24) 0\n", " Activation-18 0\n", " Activation-19 (1, 1, 28, 28) 0\n", " Conv2DTranspose-20 (1, 1, 28, 28) 801\n", "ConvolutionalAutoencoder-21 (1, 1, 28, 28) 0\n", "================================================================================\n", "Parameters in forward computation graph, duplicate included\n", " Total params: 73909\n", " Trainable params: 73909\n", " Non-trainable params: 0\n", "Shared params in forward computation graph: 0\n", "Unique parameters in model: 73909\n", "--------------------------------------------------------------------------------\n", "```\n", "\n", "The bottleneck layer forces the autoencoder to learn a compressed representation (latent variables) of the dataset. Visualizing the latent space helps to understand what the autoencoder is learning. We can check if the model is training well by checking \n", "- reconstructed images (autoencoder output)\n", "- t-Distributed Stochastic Neighbor Embedding (t-SNE) of the latent variables \n", "\n", "t-SNE maps high dimensional data into a 2- or 3-dimensional space. Following animation shows those emebeddings of latent variables while the training progresses. Each cluster represents a class (0-9) of the MNIST training dataset. Over time the autoencoder becomes better in separating those classes. \n", "\"drawing\" " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Training MXNet autoencoder model in Amazon SageMaker with debugger \n", "Before starting the SageMaker training job, we need to install some libraries. We will use `smdebug` library to read, filter and analyze raw tensors that are stored in Amazon S3. We install `seaborn` library that will be used later on to plot t-Distributed Stochastic Neighbor Embedding (t-SNE) of the latent variables." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pip\n", "\n", "\n", "def import_or_install(package):\n", " try:\n", " __import__(package)\n", " except ImportError:\n", " ! pip install $package" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import_or_install(\"smdebug\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import_or_install(\"seaborn\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First we define the MXNet estimator and the debugger hook configuration. The model training is implemented in the entry point script `autoencoder_mnist.py`. We will obtain tensors every 10th iteration and store them in the SageMaker default bucket." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sagemaker\n", "from sagemaker.mxnet import MXNet\n", "from sagemaker.debugger import DebuggerHookConfig, CollectionConfig\n", "\n", "sagemaker_session = sagemaker.Session()\n", "BUCKET_NAME = sagemaker_session.default_bucket()\n", "LOCATION_IN_BUCKET = \"smdebug-autoencoder-example\"\n", "\n", "s3_bucket_for_tensors = \"s3://{BUCKET_NAME}/{LOCATION_IN_BUCKET}\".format(\n", " BUCKET_NAME=BUCKET_NAME, LOCATION_IN_BUCKET=LOCATION_IN_BUCKET\n", ")\n", "estimator = MXNet(\n", " role=sagemaker.get_execution_role(),\n", " base_job_name=\"mxnet\",\n", " train_instance_count=1,\n", " train_instance_type=\"ml.m5.xlarge\",\n", " train_volume_size=400,\n", " source_dir=\"src\",\n", " entry_point=\"autoencoder_mnist.py\",\n", " framework_version=\"1.6.0\",\n", " py_version=\"py3\",\n", " debugger_hook_config=DebuggerHookConfig(\n", " s3_output_path=s3_bucket_for_tensors,\n", " collection_configs=[\n", " CollectionConfig(\n", " name=\"all\",\n", " parameters={\n", " \"include_regex\": \".*convolutionalautoencoder0_hybridsequential0_dense0_output_0|.*convolutionalautoencoder0_input_1|.*loss\",\n", " \"save_interval\": \"10\",\n", " },\n", " )\n", " ],\n", " ),\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Start the training job:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "estimator.fit(wait=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can check the S3 location of tensors:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "path = estimator.latest_job_debugger_artifacts_path()\n", "print(\"Tensors are stored in: {}\".format(path))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get the training job name:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "job_name = estimator.latest_training_job.name\n", "print(\"Training job name: {}\".format(job_name))\n", "\n", "client = estimator.sagemaker_session.sagemaker_client\n", "\n", "description = client.describe_training_job(TrainingJobName=job_name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can access the tensors from S3 once the training job is in status `Training` or `Completed`. In the following code cell we check the job status." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import time\n", "\n", "if description[\"TrainingJobStatus\"] != \"Completed\":\n", " while description[\"SecondaryStatus\"] not in {\"Training\", \"Completed\"}:\n", " description = client.describe_training_job(TrainingJobName=job_name)\n", " primary_status = description[\"TrainingJobStatus\"]\n", " secondary_status = description[\"SecondaryStatus\"]\n", " print(\n", " \"Current job status: [PrimaryStatus: {}, SecondaryStatus: {}]\".format(\n", " primary_status, secondary_status\n", " )\n", " )\n", " time.sleep(15)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Get tensors and visualize model training in real-time" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this section, we will retrieve the tensors from the bottlneck layer and input/output tensors while the model is still training. Once we have the tensors, we will compute t-SNE and plot the results.\n", "\n", "Helper function to compute stochastic neighbor embeddings:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.manifold import TSNE\n", "\n", "\n", "def compute_tsne(tensors, labels):\n", " # compute TSNE\n", " tsne = TSNE(n_components=2, verbose=1, perplexity=40, n_iter=300)\n", " tsne_results = tsne.fit_transform(tensors)\n", "\n", " # add results to dictionary\n", " data = {}\n", " data[\"x\"] = tsne_results[:, 0]\n", " data[\"y\"] = tsne_results[:, 1]\n", " data[\"z\"] = labels\n", "\n", " return data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Helper function to plot t-SNE results and autoencoder input/output." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "\n", "def plot_autoencoder_data(tsne_results, input_tensor, output_tensor):\n", " fig, (ax0, ax1, ax2) = plt.subplots(\n", " ncols=3, figsize=(30, 15), gridspec_kw={\"width_ratios\": [1, 1, 3]}\n", " )\n", " plt.rcParams.update({\"font.size\": 20})\n", " ax0.imshow(input_tensor, cmap=plt.cm.gray)\n", " ax1.imshow(output_tensor, cmap=plt.cm.gray)\n", " ax0.set_axis_off()\n", " ax1.set_axis_off()\n", " ax2.set_axis_off()\n", " ax0.set_title(\"autoencoder input\")\n", " ax1.set_title(\"autoencoder output\")\n", " plt.title(\"Step \" + str(step))\n", " sns.scatterplot(\n", " x=\"x\", y=\"y\", hue=\"z\", data=tsne_results, palette=\"viridis\", legend=\"full\", s=100\n", " )\n", " plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)\n", " plt.axis(\"off\")\n", " plt.show()\n", " plt.clf()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create trial: " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from smdebug.trials import create_trial\n", "\n", "trial = create_trial(estimator.latest_job_debugger_artifacts_path())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get available steps" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "steps = 0\n", "while steps == 0:\n", " steps = trial.steps()\n", " print(\"Waiting for tensors to become available...\")\n", " time.sleep(3)\n", "print(\"\\nDone\")\n", "\n", "print(\"Getting tensors...\")\n", "rendered_steps = []" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To determine how well the autoencoder is training, we will get the following tensors:\n", "- **Dense layer:** we will compute the t-distributed stochastic neighbor embeddings (t-SNE) of the tensors retrieved from the bottleneck layer.\n", "- **Input label:** will be used to mark the embeddigns. Emebeddings with the same label should be in the same clsuter. \n", "- **Autoencoder input and output:** to determine the reconstruction performance of the autoencoder." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "label_input = \"convolutionalautoencoder0_input_1\"\n", "autoencoder_bottleneck = \"convolutionalautoencoder0_hybridsequential0_dense0_output_0\"\n", "autoencoder_input = \"l2loss0_input_1\"\n", "autoencoder_output = \"l2loss0_input_0\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Following code cell iterates over available steps, retrieves the tensors and computes t-SNE. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from smdebug.exceptions import TensorUnavailableForStep\n", "from smdebug.mxnet import modes\n", "\n", "loaded_all_steps = False\n", "while not loaded_all_steps:\n", " # get available steps\n", " loaded_all_steps = trial.loaded_all_steps\n", " steps = trial.steps(mode=modes.EVAL)\n", "\n", " # quick way to get diff between two lists\n", " steps_to_render = list(set(steps).symmetric_difference(set(rendered_steps)))\n", "\n", " tensors = []\n", " labels = []\n", "\n", " # iterate over available steps\n", " for step in sorted(steps_to_render):\n", " try:\n", " if len(tensors) > 1000:\n", " tensors = []\n", " labels = []\n", "\n", " # get tensor from bottleneck layer and label\n", " tensor = trial.tensor(autoencoder_bottleneck).value(step_num=step, mode=modes.EVAL)\n", " label = trial.tensor(label_input).value(step_num=step, mode=modes.EVAL)\n", " for batch in range(tensor.shape[0]):\n", " tensors.append(tensor[batch, :])\n", " labels.append(label[batch])\n", "\n", " # compute tsne\n", " tsne_results = compute_tsne(tensors, labels)\n", "\n", " # get autoencoder input and output\n", " input_tensor = trial.tensor(autoencoder_input).value(step_num=step, mode=modes.EVAL)[\n", " 0, 0, :, :\n", " ]\n", " output_tensor = trial.tensor(autoencoder_output).value(step_num=step, mode=modes.EVAL)[\n", " 0, 0, :, :\n", " ]\n", "\n", " # plot results\n", " plot_autoencoder_data(tsne_results, input_tensor, output_tensor)\n", "\n", " except TensorUnavailableForStep:\n", " print(\"Tensor unavilable for step {}\".format(step))\n", "\n", " rendered_steps.extend(steps_to_render)\n", "\n", " time.sleep(5)\n", "\n", "print(\"\\nDone\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Notebook CI Test Results\n", "\n", "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", "\n", "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/sagemaker-debugger|model_specific_realtime_analysis|autoencoder_mnist|autoencoder_mnist.ipynb)\n", "\n", "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/sagemaker-debugger|model_specific_realtime_analysis|autoencoder_mnist|autoencoder_mnist.ipynb)\n", "\n", "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/sagemaker-debugger|model_specific_realtime_analysis|autoencoder_mnist|autoencoder_mnist.ipynb)\n", "\n", "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/sagemaker-debugger|model_specific_realtime_analysis|autoencoder_mnist|autoencoder_mnist.ipynb)\n", "\n", "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/sagemaker-debugger|model_specific_realtime_analysis|autoencoder_mnist|autoencoder_mnist.ipynb)\n", "\n", "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/sagemaker-debugger|model_specific_realtime_analysis|autoencoder_mnist|autoencoder_mnist.ipynb)\n", "\n", "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/sagemaker-debugger|model_specific_realtime_analysis|autoencoder_mnist|autoencoder_mnist.ipynb)\n", "\n", "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/sagemaker-debugger|model_specific_realtime_analysis|autoencoder_mnist|autoencoder_mnist.ipynb)\n", "\n", "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/sagemaker-debugger|model_specific_realtime_analysis|autoencoder_mnist|autoencoder_mnist.ipynb)\n", "\n", "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/sagemaker-debugger|model_specific_realtime_analysis|autoencoder_mnist|autoencoder_mnist.ipynb)\n", "\n", "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/sagemaker-debugger|model_specific_realtime_analysis|autoencoder_mnist|autoencoder_mnist.ipynb)\n", "\n", "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/sagemaker-debugger|model_specific_realtime_analysis|autoencoder_mnist|autoencoder_mnist.ipynb)\n", "\n", "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/sagemaker-debugger|model_specific_realtime_analysis|autoencoder_mnist|autoencoder_mnist.ipynb)\n", "\n", "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/sagemaker-debugger|model_specific_realtime_analysis|autoencoder_mnist|autoencoder_mnist.ipynb)\n", "\n", "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/sagemaker-debugger|model_specific_realtime_analysis|autoencoder_mnist|autoencoder_mnist.ipynb)\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (MXNet 1.9 Python 3.8 CPU Optimized)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-west-2:236514542706:image/mxnet-1.9-cpu-py38-ubuntu20.04-sagemaker-v1.0" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 4 }