{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Hyperparameter Tuning with the SageMaker TensorFlow Container\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", "\n", "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/hyperparameter_tuning|tensorflow2_mnist|hpo_tensorflow2_mnist.ipynb)\n", "\n", "---" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "\n", "This tutorial focuses on how to create a convolutional neural network model to train the [MNIST dataset](http://yann.lecun.com/exdb/mnist/) using the SageMaker TensorFlow container. It leverages hyperparameter tuning to run multiple training jobs with different hyperparameter combinations, to find the one with the best model training result.\n", "\n", "## Runtime\n", "\n", "This notebook takes approximately 10 minutes to run.\n", "\n", "## Contents\n", "\n", "1. [Set Up the Environment](#Set-Up-the-Environment)\n", "1. [Data](#Data)\n", "1. [Run a TensorFlow Training Job](#Run-a-TensorFlow-Training-Job)\n", "1. [Set Up Channels for Training and Testing Data](#Set-Up-Channels-for-Training-and-Testing-Data)\n", "1. [Run a Hyperparameter Tuning Job](#Run-a-Hyperparameter-Tuning-Job)\n", "1. [Deploy the Best Model](#Deploy-the-Best-Model)\n", "1. [Evaluate](#Evaluate)\n", "1. [Cleanup](#Cleanup)\n", "\n", "## Set Up the Environment \n", "Set up a few things before starting the workflow:\n", "\n", "1. A boto3 session object to manage interactions with the Amazon SageMaker APIs. \n", "2. An execution role which is passed to SageMaker to access your AWS resources." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "import json\n", "\n", "import sagemaker\n", "from sagemaker.tensorflow import TensorFlow\n", "from sagemaker import get_execution_role\n", "\n", "sess = sagemaker.Session()\n", "region = sess.boto_region_name\n", "role = get_execution_role()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Data\n", "Download the MNIST data from a public S3 bucket and save it in a temporary directory." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import logging\n", "import boto3\n", "from botocore.exceptions import ClientError\n", "\n", "public_bucket = f\"sagemaker-example-files-prod-{region}\"\n", "local_data_dir = \"/tmp/data\"\n", "\n", "\n", "# Download training and testing data from a public S3 bucket\n", "def download_from_s3(data_dir=\"/tmp/data\", train=True):\n", " \"\"\"Download MNIST dataset and convert it to numpy array\n", "\n", " Args:\n", " data_dir (str): directory to save the data\n", " train (bool): download training set\n", "\n", " Returns:\n", " None\n", " \"\"\"\n", " # project root\n", " if not os.path.exists(data_dir):\n", " os.makedirs(data_dir)\n", "\n", " if train:\n", " images_file = \"train-images-idx3-ubyte.gz\"\n", " labels_file = \"train-labels-idx1-ubyte.gz\"\n", " else:\n", " images_file = \"t10k-images-idx3-ubyte.gz\"\n", " labels_file = \"t10k-labels-idx1-ubyte.gz\"\n", "\n", " # download objects\n", " s3 = boto3.client(\"s3\")\n", " bucket = public_bucket\n", " for obj in [images_file, labels_file]:\n", " key = os.path.join(\"datasets/image/MNIST\", obj)\n", " dest = os.path.join(data_dir, obj)\n", " if not os.path.exists(dest):\n", " s3.download_file(bucket, key, dest)\n", " return\n", "\n", "\n", "download_from_s3(local_data_dir, True)\n", "download_from_s3(local_data_dir, False)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Run a TensorFlow Training Job\n", "A TensorFlow training job is defined by using the `TensorFlow` estimator class. It lets you run your training script on SageMaker infrastructure in a containerized environment. For more information on how to instantiate it, see the example [Train an MNIST model with TensorFlow](https://sagemaker-examples.readthedocs.io/en/latest/frameworks/tensorflow/get_started_mnist_train.html#TensorFlow-Estimator)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "est = TensorFlow(\n", " entry_point=\"train.py\",\n", " source_dir=\"code\", # directory of your training script\n", " role=role,\n", " framework_version=\"2.3.1\",\n", " model_dir=\"/opt/ml/model\",\n", " py_version=\"py37\",\n", " instance_type=\"ml.m5.4xlarge\",\n", " instance_count=1,\n", " volume_size=250,\n", " hyperparameters={\n", " \"batch-size\": 512,\n", " \"epochs\": 4,\n", " },\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Set Up Channels for Training and Testing Data\n", "Upload the MNIST data to the default bucket of your AWS account and pass the S3 URI as the channels of training and testing data for the `TensorFlow` estimator class. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "prefix = \"mnist\"\n", "bucket = sess.default_bucket()\n", "loc = sess.upload_data(path=local_data_dir, bucket=bucket, key_prefix=prefix)\n", "\n", "channels = {\"training\": loc, \"testing\": loc}" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Run a Hyperparameter Tuning Job\n", "Now that you have set up the training job and the input data channels, you are ready to train the model with hyperparameter search.\n", "\n", "Set up the hyperparameter tuning job with the following steps:\n", "* Define the ranges of hyperparameters we plan to tune. In this example, we tune the learning rate.\n", "* Define the objective metric for the tuning job to optimize.\n", "* Create a hyperparameter tuner with the above setting, as well as tuning resource configurations.\n", "\n", "\n", "\n", "\n", "For a typical ML model, there are three kinds of hyperparamters:\n", "\n", "- Categorical parameters need to take one value from a discrete set. We define this by passing the list of possible values to `CategoricalParameter(list)`\n", "- Continuous parameters can take any real number value between the minimum and maximum value, defined by `ContinuousParameter(min, max)`\n", "- Integer parameters can take any integer value between the minimum and maximum value, defined by `IntegerParameter(min, max)`\n", "\n", "Learning rate is a continuous variable, so we define its range\n", "by `ContinuousParameter`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.tuner import ContinuousParameter, HyperparameterTuner\n", "\n", "hyperparamter_range = {\"learning-rate\": ContinuousParameter(1e-4, 1e-3)}" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Next we specify the objective metric that we'd like to tune and its definition, which includes the regular expression (regex) needed to extract that metric from the CloudWatch logs of the training job. In this particular case, our script emits average loss value and we use it as the objective metric. We set the `objective_type` to `Minimize`, so that hyperparameter tuning seeks to minimize the objective metric when searching for the best hyperparameter value." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "objective_metric_name = \"average test loss\"\n", "objective_type = \"Minimize\"\n", "metric_definitions = [\n", " {\n", " \"Name\": \"average test loss\",\n", " \"Regex\": \"Test Loss: ([0-9\\\\.]+)\",\n", " }\n", "]" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Now, you'll create a `HyperparameterTuner` object. It takes the following parameters:\n", "- The `TensorFlow` estimator you previously created.\n", "- Your hyperparameter ranges.\n", "- Objective metric name and definition.\n", "- Tuning resource configurations such as the number of training jobs to run in total, and how many training jobs to run in parallel." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tuner = HyperparameterTuner(\n", " est,\n", " objective_metric_name,\n", " hyperparamter_range,\n", " metric_definitions,\n", " max_jobs=3,\n", " max_parallel_jobs=3,\n", " objective_type=objective_type,\n", ")\n", "\n", "tuner.fit(inputs=channels)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Deploy the Best Model\n", "After training with hyperparameter optimization, you can deploy the best-performing model (by the objective metric you defined) to a SageMaker endpoint. For more information about deploying a model to a SageMaker endpoint, see the example [Deploy a Trained TensorFlow V2 Model](https://sagemaker-examples.readthedocs.io/en/latest/frameworks/tensorflow/get_started_mnist_deploy.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "predictor = tuner.deploy(initial_instance_count=1, instance_type=\"ml.m5.xlarge\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluate\n", "Now, you can evaluate the best-performing model by invoking the endpoint with the MNIST test set. The test data needs to be readily consumable by the model, so we arrange them into the correct shape that is accepted by a TensorFlow model. We also normalize them so that the pixel values have mean 0 and standard deviation 1, since this is the convention used to train the model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import random\n", "import gzip\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "%matplotlib inline\n", "\n", "\n", "images_file = \"t10k-images-idx3-ubyte.gz\"\n", "\n", "\n", "def read_mnist(data_dir, images_file):\n", " \"\"\"Byte string to numpy arrays\"\"\"\n", " with gzip.open(os.path.join(data_dir, images_file), \"rb\") as f:\n", " images = np.frombuffer(f.read(), np.uint8, offset=16).reshape(-1, 28, 28)\n", " return images\n", "\n", "\n", "X = read_mnist(local_data_dir, images_file)\n", "\n", "# randomly sample 16 images to inspect\n", "mask = random.sample(range(X.shape[0]), 16)\n", "samples = X[mask]\n", "\n", "# plot the images\n", "fig, axs = plt.subplots(nrows=1, ncols=16, figsize=(16, 1))\n", "\n", "for i, splt in enumerate(axs):\n", " splt.imshow(samples[i])\n", "\n", "# preprocess the data to be consumed by the model\n", "\n", "\n", "def normalize(x, axis):\n", " eps = np.finfo(float).eps\n", "\n", " mean = np.mean(x, axis=axis, keepdims=True)\n", " # avoid division by zero\n", " std = np.std(x, axis=axis, keepdims=True) + eps\n", " return (x - mean) / std\n", "\n", "\n", "samples = normalize(samples, axis=(1, 2))\n", "samples = np.expand_dims(samples, axis=3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "predictions = predictor.predict(samples)[\"predictions\"]\n", "\n", "# softmax to logit\n", "predictions = np.array(predictions, dtype=np.float32)\n", "predictions = np.argmax(predictions, axis=1)\n", "\n", "print(\"Predictions: \", *predictions)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Cleanup\n", "If you do not plan to continue using the endpoint, delete it to free up resources." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "predictor.delete_endpoint()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Notebook CI Test Results\n", "\n", "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", "\n", "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/hyperparameter_tuning|tensorflow2_mnist|hpo_tensorflow2_mnist.ipynb)\n", "\n", "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/hyperparameter_tuning|tensorflow2_mnist|hpo_tensorflow2_mnist.ipynb)\n", "\n", "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/hyperparameter_tuning|tensorflow2_mnist|hpo_tensorflow2_mnist.ipynb)\n", "\n", "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/hyperparameter_tuning|tensorflow2_mnist|hpo_tensorflow2_mnist.ipynb)\n", "\n", "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/hyperparameter_tuning|tensorflow2_mnist|hpo_tensorflow2_mnist.ipynb)\n", "\n", "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/hyperparameter_tuning|tensorflow2_mnist|hpo_tensorflow2_mnist.ipynb)\n", "\n", "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/hyperparameter_tuning|tensorflow2_mnist|hpo_tensorflow2_mnist.ipynb)\n", "\n", "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/hyperparameter_tuning|tensorflow2_mnist|hpo_tensorflow2_mnist.ipynb)\n", "\n", "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/hyperparameter_tuning|tensorflow2_mnist|hpo_tensorflow2_mnist.ipynb)\n", "\n", "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/hyperparameter_tuning|tensorflow2_mnist|hpo_tensorflow2_mnist.ipynb)\n", "\n", "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/hyperparameter_tuning|tensorflow2_mnist|hpo_tensorflow2_mnist.ipynb)\n", "\n", "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/hyperparameter_tuning|tensorflow2_mnist|hpo_tensorflow2_mnist.ipynb)\n", "\n", "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/hyperparameter_tuning|tensorflow2_mnist|hpo_tensorflow2_mnist.ipynb)\n", "\n", "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/hyperparameter_tuning|tensorflow2_mnist|hpo_tensorflow2_mnist.ipynb)\n", "\n", "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/hyperparameter_tuning|tensorflow2_mnist|hpo_tensorflow2_mnist.ipynb)\n" ] } ], "metadata": { "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "Python 3 (TensorFlow 2.10.0 Python 3.9 CPU Optimized)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-west-2:236514542706:image/tensorflow-2.10.1-cpu-py39-ubuntu20.04-sagemaker-v1.2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.16" } }, "nbformat": 4, "nbformat_minor": 4 }