{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Automatic Model Tuning using SageMaker Tensorflow Container\n", "\n", "This tutorial focuses on how to create a convolutional neural network model to train the [MNIST dataset](http://yann.lecun.com/exdb/mnist/) using **SageMaker TensorFlow container**. It leverages hyperparameter tuning to kick off multiple training jobs with different hyperparameter combinations, to find the one with best model training result.\n", "\n", "### Lab Time\n", "This module takes around 15 to 20 minutes to complete." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Set up the environment\n", "We will set up a few things before starting the workflow. \n", "\n", "1. specify the s3 bucket and prefix where training data set and model artifacts will be stored\n", "2. get the execution role which will be passed to sagemaker for accessing your resources such as s3 bucket" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import timeit\n", "start_time = timeit.default_timer()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "import sagemaker\n", "\n", "bucket = sagemaker.Session().default_bucket() # we are using a default bucket here but you can change it to any bucket in your account\n", "prefix = 'sagemaker/DEMO-hpo-tensorflow-high' # you can customize the prefix (subfolder) here\n", "\n", "role = sagemaker.get_execution_role() # we are using the notebook instance role for training in this example" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we'll import the Python libraries we'll need." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import boto3\n", "from time import gmtime, strftime\n", "from sagemaker.tensorflow import TensorFlow\n", "from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Download the MNIST dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "import utils\n", "from tensorflow.contrib.learn.python.learn.datasets import mnist\n", "import tensorflow as tf\n", "\n", "# train-images-idx3-ubyte.gz: 학습 셋 이미지 - 55000개의 트레이닝 이미지, 5000개의 검증 이미지\n", "# train-labels-idx1-ubyte.gz: 이미지와 매칭되는 학습 셋 레이블\n", "# t10k-images-idx3-ubyte.gz: 테스트 셋 이미지 - 10000개의 이미지\n", "# t10k-labels-idx1-ubyte.gz: 이미지와 매칭되는 테스트 셋 레이블\n", "data_sets = mnist.read_data_sets('data', dtype=tf.uint8, reshape=False, validation_size=5000)\n", "\n", "utils.convert_to(data_sets.train, 'train', 'data')\n", "utils.convert_to(data_sets.validation, 'validation', 'data')\n", "utils.convert_to(data_sets.test, 'test', 'data')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Upload the data\n", "We use the ```sagemaker.Session.upload_data``` function to upload our datasets to an S3 location. The return value identifies the location -- we will use this later when we start the training job." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "inputs = sagemaker.Session().upload_data(path='data', bucket=bucket, key_prefix=prefix+'/data/mnist')\n", "print (inputs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Construct a script for distributed training \n", "Here is the full code for the network model:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "!cat 'mnist_hpo.py'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The script here is and adaptation of the [TensorFlow MNIST example](https://github.com/tensorflow/models/tree/master/official/mnist). It provides a ```model_fn(features, labels, mode)```, which is used for training, evaluation and inference. \n", "\n", "### A regular ```model_fn```\n", "\n", "A regular **```model_fn```** follows the pattern:\n", "1. [defines a neural network](https://github.com/tensorflow/models/blob/master/official/mnist/mnist.py#L96)\n", "- [applies the ```features``` in the neural network](https://github.com/tensorflow/models/blob/master/official/mnist/mnist.py#L178)\n", "- [if the ```mode``` is ```PREDICT```, returns the output from the neural network](https://github.com/tensorflow/models/blob/master/official/mnist/mnist.py#L186)\n", "- [calculates the loss function comparing the output with the ```labels```](https://github.com/tensorflow/models/blob/master/official/mnist/mnist.py#L188)\n", "- [creates an optimizer and minimizes the loss function to improve the neural network](https://github.com/tensorflow/models/blob/master/official/mnist/mnist.py#L193)\n", "- [returns the output, optimizer and loss function](https://github.com/tensorflow/models/blob/master/official/mnist/mnist.py#L205)\n", "\n", "### Writing a ```model_fn``` for distributed training\n", "When distributed training happens, the same neural network will be sent to the multiple training instances. Each instance will predict a batch of the dataset, calculate loss and minimize the optimizer. One entire loop of this process is called **training step**.\n", "\n", "#### Syncronizing training steps\n", "A [global step](https://www.tensorflow.org/api_docs/python/tf/train/global_step) is a global variable shared between the instances. It necessary for distributed training, so the optimizer will keep track of the number of **training steps** between runs: \n", "\n", "```python\n", "train_op = optimizer.minimize(loss, tf.train.get_or_create_global_step())\n", "```\n", "\n", "That is the only required change for distributed training!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Set up hyperparameter tuning job\n", "*Note, with the default setting below, the hyperparameter tuning job can take about 30 minutes to complete.*\n", "\n", "Now we will set up the hyperparameter tuning job using SageMaker Python SDK, following below steps:\n", "* Create an estimator to set up the TensorFlow training job\n", "* Define the ranges of hyperparameters we plan to tune, in this example, we are tuning \"learning_rate\"\n", "* Define the objective metric for the tuning job to optimize\n", "* Create a hyperparameter tuner with above setting, as well as tuning resource configurations " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similar to training a single TensorFlow job in SageMaker, we define our TensorFlow estimator passing in the TensorFlow script, IAM role, and (per job) hardware configuration." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "estimator = TensorFlow(entry_point='mnist_hpo.py',\n", " role=role,\n", " framework_version='1.10.0',\n", " training_steps=1000, \n", " evaluation_steps=100,\n", " train_instance_count=1,\n", " train_instance_type='ml.c4.xlarge', # try: ml.m4.xlarge, ml.m5.xlarge, ml.c5.xlarge\n", " base_job_name='DEMO-hpo-tensorflow')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once we've defined our estimator we can specify the hyperparameters we'd like to tune and their possible values. We have three different types of hyperparameters.\n", "- Categorical parameters need to take one value from a discrete set. We define this by passing the list of possible values to `CategoricalParameter(list)`\n", "- Continuous parameters can take any real number value between the minimum and maximum value, defined by `ContinuousParameter(min, max)`\n", "- Integer parameters can take any integer value between the minimum and maximum value, defined by `IntegerParameter(min, max)`\n", "\n", "*Note, if possible, it's almost always best to specify a value as the least restrictive type. For example, tuning learning rate as a continuous value between 0.01 and 0.2 is likely to yield a better result than tuning as a categorical parameter with values 0.01, 0.1, 0.15, or 0.2.*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hyperparameter_ranges = {'learning_rate': ContinuousParameter(0.01, 0.2)}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we'll specify the objective metric that we'd like to tune and its definition, which includes the regular expression (Regex) needed to extract that metric from the CloudWatch logs of the training job. In this particular case, our script emits loss value and we will use it as the objective metric, we also set the objective_type to be 'minimize', so that hyperparameter tuning seeks to minize the objective metric when searching for the best hyperparameter setting. By default, objective_type is set to 'maximize'." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "objective_metric_name = 'loss'\n", "objective_type = 'Minimize'\n", "metric_definitions = [{'Name': 'loss',\n", " 'Regex': 'loss = ([0-9\\\\.]+)'}]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we'll create a `HyperparameterTuner` object, to which we pass:\n", "- The TensorFlow estimator we created above\n", "- Our hyperparameter ranges\n", "- Objective metric name and definition\n", "- Tuning resource configurations such as Number of training jobs to run in total and how many training jobs can be run in parallel." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tuner = HyperparameterTuner(estimator,\n", " objective_metric_name,\n", " hyperparameter_ranges,\n", " metric_definitions,\n", " max_jobs=9,\n", " max_parallel_jobs=3,\n", " objective_type=objective_type)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Launch hyperparameter tuning job\n", "And finally, we can start our hyperprameter tuning job by calling `.fit()` and passing in the S3 path to our train and test dataset.\n", "\n", "After the hyperprameter tuning job is created, you should be able to describe the tuning job to see its progress in the next step, and you can go to SageMaker console->Jobs to check out the progress of the progress of the hyperparameter tuning job." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tuner.fit(inputs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's just run a quick check of the hyperparameter tuning jobs status to make sure it started successfully." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "boto3.client('sagemaker').describe_hyper_parameter_tuning_job(\n", " HyperParameterTuningJobName=tuner.latest_tuning_job.job_name)['HyperParameterTuningJobStatus']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Analyze tuning job results - after tuning job is completed\n", "Please refer to \"Module6-HPO Job Results Analyzer.ipynb\" to see example code to analyze the tuning job results." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# code you want to evaluate\n", "elapsed = timeit.default_timer() - start_time\n", "print(elapsed/60)" ] } ], "metadata": { "kernelspec": { "display_name": "conda_tensorflow_p27", "language": "python", "name": "conda_tensorflow_p27" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.15" }, "notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License." }, "nbformat": 4, "nbformat_minor": 2 }