{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# TensorFlow Regression Model Workflow with Script Mode and Automatic Model Tuning\n", "\n", "Starting with TensorFlow version 1.11, you can use SageMaker's prebuilt TensorFlow containers with TensorFlow training scripts similar to those you would use outside SageMaker. This feature is named Script Mode.\n", "\n", "In this notebook, we will use Script Mode in conjunction with TensorFlow's Eager Execution mode, which is the default execution mode of TensorFlow 2 onwards. Eager execution is an imperative interface where operations are executed immediately, rather than building a static computational graph. Advantages of Eager Execution include a more intuitive interface with natural Python control flow and less boilerplate, easier debugging, and support for dynamic models and almost all of the available TensorFlow operations. It also features close integration with tf.keras to make rapid prototyping even easier. \n", "\n", "To demonstrate Script Mode, this notebook focuses on presenting a relatively complete workflow. The workflow includes local training and hosted training in SageMaker, as well as local inference and SageMaker hosted inference with a real time endpoint. Additionally, Automatic Model Tuning in SageMaker will be used to tune the model's hyperparameters. This workflow will be applied to a straightforward regression task, predicting house prices based on the well-known Boston Housing dataset. More specifically, this public dataset contains 13 features regarding housing stock of towns in the Boston area, including features such as average number of rooms, accessibility to radial highways, adjacency to the Charles River, etc. \n", "\n", "To begin, we'll import some necessary packages and set up directories for training and test data. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "import tensorflow as tf\n", "\n", "tf.enable_eager_execution()\n", "tf.set_random_seed(0)\n", "tf.logging.set_verbosity(tf.logging.ERROR)\n", "\n", "data_dir = os.path.join(os.getcwd(), 'data')\n", "os.makedirs(data_dir, exist_ok=True)\n", "\n", "train_dir = os.path.join(os.getcwd(), 'data/train')\n", "os.makedirs(train_dir, exist_ok=True)\n", "\n", "test_dir = os.path.join(os.getcwd(), 'data/test')\n", "os.makedirs(test_dir, exist_ok=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Prepare dataset\n", "\n", "Next, we'll import the dataset. The dataset itself is small and relatively issue-free. For example, there are no missing values, a common problem for many other datasets. Accordingly, preprocessing just involves normalizing the data. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from tensorflow.python.keras.datasets import boston_housing\n", "\n", "(x_train, y_train), (x_test, y_test) = boston_housing.load_data()\n", "mean = x_train.mean(axis=0)\n", "std = x_train.std(axis=0)\n", "\n", "x_train = (x_train - mean) / (std + 1e-8)\n", "x_test = (x_test - mean) / (std + 1e-8)\n", "\n", "print('x train', x_train.shape, x_train.mean(), x_train.std())\n", "print('y train', y_train.shape, y_train.mean(), y_train.std())\n", "print('x test', x_test.shape, x_test.mean(), x_test.std())\n", "print('y test', y_test.shape, y_test.mean(), y_test.std())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data is saved as Numpy files prior to both local mode training and hosted training in SageMaker." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "np.save(os.path.join(train_dir, 'x_train.npy'), x_train)\n", "np.save(os.path.join(train_dir, 'y_train.npy'), y_train)\n", "np.save(os.path.join(test_dir, 'x_test.npy'), x_test)\n", "np.save(os.path.join(test_dir, 'y_test.npy'), y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Local Mode training\n", "\n", "Amazon SageMaker’s Local Mode training feature is a convenient way to make sure your code is working as expected before moving on to full scale, hosted training. To train in Local Mode, it is necessary to have docker-compose or nvidia-docker-compose (for GPU) installed in the notebook instance. Running following script will install docker-compose or nvidia-docker-compose and configure the notebook environment for you." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!wget -q https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/master/local_mode_setup.sh\n", "!wget -q https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/master/daemon.json \n", "!/bin/bash ./local_mode_setup.sh" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we'll set up a TensorFlow Estimator for Local Mode training. One of the key parameters for an Estimator is the `train_instance_type`, which is the kind of hardware on which training will run. In the case of Local Mode, we simply set this parameter to `local` to invoke Local Mode training on the CPU, or to `local_gpu` if the instance has a GPU. Other parameters of note are the algorithm’s hyperparameters, which are passed in as a dictionary, and a Boolean parameter indicating that we are using Script Mode. \n", "\n", "Recall that we are using Local Mode here mainly to make sure our code is working. Accordingly, instead of performing a full cycle of training with many epochs (passes over the full dataset), we'll train only for a small number of epochs to confirm the code is working properly and avoid wasting training time unnecessarily." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sagemaker\n", "from sagemaker.tensorflow import TensorFlow\n", "\n", "git_config = {'repo': 'https://github.com/aws-samples/amazon-sagemaker-script-mode', \n", " 'branch': 'master'}\n", "\n", "model_dir = '/opt/ml/model'\n", "train_instance_type = 'local'\n", "hyperparameters = {'epochs': 5, 'batch_size': 128, 'learning_rate': 0.01}\n", "local_estimator = TensorFlow(\n", " git_config=git_config,\n", " source_dir='tf-eager-script-mode/train_model',\n", " entry_point='train.py',\n", " model_dir=model_dir,\n", " train_instance_type=train_instance_type,\n", " train_instance_count=1,\n", " hyperparameters=hyperparameters,\n", " role=sagemaker.get_execution_role(),\n", " base_job_name='tf-scriptmode-bostonhousing',\n", " framework_version='1.13',\n", " py_version='py3',\n", " script_mode=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "inputs = {'train': f'file://{train_dir}',\n", " 'test': f'file://{test_dir}'}\n", "\n", "local_estimator.fit(inputs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Local Mode endpoint\n", "\n", "While Amazon SageMaker’s Local Mode training is very useful to make sure your training code is working before moving on to full scale training, it also would be useful to have a convenient way to test your model locally before incurring the time and expense of deploying it to production. One possibility is to fetch the SavedModel artifact or a model checkpoint saved in Amazon S3, and load it in your notebook for testing. We'll explore doing that in another section of this notebook below. However, an even easier way to do this is to use the Amazon SageMaker SDK to do this work for you.\n", "\n", "The Estimator object from the Local Mode training job can be used to deploy a model locally with a single line of code. With one exception, this code is the same as the code you would use to deploy to production. In particular, all you need to do is invoke the local Estimator's deploy method, and similarly to Local Mode training, specify the instance type as either `local_gpu` or `local` depending on whether your notebook instance is a GPU instance or CPU instance." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "local_predictor = local_estimator.deploy(initial_instance_count=1,instance_type='local')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get predictions from the local endpoint, simply invoke the Predictor's predict method." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "local_results = local_predictor.predict(x_test[:10])['predictions']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As a sanity check, the predictions can be compared against the actual target values." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "local_preds_flat_list = [float('%.1f'%(item)) for sublist in local_results for item in sublist]\n", "print('predictions: \\t{}'.format(np.array(local_preds_flat_list)))\n", "print('target values: \\t{}'.format(y_test[:10].round(decimals=1)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We only trained the model for a few epochs and there is room for improvement, but the predictions so far should at least appear reasonably within the ballpark. \n", "\n", "To avoid having the TensorFlow Serving container running indefinitely on this notebook instance, simply gracefully shut it down by calling the `delete_endpoint` method of the Predictor object." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "local_predictor.delete_endpoint()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## SageMaker hosted training\n", "\n", "Now that we've confirmed our code is working locally, we can move on to use SageMaker's hosted training functionality. Hosted training is preferred to for doing actual training, especially large-scale, distributed training. Unlike Local Mode training, for hosted training the actual training itself occurs not on the notebook instance, but on a separate cluster of machines managed by SageMaker. Before starting hosted training, the data must be uploaded to S3. We'll do that now, and confirm the upload was successful." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3_prefix = 'tf-eager-scriptmode-bostonhousing'\n", "\n", "traindata_s3_prefix = '{}/data/train'.format(s3_prefix)\n", "testdata_s3_prefix = '{}/data/test'.format(s3_prefix)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_s3 = sagemaker.Session().upload_data(path='./data/train/', key_prefix=traindata_s3_prefix)\n", "test_s3 = sagemaker.Session().upload_data(path='./data/test/', key_prefix=testdata_s3_prefix)\n", "\n", "inputs = {'train':train_s3, 'test': test_s3}\n", "\n", "print(inputs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We're now ready to set up an Estimator object for hosted training. It is similar to the Local Mode Estimator, except the `train_instance_type` has been set to a ML instance type instead of `local` for Local Mode. Also, since we know our code is working now, we train for a larger number of epochs.\n", "\n", "With these two changes, we simply call `fit` to start the actual hosted training." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_instance_type = 'ml.c5.xlarge'\n", "hyperparameters = {'epochs': 30, 'batch_size': 128, 'learning_rate': 0.01}\n", "\n", "estimator = TensorFlow(\n", " git_config=git_config,\n", " source_dir='tf-eager-script-mode/train_model',\n", " entry_point='train.py',\n", " model_dir=model_dir,\n", " train_instance_type=train_instance_type,\n", " train_instance_count=1,\n", " hyperparameters=hyperparameters,\n", " role=sagemaker.get_execution_role(),\n", " base_job_name='tf-scriptmode-bostonhousing',\n", " framework_version='1.13',\n", " py_version='py3',\n", " script_mode=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "estimator.fit(inputs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As with the Local Mode training, hosted training produces a model checkpoint saved in S3 that we can retrieve and load. We can then make predictions and compare them with the test set. This also demonstrates the modularity of SageMaker: having trained the model in SageMaker, you can now take the model out of SageMaker and run it anywhere else. Alternatively, you can deploy the model using SageMaker's hosted endpoints functionality.\n", "\n", "Now, instead of using a Local Mode endpoint, we'll go through the steps of downloading the model from Amazon S3 and loading a checkpoint. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!aws s3 cp {estimator.model_data} ./model/model.tar.gz" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!tar -xvzf ./model/model.tar.gz -C ./model" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!mkdir -p train_model\n", "!wget -q -P ./train_model https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/master/tf-eager-script-mode/train_model/model_def.py\n", "from tensorflow.contrib.eager.python import tfe\n", "from train_model import model_def\n", "\n", "tf.keras.backend.clear_session()\n", "device = '/cpu:0' \n", "\n", "with tf.device(device): \n", " model = model_def.get_model()\n", " saver = tfe.Saver(model.variables)\n", " saver.restore('model/weights.ckpt')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with tf.device(device): \n", " predictions = model.predict(x_test)\n", " \n", "print('predictions: \\t{}'.format(predictions[:10].flatten().round(decimals=1)))\n", "print('target values: \\t{}'.format(y_test[:10].round(decimals=1)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## SageMaker hosted endpoint\n", "\n", "After multiple sanity checks, we're confident that our model is performing as expected. If we wish to deploy the model to production, a convenient option is to use a SageMaker hosted endpoint. The endpoint will retrieve the TensorFlow SavedModel created during training and deploy it within a TensorFlow Serving container. This all can be accomplished with one line of code, an invocation of the Estimator's deploy method." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.m5.xlarge')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As one last sanity check, we can compare the predictions generated by the endpoint with those generated locally by the model checkpoint we retrieved from hosted training in SageMaker. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "results = predictor.predict(x_test[:10])['predictions'] \n", "flat_list = [float('%.1f'%(item)) for sublist in results for item in sublist]\n", "print('predictions: \\t{}'.format(np.array(flat_list)))\n", "print('target values: \\t{}'.format(y_test[:10].round(decimals=1)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before proceeding with the rest of this notebook, you can delete the prediction endpoint to release the instance(s) associated with it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sagemaker.Session().delete_endpoint(predictor.endpoint)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Automatic Model Tuning\n", "\n", "Selecting the right hyperparameter values to train your model can be difficult. The right answer is dependent on your data; some algorithms have many different hyperparameters that can be tweaked; some are very sensitive to the hyperparameter values selected; and most have a non-linear relationship between model fit and hyperparameter values. SageMaker Automatic Model Tuning helps automate the hyperparameter tuning process: it runs multiple training jobs with different hyperparameter combinations to find the set with the best model performance.\n", "\n", "We begin by specifying the hyperparameters we wish to tune, and the range of values over which to tune each one. We also must specify an objective metric to be optimized: in this use case, we'd like to minimize the validation loss." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner\n", "from time import gmtime, strftime \n", "\n", "hyperparameter_ranges = {\n", " 'learning_rate': ContinuousParameter(0.001, 0.2, scaling_type=\"Logarithmic\"),\n", " 'epochs': IntegerParameter(10, 50),\n", " 'batch_size': IntegerParameter(64, 256),\n", " }\n", "\n", "metric_definitions = [{'Name': 'loss',\n", " 'Regex': ' loss: ([0-9\\\\.]+)'},\n", " {'Name': 'val_loss',\n", " 'Regex': ' val_loss: ([0-9\\\\.]+)'}]\n", "\n", "objective_metric_name = 'val_loss'\n", "objective_type = 'Minimize'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we specify a HyperparameterTuner object that takes the above definitions as parameters. Each tuning job must be given a budget - a maximum number of training jobs - and the tuning job will complete once that many training jobs have been executed. \n", "\n", "We also can specify how much parallelism to employ, in this case five jobs, meaning that the tuning job will complete after three series of five jobs in parallel have completed. For the default Bayesian Optimization tuning strategy used here, the search is informed by the results of previous groups of training jobs, so we don't run all of the jobs in parallel, but rather divide the jobs into groups of parallel jobs. In other words, more parallel jobs will finish tuning sooner, but may sacrifice accuracy. \n", "\n", "Now we can launch a hyperparameter tuning job by calling the `fit` method of the HyperparameterTuner object. We will wait until the tuning finished, which may take around 10 minutes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tuner = HyperparameterTuner(estimator,\n", " objective_metric_name,\n", " hyperparameter_ranges,\n", " metric_definitions,\n", " max_jobs=15,\n", " max_parallel_jobs=5,\n", " objective_type=objective_type)\n", "\n", "tuning_job_name = \"tf-bostonhousing-{}\".format(strftime(\"%d-%H-%M-%S\", gmtime()))\n", "tuner.fit(inputs, job_name=tuning_job_name)\n", "tuner.wait()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After the tuning job is finished, we can use the `HyperparameterTuningJobAnalytics` method to list the top 5 tuning jobs with the best performance. Although the results typically vary from tuning job to tuning job, the best validation loss from the tuning job (under the FinalObjectiveValue column) likely will be lower than the validation loss from the hosted training job above. For an example of a more in-depth analysis of a tuning job, see HPO_Analyze_TuningJob_Results.ipynb notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tuner_metrics = sagemaker.HyperparameterTuningJobAnalytics(tuning_job_name)\n", "tuner_metrics.dataframe().sort_values(['FinalObjectiveValue'], ascending=True).head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The total training time and training jobs status can be checked with the following script. Because automatic early stopping is by default off, all the training jobs should be completed normally." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "total_time = tuner_metrics.dataframe()['TrainingElapsedTimeSeconds'].sum() / 3600\n", "print(\"The total training time is {:.2f} hours\".format(total_time))\n", "tuner_metrics.dataframe()['TrainingJobStatus'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Assuming the best model from the tuning job is better than the model produced by the hosted training job above, we could now easily deploy that model. By calling the `deploy` method of the HyperparameterTuner object we instantiated above, we can directly deploy the best model from the tuning job to a SageMaker hosted endpoint:\n", "\n", "`tuning_predictor = tuner.deploy(initial_instance_count=1, instance_type='ml.m5.xlarge')`\n", "\n", "Since we already looked at how to use a SageMaker hosted endpoint above, we won't repeat that here. We've covered a lot of content in this notebook: local and hosted training with Script Mode, local and hosted inference in SageMaker, and Automatic Model Tuning. These are likely to be central elements for most deep learning workflows in SageMaker. " ] } ], "metadata": { "kernelspec": { "display_name": "conda_tensorflow_p36", "language": "python", "name": "conda_tensorflow_p36" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }