{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# MNIST distributed training \n", "\n", "The **SageMaker Python SDK** helps you deploy your models for training and hosting in optimized, productions ready containers in SageMaker. The SageMaker Python SDK is easy to use, modular, extensible and compatible with TensorFlow and MXNet. This tutorial focuses on how to create a convolutional neural network model to train the [MNIST dataset](http://yann.lecun.com/exdb/mnist/) using **TensorFlow distributed training**.\n", "\n", "### Lab Time\n", "This module takes around 13 to 15 minutes to complete.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Set up the environment" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import timeit\n", "start_time = timeit.default_timer()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "import sagemaker\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from tensorflow.examples.tutorials.mnist import input_data\n", "import tensorflow as tf\n", "import boto3" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "os.system(\"aws s3 cp s3://sagemaker-workshop-pdx/mnist/utils.py utils.py\")\n", "os.system(\"aws s3 cp s3://sagemaker-workshop-pdx/mnist/mnist.py mnist.py\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker import get_execution_role\n", "\n", "sagemaker_session = sagemaker.Session()\n", "\n", "role = get_execution_role()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Download the MNIST dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "import utils\n", "from tensorflow.contrib.learn.python.learn.datasets import mnist\n", "import tensorflow as tf\n", "\n", "# train-images-idx3-ubyte.gz: 학습 셋 이미지 - 55000개의 트레이닝 이미지, 5000개의 검증 이미지\n", "# train-labels-idx1-ubyte.gz: 이미지와 매칭되는 학습 셋 레이블\n", "# t10k-images-idx3-ubyte.gz: 테스트 셋 이미지 - 10000개의 이미지\n", "# t10k-labels-idx1-ubyte.gz: 이미지와 매칭되는 테스트 셋 레이블\n", "data_sets = mnist.read_data_sets('data', dtype=tf.uint8, reshape=False, validation_size=5000)\n", "\n", "utils.convert_to(data_sets.train, 'train', 'data')\n", "utils.convert_to(data_sets.validation, 'validation', 'data')\n", "utils.convert_to(data_sets.test, 'test', 'data')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Upload the data\n", "We use the ```sagemaker.Session.upload_data``` function to upload our datasets to an S3 location. The return value inputs identifies the location -- we will use this later when we start the training job." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "inputs = sagemaker_session.upload_data(path='data', key_prefix='data/DEMO-mnist')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Construct a script for distributed training \n", "Here is the full code for the network model:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "!cat 'mnist.py'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The script here is and adaptation of the [TensorFlow MNIST example](https://github.com/tensorflow/models/tree/master/official/mnist). It provides a ```model_fn(features, labels, mode)```, which is used for training, evaluation and inference. \n", "\n", "## A regular ```model_fn```\n", "\n", "A regular **```model_fn```** follows the pattern:\n", "1. [defines a neural network](https://github.com/tensorflow/models/blob/master/official/mnist/mnist.py#L96)\n", "- [applies the ```features``` in the neural network](https://github.com/tensorflow/models/blob/master/official/mnist/mnist.py#L178)\n", "- [if the ```mode``` is ```PREDICT```, returns the output from the neural network](https://github.com/tensorflow/models/blob/master/official/mnist/mnist.py#L186)\n", "- [calculates the loss function comparing the output with the ```labels```](https://github.com/tensorflow/models/blob/master/official/mnist/mnist.py#L188)\n", "- [creates an optimizer and minimizes the loss function to improve the neural network](https://github.com/tensorflow/models/blob/master/official/mnist/mnist.py#L193)\n", "- [returns the output, optimizer and loss function](https://github.com/tensorflow/models/blob/master/official/mnist/mnist.py#L205)\n", "\n", "## Writing a ```model_fn``` for distributed training\n", "When distributed training happens, the same neural network will be sent to the multiple training instances. Each instance will predict a batch of the dataset, calculate loss and minimize the optimizer. One entire loop of this process is called **training step**.\n", "\n", "### Syncronizing training steps\n", "A [global step](https://www.tensorflow.org/api_docs/python/tf/train/global_step) is a global variable shared between the instances. It's necessary for distributed training, so the optimizer will keep track of the number of **training steps** between runs: \n", "\n", "```python\n", "train_op = optimizer.minimize(loss, tf.train.get_or_create_global_step())\n", "```\n", "\n", "That is the only required change for distributed training!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create a training job using the sagemaker.TensorFlow estimator" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "from sagemaker.tensorflow import TensorFlow\n", "\n", "mnist_estimator = TensorFlow(entry_point='mnist.py',\n", " role=role,\n", " framework_version='1.10.0',\n", " training_steps=1000, \n", " evaluation_steps=100,\n", " train_instance_count=2,\n", " train_instance_type='ml.c4.xlarge')\n", "\n", "mnist_estimator.fit(inputs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The **```fit```** method will create a training job in two **ml.c4.xlarge** instances. The logs above will show the instances doing training, evaluation, and incrementing the number of **training steps**. \n", "\n", "In the end of the training, the training job will generate a saved model for TF serving." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# Deploy the trained model to prepare for predictions\n", "\n", "The deploy() method creates an endpoint which serves prediction requests in real-time." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mnist_predictor = mnist_estimator.deploy(initial_instance_count=1,\n", " instance_type='ml.m4.xlarge')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Invoking the endpoint" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from tensorflow.examples.tutorials.mnist import input_data\n", "\n", "# read_data_sets()함수는 각 세가지 데이터 셋을 위한 DataSet인스턴스를 가진 딕셔너리를 리턴합니다. \n", "# data_sets.train: 초기 학습을 위한 55000개의 이미지들과 레이블들\n", "# data_sets.validation: 학습 정확도의 반복적 검증을 위한 5000개의 이미지와 레이블들\n", "# data_sets.test: 학습 정확도의 마지막 테스팅을 위한 10000개의 이미지와 레이블들\n", "mnist = input_data.read_data_sets(\"/tmp/data/\", one_hot=True)\n", "\n", "for i in range(10):\n", " data = mnist.test.images[i].tolist()\n", " tensor_proto = tf.make_tensor_proto(values=np.asarray(data), shape=[1, len(data)], dtype=tf.float32)\n", " predict_response = mnist_predictor.predict(tensor_proto)\n", " \n", " print(\"========================================\")\n", " label = np.argmax(mnist.test.labels[i])\n", " print(\"label is {}\".format(label))\n", " prediction = predict_response['outputs']['classes']['int64_val'][0]\n", " print(\"prediction is {}\".format(prediction))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Deleting the endpoint" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sagemaker.Session().delete_endpoint(mnist_predictor.endpoint)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# code you want to evaluate\n", "elapsed = timeit.default_timer() - start_time\n", "print(elapsed/60)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "conda_tensorflow_p36", "language": "python", "name": "conda_tensorflow_p36" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" }, "notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License." }, "nbformat": 4, "nbformat_minor": 2 }