{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Training and hosting SageMaker Models using the Apache MXNet Module API\n", "\n", "The **SageMaker Python SDK** makes it easy to train and deploy MXNet models. In this example, we train a simple neural network using the Apache MXNet [Module API](https://mxnet.apache.org/api/python/module/module.html) and the MNIST dataset. The MNIST dataset is widely used for handwritten digit classification, and consists of 70,000 labeled 28x28 pixel grayscale images of hand-written digits. The dataset is split into 60,000 training images and 10,000 test images. There are 10 classes (one for each of the 10 digits). The task at hand is to train a model using the 60,000 training images and subsequently test its classification accuracy on the 10,000 test images.\n", "\n", "### Setup\n", "\n", "First we need to define a few variables that will be needed later in the example." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def create_bucket(bucket):\n", " import logging\n", "\n", " try:\n", " s3.create_bucket(Bucket=bucket, CreateBucketConfiguration={'LocationConstraint': region})\n", " except botocore.exceptions.ClientError as e:\n", " logging.error(e)\n", " return 'Bucket {0} could not be created.'.format(bucket)\n", " return 'Created {0} bucket.'.format(bucket)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Replace **{{FIXME}}** with your initials before creating the bucket" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "isConfigCell": true }, "outputs": [], "source": [ "import boto3\n", "import botocore\n", "\n", "from sagemaker import get_execution_role\n", "from sagemaker.session import Session\n", "\n", "s3 = boto3.client('s3')\n", "\n", "# General variables for the region and account id for the location of the resources being created\n", "session = boto3.session.Session()\n", "region = session.region_name\n", "account_id = boto3.client('sts').get_caller_identity().get('Account')\n", "\n", "# S3 bucket for saving code and model artifacts.\n", "# Feel free to specify a different bucket here if you wish.\n", "bucket = 'mnist-sagemaker-{{FIXME}}'\n", "\n", "# Location to save your custom code in tar.gz format.\n", "custom_code_upload_location = 's3://{}/customcode/mxnet'.format(bucket)\n", "\n", "# Location where results of model training are saved.\n", "model_artifacts_location = 's3://{}/artifacts'.format(bucket)\n", "\n", "# IAM execution role that gives SageMaker access to resources in your AWS account.\n", "# We can use the SageMaker Python SDK to get the role from our notebook environment. \n", "role = get_execution_role()\n", "\n", "create_bucket(bucket)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The training script\n", "\n", "The ``mnist.py`` script provides all the code we need for training and hosting a SageMaker model. The script also checkpoints the model at the end of every epoch and saves the model graph, params and optimizer state in the folder `/opt/ml/checkpoints`. If the folder path does not exist then it will skip checkpointing. The script we will use is adaptated from Apache MXNet [MNIST tutorial (https://mxnet.incubator.apache.org/tutorials/python/mnist.html).\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pygmentize mnist.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### SageMaker's MXNet estimator class" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The SageMaker ```MXNet``` estimator allows us to run single machine or distributed training in SageMaker, using CPU or GPU-based instances.\n", "\n", "When we create the estimator, we pass in the filename of our training script, the name of our IAM execution role, and the S3 locations we defined in the setup section. We also provide a few other parameters. ``train_instance_count`` and ``train_instance_type`` determine the number and type of SageMaker instances that will be used for the training job. The ``hyperparameters`` parameter is a ``dict`` of values that will be passed to your training script -- you can see how to access these values in the ``mnist.py`` script above.\n", "\n", "For this example, we will choose one ``ml.m4.xlarge`` instance.\n", "\n", "### Managed Spot Training with MXNet\n", "\n", "For Managed Spot Training using MXNet we need to configure three things:\n", "1. Enable the `train_use_spot_instances` constructor arg - a simple self-explanatory boolean.\n", "2. Set the `train_max_wait` constructor arg - this is an int arg representing the amount of time you are willing to wait for Spot infrastructure to become available. Some instance types are harder to get at Spot prices and you may have to wait longer. You are not charged for time spent waiting for Spot infrastructure to become available, you're only charged for actual compute time spent once Spot instances have been successfully procured.\n", "3. Setup a `checkpoint_s3_uri` constructor arg. This arg will tell SageMaker an S3 location where to save checkpoints (assuming your algorithm has been modified to save checkpoints periodically). While not strictly necessary checkpointing is highly recommended for Manage Spot Training jobs due to the fact that Spot instances can be interrupted with short notice and using checkpoints to resume from the last interruption ensures you don't lose any progress made before the interruption.\n", "\n", "Feel free to toggle the `train_use_spot_instances` variable to see the effect of running the same job using regular (a.k.a. \"On Demand\") infrastructure.\n", "\n", "Note that `train_max_wait` can be set if and only if `train_use_spot_instances` is enabled and **must** be greater than or equal to `train_max_run`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import uuid\n", "\n", "from sagemaker.mxnet import MXNet\n", "\n", "train_use_spot_instances = True\n", "train_max_run=3600\n", "train_max_wait = 7200 if train_use_spot_instances else None\n", "checkpoint_suffix = str(uuid.uuid4())[:8]\n", "checkpoint_s3_uri = 's3://{}/artifacts/mxnet-checkpoint-{}/'.format(bucket, checkpoint_suffix) if train_use_spot_instances else None\n", "\n", "mnist_estimator = MXNet(entry_point='mnist.py',\n", " role=role,\n", " output_path=model_artifacts_location,\n", " code_location=custom_code_upload_location,\n", " train_instance_count=1,\n", " train_instance_type='ml.m4.xlarge',\n", " framework_version='1.4.1',\n", " py_version='py3',\n", " distributions={'parameter_server': {'enabled': True}},\n", " hyperparameters={'learning-rate': 0.1},\n", " train_use_spot_instances=train_use_spot_instances,\n", " train_max_run=train_max_run,\n", " train_max_wait=train_max_wait,\n", " checkpoint_s3_uri=checkpoint_s3_uri)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Running the Training Job" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After we've constructed our MXNet object, we can fit it using data stored in S3. Below we run SageMaker training on two input channels: **train** and **test**.\n", "\n", "During training, SageMaker makes this data stored in S3 available in the local filesystem where the mnist script is running. The ```mnist.py``` script simply loads the train and test data from disk." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "%%time\n", "import boto3\n", "\n", "region = boto3.Session().region_name\n", "train_data_location = 's3://sagemaker-sample-data-{}/mxnet/mnist/train'.format(region)\n", "test_data_location = 's3://sagemaker-sample-data-{}/mxnet/mnist/test'.format(region)\n", "\n", "mnist_estimator.fit({'train': train_data_location, 'test': test_data_location})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Savings\n", "Towards the end of the job you should see two lines of output printed:\n", "\n", "- `Training seconds: X` : This is the actual compute-time your training job spent\n", "- `Billable seconds: Y` : This is the time you will be billed for after Spot discounting is applied.\n", "\n", "If you enabled the `train_use_spot_instances` var then you should see a notable difference between `X` and `Y` signifying the cost savings you will get for having chosen Managed Spot Training. This should be reflected in an additional line:\n", "- `Managed Spot Training savings: (1-Y/X)*100 %`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Opimtize your model with Neo API\n", "Neo API allows to optimize our model for a specific hardware type. When calling compile_model() function, we specify the target instance family (M4) as well as the S3 bucket to which the compiled model would be stored.\n", "\n", "#### Important. If the following command result in a permission error, scroll up and locate the value of execution role returned by get_execution_role(). The role must have access to the S3 bucket specified in output_path." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "neo_optimize = False\n", "compiled_model = mnist_estimator\n", "if mnist_estimator.create_model().check_neo_region(boto3.Session().region_name) is False:\n", " print('Neo is not currently supported in', boto3.Session().region_name)\n", "else:\n", " output_path = '/'.join(mnist_estimator.output_path.split('/')[:-1])\n", " neo_optimize = True\n", " compiled_model = mnist_estimator.compile_model(target_instance_family='ml_m4', \n", " input_shape={'data':[1, 784]},\n", " role=role,\n", " output_path=output_path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating an inference Endpoint\n", "\n", "After training, we use the ``MXNet estimator`` object to build and deploy an ``MXNetPredictor``. This creates a Sagemaker **Endpoint** -- a hosted prediction service that we can use to perform inference. \n", "\n", "The arguments to the ``deploy`` function allow us to set the number and type of instances that will be used for the Endpoint. These do not need to be the same as the values we used for the training job. For example, you can train a model on a set of GPU-based instances, and then deploy the Endpoint to a fleet of CPU-based instances. Here we will deploy the model to a single ``ml.m4.xlarge`` instance." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "\n", "predictor = compiled_model.deploy(initial_instance_count=1,\n", " instance_type='ml.m4.xlarge')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import io\n", "import numpy as np\n", "def numpy_bytes_serializer(data):\n", " f = io.BytesIO()\n", " np.save(f, data)\n", " f.seek(0)\n", " return f.read()\n", "\n", "if neo_optimize is True:\n", " predictor.content_type = 'application/vnd+python.numpy+binary'\n", " predictor.serializer = numpy_bytes_serializer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The request handling behavior of the Endpoint is determined by the ``mnist.py`` script. In this case, the script doesn't include any request handling functions, so the Endpoint will use the default handlers provided by SageMaker. These default handlers allow us to perform inference on input data encoded as a multi-dimensional JSON array.\n", "\n", "### Making an inference request\n", "\n", "Now that our Endpoint is deployed and we have a ``predictor`` object, we can use it to classify handwritten digits.\n", "\n", "To see inference in action, draw a digit in the image box below. The pixel data from your drawing will be loaded into a ``data`` variable in this notebook. \n", "\n", "*Note: after drawing the image, you'll need to move to the next notebook cell.*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from IPython.display import HTML\n", "HTML(open(\"input.html\").read())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can use the ``predictor`` object to classify the handwritten digit:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "response = predictor.predict(data)\n", "print('Raw prediction result:')\n", "if neo_optimize is False:\n", " response = response[0]\n", "print(response)\n", "\n", "labeled_predictions = list(zip(range(10), response))\n", "print('Labeled predictions: ')\n", "print(labeled_predictions)\n", "\n", "labeled_predictions.sort(key=lambda label_and_prob: 1.0 - label_and_prob[1])\n", "print('Most likely answer: {}'.format(labeled_predictions[0]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# (Optional) Delete the Endpoint\n", "\n", "After you have finished with this example, remember to delete the prediction endpoint to release the instance(s) associated with it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"Endpoint name: \" + predictor.endpoint)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sagemaker\n", "\n", "predictor.delete_endpoint()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!aws s3 rb s3://$bucket --force" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "conda_mxnet_p36", "language": "python", "name": "conda_mxnet_p36" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.10" }, "notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License." }, "nbformat": 4, "nbformat_minor": 4 }