{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# MNIST Training using PyTorch and deploy it with Elastic Inference" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", "\n", "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/sagemaker-python-sdk|pytorch_mnist|pytorch_mnist_elastic_inference.ipynb)\n", "\n", "---" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Contents\n", "\n", "1. [Background](#Background)\n", "1. [Setup](#Setup)\n", "1. [Data](#Data)\n", "1. [Train](#Train)\n", "1. [Host](#Host)\n", "\n", "---\n", "\n", "## Background\n", "\n", "MNIST is a widely used dataset for handwritten digit classification. It consists of 70,000 labeled 28x28 pixel grayscale images of hand-written digits. The dataset is split into 60,000 training images and 10,000 test images. There are 10 classes (one for each of the 10 digits). This tutorial will show how to train and test an MNIST model on SageMaker using PyTorch.\n", "\n", "For more information about the PyTorch in SageMaker, please visit [sagemaker-pytorch-containers](https://github.com/aws/sagemaker-pytorch-containers) and [sagemaker-python-sdk](https://github.com/aws/sagemaker-python-sdk) github repositories.\n", "\n", "---\n", "\n", "## Setup\n", "\n", "_This notebook was created and tested on an ml.m4.xlarge notebook instance._\n", "\n", "Let's start by creating a SageMaker session and specifying:\n", "\n", "- The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting.\n", "- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these. Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the `sagemaker.get_execution_role()` with a the appropriate full IAM role arn string(s).\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sagemaker\n", "from sagemaker.local import LocalSession\n", "\n", "sagemaker_session = sagemaker.Session()\n", "region = sagemaker_session.boto_region_name\n", "\n", "bucket = sagemaker_session.default_bucket()\n", "prefix = \"sagemaker/DEMO-pytorch-mnist\"\n", "\n", "role = sagemaker.get_execution_role()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Data\n", "### Getting the data\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "pip install torchvision==0.5.0 --no-cache-dir" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "import torchvision\n", "from torchvision import datasets, transforms\n", "from packaging.version import Version\n", "\n", "TORCHVISION_VERSION = \"0.9.1\"\n", "if Version(torchvision.__version__) < Version(TORCHVISION_VERSION):\n", " # Set path to data source and include checksum key to make sure data isn't corrupted\n", " datasets.MNIST.resources = [\n", " (\n", " f\"https://sagemaker-example-files-prod-{region}.s3.amazonaws.com/datasets/image/MNIST/train-images-idx3-ubyte.gz\",\n", " \"f68b3c2dcbeaaa9fbdd348bbdeb94873\",\n", " ),\n", " (\n", " f\"https://sagemaker-example-files-prod-{region}.s3.amazonaws.com/datasets/image/MNIST/train-labels-idx1-ubyte.gz\",\n", " \"d53e105ee54ea40749a09fcbcd1e9432\",\n", " ),\n", " (\n", " f\"https://sagemaker-example-files-prod-{region}.s3.amazonaws.com/datasets/image/MNIST/t10k-images-idx3-ubyte.gz\",\n", " \"9fb629c4189551a2d022fa330f9573f3\",\n", " ),\n", " (\n", " f\"https://sagemaker-example-files-prod-{region}.s3.amazonaws.com/datasets/image/MNIST/t10k-labels-idx1-ubyte.gz\",\n", " \"ec29112dd5afa0611ce80d1b7f02629c\",\n", " ),\n", " ]\n", "else:\n", " # Set path to data source\n", " datasets.MNIST.mirrors = [\n", " f\"https://sagemaker-example-files-prod-{region}.s3.amazonaws.com/datasets/image/MNIST/\"\n", " ]\n", "\n", "datasets.MNIST(\n", " \"data\",\n", " download=True,\n", " transform=transforms.Compose(\n", " [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]\n", " ),\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Uploading the data to S3\n", "We are going to use the `sagemaker.Session.upload_data` function to upload our datasets to an S3 location. The return value inputs identifies the location -- we will use later when we start the training job.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "inputs = sagemaker_session.upload_data(path=\"data\", bucket=bucket, key_prefix=prefix)\n", "print(\"input spec (in this case, just an S3 path): {}\".format(inputs))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Train\n", "### Training script\n", "The `mnist.py` script provides all the code we need for training and hosting a SageMaker model (`model_fn` function to load a model).\n", "The training script is very similar to a training script you might run outside of SageMaker, but you can access useful properties about the training environment through various environment variables, such as:\n", "\n", "* `SM_MODEL_DIR`: A string representing the path to the directory to write model artifacts to.\n", " These artifacts are uploaded to S3 for model hosting.\n", "* `SM_NUM_GPUS`: The number of gpus available in the current container.\n", "* `SM_CURRENT_HOST`: The name of the current container on the container network.\n", "* `SM_HOSTS`: JSON encoded list containing all the hosts .\n", "\n", "Supposing one input channel, 'training', was used in the call to the PyTorch estimator's `fit()` method, the following will be set, following the format `SM_CHANNEL_[channel_name]`:\n", "\n", "* `SM_CHANNEL_TRAINING`: A string representing the path to the directory containing data in the 'training' channel.\n", "\n", "For more information about training environment variables, please visit [SageMaker Containers](https://github.com/aws/sagemaker-containers).\n", "\n", "A typical training script loads data from the input channels, configures training with hyperparameters, trains a model, and saves a model to `model_dir` so that it can be hosted later. Hyperparameters are passed to your script as arguments and can be retrieved with an `argparse.ArgumentParser` instance.\n", "\n", "Because the SageMaker imports the training script, you should put your training code in a main guard (``if __name__=='__main__':``) if you are using the same script to host your model as we do in this example, so that SageMaker does not inadvertently run your training code at the wrong point in execution.\n", "\n", "For example, the script run by this notebook:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "!pygmentize mnist.py" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Run training in SageMaker\n", "\n", "The `PyTorch` class allows us to run our training function as a training job on SageMaker infrastructure. We need to configure it with our training script, an IAM role, the number of training instances, the training instance type, and hyperparameters. In this case we are going to run our training job on 2 ```ml.c4.xlarge``` instances. But this example can be ran on one or multiple, cpu or gpu instances ([full list of available instances](https://aws.amazon.com/sagemaker/pricing/instance-types/)). The hyperparameters parameter is a dict of values that will be passed to your training script -- you can see how to access these values in the `mnist.py` script above.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.pytorch import PyTorch\n", "\n", "estimator = PyTorch(\n", " entry_point=\"mnist.py\",\n", " role=role,\n", " framework_version=\"1.4.0\",\n", " py_version=\"py3\",\n", " instance_count=2,\n", " instance_type=\"ml.c4.xlarge\",\n", " hyperparameters={\"epochs\": 6, \"backend\": \"gloo\"},\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "After we've constructed our `PyTorch` object, we can fit it using the data we uploaded to S3. SageMaker makes sure our data is available in the local filesystem, so our training script can simply read the data from disk.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "estimator.fit({\"training\": inputs})" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "estimator.model_data" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Host\n", "### Create endpoint\n", "After training, we use the `PyTorch` estimator object to build and deploy a `PyTorchPredictor`. This creates a Sagemaker Endpoint -- a hosted prediction service that we can use to perform inference.\n", "\n", "As mentioned above we have implementation of `model_fn` in the `mnist.py` script that is required. We are going to use default implementations of `input_fn`, `predict_fn`, `output_fn` and `transform_fm` defined in [sagemaker-pytorch-containers](https://github.com/aws/sagemaker-pytorch-containers).\n", "\n", "The arguments to the deploy function allow us to set the number and type of instances that will be used for the Endpoint. These do not need to be the same as the values we used for the training job. For example, you can train a model on a set of GPU-based instances, and then deploy the Endpoint to a fleet of CPU-based instances, but you need to make sure that you return or save your model as a cpu model similar to what we did in `mnist.py`. Here we will deploy the model to a single ```ml.m4.xlarge``` instance." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "predictor = estimator.deploy(initial_instance_count=1, instance_type=\"ml.m4.xlarge\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Evaluate\n", "We can now use this predictor to classify hand-written digits. Drawing into the image box loads the pixel data into a `data` variable in this notebook, which we can then pass to the `predictor`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "import numpy as np\n", "\n", "im_ = np.random.rand(1, 1, 28, 28)\n", "\n", "image = np.array(im_, dtype=np.float32)\n", "response = predictor.predict(image)\n", "prediction = response.argmax(axis=1)[0]\n", "print(prediction)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Cleanup\n", "\n", "After you have finished with this example, remember to delete the prediction endpoint to release the instance(s) associated with it" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "predictor.delete_endpoint()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Elastic Inference" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Selecting the right instance type for inference requires deciding between different amounts of GPU, CPU, and memory resources, and optimizing for one of these resources on a standalone GPU instance usually leads to under-utilization of other resources. [Amazon Elastic Inference](https://aws.amazon.com/machine-learning/elastic-inference/) solves this problem by enabling us to attach the right amount of GPU-powered inference acceleration to our endpoint. In March 2020, [Elastic Inference support for PyTorch became available](https://aws.amazon.com/blogs/machine-learning/reduce-ml-inference-costs-on-amazon-sagemaker-for-pytorch-models-using-amazon-elastic-inference/) for both Amazon SageMaker and Amazon EC2. For supported images, check [available images in Elastic Inference Containers section](https://github.com/aws/deep-learning-containers/blob/master/available_images.md)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "To use Elastic Inference, we must convert our trained model to TorchScript. The location of the model artifacts is estimator.model_data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "estimator.model_data" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "First we create a folder to save model trained model, and download the model.tar.gz file to local directory." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%sh -s $estimator.model_data\n", "mkdir model\n", "aws s3 cp $1 model/ \n", "tar xvzf model/model.tar.gz --directory ./model" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Convert your model into the TorchScript format using torch.jit.trace or torch.jit.script." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import torch\n", "import torch.nn as nn\n", "import torch.nn.functional as F\n", "\n", "\n", "class Net(nn.Module):\n", " def __init__(self):\n", " super(Net, self).__init__()\n", " self.conv1 = nn.Conv2d(1, 10, kernel_size=5)\n", " self.conv2 = nn.Conv2d(10, 20, kernel_size=5)\n", " self.conv2_drop = nn.Dropout2d()\n", " self.fc1 = nn.Linear(320, 50)\n", " self.fc2 = nn.Linear(50, 10)\n", "\n", " def forward(self, x):\n", " x = F.relu(F.max_pool2d(self.conv1(x), 2))\n", " x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))\n", " x = x.view(-1, 320)\n", " x = F.relu(self.fc1(x))\n", " x = F.dropout(x, training=self.training)\n", " x = self.fc2(x)\n", " return F.log_softmax(x, dim=1)\n", "\n", "\n", "model_loaded = torch.load(\"model/model.pth\")\n", "model = Net().to(\"cpu\")\n", "model = torch.nn.DataParallel(model)\n", "model.load_state_dict(model_loaded)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import subprocess\n", "\n", "trace_input = torch.rand(1, 1, 28, 28)\n", "traced_model = torch.jit.trace(model.eval(), trace_input)\n", "\n", "torch.jit.save(traced_model, \"model.pth\")\n", "subprocess.call([\"tar\", \"-czvf\", \"traced_mnist_model.tar.gz\", \"model.pth\"])" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Loading the TorchScript model and using it for prediction require small changes in our model loading and prediction functions. We create a new script deploy_ei.py that is slightly different from train_deploy.py script." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "**Notes:** Elastic Inference images are using different apis for PyTorch 1.3.1 and PyTorch 1.5.1. If you are using PyTorch 1.3.1, you can leave deploy_ei.py empty since we provide default handler for you. If you are using PyTorch 1.5.1, create a script like deploy_ei.py. Please use `attach_eia` in `model_fn`. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "!pygmentize code/deploy_ei.py" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Next we upload TorchScript model to S3 and deploy using Elastic Inference. The accelerator_type=ml.eia2.xlarge parameter is how we attach the Elastic Inference accelerator to our endpoint." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.pytorch import PyTorchModel\n", "from datetime import datetime\n", "\n", "instance_type = \"ml.m5.large\"\n", "accelerator_type = \"ml.eia2.xlarge\"\n", "\n", "# TorchScript model\n", "tar_filename = \"traced_mnist_model.tar.gz\"\n", "\n", "# You can also upload model artifacts to S3\n", "# print('Upload tarball to S3')\n", "# model_data = sagemaker_session.upload_data(path=tar_filename, bucket=bucket, key_prefix=prefix)\n", "model_data = tar_filename\n", "\n", "endpoint_name = (\n", " \"mnist-ei-traced-{}-{}-{}\".format(\n", " instance_type, accelerator_type, datetime.now().strftime(\"%d-%m-%Y-%H-%M-%S\")\n", " )\n", " .replace(\".\", \"\")\n", " .replace(\"_\", \"\")\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pytorch = PyTorchModel(\n", " model_data=model_data,\n", " role=role,\n", " entry_point=\"deploy_ei.py\",\n", " source_dir=\"code\",\n", " framework_version=\"1.3.1\",\n", " py_version=\"py3\",\n", " sagemaker_session=sagemaker_session,\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "You can attach EI remotely or locally" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# Attach EI remotely\n", "\n", "# Function will exit before endpoint is finished creating\n", "predictor = pytorch.deploy(\n", " initial_instance_count=1,\n", " instance_type=instance_type,\n", " accelerator_type=accelerator_type,\n", " endpoint_name=endpoint_name,\n", " wait=True,\n", ")\n", "\n", "\n", "# # Attach EI locally\n", "\n", "# # Deploys the model to a local endpoint\n", "# pytorch_predictor = pytorch.deploy(\n", "# initial_instance_count=1,\n", "# instance_type='local',\n", "# accelerator_type='local_sagemaker_notebook')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "predictor.delete_endpoint()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Notebook CI Test Results\n", "\n", "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", "\n", "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/sagemaker-python-sdk|pytorch_mnist|pytorch_mnist_elastic_inference.ipynb)\n", "\n", "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/sagemaker-python-sdk|pytorch_mnist|pytorch_mnist_elastic_inference.ipynb)\n", "\n", "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/sagemaker-python-sdk|pytorch_mnist|pytorch_mnist_elastic_inference.ipynb)\n", "\n", "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/sagemaker-python-sdk|pytorch_mnist|pytorch_mnist_elastic_inference.ipynb)\n", "\n", "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/sagemaker-python-sdk|pytorch_mnist|pytorch_mnist_elastic_inference.ipynb)\n", "\n", "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/sagemaker-python-sdk|pytorch_mnist|pytorch_mnist_elastic_inference.ipynb)\n", "\n", "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/sagemaker-python-sdk|pytorch_mnist|pytorch_mnist_elastic_inference.ipynb)\n", "\n", "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/sagemaker-python-sdk|pytorch_mnist|pytorch_mnist_elastic_inference.ipynb)\n", "\n", "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/sagemaker-python-sdk|pytorch_mnist|pytorch_mnist_elastic_inference.ipynb)\n", "\n", "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/sagemaker-python-sdk|pytorch_mnist|pytorch_mnist_elastic_inference.ipynb)\n", "\n", "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/sagemaker-python-sdk|pytorch_mnist|pytorch_mnist_elastic_inference.ipynb)\n", "\n", "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/sagemaker-python-sdk|pytorch_mnist|pytorch_mnist_elastic_inference.ipynb)\n", "\n", "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/sagemaker-python-sdk|pytorch_mnist|pytorch_mnist_elastic_inference.ipynb)\n", "\n", "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/sagemaker-python-sdk|pytorch_mnist|pytorch_mnist_elastic_inference.ipynb)\n", "\n", "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/sagemaker-python-sdk|pytorch_mnist|pytorch_mnist_elastic_inference.ipynb)\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (Data Science 2.0)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-west-2:236514542706:image/sagemaker-data-science-38" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" }, "notice": "Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License." }, "nbformat": 4, "nbformat_minor": 1 }