{ "cells": [ { "cell_type": "code", "execution_count": 2, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[33mWARNING: Skipping torchvison as it is not installed.\u001b[0m\u001b[33m\n", "\u001b[0m\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n", "\u001b[0myes: standard output: Broken pipe\n", "\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n", "\u001b[0m" ] } ], "source": [ "!yes | pip uninstall torchvison\n", "!pip install -qU torchvision" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# MNIST Training using PyTorch" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Contents\n", "\n", "1. [Background](#Background)\n", "1. [Setup](#Setup)\n", "1. [Data](#Data)\n", "1. [Train](#Train)\n", "1. [Host](#Host)\n", "\n", "---\n", "\n", "## Background\n", "\n", "MNIST is a widely used dataset for handwritten digit classification. It consists of 70,000 labeled 28x28 pixel grayscale images of hand-written digits. The dataset is split into 60,000 training images and 10,000 test images. There are 10 classes (one for each of the 10 digits). This tutorial will show how to train and test an MNIST model on SageMaker using PyTorch.\n", "\n", "For more information about the PyTorch in SageMaker, please visit [sagemaker-pytorch-containers](https://github.com/aws/sagemaker-pytorch-containers) and [sagemaker-python-sdk](https://github.com/aws/sagemaker-python-sdk) github repositories.\n", "\n", "---\n", "\n", "## Setup\n", "\n", "_This notebook was created and tested on an ml.m4.xlarge notebook instance._\n", "\n", "Let's start by creating a SageMaker session and specifying:\n", "\n", "- The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting.\n", "- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these. Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the `sagemaker.get_execution_role()` with a the appropriate full IAM role arn string(s).\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "tags": [] }, "outputs": [], "source": [ "import sagemaker\n", "\n", "sagemaker_session = sagemaker.Session()\n", "\n", "bucket = sagemaker_session.default_bucket()\n", "prefix = \"sagemaker/pytorch-mnist\"\n", "\n", "role = sagemaker.get_execution_role()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data\n", "### Getting the data\n", "\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "Dataset MNIST\n", " Number of datapoints: 60000\n", " Root location: lab03-pytorch-data\n", " Split: Train\n", " StandardTransform\n", "Transform: Compose(\n", " ToTensor()\n", " Normalize(mean=(0.1307,), std=(0.3081,))\n", " )" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from torchvision.datasets import MNIST\n", "from torchvision import transforms\n", "\n", "MNIST.mirrors = [\"https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/\"]\n", "\n", "MNIST(\n", " \"lab03-pytorch-data\",\n", " download=True,\n", " transform=transforms.Compose(\n", " [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]\n", " ),\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Uploading the data to S3\n", "We are going to use the `sagemaker.Session.upload_data` function to upload our datasets to an S3 location. The return value inputs identifies the location -- we will use later when we start the training job.\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "input spec (in this case, just an S3 path): s3://sagemaker-us-east-1-989682844397/sagemaker/pytorch-mnist\n" ] } ], "source": [ "inputs = sagemaker_session.upload_data(path=\"lab03-pytorch-data\", bucket=bucket, key_prefix=prefix)\n", "print(\"input spec (in this case, just an S3 path): {}\".format(inputs))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train\n", "### Training script\n", "The `mnist.py` script provides all the code we need for training and hosting a SageMaker model (`model_fn` function to load a model).\n", "The training script is very similar to a training script you might run outside of SageMaker, but you can access useful properties about the training environment through various environment variables, such as:\n", "\n", "* `SM_MODEL_DIR`: A string representing the path to the directory to write model artifacts to.\n", " These artifacts are uploaded to S3 for model hosting.\n", "* `SM_NUM_GPUS`: The number of gpus available in the current container.\n", "* `SM_CURRENT_HOST`: The name of the current container on the container network.\n", "* `SM_HOSTS`: JSON encoded list containing all the hosts .\n", "\n", "Supposing one input channel, 'training', was used in the call to the PyTorch estimator's `fit()` method, the following will be set, following the format `SM_CHANNEL_[channel_name]`:\n", "\n", "* `SM_CHANNEL_TRAINING`: A string representing the path to the directory containing data in the 'training' channel.\n", "\n", "For more information about training environment variables, please visit [SageMaker Containers](https://github.com/aws/sagemaker-containers).\n", "\n", "A typical training script loads data from the input channels, configures training with hyperparameters, trains a model, and saves a model to `model_dir` so that it can be hosted later. Hyperparameters are passed to your script as arguments and can be retrieved with an `argparse.ArgumentParser` instance.\n", "\n", "Because the SageMaker imports the training script, you should put your training code in a main guard (``if __name__=='__main__':``) if you are using the same script to host your model as we do in this example, so that SageMaker does not inadvertently run your training code at the wrong point in execution.\n", "\n", "For example, the script run by this notebook:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36margparse\u001b[39;49;00m\u001b[37m\u001b[39;49;00m\n", "\u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36mjson\u001b[39;49;00m\u001b[37m\u001b[39;49;00m\n", "\u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36mlogging\u001b[39;49;00m\u001b[37m\u001b[39;49;00m\n", "\u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36mos\u001b[39;49;00m\u001b[37m\u001b[39;49;00m\n", "\u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36msys\u001b[39;49;00m\u001b[37m\u001b[39;49;00m\n", "\u001b[37m\u001b[39;49;00m\n", "\u001b[37m#import sagemaker_containers\u001b[39;49;00m\u001b[37m\u001b[39;49;00m\n", "\u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36mtorch\u001b[39;49;00m\u001b[37m\u001b[39;49;00m\n", "\u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36mtorch\u001b[39;49;00m\u001b[04m\u001b[36m.\u001b[39;49;00m\u001b[04m\u001b[36mdistributed\u001b[39;49;00m \u001b[34mas\u001b[39;49;00m \u001b[04m\u001b[36mdist\u001b[39;49;00m\u001b[37m\u001b[39;49;00m\n", "\u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36mtorch\u001b[39;49;00m\u001b[04m\u001b[36m.\u001b[39;49;00m\u001b[04m\u001b[36mnn\u001b[39;49;00m \u001b[34mas\u001b[39;49;00m \u001b[04m\u001b[36mnn\u001b[39;49;00m\u001b[37m\u001b[39;49;00m\n", "\u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36mtorch\u001b[39;49;00m\u001b[04m\u001b[36m.\u001b[39;49;00m\u001b[04m\u001b[36mnn\u001b[39;49;00m\u001b[04m\u001b[36m.\u001b[39;49;00m\u001b[04m\u001b[36mfunctional\u001b[39;49;00m \u001b[34mas\u001b[39;49;00m \u001b[04m\u001b[36mF\u001b[39;49;00m\u001b[37m\u001b[39;49;00m\n", "\u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36mtorch\u001b[39;49;00m\u001b[04m\u001b[36m.\u001b[39;49;00m\u001b[04m\u001b[36moptim\u001b[39;49;00m \u001b[34mas\u001b[39;49;00m \u001b[04m\u001b[36moptim\u001b[39;49;00m\u001b[37m\u001b[39;49;00m\n", "\u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36mtorch\u001b[39;49;00m\u001b[04m\u001b[36m.\u001b[39;49;00m\u001b[04m\u001b[36mutils\u001b[39;49;00m\u001b[04m\u001b[36m.\u001b[39;49;00m\u001b[04m\u001b[36mdata\u001b[39;49;00m\u001b[37m\u001b[39;49;00m\n", "\u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36mtorch\u001b[39;49;00m\u001b[04m\u001b[36m.\u001b[39;49;00m\u001b[04m\u001b[36mutils\u001b[39;49;00m\u001b[04m\u001b[36m.\u001b[39;49;00m\u001b[04m\u001b[36mdata\u001b[39;49;00m\u001b[04m\u001b[36m.\u001b[39;49;00m\u001b[04m\u001b[36mdistributed\u001b[39;49;00m\u001b[37m\u001b[39;49;00m\n", "\u001b[34mfrom\u001b[39;49;00m \u001b[04m\u001b[36mtorchvision\u001b[39;49;00m \u001b[34mimport\u001b[39;49;00m datasets, transforms\u001b[37m\u001b[39;49;00m\n", "\u001b[37m\u001b[39;49;00m\n", "logger = logging.getLogger(\u001b[31m__name__\u001b[39;49;00m)\u001b[37m\u001b[39;49;00m\n", "logger.setLevel(logging.DEBUG)\u001b[37m\u001b[39;49;00m\n", "logger.addHandler(logging.StreamHandler(sys.stdout))\u001b[37m\u001b[39;49;00m\n", "\u001b[37m\u001b[39;49;00m\n", "\u001b[37m\u001b[39;49;00m\n", "\u001b[37m# Based on https://github.com/pytorch/examples/blob/master/mnist/main.py\u001b[39;49;00m\u001b[37m\u001b[39;49;00m\n", "\u001b[34mclass\u001b[39;49;00m \u001b[04m\u001b[32mNet\u001b[39;49;00m(nn.Module):\u001b[37m\u001b[39;49;00m\n", " \u001b[34mdef\u001b[39;49;00m \u001b[32m__init__\u001b[39;49;00m(\u001b[36mself\u001b[39;49;00m):\u001b[37m\u001b[39;49;00m\n", " \u001b[36msuper\u001b[39;49;00m(Net, \u001b[36mself\u001b[39;49;00m).\u001b[32m__init__\u001b[39;49;00m()\u001b[37m\u001b[39;49;00m\n", " \u001b[36mself\u001b[39;49;00m.conv1 = nn.Conv2d(\u001b[34m1\u001b[39;49;00m, \u001b[34m10\u001b[39;49;00m, kernel_size=\u001b[34m5\u001b[39;49;00m)\u001b[37m\u001b[39;49;00m\n", " \u001b[36mself\u001b[39;49;00m.conv2 = nn.Conv2d(\u001b[34m10\u001b[39;49;00m, \u001b[34m20\u001b[39;49;00m, kernel_size=\u001b[34m5\u001b[39;49;00m)\u001b[37m\u001b[39;49;00m\n", " \u001b[36mself\u001b[39;49;00m.conv2_drop = nn.Dropout2d()\u001b[37m\u001b[39;49;00m\n", " \u001b[36mself\u001b[39;49;00m.fc1 = nn.Linear(\u001b[34m320\u001b[39;49;00m, \u001b[34m50\u001b[39;49;00m)\u001b[37m\u001b[39;49;00m\n", " \u001b[36mself\u001b[39;49;00m.fc2 = nn.Linear(\u001b[34m50\u001b[39;49;00m, \u001b[34m10\u001b[39;49;00m)\u001b[37m\u001b[39;49;00m\n", "\u001b[37m\u001b[39;49;00m\n", " \u001b[34mdef\u001b[39;49;00m \u001b[32mforward\u001b[39;49;00m(\u001b[36mself\u001b[39;49;00m, x):\u001b[37m\u001b[39;49;00m\n", " x = F.relu(F.max_pool2d(\u001b[36mself\u001b[39;49;00m.conv1(x), \u001b[34m2\u001b[39;49;00m))\u001b[37m\u001b[39;49;00m\n", " x = F.relu(F.max_pool2d(\u001b[36mself\u001b[39;49;00m.conv2_drop(\u001b[36mself\u001b[39;49;00m.conv2(x)), \u001b[34m2\u001b[39;49;00m))\u001b[37m\u001b[39;49;00m\n", " x = x.view(-\u001b[34m1\u001b[39;49;00m, \u001b[34m320\u001b[39;49;00m)\u001b[37m\u001b[39;49;00m\n", " x = F.relu(\u001b[36mself\u001b[39;49;00m.fc1(x))\u001b[37m\u001b[39;49;00m\n", " x = F.dropout(x, training=\u001b[36mself\u001b[39;49;00m.training)\u001b[37m\u001b[39;49;00m\n", " x = \u001b[36mself\u001b[39;49;00m.fc2(x)\u001b[37m\u001b[39;49;00m\n", " \u001b[34mreturn\u001b[39;49;00m F.log_softmax(x, dim=\u001b[34m1\u001b[39;49;00m)\u001b[37m\u001b[39;49;00m\n", "\u001b[37m\u001b[39;49;00m\n", "\u001b[37m\u001b[39;49;00m\n", "\u001b[34mdef\u001b[39;49;00m \u001b[32m_get_train_data_loader\u001b[39;49;00m(batch_size, training_dir, is_distributed, **kwargs):\u001b[37m\u001b[39;49;00m\n", " logger.info(\u001b[33m\"\u001b[39;49;00m\u001b[33mGet train data loader\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\u001b[37m\u001b[39;49;00m\n", " dataset = datasets.MNIST(\u001b[37m\u001b[39;49;00m\n", " training_dir,\u001b[37m\u001b[39;49;00m\n", " train=\u001b[34mTrue\u001b[39;49;00m,\u001b[37m\u001b[39;49;00m\n", " transform=transforms.Compose(\u001b[37m\u001b[39;49;00m\n", " [transforms.ToTensor(), transforms.Normalize((\u001b[34m0.1307\u001b[39;49;00m,), (\u001b[34m0.3081\u001b[39;49;00m,))]\u001b[37m\u001b[39;49;00m\n", " ),\u001b[37m\u001b[39;49;00m\n", " )\u001b[37m\u001b[39;49;00m\n", " train_sampler = (\u001b[37m\u001b[39;49;00m\n", " torch.utils.data.distributed.DistributedSampler(dataset) \u001b[34mif\u001b[39;49;00m is_distributed \u001b[34melse\u001b[39;49;00m \u001b[34mNone\u001b[39;49;00m\u001b[37m\u001b[39;49;00m\n", " )\u001b[37m\u001b[39;49;00m\n", " \u001b[34mreturn\u001b[39;49;00m torch.utils.data.DataLoader(\u001b[37m\u001b[39;49;00m\n", " dataset,\u001b[37m\u001b[39;49;00m\n", " batch_size=batch_size,\u001b[37m\u001b[39;49;00m\n", " shuffle=train_sampler \u001b[35mis\u001b[39;49;00m \u001b[34mNone\u001b[39;49;00m,\u001b[37m\u001b[39;49;00m\n", " sampler=train_sampler,\u001b[37m\u001b[39;49;00m\n", " **kwargs\u001b[37m\u001b[39;49;00m\n", " )\u001b[37m\u001b[39;49;00m\n", "\u001b[37m\u001b[39;49;00m\n", "\u001b[37m\u001b[39;49;00m\n", "\u001b[34mdef\u001b[39;49;00m \u001b[32m_get_test_data_loader\u001b[39;49;00m(test_batch_size, training_dir, **kwargs):\u001b[37m\u001b[39;49;00m\n", " logger.info(\u001b[33m\"\u001b[39;49;00m\u001b[33mGet test data loader\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\u001b[37m\u001b[39;49;00m\n", " \u001b[34mreturn\u001b[39;49;00m torch.utils.data.DataLoader(\u001b[37m\u001b[39;49;00m\n", " datasets.MNIST(\u001b[37m\u001b[39;49;00m\n", " training_dir,\u001b[37m\u001b[39;49;00m\n", " train=\u001b[34mFalse\u001b[39;49;00m,\u001b[37m\u001b[39;49;00m\n", " transform=transforms.Compose(\u001b[37m\u001b[39;49;00m\n", " [transforms.ToTensor(), transforms.Normalize((\u001b[34m0.1307\u001b[39;49;00m,), (\u001b[34m0.3081\u001b[39;49;00m,))]\u001b[37m\u001b[39;49;00m\n", " ),\u001b[37m\u001b[39;49;00m\n", " ),\u001b[37m\u001b[39;49;00m\n", " batch_size=test_batch_size,\u001b[37m\u001b[39;49;00m\n", " shuffle=\u001b[34mTrue\u001b[39;49;00m,\u001b[37m\u001b[39;49;00m\n", " **kwargs\u001b[37m\u001b[39;49;00m\n", " )\u001b[37m\u001b[39;49;00m\n", "\u001b[37m\u001b[39;49;00m\n", "\u001b[37m\u001b[39;49;00m\n", "\u001b[34mdef\u001b[39;49;00m \u001b[32m_average_gradients\u001b[39;49;00m(model):\u001b[37m\u001b[39;49;00m\n", " \u001b[37m# Gradient averaging.\u001b[39;49;00m\u001b[37m\u001b[39;49;00m\n", " size = \u001b[36mfloat\u001b[39;49;00m(dist.get_world_size())\u001b[37m\u001b[39;49;00m\n", " \u001b[34mfor\u001b[39;49;00m param \u001b[35min\u001b[39;49;00m model.parameters():\u001b[37m\u001b[39;49;00m\n", " dist.all_reduce(param.grad.data, op=dist.reduce_op.SUM)\u001b[37m\u001b[39;49;00m\n", " param.grad.data /= size\u001b[37m\u001b[39;49;00m\n", "\u001b[37m\u001b[39;49;00m\n", "\u001b[37m\u001b[39;49;00m\n", "\u001b[34mdef\u001b[39;49;00m \u001b[32mtrain\u001b[39;49;00m(args):\u001b[37m\u001b[39;49;00m\n", " is_distributed = \u001b[36mlen\u001b[39;49;00m(args.hosts) > \u001b[34m1\u001b[39;49;00m \u001b[35mand\u001b[39;49;00m args.backend \u001b[35mis\u001b[39;49;00m \u001b[35mnot\u001b[39;49;00m \u001b[34mNone\u001b[39;49;00m\u001b[37m\u001b[39;49;00m\n", " logger.debug(\u001b[33m\"\u001b[39;49;00m\u001b[33mDistributed training - \u001b[39;49;00m\u001b[33m{}\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m.format(is_distributed))\u001b[37m\u001b[39;49;00m\n", " use_cuda = args.num_gpus > \u001b[34m0\u001b[39;49;00m\u001b[37m\u001b[39;49;00m\n", " logger.debug(\u001b[33m\"\u001b[39;49;00m\u001b[33mNumber of gpus available - \u001b[39;49;00m\u001b[33m{}\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m.format(args.num_gpus))\u001b[37m\u001b[39;49;00m\n", " kwargs = {\u001b[33m\"\u001b[39;49;00m\u001b[33mnum_workers\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m: \u001b[34m1\u001b[39;49;00m, \u001b[33m\"\u001b[39;49;00m\u001b[33mpin_memory\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m: \u001b[34mTrue\u001b[39;49;00m} \u001b[34mif\u001b[39;49;00m use_cuda \u001b[34melse\u001b[39;49;00m {}\u001b[37m\u001b[39;49;00m\n", " device = torch.device(\u001b[33m\"\u001b[39;49;00m\u001b[33mcuda\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m \u001b[34mif\u001b[39;49;00m use_cuda \u001b[34melse\u001b[39;49;00m \u001b[33m\"\u001b[39;49;00m\u001b[33mcpu\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\u001b[37m\u001b[39;49;00m\n", "\u001b[37m\u001b[39;49;00m\n", " \u001b[34mif\u001b[39;49;00m is_distributed:\u001b[37m\u001b[39;49;00m\n", " \u001b[37m# Initialize the distributed environment.\u001b[39;49;00m\u001b[37m\u001b[39;49;00m\n", " world_size = \u001b[36mlen\u001b[39;49;00m(args.hosts)\u001b[37m\u001b[39;49;00m\n", " os.environ[\u001b[33m\"\u001b[39;49;00m\u001b[33mWORLD_SIZE\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m] = \u001b[36mstr\u001b[39;49;00m(world_size)\u001b[37m\u001b[39;49;00m\n", " host_rank = args.hosts.index(args.current_host)\u001b[37m\u001b[39;49;00m\n", " os.environ[\u001b[33m\"\u001b[39;49;00m\u001b[33mRANK\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m] = \u001b[36mstr\u001b[39;49;00m(host_rank)\u001b[37m\u001b[39;49;00m\n", " dist.init_process_group(backend=args.backend, rank=host_rank, world_size=world_size)\u001b[37m\u001b[39;49;00m\n", " logger.info(\u001b[37m\u001b[39;49;00m\n", " \u001b[33m\"\u001b[39;49;00m\u001b[33mInitialized the distributed environment: \u001b[39;49;00m\u001b[33m'\u001b[39;49;00m\u001b[33m{}\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m\u001b[33m backend on \u001b[39;49;00m\u001b[33m{}\u001b[39;49;00m\u001b[33m nodes. \u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m.format(\u001b[37m\u001b[39;49;00m\n", " args.backend, dist.get_world_size()\u001b[37m\u001b[39;49;00m\n", " )\u001b[37m\u001b[39;49;00m\n", " + \u001b[33m\"\u001b[39;49;00m\u001b[33mCurrent host rank is \u001b[39;49;00m\u001b[33m{}\u001b[39;49;00m\u001b[33m. Number of gpus: \u001b[39;49;00m\u001b[33m{}\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m.format(dist.get_rank(), args.num_gpus)\u001b[37m\u001b[39;49;00m\n", " )\u001b[37m\u001b[39;49;00m\n", "\u001b[37m\u001b[39;49;00m\n", " \u001b[37m# set the seed for generating random numbers\u001b[39;49;00m\u001b[37m\u001b[39;49;00m\n", " torch.manual_seed(args.seed)\u001b[37m\u001b[39;49;00m\n", " \u001b[34mif\u001b[39;49;00m use_cuda:\u001b[37m\u001b[39;49;00m\n", " torch.cuda.manual_seed(args.seed)\u001b[37m\u001b[39;49;00m\n", "\u001b[37m\u001b[39;49;00m\n", " train_loader = _get_train_data_loader(args.batch_size, args.data_dir, is_distributed, **kwargs)\u001b[37m\u001b[39;49;00m\n", " test_loader = _get_test_data_loader(args.test_batch_size, args.data_dir, **kwargs)\u001b[37m\u001b[39;49;00m\n", "\u001b[37m\u001b[39;49;00m\n", " logger.debug(\u001b[37m\u001b[39;49;00m\n", " \u001b[33m\"\u001b[39;49;00m\u001b[33mProcesses \u001b[39;49;00m\u001b[33m{}\u001b[39;49;00m\u001b[33m/\u001b[39;49;00m\u001b[33m{}\u001b[39;49;00m\u001b[33m (\u001b[39;49;00m\u001b[33m{:.0f}\u001b[39;49;00m\u001b[33m%\u001b[39;49;00m\u001b[33m) of train data\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m.format(\u001b[37m\u001b[39;49;00m\n", " \u001b[36mlen\u001b[39;49;00m(train_loader.sampler),\u001b[37m\u001b[39;49;00m\n", " \u001b[36mlen\u001b[39;49;00m(train_loader.dataset),\u001b[37m\u001b[39;49;00m\n", " \u001b[34m100.0\u001b[39;49;00m * \u001b[36mlen\u001b[39;49;00m(train_loader.sampler) / \u001b[36mlen\u001b[39;49;00m(train_loader.dataset),\u001b[37m\u001b[39;49;00m\n", " )\u001b[37m\u001b[39;49;00m\n", " )\u001b[37m\u001b[39;49;00m\n", "\u001b[37m\u001b[39;49;00m\n", " logger.debug(\u001b[37m\u001b[39;49;00m\n", " \u001b[33m\"\u001b[39;49;00m\u001b[33mProcesses \u001b[39;49;00m\u001b[33m{}\u001b[39;49;00m\u001b[33m/\u001b[39;49;00m\u001b[33m{}\u001b[39;49;00m\u001b[33m (\u001b[39;49;00m\u001b[33m{:.0f}\u001b[39;49;00m\u001b[33m%\u001b[39;49;00m\u001b[33m) of test data\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m.format(\u001b[37m\u001b[39;49;00m\n", " \u001b[36mlen\u001b[39;49;00m(test_loader.sampler),\u001b[37m\u001b[39;49;00m\n", " \u001b[36mlen\u001b[39;49;00m(test_loader.dataset),\u001b[37m\u001b[39;49;00m\n", " \u001b[34m100.0\u001b[39;49;00m * \u001b[36mlen\u001b[39;49;00m(test_loader.sampler) / \u001b[36mlen\u001b[39;49;00m(test_loader.dataset),\u001b[37m\u001b[39;49;00m\n", " )\u001b[37m\u001b[39;49;00m\n", " )\u001b[37m\u001b[39;49;00m\n", "\u001b[37m\u001b[39;49;00m\n", " model = Net().to(device)\u001b[37m\u001b[39;49;00m\n", " \u001b[34mif\u001b[39;49;00m is_distributed \u001b[35mand\u001b[39;49;00m use_cuda:\u001b[37m\u001b[39;49;00m\n", " \u001b[37m# multi-machine multi-gpu case\u001b[39;49;00m\u001b[37m\u001b[39;49;00m\n", " model = torch.nn.parallel.DistributedDataParallel(model)\u001b[37m\u001b[39;49;00m\n", " \u001b[34melse\u001b[39;49;00m:\u001b[37m\u001b[39;49;00m\n", " \u001b[37m# single-machine multi-gpu case or single-machine or multi-machine cpu case\u001b[39;49;00m\u001b[37m\u001b[39;49;00m\n", " model = torch.nn.DataParallel(model)\u001b[37m\u001b[39;49;00m\n", "\u001b[37m\u001b[39;49;00m\n", " optimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum)\u001b[37m\u001b[39;49;00m\n", "\u001b[37m\u001b[39;49;00m\n", " \u001b[34mfor\u001b[39;49;00m epoch \u001b[35min\u001b[39;49;00m \u001b[36mrange\u001b[39;49;00m(\u001b[34m1\u001b[39;49;00m, args.epochs + \u001b[34m1\u001b[39;49;00m):\u001b[37m\u001b[39;49;00m\n", " model.train()\u001b[37m\u001b[39;49;00m\n", " \u001b[34mfor\u001b[39;49;00m batch_idx, (data, target) \u001b[35min\u001b[39;49;00m \u001b[36menumerate\u001b[39;49;00m(train_loader, \u001b[34m1\u001b[39;49;00m):\u001b[37m\u001b[39;49;00m\n", " data, target = data.to(device), target.to(device)\u001b[37m\u001b[39;49;00m\n", " optimizer.zero_grad()\u001b[37m\u001b[39;49;00m\n", " output = model(data)\u001b[37m\u001b[39;49;00m\n", " loss = F.nll_loss(output, target)\u001b[37m\u001b[39;49;00m\n", " loss.backward()\u001b[37m\u001b[39;49;00m\n", " \u001b[34mif\u001b[39;49;00m is_distributed \u001b[35mand\u001b[39;49;00m \u001b[35mnot\u001b[39;49;00m use_cuda:\u001b[37m\u001b[39;49;00m\n", " \u001b[37m# average gradients manually for multi-machine cpu case only\u001b[39;49;00m\u001b[37m\u001b[39;49;00m\n", " _average_gradients(model)\u001b[37m\u001b[39;49;00m\n", " optimizer.step()\u001b[37m\u001b[39;49;00m\n", " \u001b[34mif\u001b[39;49;00m batch_idx % args.log_interval == \u001b[34m0\u001b[39;49;00m:\u001b[37m\u001b[39;49;00m\n", " logger.info(\u001b[37m\u001b[39;49;00m\n", " \u001b[33m\"\u001b[39;49;00m\u001b[33mTrain Epoch: \u001b[39;49;00m\u001b[33m{}\u001b[39;49;00m\u001b[33m [\u001b[39;49;00m\u001b[33m{}\u001b[39;49;00m\u001b[33m/\u001b[39;49;00m\u001b[33m{}\u001b[39;49;00m\u001b[33m (\u001b[39;49;00m\u001b[33m{:.0f}\u001b[39;49;00m\u001b[33m%\u001b[39;49;00m\u001b[33m)] Loss: \u001b[39;49;00m\u001b[33m{:.6f}\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m.format(\u001b[37m\u001b[39;49;00m\n", " epoch,\u001b[37m\u001b[39;49;00m\n", " batch_idx * \u001b[36mlen\u001b[39;49;00m(data),\u001b[37m\u001b[39;49;00m\n", " \u001b[36mlen\u001b[39;49;00m(train_loader.sampler),\u001b[37m\u001b[39;49;00m\n", " \u001b[34m100.0\u001b[39;49;00m * batch_idx / \u001b[36mlen\u001b[39;49;00m(train_loader),\u001b[37m\u001b[39;49;00m\n", " loss.item(),\u001b[37m\u001b[39;49;00m\n", " )\u001b[37m\u001b[39;49;00m\n", " )\u001b[37m\u001b[39;49;00m\n", " test(model, test_loader, device)\u001b[37m\u001b[39;49;00m\n", " save_model(model, args.model_dir)\u001b[37m\u001b[39;49;00m\n", "\u001b[37m\u001b[39;49;00m\n", "\u001b[37m\u001b[39;49;00m\n", "\u001b[34mdef\u001b[39;49;00m \u001b[32mtest\u001b[39;49;00m(model, test_loader, device):\u001b[37m\u001b[39;49;00m\n", " model.eval()\u001b[37m\u001b[39;49;00m\n", " test_loss = \u001b[34m0\u001b[39;49;00m\u001b[37m\u001b[39;49;00m\n", " correct = \u001b[34m0\u001b[39;49;00m\u001b[37m\u001b[39;49;00m\n", " \u001b[34mwith\u001b[39;49;00m torch.no_grad():\u001b[37m\u001b[39;49;00m\n", " \u001b[34mfor\u001b[39;49;00m data, target \u001b[35min\u001b[39;49;00m test_loader:\u001b[37m\u001b[39;49;00m\n", " data, target = data.to(device), target.to(device)\u001b[37m\u001b[39;49;00m\n", " output = model(data)\u001b[37m\u001b[39;49;00m\n", " test_loss += F.nll_loss(output, target, size_average=\u001b[34mFalse\u001b[39;49;00m).item() \u001b[37m# sum up batch loss\u001b[39;49;00m\u001b[37m\u001b[39;49;00m\n", " pred = output.max(\u001b[34m1\u001b[39;49;00m, keepdim=\u001b[34mTrue\u001b[39;49;00m)[\u001b[34m1\u001b[39;49;00m] \u001b[37m# get the index of the max log-probability\u001b[39;49;00m\u001b[37m\u001b[39;49;00m\n", " correct += pred.eq(target.view_as(pred)).sum().item()\u001b[37m\u001b[39;49;00m\n", "\u001b[37m\u001b[39;49;00m\n", " test_loss /= \u001b[36mlen\u001b[39;49;00m(test_loader.dataset)\u001b[37m\u001b[39;49;00m\n", " logger.info(\u001b[37m\u001b[39;49;00m\n", " \u001b[33m\"\u001b[39;49;00m\u001b[33mTest set: Average loss: \u001b[39;49;00m\u001b[33m{:.4f}\u001b[39;49;00m\u001b[33m, Accuracy: \u001b[39;49;00m\u001b[33m{}\u001b[39;49;00m\u001b[33m/\u001b[39;49;00m\u001b[33m{}\u001b[39;49;00m\u001b[33m (\u001b[39;49;00m\u001b[33m{:.0f}\u001b[39;49;00m\u001b[33m%\u001b[39;49;00m\u001b[33m)\u001b[39;49;00m\u001b[33m\\n\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m.format(\u001b[37m\u001b[39;49;00m\n", " test_loss, correct, \u001b[36mlen\u001b[39;49;00m(test_loader.dataset), \u001b[34m100.0\u001b[39;49;00m * correct / \u001b[36mlen\u001b[39;49;00m(test_loader.dataset)\u001b[37m\u001b[39;49;00m\n", " )\u001b[37m\u001b[39;49;00m\n", " )\u001b[37m\u001b[39;49;00m\n", "\u001b[37m\u001b[39;49;00m\n", "\u001b[37m\u001b[39;49;00m\n", "\u001b[34mdef\u001b[39;49;00m \u001b[32mmodel_fn\u001b[39;49;00m(model_dir):\u001b[37m\u001b[39;49;00m\n", " device = torch.device(\u001b[33m\"\u001b[39;49;00m\u001b[33mcuda\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m \u001b[34mif\u001b[39;49;00m torch.cuda.is_available() \u001b[34melse\u001b[39;49;00m \u001b[33m\"\u001b[39;49;00m\u001b[33mcpu\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\u001b[37m\u001b[39;49;00m\n", " model = torch.nn.DataParallel(Net())\u001b[37m\u001b[39;49;00m\n", " \u001b[34mwith\u001b[39;49;00m \u001b[36mopen\u001b[39;49;00m(os.path.join(model_dir, \u001b[33m\"\u001b[39;49;00m\u001b[33mmodel.pth\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m), \u001b[33m\"\u001b[39;49;00m\u001b[33mrb\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m) \u001b[34mas\u001b[39;49;00m f:\u001b[37m\u001b[39;49;00m\n", " model.load_state_dict(torch.load(f))\u001b[37m\u001b[39;49;00m\n", " \u001b[34mreturn\u001b[39;49;00m model.to(device)\u001b[37m\u001b[39;49;00m\n", "\u001b[37m\u001b[39;49;00m\n", "\u001b[37m\u001b[39;49;00m\n", "\u001b[34mdef\u001b[39;49;00m \u001b[32msave_model\u001b[39;49;00m(model, model_dir):\u001b[37m\u001b[39;49;00m\n", " logger.info(\u001b[33m\"\u001b[39;49;00m\u001b[33mSaving the model.\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\u001b[37m\u001b[39;49;00m\n", " path = os.path.join(model_dir, \u001b[33m\"\u001b[39;49;00m\u001b[33mmodel.pth\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\u001b[37m\u001b[39;49;00m\n", " \u001b[37m# recommended way from http://pytorch.org/docs/master/notes/serialization.html\u001b[39;49;00m\u001b[37m\u001b[39;49;00m\n", " torch.save(model.cpu().state_dict(), path)\u001b[37m\u001b[39;49;00m\n", "\u001b[37m\u001b[39;49;00m\n", "\u001b[37m\u001b[39;49;00m\n", "\u001b[34mif\u001b[39;49;00m \u001b[31m__name__\u001b[39;49;00m == \u001b[33m\"\u001b[39;49;00m\u001b[33m__main__\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m:\u001b[37m\u001b[39;49;00m\n", " parser = argparse.ArgumentParser()\u001b[37m\u001b[39;49;00m\n", "\u001b[37m\u001b[39;49;00m\n", " \u001b[37m# Data and model checkpoints directories\u001b[39;49;00m\u001b[37m\u001b[39;49;00m\n", " parser.add_argument(\u001b[37m\u001b[39;49;00m\n", " \u001b[33m\"\u001b[39;49;00m\u001b[33m--batch-size\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m,\u001b[37m\u001b[39;49;00m\n", " \u001b[36mtype\u001b[39;49;00m=\u001b[36mint\u001b[39;49;00m,\u001b[37m\u001b[39;49;00m\n", " default=\u001b[34m64\u001b[39;49;00m,\u001b[37m\u001b[39;49;00m\n", " metavar=\u001b[33m\"\u001b[39;49;00m\u001b[33mN\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m,\u001b[37m\u001b[39;49;00m\n", " help=\u001b[33m\"\u001b[39;49;00m\u001b[33minput batch size for training (default: 64)\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m,\u001b[37m\u001b[39;49;00m\n", " )\u001b[37m\u001b[39;49;00m\n", " parser.add_argument(\u001b[37m\u001b[39;49;00m\n", " \u001b[33m\"\u001b[39;49;00m\u001b[33m--test-batch-size\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m,\u001b[37m\u001b[39;49;00m\n", " \u001b[36mtype\u001b[39;49;00m=\u001b[36mint\u001b[39;49;00m,\u001b[37m\u001b[39;49;00m\n", " default=\u001b[34m1000\u001b[39;49;00m,\u001b[37m\u001b[39;49;00m\n", " metavar=\u001b[33m\"\u001b[39;49;00m\u001b[33mN\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m,\u001b[37m\u001b[39;49;00m\n", " help=\u001b[33m\"\u001b[39;49;00m\u001b[33minput batch size for testing (default: 1000)\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m,\u001b[37m\u001b[39;49;00m\n", " )\u001b[37m\u001b[39;49;00m\n", " parser.add_argument(\u001b[37m\u001b[39;49;00m\n", " \u001b[33m\"\u001b[39;49;00m\u001b[33m--epochs\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m,\u001b[37m\u001b[39;49;00m\n", " \u001b[36mtype\u001b[39;49;00m=\u001b[36mint\u001b[39;49;00m,\u001b[37m\u001b[39;49;00m\n", " default=\u001b[34m10\u001b[39;49;00m,\u001b[37m\u001b[39;49;00m\n", " metavar=\u001b[33m\"\u001b[39;49;00m\u001b[33mN\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m,\u001b[37m\u001b[39;49;00m\n", " help=\u001b[33m\"\u001b[39;49;00m\u001b[33mnumber of epochs to train (default: 10)\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m,\u001b[37m\u001b[39;49;00m\n", " )\u001b[37m\u001b[39;49;00m\n", " parser.add_argument(\u001b[37m\u001b[39;49;00m\n", " \u001b[33m\"\u001b[39;49;00m\u001b[33m--lr\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, \u001b[36mtype\u001b[39;49;00m=\u001b[36mfloat\u001b[39;49;00m, default=\u001b[34m0.01\u001b[39;49;00m, metavar=\u001b[33m\"\u001b[39;49;00m\u001b[33mLR\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, help=\u001b[33m\"\u001b[39;49;00m\u001b[33mlearning rate (default: 0.01)\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\u001b[37m\u001b[39;49;00m\n", " )\u001b[37m\u001b[39;49;00m\n", " parser.add_argument(\u001b[37m\u001b[39;49;00m\n", " \u001b[33m\"\u001b[39;49;00m\u001b[33m--momentum\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, \u001b[36mtype\u001b[39;49;00m=\u001b[36mfloat\u001b[39;49;00m, default=\u001b[34m0.5\u001b[39;49;00m, metavar=\u001b[33m\"\u001b[39;49;00m\u001b[33mM\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, help=\u001b[33m\"\u001b[39;49;00m\u001b[33mSGD momentum (default: 0.5)\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\u001b[37m\u001b[39;49;00m\n", " )\u001b[37m\u001b[39;49;00m\n", " parser.add_argument(\u001b[33m\"\u001b[39;49;00m\u001b[33m--seed\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, \u001b[36mtype\u001b[39;49;00m=\u001b[36mint\u001b[39;49;00m, default=\u001b[34m1\u001b[39;49;00m, metavar=\u001b[33m\"\u001b[39;49;00m\u001b[33mS\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, help=\u001b[33m\"\u001b[39;49;00m\u001b[33mrandom seed (default: 1)\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\u001b[37m\u001b[39;49;00m\n", " parser.add_argument(\u001b[37m\u001b[39;49;00m\n", " \u001b[33m\"\u001b[39;49;00m\u001b[33m--log-interval\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m,\u001b[37m\u001b[39;49;00m\n", " \u001b[36mtype\u001b[39;49;00m=\u001b[36mint\u001b[39;49;00m,\u001b[37m\u001b[39;49;00m\n", " default=\u001b[34m100\u001b[39;49;00m,\u001b[37m\u001b[39;49;00m\n", " metavar=\u001b[33m\"\u001b[39;49;00m\u001b[33mN\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m,\u001b[37m\u001b[39;49;00m\n", " help=\u001b[33m\"\u001b[39;49;00m\u001b[33mhow many batches to wait before logging training status\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m,\u001b[37m\u001b[39;49;00m\n", " )\u001b[37m\u001b[39;49;00m\n", " parser.add_argument(\u001b[37m\u001b[39;49;00m\n", " \u001b[33m\"\u001b[39;49;00m\u001b[33m--backend\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m,\u001b[37m\u001b[39;49;00m\n", " \u001b[36mtype\u001b[39;49;00m=\u001b[36mstr\u001b[39;49;00m,\u001b[37m\u001b[39;49;00m\n", " default=\u001b[34mNone\u001b[39;49;00m,\u001b[37m\u001b[39;49;00m\n", " help=\u001b[33m\"\u001b[39;49;00m\u001b[33mbackend for distributed training (tcp, gloo on cpu and gloo, nccl on gpu)\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m,\u001b[37m\u001b[39;49;00m\n", " )\u001b[37m\u001b[39;49;00m\n", "\u001b[37m\u001b[39;49;00m\n", " \u001b[37m# Container environment\u001b[39;49;00m\u001b[37m\u001b[39;49;00m\n", " parser.add_argument(\u001b[33m\"\u001b[39;49;00m\u001b[33m--hosts\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, \u001b[36mtype\u001b[39;49;00m=\u001b[36mlist\u001b[39;49;00m, default=json.loads(os.environ[\u001b[33m\"\u001b[39;49;00m\u001b[33mSM_HOSTS\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m]))\u001b[37m\u001b[39;49;00m\n", " parser.add_argument(\u001b[33m\"\u001b[39;49;00m\u001b[33m--current-host\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, \u001b[36mtype\u001b[39;49;00m=\u001b[36mstr\u001b[39;49;00m, default=os.environ[\u001b[33m\"\u001b[39;49;00m\u001b[33mSM_CURRENT_HOST\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m])\u001b[37m\u001b[39;49;00m\n", " parser.add_argument(\u001b[33m\"\u001b[39;49;00m\u001b[33m--model-dir\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, \u001b[36mtype\u001b[39;49;00m=\u001b[36mstr\u001b[39;49;00m, default=os.environ[\u001b[33m\"\u001b[39;49;00m\u001b[33mSM_MODEL_DIR\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m])\u001b[37m\u001b[39;49;00m\n", " parser.add_argument(\u001b[33m\"\u001b[39;49;00m\u001b[33m--data-dir\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, \u001b[36mtype\u001b[39;49;00m=\u001b[36mstr\u001b[39;49;00m, default=os.environ[\u001b[33m\"\u001b[39;49;00m\u001b[33mSM_CHANNEL_TRAINING\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m])\u001b[37m\u001b[39;49;00m\n", " parser.add_argument(\u001b[33m\"\u001b[39;49;00m\u001b[33m--num-gpus\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, \u001b[36mtype\u001b[39;49;00m=\u001b[36mint\u001b[39;49;00m, default=os.environ[\u001b[33m\"\u001b[39;49;00m\u001b[33mSM_NUM_GPUS\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m])\u001b[37m\u001b[39;49;00m\n", "\u001b[37m\u001b[39;49;00m\n", " train(parser.parse_args())\u001b[37m\u001b[39;49;00m\n" ] } ], "source": [ "!pygmentize mnist-pytorch.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Run training in SageMaker\n", "\n", "The `PyTorch` class allows us to run our training function as a training job on SageMaker infrastructure. We need to configure it with our training script, an IAM role, the number of training instances, the training instance type, and hyperparameters. In this case we are going to run our training job on 2 ```ml.c4.xlarge``` instances. But this example can be ran on one or multiple, cpu or gpu instances ([full list of available instances](https://aws.amazon.com/sagemaker/pricing/instance-types/)). The hyperparameters parameter is a dict of values that will be passed to your training script -- you can see how to access these values in the `mnist-pytorch.py` script above.\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "from sagemaker.pytorch import PyTorch\n", "\n", "estimator = PyTorch(\n", " entry_point=\"mnist-pytorch.py\",\n", " role=role,\n", " py_version=\"py38\",\n", " framework_version=\"1.11.0\",\n", " instance_count=2,\n", " instance_type=\"ml.c5.2xlarge\",\n", " hyperparameters={\"epochs\": 1, \"backend\": \"gloo\"},\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After we've constructed our `PyTorch` object, we can fit it using the data we uploaded to S3. SageMaker makes sure our data is available in the local filesystem, so our training script can simply read the data from disk.\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.\n", "INFO:sagemaker:Creating training-job with name: pytorch-training-2023-03-20-00-02-32-142\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "2023-03-20 00:02:37 Starting - Starting the training job...\n", "2023-03-20 00:02:52 Starting - Preparing the instances for training...\n", "2023-03-20 00:03:38 Downloading - Downloading input data...\n", "2023-03-20 00:04:09 Training - Downloading the training image......\n", "2023-03-20 00:04:50 Training - Training image download completed. Training in progress.\u001b[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device\u001b[0m\n", "\u001b[34mbash: no job control in this shell\u001b[0m\n", "\u001b[34m2023-03-19 20:05:01,092 sagemaker-training-toolkit INFO Imported framework sagemaker_pytorch_container.training\u001b[0m\n", "\u001b[34m2023-03-19 20:05:01,094 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)\u001b[0m\n", "\u001b[34m2023-03-19 20:05:01,096 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)\u001b[0m\n", "\u001b[34m2023-03-19 20:05:01,104 sagemaker_pytorch_container.training INFO Block until all host DNS lookups succeed.\u001b[0m\n", "\u001b[34m2023-03-19 20:05:01,106 sagemaker_pytorch_container.training INFO Invoking user training script.\u001b[0m\n", "\u001b[34m2023-03-19 20:05:01,310 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)\u001b[0m\n", "\u001b[34m2023-03-19 20:05:01,312 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)\u001b[0m\n", "\u001b[34m2023-03-19 20:05:01,324 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)\u001b[0m\n", "\u001b[34m2023-03-19 20:05:01,326 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)\u001b[0m\n", "\u001b[34m2023-03-19 20:05:01,338 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)\u001b[0m\n", "\u001b[34m2023-03-19 20:05:01,340 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)\u001b[0m\n", "\u001b[34m2023-03-19 20:05:01,350 sagemaker-training-toolkit INFO Invoking user script\u001b[0m\n", "\u001b[34mTraining Env:\u001b[0m\n", "\u001b[34m{\n", " \"additional_framework_parameters\": {},\n", " \"channel_input_dirs\": {\n", " \"training\": \"/opt/ml/input/data/training\"\n", " },\n", " \"current_host\": \"algo-1\",\n", " \"current_instance_group\": \"homogeneousCluster\",\n", " \"current_instance_group_hosts\": [\n", " \"algo-1\",\n", " \"algo-2\"\n", " ],\n", " \"current_instance_type\": \"ml.c5.2xlarge\",\n", " \"distribution_hosts\": [],\n", " \"distribution_instance_groups\": [],\n", " \"framework_module\": \"sagemaker_pytorch_container.training:main\",\n", " \"hosts\": [\n", " \"algo-1\",\n", " \"algo-2\"\n", " ],\n", " \"hyperparameters\": {\n", " \"backend\": \"gloo\",\n", " \"epochs\": 1\n", " },\n", " \"input_config_dir\": \"/opt/ml/input/config\",\n", " \"input_data_config\": {\n", " \"training\": {\n", " \"TrainingInputMode\": \"File\",\n", " \"S3DistributionType\": \"FullyReplicated\",\n", " \"RecordWrapperType\": \"None\"\n", " }\n", " },\n", " \"input_dir\": \"/opt/ml/input\",\n", " \"instance_groups\": [\n", " \"homogeneousCluster\"\n", " ],\n", " \"instance_groups_dict\": {\n", " \"homogeneousCluster\": {\n", " \"instance_group_name\": \"homogeneousCluster\",\n", " \"instance_type\": \"ml.c5.2xlarge\",\n", " \"hosts\": [\n", " \"algo-1\",\n", " \"algo-2\"\n", " ]\n", " }\n", " },\n", " \"is_hetero\": false,\n", " \"is_master\": true,\n", " \"is_modelparallel_enabled\": null,\n", " \"is_smddpmprun_installed\": false,\n", " \"job_name\": \"pytorch-training-2023-03-20-00-02-32-142\",\n", " \"log_level\": 20,\n", " \"master_hostname\": \"algo-1\",\n", " \"model_dir\": \"/opt/ml/model\",\n", " \"module_dir\": \"s3://sagemaker-us-east-1-989682844397/pytorch-training-2023-03-20-00-02-32-142/source/sourcedir.tar.gz\",\n", " \"module_name\": \"mnist-pytorch\",\n", " \"network_interface_name\": \"eth0\",\n", " \"num_cpus\": 8,\n", " \"num_gpus\": 0,\n", " \"num_neurons\": 0,\n", " \"output_data_dir\": \"/opt/ml/output/data\",\n", " \"output_dir\": \"/opt/ml/output\",\n", " \"output_intermediate_dir\": \"/opt/ml/output/intermediate\",\n", " \"resource_config\": {\n", " \"current_host\": \"algo-1\",\n", " \"current_instance_type\": \"ml.c5.2xlarge\",\n", " \"current_group_name\": \"homogeneousCluster\",\n", " \"hosts\": [\n", " \"algo-1\",\n", " \"algo-2\"\n", " ],\n", " \"instance_groups\": [\n", " {\n", " \"instance_group_name\": \"homogeneousCluster\",\n", " \"instance_type\": \"ml.c5.2xlarge\",\n", " \"hosts\": [\n", " \"algo-1\",\n", " \"algo-2\"\n", " ]\n", " }\n", " ],\n", " \"network_interface_name\": \"eth0\"\n", " },\n", " \"user_entry_point\": \"mnist-pytorch.py\"\u001b[0m\n", "\u001b[34m}\u001b[0m\n", "\u001b[34mEnvironment variables:\u001b[0m\n", "\u001b[34mSM_HOSTS=[\"algo-1\",\"algo-2\"]\u001b[0m\n", "\u001b[34mSM_NETWORK_INTERFACE_NAME=eth0\u001b[0m\n", "\u001b[34mSM_HPS={\"backend\":\"gloo\",\"epochs\":1}\u001b[0m\n", "\u001b[34mSM_USER_ENTRY_POINT=mnist-pytorch.py\u001b[0m\n", "\u001b[34mSM_FRAMEWORK_PARAMS={}\u001b[0m\n", "\u001b[34mSM_RESOURCE_CONFIG={\"current_group_name\":\"homogeneousCluster\",\"current_host\":\"algo-1\",\"current_instance_type\":\"ml.c5.2xlarge\",\"hosts\":[\"algo-1\",\"algo-2\"],\"instance_groups\":[{\"hosts\":[\"algo-1\",\"algo-2\"],\"instance_group_name\":\"homogeneousCluster\",\"instance_type\":\"ml.c5.2xlarge\"}],\"network_interface_name\":\"eth0\"}\u001b[0m\n", "\u001b[34mSM_INPUT_DATA_CONFIG={\"training\":{\"RecordWrapperType\":\"None\",\"S3DistributionType\":\"FullyReplicated\",\"TrainingInputMode\":\"File\"}}\u001b[0m\n", "\u001b[34mSM_OUTPUT_DATA_DIR=/opt/ml/output/data\u001b[0m\n", "\u001b[34mSM_CHANNELS=[\"training\"]\u001b[0m\n", "\u001b[34mSM_CURRENT_HOST=algo-1\u001b[0m\n", "\u001b[34mSM_CURRENT_INSTANCE_TYPE=ml.c5.2xlarge\u001b[0m\n", "\u001b[34mSM_CURRENT_INSTANCE_GROUP=homogeneousCluster\u001b[0m\n", "\u001b[34mSM_CURRENT_INSTANCE_GROUP_HOSTS=[\"algo-1\",\"algo-2\"]\u001b[0m\n", "\u001b[34mSM_INSTANCE_GROUPS=[\"homogeneousCluster\"]\u001b[0m\n", "\u001b[34mSM_INSTANCE_GROUPS_DICT={\"homogeneousCluster\":{\"hosts\":[\"algo-1\",\"algo-2\"],\"instance_group_name\":\"homogeneousCluster\",\"instance_type\":\"ml.c5.2xlarge\"}}\u001b[0m\n", "\u001b[34mSM_DISTRIBUTION_INSTANCE_GROUPS=[]\u001b[0m\n", "\u001b[34mSM_IS_HETERO=false\u001b[0m\n", "\u001b[34mSM_MODULE_NAME=mnist-pytorch\u001b[0m\n", "\u001b[34mSM_LOG_LEVEL=20\u001b[0m\n", "\u001b[34mSM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main\u001b[0m\n", "\u001b[34mSM_INPUT_DIR=/opt/ml/input\u001b[0m\n", "\u001b[34mSM_INPUT_CONFIG_DIR=/opt/ml/input/config\u001b[0m\n", "\u001b[34mSM_OUTPUT_DIR=/opt/ml/output\u001b[0m\n", "\u001b[34mSM_NUM_CPUS=8\u001b[0m\n", "\u001b[34mSM_NUM_GPUS=0\u001b[0m\n", "\u001b[34mSM_NUM_NEURONS=0\u001b[0m\n", "\u001b[34mSM_MODEL_DIR=/opt/ml/model\u001b[0m\n", "\u001b[34mSM_MODULE_DIR=s3://sagemaker-us-east-1-989682844397/pytorch-training-2023-03-20-00-02-32-142/source/sourcedir.tar.gz\u001b[0m\n", "\u001b[34mSM_TRAINING_ENV={\"additional_framework_parameters\":{},\"channel_input_dirs\":{\"training\":\"/opt/ml/input/data/training\"},\"current_host\":\"algo-1\",\"current_instance_group\":\"homogeneousCluster\",\"current_instance_group_hosts\":[\"algo-1\",\"algo-2\"],\"current_instance_type\":\"ml.c5.2xlarge\",\"distribution_hosts\":[],\"distribution_instance_groups\":[],\"framework_module\":\"sagemaker_pytorch_container.training:main\",\"hosts\":[\"algo-1\",\"algo-2\"],\"hyperparameters\":{\"backend\":\"gloo\",\"epochs\":1},\"input_config_dir\":\"/opt/ml/input/config\",\"input_data_config\":{\"training\":{\"RecordWrapperType\":\"None\",\"S3DistributionType\":\"FullyReplicated\",\"TrainingInputMode\":\"File\"}},\"input_dir\":\"/opt/ml/input\",\"instance_groups\":[\"homogeneousCluster\"],\"instance_groups_dict\":{\"homogeneousCluster\":{\"hosts\":[\"algo-1\",\"algo-2\"],\"instance_group_name\":\"homogeneousCluster\",\"instance_type\":\"ml.c5.2xlarge\"}},\"is_hetero\":false,\"is_master\":true,\"is_modelparallel_enabled\":null,\"is_smddpmprun_installed\":false,\"job_name\":\"pytorch-training-2023-03-20-00-02-32-142\",\"log_level\":20,\"master_hostname\":\"algo-1\",\"model_dir\":\"/opt/ml/model\",\"module_dir\":\"s3://sagemaker-us-east-1-989682844397/pytorch-training-2023-03-20-00-02-32-142/source/sourcedir.tar.gz\",\"module_name\":\"mnist-pytorch\",\"network_interface_name\":\"eth0\",\"num_cpus\":8,\"num_gpus\":0,\"num_neurons\":0,\"output_data_dir\":\"/opt/ml/output/data\",\"output_dir\":\"/opt/ml/output\",\"output_intermediate_dir\":\"/opt/ml/output/intermediate\",\"resource_config\":{\"current_group_name\":\"homogeneousCluster\",\"current_host\":\"algo-1\",\"current_instance_type\":\"ml.c5.2xlarge\",\"hosts\":[\"algo-1\",\"algo-2\"],\"instance_groups\":[{\"hosts\":[\"algo-1\",\"algo-2\"],\"instance_group_name\":\"homogeneousCluster\",\"instance_type\":\"ml.c5.2xlarge\"}],\"network_interface_name\":\"eth0\"},\"user_entry_point\":\"mnist-pytorch.py\"}\u001b[0m\n", "\u001b[34mSM_USER_ARGS=[\"--backend\",\"gloo\",\"--epochs\",\"1\"]\u001b[0m\n", "\u001b[34mSM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate\u001b[0m\n", "\u001b[34mSM_CHANNEL_TRAINING=/opt/ml/input/data/training\u001b[0m\n", "\u001b[34mSM_HP_BACKEND=gloo\u001b[0m\n", "\u001b[34mSM_HP_EPOCHS=1\u001b[0m\n", "\u001b[34mPYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python38.zip:/opt/conda/lib/python3.8:/opt/conda/lib/python3.8/lib-dynload:/opt/conda/lib/python3.8/site-packages:/opt/conda/lib/python3.8/site-packages/smdebug-1.0.26b20230214-py3.8.egg:/opt/conda/lib/python3.8/site-packages/pyinstrument-3.4.2-py3.8.egg:/opt/conda/lib/python3.8/site-packages/pyinstrument_cext-0.2.4-py3.8-linux-x86_64.egg\u001b[0m\n", "\u001b[34mInvoking script with the following command:\u001b[0m\n", "\u001b[34m/opt/conda/bin/python3.8 mnist-pytorch.py --backend gloo --epochs 1\u001b[0m\n", "\u001b[34m2023-03-19 20:05:01,795 sagemaker-training-toolkit INFO Exceptions not imported for SageMaker TF as Tensorflow is not installed.\u001b[0m\n", "\u001b[35mbash: cannot set terminal process group (-1): Inappropriate ioctl for device\u001b[0m\n", "\u001b[35mbash: no job control in this shell\u001b[0m\n", "\u001b[35m2023-03-19 20:04:59,755 sagemaker-training-toolkit INFO Imported framework sagemaker_pytorch_container.training\u001b[0m\n", "\u001b[35m2023-03-19 20:04:59,756 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)\u001b[0m\n", "\u001b[35m2023-03-19 20:04:59,758 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)\u001b[0m\n", "\u001b[35m2023-03-19 20:04:59,767 sagemaker_pytorch_container.training INFO Block until all host DNS lookups succeed.\u001b[0m\n", "\u001b[35m2023-03-19 20:04:59,769 sagemaker_pytorch_container.training INFO Invoking user training script.\u001b[0m\n", "\u001b[35m2023-03-19 20:04:59,982 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)\u001b[0m\n", "\u001b[35m2023-03-19 20:04:59,985 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)\u001b[0m\n", "\u001b[35m2023-03-19 20:04:59,996 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)\u001b[0m\n", "\u001b[35m2023-03-19 20:04:59,998 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)\u001b[0m\n", "\u001b[35m2023-03-19 20:05:00,011 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)\u001b[0m\n", "\u001b[35m2023-03-19 20:05:00,013 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)\u001b[0m\n", "\u001b[35m2023-03-19 20:05:00,023 sagemaker-training-toolkit INFO Invoking user script\u001b[0m\n", "\u001b[35mTraining Env:\u001b[0m\n", "\u001b[35m{\n", " \"additional_framework_parameters\": {},\n", " \"channel_input_dirs\": {\n", " \"training\": \"/opt/ml/input/data/training\"\n", " },\n", " \"current_host\": \"algo-2\",\n", " \"current_instance_group\": \"homogeneousCluster\",\n", " \"current_instance_group_hosts\": [\n", " \"algo-1\",\n", " \"algo-2\"\n", " ],\n", " \"current_instance_type\": \"ml.c5.2xlarge\",\n", " \"distribution_hosts\": [],\n", " \"distribution_instance_groups\": [],\n", " \"framework_module\": \"sagemaker_pytorch_container.training:main\",\n", " \"hosts\": [\n", " \"algo-1\",\n", " \"algo-2\"\n", " ],\n", " \"hyperparameters\": {\n", " \"backend\": \"gloo\",\n", " \"epochs\": 1\n", " },\n", " \"input_config_dir\": \"/opt/ml/input/config\",\n", " \"input_data_config\": {\n", " \"training\": {\n", " \"TrainingInputMode\": \"File\",\n", " \"S3DistributionType\": \"FullyReplicated\",\n", " \"RecordWrapperType\": \"None\"\n", " }\n", " },\n", " \"input_dir\": \"/opt/ml/input\",\n", " \"instance_groups\": [\n", " \"homogeneousCluster\"\n", " ],\n", " \"instance_groups_dict\": {\n", " \"homogeneousCluster\": {\n", " \"instance_group_name\": \"homogeneousCluster\",\n", " \"instance_type\": \"ml.c5.2xlarge\",\n", " \"hosts\": [\n", " \"algo-1\",\n", " \"algo-2\"\n", " ]\n", " }\n", " },\n", " \"is_hetero\": false,\n", " \"is_master\": false,\n", " \"is_modelparallel_enabled\": null,\n", " \"is_smddpmprun_installed\": false,\n", " \"job_name\": \"pytorch-training-2023-03-20-00-02-32-142\",\n", " \"log_level\": 20,\n", " \"master_hostname\": \"algo-1\",\n", " \"model_dir\": \"/opt/ml/model\",\n", " \"module_dir\": \"s3://sagemaker-us-east-1-989682844397/pytorch-training-2023-03-20-00-02-32-142/source/sourcedir.tar.gz\",\n", " \"module_name\": \"mnist-pytorch\",\n", " \"network_interface_name\": \"eth0\",\n", " \"num_cpus\": 8,\n", " \"num_gpus\": 0,\n", " \"num_neurons\": 0,\n", " \"output_data_dir\": \"/opt/ml/output/data\",\n", " \"output_dir\": \"/opt/ml/output\",\n", " \"output_intermediate_dir\": \"/opt/ml/output/intermediate\",\n", " \"resource_config\": {\n", " \"current_host\": \"algo-2\",\n", " \"current_instance_type\": \"ml.c5.2xlarge\",\n", " \"current_group_name\": \"homogeneousCluster\",\n", " \"hosts\": [\n", " \"algo-1\",\n", " \"algo-2\"\n", " ],\n", " \"instance_groups\": [\n", " {\n", " \"instance_group_name\": \"homogeneousCluster\",\n", " \"instance_type\": \"ml.c5.2xlarge\",\n", " \"hosts\": [\n", " \"algo-1\",\n", " \"algo-2\"\n", " ]\n", " }\n", " ],\n", " \"network_interface_name\": \"eth0\"\n", " },\n", " \"user_entry_point\": \"mnist-pytorch.py\"\u001b[0m\n", "\u001b[35m}\u001b[0m\n", "\u001b[35mEnvironment variables:\u001b[0m\n", "\u001b[35mSM_HOSTS=[\"algo-1\",\"algo-2\"]\u001b[0m\n", "\u001b[35mSM_NETWORK_INTERFACE_NAME=eth0\u001b[0m\n", "\u001b[35mSM_HPS={\"backend\":\"gloo\",\"epochs\":1}\u001b[0m\n", "\u001b[35mSM_USER_ENTRY_POINT=mnist-pytorch.py\u001b[0m\n", "\u001b[35mSM_FRAMEWORK_PARAMS={}\u001b[0m\n", "\u001b[35mSM_RESOURCE_CONFIG={\"current_group_name\":\"homogeneousCluster\",\"current_host\":\"algo-2\",\"current_instance_type\":\"ml.c5.2xlarge\",\"hosts\":[\"algo-1\",\"algo-2\"],\"instance_groups\":[{\"hosts\":[\"algo-1\",\"algo-2\"],\"instance_group_name\":\"homogeneousCluster\",\"instance_type\":\"ml.c5.2xlarge\"}],\"network_interface_name\":\"eth0\"}\u001b[0m\n", "\u001b[35mSM_INPUT_DATA_CONFIG={\"training\":{\"RecordWrapperType\":\"None\",\"S3DistributionType\":\"FullyReplicated\",\"TrainingInputMode\":\"File\"}}\u001b[0m\n", "\u001b[35mSM_OUTPUT_DATA_DIR=/opt/ml/output/data\u001b[0m\n", "\u001b[35mSM_CHANNELS=[\"training\"]\u001b[0m\n", "\u001b[35mSM_CURRENT_HOST=algo-2\u001b[0m\n", "\u001b[35mSM_CURRENT_INSTANCE_TYPE=ml.c5.2xlarge\u001b[0m\n", "\u001b[35mSM_CURRENT_INSTANCE_GROUP=homogeneousCluster\u001b[0m\n", "\u001b[35mSM_CURRENT_INSTANCE_GROUP_HOSTS=[\"algo-1\",\"algo-2\"]\u001b[0m\n", "\u001b[35mSM_INSTANCE_GROUPS=[\"homogeneousCluster\"]\u001b[0m\n", "\u001b[35mSM_INSTANCE_GROUPS_DICT={\"homogeneousCluster\":{\"hosts\":[\"algo-1\",\"algo-2\"],\"instance_group_name\":\"homogeneousCluster\",\"instance_type\":\"ml.c5.2xlarge\"}}\u001b[0m\n", "\u001b[35mSM_DISTRIBUTION_INSTANCE_GROUPS=[]\u001b[0m\n", "\u001b[35mSM_IS_HETERO=false\u001b[0m\n", "\u001b[35mSM_MODULE_NAME=mnist-pytorch\u001b[0m\n", "\u001b[35mSM_LOG_LEVEL=20\u001b[0m\n", "\u001b[35mSM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main\u001b[0m\n", "\u001b[35mSM_INPUT_DIR=/opt/ml/input\u001b[0m\n", "\u001b[35mSM_INPUT_CONFIG_DIR=/opt/ml/input/config\u001b[0m\n", "\u001b[35mSM_OUTPUT_DIR=/opt/ml/output\u001b[0m\n", "\u001b[35mSM_NUM_CPUS=8\u001b[0m\n", "\u001b[35mSM_NUM_GPUS=0\u001b[0m\n", "\u001b[35mSM_NUM_NEURONS=0\u001b[0m\n", "\u001b[35mSM_MODEL_DIR=/opt/ml/model\u001b[0m\n", "\u001b[35mSM_MODULE_DIR=s3://sagemaker-us-east-1-989682844397/pytorch-training-2023-03-20-00-02-32-142/source/sourcedir.tar.gz\u001b[0m\n", "\u001b[35mSM_TRAINING_ENV={\"additional_framework_parameters\":{},\"channel_input_dirs\":{\"training\":\"/opt/ml/input/data/training\"},\"current_host\":\"algo-2\",\"current_instance_group\":\"homogeneousCluster\",\"current_instance_group_hosts\":[\"algo-1\",\"algo-2\"],\"current_instance_type\":\"ml.c5.2xlarge\",\"distribution_hosts\":[],\"distribution_instance_groups\":[],\"framework_module\":\"sagemaker_pytorch_container.training:main\",\"hosts\":[\"algo-1\",\"algo-2\"],\"hyperparameters\":{\"backend\":\"gloo\",\"epochs\":1},\"input_config_dir\":\"/opt/ml/input/config\",\"input_data_config\":{\"training\":{\"RecordWrapperType\":\"None\",\"S3DistributionType\":\"FullyReplicated\",\"TrainingInputMode\":\"File\"}},\"input_dir\":\"/opt/ml/input\",\"instance_groups\":[\"homogeneousCluster\"],\"instance_groups_dict\":{\"homogeneousCluster\":{\"hosts\":[\"algo-1\",\"algo-2\"],\"instance_group_name\":\"homogeneousCluster\",\"instance_type\":\"ml.c5.2xlarge\"}},\"is_hetero\":false,\"is_master\":false,\"is_modelparallel_enabled\":null,\"is_smddpmprun_installed\":false,\"job_name\":\"pytorch-training-2023-03-20-00-02-32-142\",\"log_level\":20,\"master_hostname\":\"algo-1\",\"model_dir\":\"/opt/ml/model\",\"module_dir\":\"s3://sagemaker-us-east-1-989682844397/pytorch-training-2023-03-20-00-02-32-142/source/sourcedir.tar.gz\",\"module_name\":\"mnist-pytorch\",\"network_interface_name\":\"eth0\",\"num_cpus\":8,\"num_gpus\":0,\"num_neurons\":0,\"output_data_dir\":\"/opt/ml/output/data\",\"output_dir\":\"/opt/ml/output\",\"output_intermediate_dir\":\"/opt/ml/output/intermediate\",\"resource_config\":{\"current_group_name\":\"homogeneousCluster\",\"current_host\":\"algo-2\",\"current_instance_type\":\"ml.c5.2xlarge\",\"hosts\":[\"algo-1\",\"algo-2\"],\"instance_groups\":[{\"hosts\":[\"algo-1\",\"algo-2\"],\"instance_group_name\":\"homogeneousCluster\",\"instance_type\":\"ml.c5.2xlarge\"}],\"network_interface_name\":\"eth0\"},\"user_entry_point\":\"mnist-pytorch.py\"}\u001b[0m\n", "\u001b[35mSM_USER_ARGS=[\"--backend\",\"gloo\",\"--epochs\",\"1\"]\u001b[0m\n", "\u001b[35mSM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate\u001b[0m\n", "\u001b[35mSM_CHANNEL_TRAINING=/opt/ml/input/data/training\u001b[0m\n", "\u001b[35mSM_HP_BACKEND=gloo\u001b[0m\n", "\u001b[35mSM_HP_EPOCHS=1\u001b[0m\n", "\u001b[35mPYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python38.zip:/opt/conda/lib/python3.8:/opt/conda/lib/python3.8/lib-dynload:/opt/conda/lib/python3.8/site-packages:/opt/conda/lib/python3.8/site-packages/smdebug-1.0.26b20230214-py3.8.egg:/opt/conda/lib/python3.8/site-packages/pyinstrument-3.4.2-py3.8.egg:/opt/conda/lib/python3.8/site-packages/pyinstrument_cext-0.2.4-py3.8-linux-x86_64.egg\u001b[0m\n", "\u001b[35mInvoking script with the following command:\u001b[0m\n", "\u001b[35m/opt/conda/bin/python3.8 mnist-pytorch.py --backend gloo --epochs 1\u001b[0m\n", "\u001b[35m2023-03-19 20:05:00,458 sagemaker-training-toolkit INFO Exceptions not imported for SageMaker TF as Tensorflow is not installed.\u001b[0m\n", "\u001b[35mDistributed training - True\u001b[0m\n", "\u001b[35mNumber of gpus available - 0\u001b[0m\n", "\u001b[34mDistributed training - True\u001b[0m\n", "\u001b[34mNumber of gpus available - 0\u001b[0m\n", "\u001b[35mInitialized the distributed environment: 'gloo' backend on 2 nodes. Current host rank is 1. Number of gpus: 0\u001b[0m\n", "\u001b[35mGet train data loader\u001b[0m\n", "\u001b[35mGet test data loader\u001b[0m\n", "\u001b[35mProcesses 30000/60000 (50%) of train data\u001b[0m\n", "\u001b[35mProcesses 10000/10000 (100%) of test data\u001b[0m\n", "\u001b[35m[2023-03-19 20:05:03.243 algo-2:47 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None\u001b[0m\n", "\u001b[35m/opt/conda/lib/python3.8/site-packages/smdebug-1.0.26b20230214-py3.8.egg/smdebug/profiler/system_metrics_reader.py:78: SyntaxWarning: \"is not\" with a literal. Did you mean \"!=\"?\u001b[0m\n", "\u001b[35m/opt/conda/lib/python3.8/site-packages/smdebug-1.0.26b20230214-py3.8.egg/smdebug/profiler/system_metrics_reader.py:78: SyntaxWarning: \"is not\" with a literal. Did you mean \"!=\"?\u001b[0m\n", "\u001b[34mInitialized the distributed environment: 'gloo' backend on 2 nodes. Current host rank is 0. Number of gpus: 0\u001b[0m\n", "\u001b[34mGet train data loader\u001b[0m\n", "\u001b[34mGet test data loader\u001b[0m\n", "\u001b[34mProcesses 30000/60000 (50%) of train data\u001b[0m\n", "\u001b[34mProcesses 10000/10000 (100%) of test data\u001b[0m\n", "\u001b[34m[2023-03-19 20:05:03.238 algo-1:47 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None\u001b[0m\n", "\u001b[34m/opt/conda/lib/python3.8/site-packages/smdebug-1.0.26b20230214-py3.8.egg/smdebug/profiler/system_metrics_reader.py:78: SyntaxWarning: \"is not\" with a literal. Did you mean \"!=\"?\u001b[0m\n", "\u001b[34m/opt/conda/lib/python3.8/site-packages/smdebug-1.0.26b20230214-py3.8.egg/smdebug/profiler/system_metrics_reader.py:78: SyntaxWarning: \"is not\" with a literal. Did you mean \"!=\"?\u001b[0m\n", "\u001b[34m[2023-03-19 20:05:03.808 algo-1:47 INFO profiler_config_parser.py:111] User has disabled profiler.\u001b[0m\n", "\u001b[34m[2023-03-19 20:05:03.809 algo-1:47 INFO json_config.py:92] Creating hook from json_config at /opt/ml/input/config/debughookconfig.json.\u001b[0m\n", "\u001b[34m[2023-03-19 20:05:03.810 algo-1:47 INFO hook.py:206] tensorboard_dir has not been set for the hook. SMDebug will not be exporting tensorboard summaries.\u001b[0m\n", "\u001b[34m[2023-03-19 20:05:03.810 algo-1:47 INFO hook.py:259] Saving to /opt/ml/output/tensors\u001b[0m\n", "\u001b[34m[2023-03-19 20:05:03.810 algo-1:47 INFO state_store.py:77] The checkpoint config file /opt/ml/input/config/checkpointconfig.json does not exist.\u001b[0m\n", "\u001b[34m/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:181: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead\n", " warnings.warn(\u001b[0m\n", "\u001b[35m[2023-03-19 20:05:03.802 algo-2:47 INFO profiler_config_parser.py:111] User has disabled profiler.\u001b[0m\n", "\u001b[35m[2023-03-19 20:05:03.803 algo-2:47 INFO json_config.py:92] Creating hook from json_config at /opt/ml/input/config/debughookconfig.json.\u001b[0m\n", "\u001b[35m[2023-03-19 20:05:03.804 algo-2:47 INFO hook.py:206] tensorboard_dir has not been set for the hook. SMDebug will not be exporting tensorboard summaries.\u001b[0m\n", "\u001b[35m[2023-03-19 20:05:03.804 algo-2:47 INFO hook.py:259] Saving to /opt/ml/output/tensors\u001b[0m\n", "\u001b[35m[2023-03-19 20:05:03.804 algo-2:47 INFO state_store.py:77] The checkpoint config file /opt/ml/input/config/checkpointconfig.json does not exist.\u001b[0m\n", "\u001b[35m/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:181: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead\n", " warnings.warn(\u001b[0m\n", "\u001b[34mTrain Epoch: 1 [6400/30000 (21%)] Loss: 2.076445\u001b[0m\n", "\u001b[34mINFO:__main__:Train Epoch: 1 [6400/30000 (21%)] Loss: 2.076445\u001b[0m\n", "\u001b[35mTrain Epoch: 1 [6400/30000 (21%)] Loss: 2.075126\u001b[0m\n", "\u001b[35mINFO:__main__:Train Epoch: 1 [6400/30000 (21%)] Loss: 2.075126\u001b[0m\n", "\u001b[34mTrain Epoch: 1 [12800/30000 (43%)] Loss: 1.056658\u001b[0m\n", "\u001b[34mINFO:__main__:Train Epoch: 1 [12800/30000 (43%)] Loss: 1.056658\u001b[0m\n", "\u001b[35mTrain Epoch: 1 [12800/30000 (43%)] Loss: 1.213364\u001b[0m\n", "\u001b[35mINFO:__main__:Train Epoch: 1 [12800/30000 (43%)] Loss: 1.213364\u001b[0m\n", "\u001b[35mTrain Epoch: 1 [19200/30000 (64%)] Loss: 0.908578\u001b[0m\n", "\u001b[35mINFO:__main__:Train Epoch: 1 [19200/30000 (64%)] Loss: 0.908578\u001b[0m\n", "\u001b[34mTrain Epoch: 1 [19200/30000 (64%)] Loss: 0.944071\u001b[0m\n", "\u001b[34mINFO:__main__:Train Epoch: 1 [19200/30000 (64%)] Loss: 0.944071\u001b[0m\n", "\u001b[35mTrain Epoch: 1 [25600/30000 (85%)] Loss: 0.671730\u001b[0m\n", "\u001b[35mINFO:__main__:Train Epoch: 1 [25600/30000 (85%)] Loss: 0.671730\u001b[0m\n", "\u001b[34mTrain Epoch: 1 [25600/30000 (85%)] Loss: 0.846429\u001b[0m\n", "\u001b[34mINFO:__main__:Train Epoch: 1 [25600/30000 (85%)] Loss: 0.846429\u001b[0m\n", "\u001b[35m/opt/conda/lib/python3.8/site-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.\n", " warnings.warn(warning.format(ret))\u001b[0m\n", "\u001b[34m/opt/conda/lib/python3.8/site-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.\n", " warnings.warn(warning.format(ret))\u001b[0m\n", "\u001b[34mTest set: Average loss: 0.3238, Accuracy: 9094/10000 (91%)\u001b[0m\n", "\u001b[34mSaving the model.\u001b[0m\n", "\u001b[34mINFO:__main__:Test set: Average loss: 0.3238, Accuracy: 9094/10000 (91%)\u001b[0m\n", "\u001b[34mINFO:__main__:Saving the model.\u001b[0m\n", "\u001b[35mTest set: Average loss: 0.3238, Accuracy: 9094/10000 (91%)\u001b[0m\n", "\u001b[35mINFO:__main__:Test set: Average loss: 0.3238, Accuracy: 9094/10000 (91%)\u001b[0m\n", "\u001b[35mSaving the model.\u001b[0m\n", "\u001b[35mINFO:__main__:Saving the model.\u001b[0m\n", "\u001b[35m2023-03-19 20:05:13,973 sagemaker-training-toolkit INFO Waiting for the process to finish and give a return code.\u001b[0m\n", "\u001b[35m2023-03-19 20:05:13,973 sagemaker-training-toolkit INFO Done waiting for a return code. Received 0 from exiting process.\u001b[0m\n", "\u001b[35m2023-03-19 20:05:13,973 sagemaker-training-toolkit INFO Reporting training SUCCESS\u001b[0m\n", "\u001b[34m2023-03-19 20:05:13,977 sagemaker-training-toolkit INFO Waiting for the process to finish and give a return code.\u001b[0m\n", "\u001b[34m2023-03-19 20:05:13,977 sagemaker-training-toolkit INFO Done waiting for a return code. Received 0 from exiting process.\u001b[0m\n", "\u001b[34m2023-03-19 20:05:13,977 sagemaker-training-toolkit INFO Reporting training SUCCESS\u001b[0m\n", "\n", "2023-03-20 00:05:32 Uploading - Uploading generated training model\n", "2023-03-20 00:05:32 Completed - Training job completed\n", "Training seconds: 226\n", "Billable seconds: 226\n" ] } ], "source": [ "estimator.fit({\"training\": inputs})" ] }, { "cell_type": "markdown", "metadata": { "jupyter": { "source_hidden": true }, "tags": [] }, "source": [ "## Host\n", "### Create endpoint\n", "After training, we use the `PyTorch` estimator object to build and deploy a `PyTorchPredictor`. This creates a Sagemaker Endpoint -- a hosted prediction service that we can use to perform inference.\n", "\n", "As mentioned above we have implementation of `model_fn` in the `mnist-pytorch.py` script that is required. We are going to use default implementations of `input_fn`, `predict_fn`, `output_fn` and `transform_fm` defined in [sagemaker-pytorch-containers](https://github.com/aws/sagemaker-pytorch-containers).\n", "\n", "The arguments to the deploy function allow us to set the number and type of instances that will be used for the Endpoint. These do not need to be the same as the values we used for the training job. For example, you can train a model on a set of GPU-based instances, and then deploy the Endpoint to a fleet of CPU-based instances, but you need to make sure that you return or save your model as a cpu model similar to what we did in `mnist-pytorch.py`. Here we will deploy the model to a single ```ml.m4.xlarge``` instance." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:sagemaker:Creating model with name: pytorch-training-2023-03-20-00-38-25-414\n", "INFO:sagemaker:Creating endpoint-config with name pytorch-training-2023-03-20-00-38-25-414\n", "INFO:sagemaker:Creating endpoint with name pytorch-training-2023-03-20-00-38-25-414\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "----------!" ] } ], "source": [ "predictor = estimator.deploy(initial_instance_count=1, instance_type=\"ml.m4.xlarge\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Evaluate\n", "\n", "You can use the test images to evalute the endpoint. The accuracy of the model depends on how many it is trained. " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "t10k-images-idx3-ubyte\t train-images-idx3-ubyte\n", "t10k-images-idx3-ubyte.gz train-images-idx3-ubyte.gz\n", "t10k-labels-idx1-ubyte\t train-labels-idx1-ubyte\n", "t10k-labels-idx1-ubyte.gz train-labels-idx1-ubyte.gz\n" ] } ], "source": [ "!ls lab03-pytorch-data/MNIST/raw" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "import gzip\n", "import numpy as np\n", "import random\n", "import os\n", "\n", "data_dir = \"lab03-pytorch-data/MNIST/raw\"\n", "with gzip.open(os.path.join(data_dir, \"t10k-images-idx3-ubyte.gz\"), \"rb\") as f:\n", " images = np.frombuffer(f.read(), np.uint8, offset=16).reshape(-1, 28, 28).astype(np.float32)\n", "\n", "mask = random.sample(range(len(images)), 16) # randomly select some of the test images\n", "mask = np.array(mask, dtype=int)\n", "data = images[mask]" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Raw prediction result:\n", "[[ -871.36743164 -971.89575195 -913.92596436 -654.67346191\n", " -97.44143677 -498.67459106 -707.69451904 -301.13406372\n", " -463.44345093 0. ]\n", " [-1198.8449707 -1175.25219727 -936.64807129 -718.03765869\n", " -766.48907471 -1127.92248535 -1482.80273438 0.\n", " -904.03717041 -280.87255859]\n", " [ -915.80487061 0. -370.35418701 -575.47436523\n", " -504.13796997 -668.82928467 -604.71899414 -464.13146973\n", " -392.08587646 -507.47665405]\n", " [ -425.44076538 -467.09265137 -282.55163574 -510.90609741\n", " -115.01202393 -289.24203491 0. -542.6862793\n", " -380.04800415 -338.26568604]\n", " [ 0. -671.24707031 -254.16288757 -130.0802002\n", " -928.06286621 -488.23287964 -750.70385742 -346.58804321\n", " -464.49978638 -632.40795898]\n", " [ -529.78643799 -376.62268066 -423.89981079 0.\n", " -481.95996094 -253.3106842 -584.8782959 -422.38235474\n", " -312.53594971 -359.24209595]\n", " [ 0. -1099.6451416 -592.86077881 -635.79467773\n", " -922.0513916 -532.85437012 -657.04998779 -723.68963623\n", " -604.88562012 -789.17773438]\n", " [ -720.66485596 -632.3706665 0. -422.42520142\n", " -1019.03820801 -886.56982422 -742.4307251 -935.56298828\n", " -387.4772644 -1025.28649902]\n", " [ 0. -1285.52966309 -876.42047119 -741.64910889\n", " -971.92895508 -467.35687256 -624.23175049 -908.52587891\n", " -834.10925293 -886.8447876 ]\n", " [ -748.95330811 -822.61834717 -701.46142578 -555.51751709\n", " -168.34185791 -471.10089111 -707.58148193 -168.32168579\n", " -307.62628174 0. ]\n", " [ -559.73217773 -474.19140625 -206.41279602 0.\n", " -421.03112793 -351.05273438 -638.84527588 -223.26074219\n", " -172.14324951 -162.9990387 ]\n", " [ 0. -1601.81567383 -981.31481934 -969.63879395\n", " -1199.11694336 -654.262146 -784.73376465 -1118.02600098\n", " -931.71972656 -1115.87390137]\n", " [ -932.12322998 0. -437.6164856 -589.36120605\n", " -617.91271973 -729.02789307 -669.26696777 -591.15795898\n", " -419.69714355 -658.73071289]\n", " [ -571.80773926 -624.76086426 -517.0725708 -637.00036621\n", " 0. -373.50500488 -259.54901123 -495.45053101\n", " -440.54824829 -204.71725464]\n", " [-1023.34185791 -774.77142334 -268.0958252 -434.04608154\n", " -917.65264893 -1111.8729248 -1385.52514648 0.\n", " -608.21112061 -387.69573975]\n", " [ -681.6385498 -458.6541748 0. -157.78533936\n", " -1009.4597168 -624.41400146 -771.04199219 -605.02685547\n", " -222.2220459 -798.47497559]]\n", "\n", "Labeled predictions: \n", "[(0, -871.367431640625), (1, -971.895751953125), (2, -913.9259643554688), (3, -654.6734619140625), (4, -97.44143676757812), (5, -498.6745910644531), (6, -707.6945190429688), (7, -301.1340637207031), (8, -463.4434509277344), (9, 0.0)]\n", "\n", "Most likely answer: (9, 0.0)\n" ] } ], "source": [ "response = predictor.predict(np.expand_dims(data, axis=1))\n", "print(\"Raw prediction result:\")\n", "print(response)\n", "print()\n", "\n", "labeled_predictions = list(zip(range(10), response[0]))\n", "print(\"Labeled predictions: \")\n", "print(labeled_predictions)\n", "print()\n", "\n", "labeled_predictions.sort(key=lambda label_and_prob: 1.0 - label_and_prob[1])\n", "print(\"Most likely answer: {}\".format(labeled_predictions[0]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cleanup\n", "\n", "After you have finished with this example, remember to delete the prediction endpoint to release the instance(s) associated with it" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "sagemaker_session.delete_endpoint(endpoint_name=predictor.endpoint_name)" ] } ], "metadata": { "availableInstances": [ { "_defaultOrder": 0, "_isFastLaunch": true, "category": "General purpose", "gpuNum": 0, "memoryGiB": 4, "name": "ml.t3.medium", "vcpuNum": 2 }, { "_defaultOrder": 1, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 8, "name": "ml.t3.large", "vcpuNum": 2 }, { "_defaultOrder": 2, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 16, "name": "ml.t3.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 3, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 32, "name": "ml.t3.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 4, "_isFastLaunch": true, "category": "General purpose", "gpuNum": 0, "memoryGiB": 8, "name": "ml.m5.large", "vcpuNum": 2 }, { "_defaultOrder": 5, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 16, "name": "ml.m5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 6, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 32, "name": "ml.m5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 7, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 64, "name": "ml.m5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 8, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 128, "name": "ml.m5.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 9, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 192, "name": "ml.m5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 10, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 256, "name": "ml.m5.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 11, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 384, "name": "ml.m5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 12, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 8, "name": "ml.m5d.large", "vcpuNum": 2 }, { "_defaultOrder": 13, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 16, "name": "ml.m5d.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 14, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 32, "name": "ml.m5d.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 15, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 64, "name": "ml.m5d.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 16, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 128, "name": "ml.m5d.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 17, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 192, "name": "ml.m5d.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 18, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 256, "name": "ml.m5d.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 19, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 384, "name": "ml.m5d.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 20, "_isFastLaunch": true, "category": "Compute optimized", "gpuNum": 0, "memoryGiB": 4, "name": "ml.c5.large", "vcpuNum": 2 }, { "_defaultOrder": 21, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "memoryGiB": 8, "name": "ml.c5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 22, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "memoryGiB": 16, "name": "ml.c5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 23, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "memoryGiB": 32, "name": "ml.c5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 24, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "memoryGiB": 72, "name": "ml.c5.9xlarge", "vcpuNum": 36 }, { "_defaultOrder": 25, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "memoryGiB": 96, "name": "ml.c5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 26, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "memoryGiB": 144, "name": "ml.c5.18xlarge", "vcpuNum": 72 }, { "_defaultOrder": 27, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "memoryGiB": 192, "name": "ml.c5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 28, "_isFastLaunch": true, "category": "Accelerated computing", "gpuNum": 1, "memoryGiB": 16, "name": "ml.g4dn.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 29, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "memoryGiB": 32, "name": "ml.g4dn.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 30, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "memoryGiB": 64, "name": "ml.g4dn.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 31, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "memoryGiB": 128, "name": "ml.g4dn.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 32, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "memoryGiB": 192, "name": "ml.g4dn.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 33, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "memoryGiB": 256, "name": "ml.g4dn.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 34, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "memoryGiB": 61, "name": "ml.p3.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 35, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "memoryGiB": 244, "name": "ml.p3.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 36, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "memoryGiB": 488, "name": "ml.p3.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 37, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "memoryGiB": 768, "name": "ml.p3dn.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 38, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "memoryGiB": 16, "name": "ml.r5.large", "vcpuNum": 2 }, { "_defaultOrder": 39, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "memoryGiB": 32, "name": "ml.r5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 40, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "memoryGiB": 64, "name": "ml.r5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 41, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "memoryGiB": 128, "name": "ml.r5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 42, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "memoryGiB": 256, "name": "ml.r5.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 43, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "memoryGiB": 384, "name": "ml.r5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 44, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "memoryGiB": 512, "name": "ml.r5.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 45, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "memoryGiB": 768, "name": "ml.r5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 46, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "memoryGiB": 16, "name": "ml.g5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 47, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "memoryGiB": 32, "name": "ml.g5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 48, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "memoryGiB": 64, "name": "ml.g5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 49, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "memoryGiB": 128, "name": "ml.g5.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 50, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "memoryGiB": 256, "name": "ml.g5.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 51, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "memoryGiB": 192, "name": "ml.g5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 52, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "memoryGiB": 384, "name": "ml.g5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 53, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "memoryGiB": 768, "name": "ml.g5.48xlarge", "vcpuNum": 192 } ], "kernelspec": { "display_name": "Python 3 (Data Science)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" }, "notice": "Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License." }, "nbformat": 4, "nbformat_minor": 4 }