{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Training a Tensorflow Model on MNIST\n", "\n", "MNIST is a widely used dataset for handwritten digit classification. It consists of 70,000 labeled 28x28 pixel grayscale images of hand-written digits. The dataset is split into 60,000 training images and 10,000 test images. There are 10 classes (one for each of the 10 digits). This tutorial will show how to train a Tensorflow V2 model on MNIST model on SageMaker.\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "import os\n", "import json\n", "\n", "import sagemaker\n", "from sagemaker.tensorflow import TensorFlow\n", "from sagemaker import get_execution_role\n", "\n", "sess = sagemaker.Session()\n", "\n", "role = get_execution_role()\n", "\n", "output_path = \"s3://\" + sess.default_bucket() + \"/tensorflow/mnist\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## TensorFlow Estimator\n", "\n", "The `TensorFlow` class allows you to run your training script on SageMaker\n", "infrastracture in a containerized environment. In this notebook, we\n", "refer to this container as *training container*. \n", "\n", "You need to configure\n", "it with the following parameters to set up the environment:\n", "\n", "- entry_point: A user defined python file to be used by the training container as the \n", "instructions for training. We will further discuss this file in the next subsection\n", "\n", "- role: An IAM role to make AWS service requests\n", "\n", "- instance_type: The type of SageMaker instance to run your training script. \n", "Set it to `local` if you want to run the training job on \n", "the SageMaker instance you are using to run this notebook\n", "\n", "- model_dir: S3 bucket URI where the checkpoint data and models can be exported to during training (default: None). \n", "To disable having model_dir passed to your training script, set `model_dir`=False\n", "\n", "- instance count: The number of instances you need to run your training job. \n", "Multiple instances are needed for distributed training\n", "\n", "- output_path: \n", "S3 bucket URI to save training output (model artifacts and output files)\n", "\n", "- framework_version: The version of TensorFlow you need to use.\n", "\n", "- py_version: The python version you need to use\n", "\n", "For more information, see [the API reference](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.EstimatorBase)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Implement the entry point for training\n", "\n", "The entry point for training is a python script that provides all \n", "the code for training a TensorFlow model. It is used by the SageMaker \n", "TensorFlow Estimator (`TensorFlow` class above) as the entry point for running the training job.\n", "\n", "Under the hood, SageMaker TensorFlow Estimator downloads a docker image\n", "with runtime environemnts \n", "specified by the parameters you used to initiated the\n", "estimator class and it injects the training script into the \n", "docker image to be used as the entry point to run the container.\n", "\n", "In the rest of the notebook, we use *training image* to refer to the \n", "docker image specified by the TensorFlow Estimator and *training container*\n", "to refer to the container that runs the training image. \n", "\n", "This means your training script is very similar to a training script\n", "you might run outside Amazon SageMaker, but it can access the useful environment \n", "variables provided by the training image. Checkout [the short list of environment variables provided by the SageMaker service](https://sagemaker.readthedocs.io/en/stable/frameworks/mxnet/using_mxnet.html?highlight=entry%20point) to see some common environment \n", "variables you might used. Checkout [the complete list of environment variables](https://github.com/aws/sagemaker-training-toolkit/blob/master/ENVIRONMENT_VARIABLES.md) for a complete \n", "description of all environment variables your training script\n", "can access to. \n", "\n", "In this example, we use the training script `code/train.py`\n", "as the entry point for our TensorFlow Estimator. " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[34mfrom\u001b[39;49;00m \u001b[04m\u001b[36m__future__\u001b[39;49;00m \u001b[34mimport\u001b[39;49;00m print_function\n", "\n", "\u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36margparse\u001b[39;49;00m\n", "\u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36mgzip\u001b[39;49;00m\n", "\u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36mjson\u001b[39;49;00m\n", "\u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36mlogging\u001b[39;49;00m\n", "\u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36mos\u001b[39;49;00m\n", "\u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36mtraceback\u001b[39;49;00m\n", "\n", "\u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36mnumpy\u001b[39;49;00m \u001b[34mas\u001b[39;49;00m \u001b[04m\u001b[36mnp\u001b[39;49;00m\n", "\u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36mtensorflow\u001b[39;49;00m \u001b[34mas\u001b[39;49;00m \u001b[04m\u001b[36mtf\u001b[39;49;00m\n", "\u001b[34mfrom\u001b[39;49;00m \u001b[04m\u001b[36mtensorflow\u001b[39;49;00m\u001b[04m\u001b[36m.\u001b[39;49;00m\u001b[04m\u001b[36mkeras\u001b[39;49;00m \u001b[34mimport\u001b[39;49;00m Model\n", "\u001b[34mfrom\u001b[39;49;00m \u001b[04m\u001b[36mtensorflow\u001b[39;49;00m\u001b[04m\u001b[36m.\u001b[39;49;00m\u001b[04m\u001b[36mkeras\u001b[39;49;00m\u001b[04m\u001b[36m.\u001b[39;49;00m\u001b[04m\u001b[36mlayers\u001b[39;49;00m \u001b[34mimport\u001b[39;49;00m Conv2D, Dense, Flatten\n", "\n", "logging.basicConfig(level=logging.DEBUG)\n", "\n", "\u001b[37m# Define the model object\u001b[39;49;00m\n", "\n", "\n", "\u001b[34mclass\u001b[39;49;00m \u001b[04m\u001b[32mSmallConv\u001b[39;49;00m(Model):\n", " \u001b[34mdef\u001b[39;49;00m \u001b[32m__init__\u001b[39;49;00m(\u001b[36mself\u001b[39;49;00m):\n", " \u001b[36msuper\u001b[39;49;00m(SmallConv, \u001b[36mself\u001b[39;49;00m).\u001b[32m__init__\u001b[39;49;00m()\n", " \u001b[36mself\u001b[39;49;00m.conv1 = Conv2D(\u001b[34m32\u001b[39;49;00m, \u001b[34m3\u001b[39;49;00m, activation=\u001b[33m\"\u001b[39;49;00m\u001b[33mrelu\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\n", " \u001b[36mself\u001b[39;49;00m.flatten = Flatten()\n", " \u001b[36mself\u001b[39;49;00m.d1 = Dense(\u001b[34m128\u001b[39;49;00m, activation=\u001b[33m\"\u001b[39;49;00m\u001b[33mrelu\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\n", " \u001b[36mself\u001b[39;49;00m.d2 = Dense(\u001b[34m10\u001b[39;49;00m)\n", "\n", " \u001b[34mdef\u001b[39;49;00m \u001b[32mcall\u001b[39;49;00m(\u001b[36mself\u001b[39;49;00m, x):\n", " x = \u001b[36mself\u001b[39;49;00m.conv1(x)\n", " x = \u001b[36mself\u001b[39;49;00m.flatten(x)\n", " x = \u001b[36mself\u001b[39;49;00m.d1(x)\n", " \u001b[34mreturn\u001b[39;49;00m \u001b[36mself\u001b[39;49;00m.d2(x)\n", "\n", "\n", "\u001b[37m# Decode and preprocess data\u001b[39;49;00m\n", "\u001b[34mdef\u001b[39;49;00m \u001b[32mconvert_to_numpy\u001b[39;49;00m(data_dir, images_file, labels_file):\n", " \u001b[33m\"\"\"Byte string to numpy arrays\"\"\"\u001b[39;49;00m\n", " \u001b[34mwith\u001b[39;49;00m gzip.open(os.path.join(data_dir, images_file), \u001b[33m\"\u001b[39;49;00m\u001b[33mrb\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m) \u001b[34mas\u001b[39;49;00m f:\n", " images = np.frombuffer(f.read(), np.uint8, offset=\u001b[34m16\u001b[39;49;00m).reshape(-\u001b[34m1\u001b[39;49;00m, \u001b[34m28\u001b[39;49;00m, \u001b[34m28\u001b[39;49;00m)\n", "\n", " \u001b[34mwith\u001b[39;49;00m gzip.open(os.path.join(data_dir, labels_file), \u001b[33m\"\u001b[39;49;00m\u001b[33mrb\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m) \u001b[34mas\u001b[39;49;00m f:\n", " labels = np.frombuffer(f.read(), np.uint8, offset=\u001b[34m8\u001b[39;49;00m)\n", "\n", " \u001b[34mreturn\u001b[39;49;00m (images, labels)\n", "\n", "\n", "\u001b[34mdef\u001b[39;49;00m \u001b[32mmnist_to_numpy\u001b[39;49;00m(data_dir, train):\n", " \u001b[33m\"\"\"Load raw MNIST data into numpy array\u001b[39;49;00m\n", "\u001b[33m\u001b[39;49;00m\n", "\u001b[33m Args:\u001b[39;49;00m\n", "\u001b[33m data_dir (str): directory of MNIST raw data.\u001b[39;49;00m\n", "\u001b[33m This argument can be accessed via SM_CHANNEL_TRAINING\u001b[39;49;00m\n", "\u001b[33m\u001b[39;49;00m\n", "\u001b[33m train (bool): use training data\u001b[39;49;00m\n", "\u001b[33m\u001b[39;49;00m\n", "\u001b[33m Returns:\u001b[39;49;00m\n", "\u001b[33m tuple of images and labels as numpy array\u001b[39;49;00m\n", "\u001b[33m \"\"\"\u001b[39;49;00m\n", "\n", " \u001b[34mif\u001b[39;49;00m train:\n", " images_file = \u001b[33m\"\u001b[39;49;00m\u001b[33mtrain-images-idx3-ubyte.gz\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\n", " labels_file = \u001b[33m\"\u001b[39;49;00m\u001b[33mtrain-labels-idx1-ubyte.gz\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\n", " \u001b[34melse\u001b[39;49;00m:\n", " images_file = \u001b[33m\"\u001b[39;49;00m\u001b[33mt10k-images-idx3-ubyte.gz\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\n", " labels_file = \u001b[33m\"\u001b[39;49;00m\u001b[33mt10k-labels-idx1-ubyte.gz\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\n", "\n", " \u001b[34mreturn\u001b[39;49;00m convert_to_numpy(data_dir, images_file, labels_file)\n", "\n", "\n", "\u001b[34mdef\u001b[39;49;00m \u001b[32mnormalize\u001b[39;49;00m(x, axis):\n", " eps = np.finfo(\u001b[36mfloat\u001b[39;49;00m).eps\n", "\n", " mean = np.mean(x, axis=axis, keepdims=\u001b[34mTrue\u001b[39;49;00m)\n", " \u001b[37m# avoid division by zero\u001b[39;49;00m\n", " std = np.std(x, axis=axis, keepdims=\u001b[34mTrue\u001b[39;49;00m) + eps\n", " \u001b[34mreturn\u001b[39;49;00m (x - mean) / std\n", "\n", "\n", "\u001b[37m# Training logic\u001b[39;49;00m\n", "\n", "\n", "\u001b[34mdef\u001b[39;49;00m \u001b[32mtrain\u001b[39;49;00m(args):\n", " \u001b[37m# create data loader from the train / test channels\u001b[39;49;00m\n", " x_train, y_train = mnist_to_numpy(data_dir=args.train, train=\u001b[34mTrue\u001b[39;49;00m)\n", " x_test, y_test = mnist_to_numpy(data_dir=args.test, train=\u001b[34mFalse\u001b[39;49;00m)\n", "\n", " x_train, x_test = x_train.astype(np.float32), x_test.astype(np.float32)\n", "\n", " \u001b[37m# normalize the inputs to mean 0 and std 1\u001b[39;49;00m\n", " x_train, x_test = normalize(x_train, (\u001b[34m1\u001b[39;49;00m, \u001b[34m2\u001b[39;49;00m)), normalize(x_test, (\u001b[34m1\u001b[39;49;00m, \u001b[34m2\u001b[39;49;00m))\n", "\n", " \u001b[37m# expand channel axis\u001b[39;49;00m\n", " \u001b[37m# tf uses depth minor convention\u001b[39;49;00m\n", " x_train, x_test = np.expand_dims(x_train, axis=\u001b[34m3\u001b[39;49;00m), np.expand_dims(x_test, axis=\u001b[34m3\u001b[39;49;00m)\n", "\n", " \u001b[37m# normalize the data to mean 0 and std 1\u001b[39;49;00m\n", " train_loader = (\n", " tf.data.Dataset.from_tensor_slices((x_train, y_train))\n", " .shuffle(\u001b[36mlen\u001b[39;49;00m(x_train))\n", " .batch(args.batch_size)\n", " )\n", "\n", " test_loader = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(args.batch_size)\n", "\n", " model = SmallConv()\n", " model.compile()\n", " loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=\u001b[34mTrue\u001b[39;49;00m)\n", " optimizer = tf.keras.optimizers.Adam(\n", " learning_rate=args.learning_rate, beta_1=args.beta_1, beta_2=args.beta_2\n", " )\n", "\n", " train_loss = tf.keras.metrics.Mean(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mtrain_loss\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\n", " train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mtrain_accuracy\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\n", "\n", " test_loss = tf.keras.metrics.Mean(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mtest_loss\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\n", " test_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mtest_accuracy\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\n", "\n", " \u001b[90m@tf\u001b[39;49;00m.function\n", " \u001b[34mdef\u001b[39;49;00m \u001b[32mtrain_step\u001b[39;49;00m(images, labels):\n", " \u001b[34mwith\u001b[39;49;00m tf.GradientTape() \u001b[34mas\u001b[39;49;00m tape:\n", " predictions = model(images, training=\u001b[34mTrue\u001b[39;49;00m)\n", " loss = loss_fn(labels, predictions)\n", " grad = tape.gradient(loss, model.trainable_variables)\n", " optimizer.apply_gradients(\u001b[36mzip\u001b[39;49;00m(grad, model.trainable_variables))\n", "\n", " train_loss(loss)\n", " train_accuracy(labels, predictions)\n", " \u001b[34mreturn\u001b[39;49;00m\n", "\n", " \u001b[90m@tf\u001b[39;49;00m.function\n", " \u001b[34mdef\u001b[39;49;00m \u001b[32mtest_step\u001b[39;49;00m(images, labels):\n", " predictions = model(images, training=\u001b[34mFalse\u001b[39;49;00m)\n", " t_loss = loss_fn(labels, predictions)\n", " test_loss(t_loss)\n", " test_accuracy(labels, predictions)\n", " \u001b[34mreturn\u001b[39;49;00m\n", "\n", " \u001b[36mprint\u001b[39;49;00m(\u001b[33m\"\u001b[39;49;00m\u001b[33mTraining starts ...\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\n", " \u001b[34mfor\u001b[39;49;00m epoch \u001b[35min\u001b[39;49;00m \u001b[36mrange\u001b[39;49;00m(args.epochs):\n", " train_loss.reset_states()\n", " train_accuracy.reset_states()\n", " test_loss.reset_states()\n", " test_accuracy.reset_states()\n", "\n", " \u001b[34mfor\u001b[39;49;00m batch, (images, labels) \u001b[35min\u001b[39;49;00m \u001b[36menumerate\u001b[39;49;00m(train_loader):\n", " train_step(images, labels)\n", "\n", " \u001b[34mfor\u001b[39;49;00m images, labels \u001b[35min\u001b[39;49;00m test_loader:\n", " test_step(images, labels)\n", "\n", " \u001b[36mprint\u001b[39;49;00m(\n", " \u001b[33mf\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\u001b[33mEpoch \u001b[39;49;00m\u001b[33m{\u001b[39;49;00mepoch + \u001b[34m1\u001b[39;49;00m\u001b[33m}\u001b[39;49;00m\u001b[33m, \u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\n", " \u001b[33mf\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\u001b[33mLoss: \u001b[39;49;00m\u001b[33m{\u001b[39;49;00mtrain_loss.result()\u001b[33m}\u001b[39;49;00m\u001b[33m, \u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\n", " \u001b[33mf\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\u001b[33mAccuracy: \u001b[39;49;00m\u001b[33m{\u001b[39;49;00mtrain_accuracy.result() * \u001b[34m100\u001b[39;49;00m\u001b[33m}\u001b[39;49;00m\u001b[33m, \u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\n", " \u001b[33mf\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\u001b[33mTest Loss: \u001b[39;49;00m\u001b[33m{\u001b[39;49;00mtest_loss.result()\u001b[33m}\u001b[39;49;00m\u001b[33m, \u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\n", " \u001b[33mf\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\u001b[33mTest Accuracy: \u001b[39;49;00m\u001b[33m{\u001b[39;49;00mtest_accuracy.result() * \u001b[34m100\u001b[39;49;00m\u001b[33m}\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\n", " )\n", "\n", " \u001b[37m# Save the model\u001b[39;49;00m\n", " \u001b[37m# A version number is needed for the serving container\u001b[39;49;00m\n", " \u001b[37m# to load the model\u001b[39;49;00m\n", " version = \u001b[33m\"\u001b[39;49;00m\u001b[33m00000000\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\n", " ckpt_dir = os.path.join(args.model_dir, version)\n", " \u001b[34mif\u001b[39;49;00m \u001b[35mnot\u001b[39;49;00m os.path.exists(ckpt_dir):\n", " os.makedirs(ckpt_dir)\n", " model.save(ckpt_dir)\n", " \u001b[34mreturn\u001b[39;49;00m\n", "\n", "\n", "\u001b[34mdef\u001b[39;49;00m \u001b[32mparse_args\u001b[39;49;00m():\n", " parser = argparse.ArgumentParser()\n", "\n", " parser.add_argument(\u001b[33m\"\u001b[39;49;00m\u001b[33m--batch-size\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, \u001b[36mtype\u001b[39;49;00m=\u001b[36mint\u001b[39;49;00m, default=\u001b[34m32\u001b[39;49;00m)\n", " parser.add_argument(\u001b[33m\"\u001b[39;49;00m\u001b[33m--epochs\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, \u001b[36mtype\u001b[39;49;00m=\u001b[36mint\u001b[39;49;00m, default=\u001b[34m1\u001b[39;49;00m)\n", " parser.add_argument(\u001b[33m\"\u001b[39;49;00m\u001b[33m--learning-rate\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, \u001b[36mtype\u001b[39;49;00m=\u001b[36mfloat\u001b[39;49;00m, default=\u001b[34m1e-3\u001b[39;49;00m)\n", " parser.add_argument(\u001b[33m\"\u001b[39;49;00m\u001b[33m--beta_1\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, \u001b[36mtype\u001b[39;49;00m=\u001b[36mfloat\u001b[39;49;00m, default=\u001b[34m0.9\u001b[39;49;00m)\n", " parser.add_argument(\u001b[33m\"\u001b[39;49;00m\u001b[33m--beta_2\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, \u001b[36mtype\u001b[39;49;00m=\u001b[36mfloat\u001b[39;49;00m, default=\u001b[34m0.999\u001b[39;49;00m)\n", "\n", " \u001b[37m# Environment variables given by the training image\u001b[39;49;00m\n", " parser.add_argument(\u001b[33m\"\u001b[39;49;00m\u001b[33m--model-dir\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, \u001b[36mtype\u001b[39;49;00m=\u001b[36mstr\u001b[39;49;00m, default=os.environ[\u001b[33m\"\u001b[39;49;00m\u001b[33mSM_MODEL_DIR\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m])\n", " parser.add_argument(\u001b[33m\"\u001b[39;49;00m\u001b[33m--train\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, \u001b[36mtype\u001b[39;49;00m=\u001b[36mstr\u001b[39;49;00m, default=os.environ[\u001b[33m\"\u001b[39;49;00m\u001b[33mSM_CHANNEL_TRAINING\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m])\n", " parser.add_argument(\u001b[33m\"\u001b[39;49;00m\u001b[33m--test\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, \u001b[36mtype\u001b[39;49;00m=\u001b[36mstr\u001b[39;49;00m, default=os.environ[\u001b[33m\"\u001b[39;49;00m\u001b[33mSM_CHANNEL_TESTING\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m])\n", "\n", " parser.add_argument(\u001b[33m\"\u001b[39;49;00m\u001b[33m--current-host\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, \u001b[36mtype\u001b[39;49;00m=\u001b[36mstr\u001b[39;49;00m, default=os.environ[\u001b[33m\"\u001b[39;49;00m\u001b[33mSM_CURRENT_HOST\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m])\n", " parser.add_argument(\u001b[33m\"\u001b[39;49;00m\u001b[33m--hosts\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, \u001b[36mtype\u001b[39;49;00m=\u001b[36mlist\u001b[39;49;00m, default=json.loads(os.environ[\u001b[33m\"\u001b[39;49;00m\u001b[33mSM_HOSTS\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m]))\n", "\n", " \u001b[34mreturn\u001b[39;49;00m parser.parse_args()\n", "\n", "\n", "\u001b[34mif\u001b[39;49;00m \u001b[31m__name__\u001b[39;49;00m == \u001b[33m\"\u001b[39;49;00m\u001b[33m__main__\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m:\n", " args = parse_args()\n", " train(args)\n" ] } ], "source": [ "!pygmentize 'code/train.py'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Set hyperparameters\n", "\n", "In addition, TensorFlow estimator allows you to parse command line arguments\n", "to your training script via `hyperparameters`.\n", "\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# set local_mode to be True if you want to run the training script\n", "# on the machine that runs this notebook\n", "\n", "local_mode = False\n", "\n", "if local_mode:\n", " instance_type = \"local\"\n", "else:\n", " instance_type = \"ml.c4.xlarge\"\n", "\n", "est = TensorFlow(\n", " entry_point=\"train.py\",\n", " source_dir=\"code\", # directory of your training script\n", " role=role,\n", " framework_version=\"2.3.1\",\n", " model_dir=False, # don't pass --model_dir to your training script\n", " py_version=\"py37\",\n", " instance_type=instance_type,\n", " instance_count=1,\n", " volume_size=250,\n", " output_path=output_path,\n", " hyperparameters={\n", " \"batch-size\": 512,\n", " \"epochs\": 1,\n", " \"learning-rate\": 1e-3,\n", " \"beta_1\": 0.9,\n", " \"beta_2\": 0.999,\n", " },\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The training container executes your training script like\n", "\n", "```\n", "python train.py --batch-size 32 --epochs 1 --learning-rate 0.001\n", " --beta_1 0.9 --beta_2 0.999\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Set up channels for training and testing data\n", "\n", "You need to tell `TensorFlow` estimator where to find your training and \n", "testing data. It can be a link to an S3 bucket or it can be a path\n", "in your local file system if you use local mode. In this example,\n", "we download the MNIST data from a public S3 bucket and upload it \n", "to your default bucket. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import logging\n", "import boto3\n", "from botocore.exceptions import ClientError\n", "\n", "# Download training and testing data from a public S3 bucket\n", "\n", "\n", "def download_from_s3(data_dir=\"/tmp/data\", train=True):\n", " \"\"\"Download MNIST dataset and convert it to numpy array\n", "\n", " Args:\n", " data_dir (str): directory to save the data\n", " train (bool): download training set\n", "\n", " Returns:\n", " None\n", " \"\"\"\n", "\n", " if not os.path.exists(data_dir):\n", " os.makedirs(data_dir)\n", "\n", " if train:\n", " images_file = \"train-images-idx3-ubyte.gz\"\n", " labels_file = \"train-labels-idx1-ubyte.gz\"\n", " else:\n", " images_file = \"t10k-images-idx3-ubyte.gz\"\n", " labels_file = \"t10k-labels-idx1-ubyte.gz\"\n", "\n", " # download objects\n", " s3 = boto3.client(\"s3\")\n", " bucket = f\"sagemaker-sample-files\"\n", " for obj in [images_file, labels_file]:\n", " key = os.path.join(\"datasets/image/MNIST\", obj)\n", " dest = os.path.join(data_dir, obj)\n", " if not os.path.exists(dest):\n", " s3.download_file(bucket, key, dest)\n", " return\n", "\n", "\n", "download_from_s3(\"/tmp/data\", True)\n", "download_from_s3(\"/tmp/data\", False)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# upload to the default bucket\n", "\n", "prefix = \"mnist\"\n", "bucket = sess.default_bucket()\n", "loc = sess.upload_data(path=\"/tmp/data\", bucket=bucket, key_prefix=prefix)\n", "\n", "channels = {\"training\": loc, \"testing\": loc}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The keys of the dictionary `channels` are parsed to the training image\n", "and it creates the environment variable `SM_CHANNEL_`. \n", "\n", "In this example, `SM_CHANNEL_TRAINING` and `SM_CHANNEL_TESTING` are created in the training image (checkout \n", "how `code/train.py` access these variables). For more information,\n", "see: [SM_CHANNEL_{channel_name}](https://github.com/aws/sagemaker-training-toolkit/blob/master/ENVIRONMENT_VARIABLES.md#sm_channel_channel_name)\n", "\n", "If you want, you can create a channel for validation:\n", "```\n", "channels = {\n", " 'training': train_data_loc,\n", " 'validation': val_data_loc,\n", " 'test': test_data_loc\n", " }\n", "```\n", "You can then access this channel within your training script via\n", "`SM_CHANNEL_VALIDATION`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Run the training script on SageMaker\n", "Now, the training container has everything to execute your training\n", "script. You can start the container by calling `fit` method." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2022-06-06 03:53:29 Starting - Starting the training job...\n", "2022-06-06 03:53:44 Starting - Preparing the instances for trainingProfilerReport-1654487609: InProgress\n", ".........\n", "2022-06-06 03:55:28 Downloading - Downloading input data\n", "2022-06-06 03:55:28 Training - Downloading the training image......\n", "2022-06-06 03:56:14 Training - Training image download completed. Training in progress.\u001b[34m2022-06-06 03:56:16.997780: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.\u001b[0m\n", "\u001b[34m2022-06-06 03:56:17.008508: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:105] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped.\u001b[0m\n", "\u001b[34m2022-06-06 03:56:17.396332: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.\u001b[0m\n", "\u001b[34m2022-06-06 03:56:21,653 sagemaker-training-toolkit INFO Imported framework sagemaker_tensorflow_container.training\u001b[0m\n", "\u001b[34m2022-06-06 03:56:21,659 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)\u001b[0m\n", "\u001b[34m2022-06-06 03:56:22,301 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)\u001b[0m\n", "\u001b[34m2022-06-06 03:56:22,320 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)\u001b[0m\n", "\u001b[34m2022-06-06 03:56:22,338 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)\u001b[0m\n", "\u001b[34m2022-06-06 03:56:22,360 sagemaker-training-toolkit INFO Invoking user script\u001b[0m\n", "\u001b[34mTraining Env:\u001b[0m\n", "\u001b[34m{\n", " \"additional_framework_parameters\": {},\n", " \"channel_input_dirs\": {\n", " \"testing\": \"/opt/ml/input/data/testing\",\n", " \"training\": \"/opt/ml/input/data/training\"\n", " },\n", " \"current_host\": \"algo-1\",\n", " \"framework_module\": \"sagemaker_tensorflow_container.training:main\",\n", " \"hosts\": [\n", " \"algo-1\"\n", " ],\n", " \"hyperparameters\": {\n", " \"batch-size\": 512,\n", " \"beta_1\": 0.9,\n", " \"beta_2\": 0.999,\n", " \"epochs\": 1,\n", " \"learning-rate\": 0.001\n", " },\n", " \"input_config_dir\": \"/opt/ml/input/config\",\n", " \"input_data_config\": {\n", " \"testing\": {\n", " \"TrainingInputMode\": \"File\",\n", " \"S3DistributionType\": \"FullyReplicated\",\n", " \"RecordWrapperType\": \"None\"\n", " },\n", " \"training\": {\n", " \"TrainingInputMode\": \"File\",\n", " \"S3DistributionType\": \"FullyReplicated\",\n", " \"RecordWrapperType\": \"None\"\n", " }\n", " },\n", " \"input_dir\": \"/opt/ml/input\",\n", " \"is_master\": true,\n", " \"job_name\": \"tensorflow-training-2022-06-06-03-53-29-103\",\n", " \"log_level\": 20,\n", " \"master_hostname\": \"algo-1\",\n", " \"model_dir\": \"/opt/ml/model\",\n", " \"module_dir\": \"s3://sagemaker-ap-southeast-2-431579215499/tensorflow-training-2022-06-06-03-53-29-103/source/sourcedir.tar.gz\",\n", " \"module_name\": \"train\",\n", " \"network_interface_name\": \"eth0\",\n", " \"num_cpus\": 4,\n", " \"num_gpus\": 0,\n", " \"output_data_dir\": \"/opt/ml/output/data\",\n", " \"output_dir\": \"/opt/ml/output\",\n", " \"output_intermediate_dir\": \"/opt/ml/output/intermediate\",\n", " \"resource_config\": {\n", " \"current_host\": \"algo-1\",\n", " \"current_instance_type\": \"ml.c4.xlarge\",\n", " \"current_group_name\": \"homogeneousCluster\",\n", " \"hosts\": [\n", " \"algo-1\"\n", " ],\n", " \"instance_groups\": [\n", " {\n", " \"instance_group_name\": \"homogeneousCluster\",\n", " \"instance_type\": \"ml.c4.xlarge\",\n", " \"hosts\": [\n", " \"algo-1\"\n", " ]\n", " }\n", " ],\n", " \"network_interface_name\": \"eth0\"\n", " },\n", " \"user_entry_point\": \"train.py\"\u001b[0m\n", "\u001b[34m}\u001b[0m\n", "\u001b[34mEnvironment variables:\u001b[0m\n", "\u001b[34mSM_HOSTS=[\"algo-1\"]\u001b[0m\n", "\u001b[34mSM_NETWORK_INTERFACE_NAME=eth0\u001b[0m\n", "\u001b[34mSM_HPS={\"batch-size\":512,\"beta_1\":0.9,\"beta_2\":0.999,\"epochs\":1,\"learning-rate\":0.001}\u001b[0m\n", "\u001b[34mSM_USER_ENTRY_POINT=train.py\u001b[0m\n", "\u001b[34mSM_FRAMEWORK_PARAMS={}\u001b[0m\n", "\u001b[34mSM_RESOURCE_CONFIG={\"current_group_name\":\"homogeneousCluster\",\"current_host\":\"algo-1\",\"current_instance_type\":\"ml.c4.xlarge\",\"hosts\":[\"algo-1\"],\"instance_groups\":[{\"hosts\":[\"algo-1\"],\"instance_group_name\":\"homogeneousCluster\",\"instance_type\":\"ml.c4.xlarge\"}],\"network_interface_name\":\"eth0\"}\u001b[0m\n", "\u001b[34mSM_INPUT_DATA_CONFIG={\"testing\":{\"RecordWrapperType\":\"None\",\"S3DistributionType\":\"FullyReplicated\",\"TrainingInputMode\":\"File\"},\"training\":{\"RecordWrapperType\":\"None\",\"S3DistributionType\":\"FullyReplicated\",\"TrainingInputMode\":\"File\"}}\u001b[0m\n", "\u001b[34mSM_OUTPUT_DATA_DIR=/opt/ml/output/data\u001b[0m\n", "\u001b[34mSM_CHANNELS=[\"testing\",\"training\"]\u001b[0m\n", "\u001b[34mSM_CURRENT_HOST=algo-1\u001b[0m\n", "\u001b[34mSM_MODULE_NAME=train\u001b[0m\n", "\u001b[34mSM_LOG_LEVEL=20\u001b[0m\n", "\u001b[34mSM_FRAMEWORK_MODULE=sagemaker_tensorflow_container.training:main\u001b[0m\n", "\u001b[34mSM_INPUT_DIR=/opt/ml/input\u001b[0m\n", "\u001b[34mSM_INPUT_CONFIG_DIR=/opt/ml/input/config\u001b[0m\n", "\u001b[34mSM_OUTPUT_DIR=/opt/ml/output\u001b[0m\n", "\u001b[34mSM_NUM_CPUS=4\u001b[0m\n", "\u001b[34mSM_NUM_GPUS=0\u001b[0m\n", "\u001b[34mSM_MODEL_DIR=/opt/ml/model\u001b[0m\n", "\u001b[34mSM_MODULE_DIR=s3://sagemaker-ap-southeast-2-431579215499/tensorflow-training-2022-06-06-03-53-29-103/source/sourcedir.tar.gz\u001b[0m\n", "\u001b[34mSM_TRAINING_ENV={\"additional_framework_parameters\":{},\"channel_input_dirs\":{\"testing\":\"/opt/ml/input/data/testing\",\"training\":\"/opt/ml/input/data/training\"},\"current_host\":\"algo-1\",\"framework_module\":\"sagemaker_tensorflow_container.training:main\",\"hosts\":[\"algo-1\"],\"hyperparameters\":{\"batch-size\":512,\"beta_1\":0.9,\"beta_2\":0.999,\"epochs\":1,\"learning-rate\":0.001},\"input_config_dir\":\"/opt/ml/input/config\",\"input_data_config\":{\"testing\":{\"RecordWrapperType\":\"None\",\"S3DistributionType\":\"FullyReplicated\",\"TrainingInputMode\":\"File\"},\"training\":{\"RecordWrapperType\":\"None\",\"S3DistributionType\":\"FullyReplicated\",\"TrainingInputMode\":\"File\"}},\"input_dir\":\"/opt/ml/input\",\"is_master\":true,\"job_name\":\"tensorflow-training-2022-06-06-03-53-29-103\",\"log_level\":20,\"master_hostname\":\"algo-1\",\"model_dir\":\"/opt/ml/model\",\"module_dir\":\"s3://sagemaker-ap-southeast-2-431579215499/tensorflow-training-2022-06-06-03-53-29-103/source/sourcedir.tar.gz\",\"module_name\":\"train\",\"network_interface_name\":\"eth0\",\"num_cpus\":4,\"num_gpus\":0,\"output_data_dir\":\"/opt/ml/output/data\",\"output_dir\":\"/opt/ml/output\",\"output_intermediate_dir\":\"/opt/ml/output/intermediate\",\"resource_config\":{\"current_group_name\":\"homogeneousCluster\",\"current_host\":\"algo-1\",\"current_instance_type\":\"ml.c4.xlarge\",\"hosts\":[\"algo-1\"],\"instance_groups\":[{\"hosts\":[\"algo-1\"],\"instance_group_name\":\"homogeneousCluster\",\"instance_type\":\"ml.c4.xlarge\"}],\"network_interface_name\":\"eth0\"},\"user_entry_point\":\"train.py\"}\u001b[0m\n", "\u001b[34mSM_USER_ARGS=[\"--batch-size\",\"512\",\"--beta_1\",\"0.9\",\"--beta_2\",\"0.999\",\"--epochs\",\"1\",\"--learning-rate\",\"0.001\"]\u001b[0m\n", "\u001b[34mSM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate\u001b[0m\n", "\u001b[34mSM_CHANNEL_TESTING=/opt/ml/input/data/testing\u001b[0m\n", "\u001b[34mSM_CHANNEL_TRAINING=/opt/ml/input/data/training\u001b[0m\n", "\u001b[34mSM_HP_BATCH-SIZE=512\u001b[0m\n", "\u001b[34mSM_HP_BETA_1=0.9\u001b[0m\n", "\u001b[34mSM_HP_BETA_2=0.999\u001b[0m\n", "\u001b[34mSM_HP_EPOCHS=1\u001b[0m\n", "\u001b[34mSM_HP_LEARNING-RATE=0.001\u001b[0m\n", "\u001b[34mPYTHONPATH=/opt/ml/code:/usr/local/bin:/usr/local/lib/python37.zip:/usr/local/lib/python3.7:/usr/local/lib/python3.7/lib-dynload:/usr/local/lib/python3.7/site-packages\u001b[0m\n", "\u001b[34mInvoking script with the following command:\u001b[0m\n", "\u001b[34m/usr/local/bin/python3.7 train.py --batch-size 512 --beta_1 0.9 --beta_2 0.999 --epochs 1 --learning-rate 0.001\u001b[0m\n", "\u001b[34mTraining starts ...\u001b[0m\n", "\u001b[34m[2022-06-06 03:56:26.662 ip-10-0-168-116.ap-southeast-2.compute.internal:26 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None\u001b[0m\n", "\u001b[34m[2022-06-06 03:56:26.976 ip-10-0-168-116.ap-southeast-2.compute.internal:26 INFO profiler_config_parser.py:102] User has disabled profiler.\u001b[0m\n", "\u001b[34m[2022-06-06 03:56:26.978 ip-10-0-168-116.ap-southeast-2.compute.internal:26 INFO json_config.py:91] Creating hook from json_config at /opt/ml/input/config/debughookconfig.json.\u001b[0m\n", "\u001b[34m[2022-06-06 03:56:26.978 ip-10-0-168-116.ap-southeast-2.compute.internal:26 INFO hook.py:199] tensorboard_dir has not been set for the hook. SMDebug will not be exporting tensorboard summaries.\u001b[0m\n", "\u001b[34m[2022-06-06 03:56:26.979 ip-10-0-168-116.ap-southeast-2.compute.internal:26 INFO hook.py:253] Saving to /opt/ml/output/tensors\u001b[0m\n", "\u001b[34m[2022-06-06 03:56:26.979 ip-10-0-168-116.ap-southeast-2.compute.internal:26 INFO state_store.py:75] The checkpoint config file /opt/ml/input/config/checkpointconfig.json does not exist.\u001b[0m\n", "\u001b[34m[2022-06-06 03:56:26.980 ip-10-0-168-116.ap-southeast-2.compute.internal:26 INFO hook.py:413] Monitoring the collections: metrics, losses, sm_metrics\u001b[0m\n", "\n", "2022-06-06 03:56:54 Uploading - Uploading generated training model\u001b[34mEpoch 1, Loss: 0.30810317397117615, Accuracy: 90.59833526611328, Test Loss: 0.13073942065238953, Test Accuracy: 96.17000579833984\u001b[0m\n", "\u001b[34m2022-06-06 03:56:22.715657: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.\u001b[0m\n", "\u001b[34m2022-06-06 03:56:22.715805: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:105] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped.\u001b[0m\n", "\u001b[34m2022-06-06 03:56:22.746773: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.\u001b[0m\n", "\u001b[34mWARNING:tensorflow:From /usr/local/lib/python3.7/site-packages/tensorflow/python/training/tracking/tracking.py:111: Model.state_updates (from tensorflow.python.keras.engine.training) is deprecated and will be removed in a future version.\u001b[0m\n", "\u001b[34mInstructions for updating:\u001b[0m\n", "\u001b[34mThis property should not be used in TensorFlow 2.0, as updates are applied automatically.\u001b[0m\n", "\u001b[34mWARNING:tensorflow:From /usr/local/lib/python3.7/site-packages/tensorflow/python/training/tracking/tracking.py:111: Model.state_updates (from tensorflow.python.keras.engine.training) is deprecated and will be removed in a future version.\u001b[0m\n", "\u001b[34mInstructions for updating:\u001b[0m\n", "\u001b[34mThis property should not be used in TensorFlow 2.0, as updates are applied automatically.\u001b[0m\n", "\u001b[34mWARNING:tensorflow:From /usr/local/lib/python3.7/site-packages/tensorflow/python/training/tracking/tracking.py:111: Layer.updates (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.\u001b[0m\n", "\u001b[34mInstructions for updating:\u001b[0m\n", "\u001b[34mThis property should not be used in TensorFlow 2.0, as updates are applied automatically.\u001b[0m\n", "\u001b[34mWARNING:tensorflow:From /usr/local/lib/python3.7/site-packages/tensorflow/python/training/tracking/tracking.py:111: Layer.updates (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.\u001b[0m\n", "\u001b[34mInstructions for updating:\u001b[0m\n", "\u001b[34mThis property should not be used in TensorFlow 2.0, as updates are applied automatically.\u001b[0m\n", "\u001b[34m2022-06-06 03:56:46.540805: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.\u001b[0m\n", "\u001b[34mINFO:tensorflow:Assets written to: /opt/ml/model/00000000/assets\u001b[0m\n", "\u001b[34mINFO:tensorflow:Assets written to: /opt/ml/model/00000000/assets\u001b[0m\n", "\u001b[34m2022-06-06 03:56:47,337 sagemaker-training-toolkit INFO Reporting training SUCCESS\u001b[0m\n", "\n", "2022-06-06 03:57:14 Completed - Training job completed\n", "Training seconds: 123\n", "Billable seconds: 123\n" ] } ], "source": [ "est.fit(inputs=channels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Inspect and store model data\n", "\n", "Now, the training is finished, the model artifact has been saved in \n", "the `output_path`. We " ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Model artifact saved at:\n", " s3://sagemaker-ap-southeast-2-431579215499/tensorflow/mnist/tensorflow-training-2022-06-06-03-53-29-103/output/model.tar.gz\n" ] } ], "source": [ "tf_mnist_model_data = est.model_data\n", "print(\"Model artifact saved at:\\n\", tf_mnist_model_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will store the variable `model_data` in the current notebook kernel. \n", "In the [next notebook](get_started_with_mnist_deploy.ipynb), you will learn how to retrieve the model artifact and deploy to a SageMaker\n", "endpoint." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Stored 'tf_mnist_model_data' (str)\n" ] } ], "source": [ "%store tf_mnist_model_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Test and debug the entry point before executing the training container\n", "\n", "The entry point `code/train.py` provided here has been tested and it can be executed in the training container. \n", "When you develop your own training script, it is a good practice to simulate the container environment \n", "in the local shell and test it before sending it to SageMaker, because debugging in a containerized environment\n", "is rather cumbersome. The following script shows how you can test your training script:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36mjson\u001b[39;49;00m\n", "\u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36mos\u001b[39;49;00m\n", "\u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36msys\u001b[39;49;00m\n", "\n", "\u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36mboto3\u001b[39;49;00m\n", "\u001b[34mfrom\u001b[39;49;00m \u001b[04m\u001b[36mtrain\u001b[39;49;00m \u001b[34mimport\u001b[39;49;00m parse_args, train\n", "\n", "dirname = os.path.dirname(os.path.abspath(\u001b[31m__file__\u001b[39;49;00m))\n", "\n", "\u001b[34mwith\u001b[39;49;00m \u001b[36mopen\u001b[39;49;00m(os.path.join(dirname, \u001b[33m\"\u001b[39;49;00m\u001b[33mconfig.json\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m), \u001b[33m\"\u001b[39;49;00m\u001b[33mr\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m) \u001b[34mas\u001b[39;49;00m f:\n", " CONFIG = json.load(f)\n", "\n", "\n", "\u001b[34mdef\u001b[39;49;00m \u001b[32mdownload_from_s3\u001b[39;49;00m(data_dir=\u001b[33m\"\u001b[39;49;00m\u001b[33m/tmp/data\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, train=\u001b[34mTrue\u001b[39;49;00m):\n", " \u001b[33m\"\"\"Download MNIST dataset and convert it to numpy array\u001b[39;49;00m\n", "\u001b[33m Args:\u001b[39;49;00m\n", "\u001b[33m data_dir (str): directory to save the data\u001b[39;49;00m\n", "\u001b[33m train (bool): download training set\u001b[39;49;00m\n", "\u001b[33m Returns:\u001b[39;49;00m\n", "\u001b[33m tuple of images and labels as numpy arrays\u001b[39;49;00m\n", "\u001b[33m \"\"\"\u001b[39;49;00m\n", "\n", " \u001b[34mif\u001b[39;49;00m \u001b[35mnot\u001b[39;49;00m os.path.exists(data_dir):\n", " os.makedirs(data_dir)\n", "\n", " \u001b[34mif\u001b[39;49;00m train:\n", " images_file = \u001b[33m\"\u001b[39;49;00m\u001b[33mtrain-images-idx3-ubyte.gz\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\n", " labels_file = \u001b[33m\"\u001b[39;49;00m\u001b[33mtrain-labels-idx1-ubyte.gz\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\n", " \u001b[34melse\u001b[39;49;00m:\n", " images_file = \u001b[33m\"\u001b[39;49;00m\u001b[33mt10k-images-idx3-ubyte.gz\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\n", " labels_file = \u001b[33m\"\u001b[39;49;00m\u001b[33mt10k-labels-idx1-ubyte.gz\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\n", "\n", " \u001b[37m# download objects\u001b[39;49;00m\n", " s3 = boto3.client(\u001b[33m\"\u001b[39;49;00m\u001b[33ms3\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\n", " bucket = CONFIG[\u001b[33m\"\u001b[39;49;00m\u001b[33mpublic_bucket\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m]\n", " \u001b[34mfor\u001b[39;49;00m obj \u001b[35min\u001b[39;49;00m [images_file, labels_file]:\n", " key = os.path.join(\u001b[33m\"\u001b[39;49;00m\u001b[33mdatasets/image/MNIST\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, obj)\n", " dest = os.path.join(data_dir, obj)\n", " \u001b[34mif\u001b[39;49;00m \u001b[35mnot\u001b[39;49;00m os.path.exists(dest):\n", " s3.download_file(bucket, key, dest)\n", " \u001b[34mreturn\u001b[39;49;00m\n", "\n", "\n", "\u001b[34mclass\u001b[39;49;00m \u001b[04m\u001b[32mEnv\u001b[39;49;00m:\n", " \u001b[34mdef\u001b[39;49;00m \u001b[32m__init__\u001b[39;49;00m(\u001b[36mself\u001b[39;49;00m):\n", " \u001b[37m# simulate container env\u001b[39;49;00m\n", " os.environ[\u001b[33m\"\u001b[39;49;00m\u001b[33mSM_MODEL_DIR\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m] = \u001b[33m\"\u001b[39;49;00m\u001b[33m/tmp/tf/model\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\n", " os.environ[\u001b[33m\"\u001b[39;49;00m\u001b[33mSM_CHANNEL_TRAINING\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m] = \u001b[33m\"\u001b[39;49;00m\u001b[33m/tmp/data\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\n", " os.environ[\u001b[33m\"\u001b[39;49;00m\u001b[33mSM_CHANNEL_TESTING\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m] = \u001b[33m\"\u001b[39;49;00m\u001b[33m/tmp/data\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\n", " os.environ[\u001b[33m\"\u001b[39;49;00m\u001b[33mSM_HOSTS\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m] = \u001b[33m'\u001b[39;49;00m\u001b[33m[\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\u001b[33malgo-1\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\u001b[33m]\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m\n", " os.environ[\u001b[33m\"\u001b[39;49;00m\u001b[33mSM_CURRENT_HOST\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m] = \u001b[33m\"\u001b[39;49;00m\u001b[33malgo-1\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\n", " os.environ[\u001b[33m\"\u001b[39;49;00m\u001b[33mSM_NUM_GPUS\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m] = \u001b[33m\"\u001b[39;49;00m\u001b[33m0\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\n", "\n", "\n", "\u001b[34mif\u001b[39;49;00m \u001b[31m__name__\u001b[39;49;00m == \u001b[33m\"\u001b[39;49;00m\u001b[33m__main__\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m:\n", " Env()\n", " download_from_s3()\n", " download_from_s3(train=\u001b[34mFalse\u001b[39;49;00m)\n", " args = parse_args()\n", " train(args)\n" ] } ], "source": [ "!pygmentize code/test_train.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In [the next notebook](get_started_mnist_deploy.ipynb) you will see how to deploy your \n", "trained model artifacts to a SageMaker endpoint. " ] } ], "metadata": { "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" }, "notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License." }, "nbformat": 4, "nbformat_minor": 4 }