{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Distributed Tensorflow in SageMaker with Magic " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\u001B[0;31mDocstring:\u001B[0m\n", "::\n", "\n", " %tfjob [--estimator_name ESTIMATOR_NAME] [--entry_point ENTRY_POINT]\n", " [--source_dir SOURCE_DIR] [--role ROLE]\n", " [--framework_version FRAMEWORK_VERSION] [--py_version PY_VERSION]\n", " [--instance_type INSTANCE_TYPE] [--instance_count INSTANCE_COUNT]\n", " [--output_path OUTPUT_PATH]\n", " [--hyperparameters FOO:1,BAR:0.555,BAZ:ABC | 'FOO : 1, BAR : 0.555, BAZ : ABC']\n", " [--channel_training CHANNEL_TRAINING]\n", " [--channel_testing CHANNEL_TESTING]\n", " [--use_spot_instances [USE_SPOT_INSTANCES]] [--max_wait MAX_WAIT]\n", " [--enable_sagemaker_metrics [ENABLE_SAGEMAKER_METRICS]]\n", " [--metric_definitions ['Name: ganloss, Regex: GAN_loss=.*?);' ['Name: ganloss, Regex: GAN_loss=(.*?;' ...]]]\n", " [--distribution {parameter_server,horovod}]\n", " [--mpi_processes_per_host MPI_PROCESSES_PER_HOST]\n", " [--mpi_custom_mpi_options MPI_CUSTOM_MPI_OPTIONS]\n", " [--name_contains NAME_CONTAINS] [--max_result MAX_RESULT]\n", " {submit,list,status,logs,delete}\n", "\n", "Tensorflow magic command.\n", "\n", "methods:\n", " {submit,list,status,logs,delete}\n", "\n", "submit:\n", " --estimator_name ESTIMATOR_NAME\n", " estimator shell variable name\n", " --entry_point ENTRY_POINT\n", " notebook local code file\n", " --source_dir SOURCE_DIR\n", " notebook local code src, may contain requirements.txt\n", " --role ROLE An AWS IAM role (either name or full ARN). The Amazon\n", " SageMaker training jobs and APIs that create Amazon\n", " SageMaker endpoints use this role to access training\n", " data and model artifacts. After the endpoint is\n", " created, the inference code might use the IAM role, if\n", " it needs to access an AWS resource.\n", " --framework_version FRAMEWORK_VERSION\n", " TensorFlow version\n", " --py_version PY_VERSION\n", " Python version\n", " --instance_type INSTANCE_TYPE\n", " Type of EC2 instance to use for training, for example,\n", " ‘ml.c4.xlarge’.\n", " --instance_count INSTANCE_COUNT\n", " Number of Amazon EC2 instances to use for training.\n", " --output_path OUTPUT_PATH\n", " S3 location for saving the training result (model\n", " artifacts and output files). If not specified, results\n", " are stored to a default bucket. If the bucket with the\n", " specific name does not exist, the estimator creates\n", " the bucket during the fit() method execution.\n", " --hyperparameters \n", " Hyperparameters are passed to your script as arguments\n", " and can be retrieved with an argparse.\n", " --channel_training CHANNEL_TRAINING\n", " A string that represents the path to the directory\n", " that contains the input data for the training channel.\n", " --channel_testing CHANNEL_TESTING\n", " A string that represents the path to the directory\n", " that contains the input data for the testing channel.\n", "\n", "submit-spot:\n", " --use_spot_instances <[USE_SPOT_INSTANCES]>\n", " Specifies whether to use SageMaker Managed Spot\n", " instances for training. If enabled then the max_wait\n", " arg should also be set. More information:\n", " https://docs.aws.amazon.com/sagemaker/latest/dg/model-\n", " managed-spot-training.html\n", " --max_wait MAX_WAIT Timeout in seconds waiting for spot training instances\n", " (default: None). After this amount of time Amazon\n", " SageMaker will stop waiting for Spot instances to\n", " become available (default: None).\n", "\n", "submit-metrics:\n", " --enable_sagemaker_metrics <[ENABLE_SAGEMAKER_METRICS]>\n", " Enables SageMaker Metrics Time Series. For more\n", " information see: https://docs.aws.amazon.com/sagemaker\n", " /latest/dg/API_AlgorithmSpecification.html# SageMaker-\n", " Type-AlgorithmSpecification-\n", " EnableSageMakerMetricsTimeSeries\n", " --metric_definitions <['Name: ganloss, Regex: GAN_loss=(.*?);' ['Name: ganloss, Regex: GAN_loss=(.*?);' ...]]>\n", " A list of dictionaries that defines the metric(s) used\n", " to evaluate the training jobs. Each dictionary\n", " contains two keys: ‘Name’ for the name of the metric,\n", " and ‘Regex’ for the regular expression used to extract\n", " the metric from the logs. This should be defined only\n", " for jobs that don’t use an Amazon algorithm.\n", "\n", "submit-distribution:\n", " --distribution <{parameter_server,horovod}>\n", " To run your training job with multiple instances in a\n", " distributed fashion, set instance_count to a number\n", " larger than 1. We support two different types of\n", " distributed training, parameter server and Horovod.\n", " The distribution parameter is used to configure which\n", " distributed training strategy to use.\n", " --mpi_processes_per_host MPI_PROCESSES_PER_HOST\n", " horovod mpi_processes_per_host\n", " --mpi_custom_mpi_options MPI_CUSTOM_MPI_OPTIONS\n", " horovod custom_mpi_options\n", "\n", "list:\n", " --name_contains NAME_CONTAINS\n", " --max_result MAX_RESULT\n", "\u001B[0;31mFile:\u001B[0m /opt/conda/lib/python3.8/site-packages/sage_maker_kernel/kernelmagics.py\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%tfjob?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Setup S3 bucket locations\n", "\n", "First, setup some locations in the default SageMaker bucket to store the raw input datasets and the Tensorflow job output." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Couldn't call 'get_role' to get Role ARN from role name workshop-sagemaker to get Role path.\n" ] } ], "source": [ "import sagemaker\n", "\n", "sess = sagemaker.Session()\n", "output_path='s3://' + sess.default_bucket() + '/mnist'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "MNIST is a widely used dataset for handwritten digit classification. It consists of 70,000 labeled 28x28 pixel grayscale images of hand-written digits. The dataset is split into 60,000 training images and 10,000 test images." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import os\n", "import json\n", "import logging\n", "import boto3\n", "from botocore.exceptions import ClientError\n", "# Download training and testing data from a public S3 bucket\n", "\n", "def download_from_s3(data_dir='/tmp/data', train=True):\n", " \"\"\"Download MNIST dataset and convert it to numpy array\n", " \n", " Args:\n", " data_dir (str): directory to save the data\n", " train (bool): download training set\n", " \n", " Returns:\n", " None\n", " \"\"\"\n", " \n", " if not os.path.exists(data_dir):\n", " os.makedirs(data_dir)\n", " \n", " if train:\n", " images_file = \"train-images-idx3-ubyte.gz\"\n", " labels_file = \"train-labels-idx1-ubyte.gz\"\n", " else:\n", " images_file = \"t10k-images-idx3-ubyte.gz\"\n", " labels_file = \"t10k-labels-idx1-ubyte.gz\"\n", " \n", "# with open('code/config.json', 'r') as f:\n", "# config = json.load(f)\n", " config = {}\n", " config['public_bucket'] = \"sagemaker-sample-files\"\n", "\n", " # download objects\n", " s3 = boto3.client('s3')\n", " bucket = config['public_bucket']\n", " for obj in [images_file, labels_file]:\n", " key = os.path.join(\"datasets/image/MNIST\", obj)\n", " dest = os.path.join(data_dir, obj)\n", " if not os.path.exists(dest):\n", " s3.download_file(bucket, key, dest)\n", " return\n", "\n", "\n", "download_from_s3('/tmp/data', True)\n", "download_from_s3('/tmp/data', False)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# upload to the default bucket\n", "\n", "prefix = 'mnist'\n", "bucket = sess.default_bucket()\n", "loc = sess.upload_data(path='/tmp/data', bucket=bucket, key_prefix=prefix)\n", "\n", "channels = {\n", " \"training\": loc,\n", " \"testing\": loc\n", "}\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Write the Tensorflow script\n", "\n", "The source for a traning script is in the cell below. The cell uses the `%%tfjob submit` directive to submit python application from cell to Tensorflow Estimator. " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Couldn't call 'get_role' to get Role ARN from role name workshop-sagemaker to get Role path.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "submit:\n", " {\n", " \"channel_testing\": \"s3://sagemaker-eu-west-1-245582572290/mnist\",\n", " \"channel_training\": \"s3://sagemaker-eu-west-1-245582572290/mnist\",\n", " \"enable_sagemaker_metrics\": false,\n", " \"entry_point\": \"/tmp/tmp-4c621cef-148d-4252-918a-52e58e750496.py\",\n", " \"estimator_name\": \"___TensorFlow_estimator\",\n", " \"framework_version\": \"2.3.0\",\n", " \"instance_count\": 1,\n", " \"instance_type\": \"ml.c4.xlarge\",\n", " \"max_result\": 10,\n", " \"mpi_custom_mpi_options\": \"--NCCL_DEBUG INFO\",\n", " \"mpi_processes_per_host\": 4,\n", " \"name_contains\": \"tensorflow\",\n", " \"output_path\": \"s3://sagemaker-eu-west-1-245582572290/mnist\",\n", " \"py_version\": \"py37\",\n", " \"role\": \"arn:aws:iam::245582572290:role/workshop-sagemaker\",\n", " \"use_spot_instances\": false\n", "}\n", "{\n", " \"___TensorFlow_latest_job_name\": \"tensorflow-training-2020-12-17-17-28-10-885\",\n", " \"estimator_variable\": \"___TensorFlow_estimator\"\n", "}\n" ] } ], "source": [ "%%tfjob submit --output_path s3://sagemaker-eu-west-1-245582572290/mnist --channel_training s3://sagemaker-eu-west-1-245582572290/mnist --channel_testing s3://sagemaker-eu-west-1-245582572290/mnist \n", "\n", "from __future__ import print_function\n", "\n", "import argparse\n", "import logging\n", "import os\n", "import json\n", "import gzip\n", "import numpy as np\n", "import traceback\n", "\n", "import tensorflow as tf\n", "from tensorflow.keras.layers import Dense, Flatten, Conv2D\n", "from tensorflow.keras import Model\n", "\n", "\n", "logging.basicConfig(level=logging.DEBUG)\n", "\n", "# Define the model object\n", "\n", "class SmallConv(Model):\n", " def __init__(self):\n", " super(SmallConv, self).__init__()\n", " self.conv1 = Conv2D(32, 3, activation='relu')\n", " self.flatten = Flatten()\n", " self.d1 = Dense(128, activation='relu')\n", " self.d2 = Dense(10)\n", " \n", " def call(self, x):\n", " x = self.conv1(x)\n", " x = self.flatten(x)\n", " x = self.d1(x)\n", " return self.d2(x)\n", "\n", "\n", "# Decode and preprocess data\n", "def convert_to_numpy(data_dir, images_file, labels_file):\n", " \"\"\"Byte string to numpy arrays\"\"\"\n", " with gzip.open(os.path.join(data_dir, images_file), 'rb') as f:\n", " images = np.frombuffer(f.read(), np.uint8, offset=16).reshape(-1, 28, 28)\n", " \n", " with gzip.open(os.path.join(data_dir, labels_file), 'rb') as f:\n", " labels = np.frombuffer(f.read(), np.uint8, offset=8)\n", "\n", " return (images, labels)\n", "\n", "def mnist_to_numpy(data_dir, train):\n", " \"\"\"Load raw MNIST data into numpy array\n", " \n", " Args:\n", " data_dir (str): directory of MNIST raw data. \n", " This argument can be accessed via SM_CHANNEL_TRAINING\n", " \n", " train (bool): use training data\n", "\n", " Returns:\n", " tuple of images and labels as numpy array\n", " \"\"\"\n", "\n", " if train:\n", " images_file = \"train-images-idx3-ubyte.gz\"\n", " labels_file = \"train-labels-idx1-ubyte.gz\"\n", " else:\n", " images_file = \"t10k-images-idx3-ubyte.gz\"\n", " labels_file = \"t10k-labels-idx1-ubyte.gz\"\n", "\n", " return convert_to_numpy(data_dir, images_file, labels_file)\n", "\n", "\n", "def normalize(x, axis):\n", " eps = np.finfo(float).eps\n", "\n", " mean = np.mean(x, axis=axis, keepdims=True)\n", " # avoid division by zero\n", " std = np.std(x, axis=axis, keepdims=True) + eps\n", " return (x - mean) / std\n", "\n", "# Training logic\n", "\n", "def train(args):\n", " # create data loader from the train / test channels\n", " x_train, y_train = mnist_to_numpy(data_dir=args.train, train=True)\n", " x_test, y_test = mnist_to_numpy(data_dir=args.test, train=False)\n", "\n", " x_train, x_test = x_train.astype(np.float32), x_test.astype(np.float32)\n", "\n", " # normalize the inputs to mean 0 and std 1\n", " x_train, x_test = normalize(x_train, (1, 2)), normalize(x_test, (1, 2))\n", "\n", " # expand channel axis\n", " # tf uses depth minor convention\n", " x_train, x_test = np.expand_dims(x_train, axis=3), np.expand_dims(x_test, axis=3)\n", " \n", " # normalize the data to mean 0 and std 1\n", " train_loader = tf.data.Dataset.from_tensor_slices(\n", " (x_train, y_train)).shuffle(len(x_train)).batch(args.batch_size)\n", "\n", " test_loader = tf.data.Dataset.from_tensor_slices(\n", " (x_test, y_test)).batch(args.batch_size)\n", "\n", " model = SmallConv()\n", " model.compile()\n", " loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)\n", " optimizer = tf.keras.optimizers.Adam(\n", " learning_rate=args.learning_rate, \n", " beta_1=args.beta_1,\n", " beta_2=args.beta_2\n", " )\n", "\n", "\n", " train_loss = tf.keras.metrics.Mean(name='train_loss')\n", " train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='train_accuracy')\n", "\n", " test_loss = tf.keras.metrics.Mean(name='test_loss')\n", " test_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='test_accuracy')\n", "\n", "\n", " @tf.function\n", " def train_step(images, labels):\n", " with tf.GradientTape() as tape:\n", " predictions = model(images, training=True)\n", " loss = loss_fn(labels, predictions)\n", " grad = tape.gradient(loss, model.trainable_variables)\n", " optimizer.apply_gradients(zip(grad, model.trainable_variables))\n", " \n", " train_loss(loss)\n", " train_accuracy(labels, predictions)\n", " return \n", " \n", " @tf.function\n", " def test_step(images, labels):\n", " predictions = model(images, training=False)\n", " t_loss = loss_fn(labels, predictions)\n", " test_loss(t_loss)\n", " test_accuracy(labels, predictions)\n", " return\n", " \n", " print(\"Training starts ...\")\n", " for epoch in range(args.epochs):\n", " train_loss.reset_states()\n", " train_accuracy.reset_states()\n", " test_loss.reset_states()\n", " test_accuracy.reset_states()\n", " \n", " for batch, (images, labels) in enumerate(train_loader):\n", " train_step(images, labels)\n", " \n", " for images, labels in test_loader:\n", " test_step(images, labels)\n", " \n", " print(\n", " f'Epoch {epoch + 1}, '\n", " f'Loss: {train_loss.result()}, '\n", " f'Accuracy: {train_accuracy.result() * 100}, '\n", " f'Test Loss: {test_loss.result()}, '\n", " f'Test Accuracy: {test_accuracy.result() * 100}'\n", " )\n", "\n", " # Save the model\n", " # A version number is needed for the serving container\n", " # to load the model\n", " version = '00000000'\n", "# ckpt_dir = os.path.join(args.model_dir, version)\n", " ckpt_dir = os.path.join(args.sm_model_dir, version)\n", " if not os.path.exists(ckpt_dir):\n", " os.makedirs(ckpt_dir)\n", " model.save(ckpt_dir)\n", " return\n", "\n", "\n", "def parse_args():\n", " parser = argparse.ArgumentParser()\n", "\n", " parser.add_argument('--batch-size', type=int, default=32)\n", " parser.add_argument('--epochs', type=int, default=1)\n", " parser.add_argument('--learning-rate', type=float, default=1e-3)\n", " parser.add_argument('--beta_1', type=float, default=0.9)\n", " parser.add_argument('--beta_2', type=float, default=0.999)\n", " \n", " # Environment variables given by the training image\n", " \n", "# parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])\n", " parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAINING'])\n", " parser.add_argument('--test', type=str, default=os.environ['SM_CHANNEL_TESTING'])\n", "\n", " parser.add_argument('--current-host', type=str, default=os.environ['SM_CURRENT_HOST'])\n", " parser.add_argument('--hosts', type=list, default=json.loads(os.environ['SM_HOSTS']))\n", " \n", " # Data, model, and output directories.\n", " # model_dir is always passed in from SageMaker.\n", " # By default this is a S3 path under the default bucket.\n", " parser.add_argument('--model_dir', type=str)\n", " parser.add_argument('--sm-model-dir', type=str, default=os.environ.get('SM_MODEL_DIR'))\n", "\n", " return parser.parse_args()\n", "\n", "\n", "\n", "if __name__ == '__main__':\n", " args = parse_args()\n", " train(args)\n", "\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'tensorflow-training-2020-12-17-17-28-10-885'" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "___TensorFlow_latest_job_name" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Stop latest traning Job" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"AlgorithmSpecification\": {\n", " \"EnableSageMakerMetricsTimeSeries\": false,\n", " \"TrainingImage\": \"763104351884.dkr.ecr.eu-west-1.amazonaws.com/tensorflow-training:2.3.0-cpu-py37\",\n", " \"TrainingInputMode\": \"File\"\n", " },\n", " \"CreationTime\": \"2020-12-17 17:27:44.884000+00:00\",\n", " \"DebugHookConfig\": {\n", " \"CollectionConfigurations\": [],\n", " \"S3OutputPath\": \"s3://sagemaker-eu-west-1-245582572290/mnist\"\n", " },\n", " \"EnableInterContainerTrafficEncryption\": false,\n", " \"EnableManagedSpotTraining\": false,\n", " \"EnableNetworkIsolation\": false,\n", " \"HyperParameters\": {\n", " \"model_dir\": \"\\\"s3://sagemaker-eu-west-1-245582572290/mnist/tensorflow-training-2020-12-17-17-27-44-577/model\\\"\",\n", " \"sagemaker_container_log_level\": \"20\",\n", " \"sagemaker_job_name\": \"\\\"tensorflow-training-2020-12-17-17-27-44-577\\\"\",\n", " \"sagemaker_program\": \"\\\"tmp-223ca4e1-019c-489c-9b25-d553c2cc311d.py\\\"\",\n", " \"sagemaker_region\": \"\\\"eu-west-1\\\"\",\n", " \"sagemaker_submit_directory\": \"\\\"s3://sagemaker-eu-west-1-245582572290/tensorflow-training-2020-12-17-17-27-44-577/source/sourcedir.tar.gz\\\"\"\n", " },\n", " \"InputDataConfig\": [\n", " {\n", " \"ChannelName\": \"training\",\n", " \"CompressionType\": \"None\",\n", " \"DataSource\": {\n", " \"S3DataSource\": {\n", " \"S3DataDistributionType\": \"FullyReplicated\",\n", " \"S3DataType\": \"S3Prefix\",\n", " \"S3Uri\": \"s3://sagemaker-eu-west-1-245582572290/mnist\"\n", " }\n", " },\n", " \"RecordWrapperType\": \"None\"\n", " },\n", " {\n", " \"ChannelName\": \"testing\",\n", " \"CompressionType\": \"None\",\n", " \"DataSource\": {\n", " \"S3DataSource\": {\n", " \"S3DataDistributionType\": \"FullyReplicated\",\n", " \"S3DataType\": \"S3Prefix\",\n", " \"S3Uri\": \"s3://sagemaker-eu-west-1-245582572290/mnist\"\n", " }\n", " },\n", " \"RecordWrapperType\": \"None\"\n", " }\n", " ],\n", " \"LastModifiedTime\": \"2020-12-17 17:28:03.279000+00:00\",\n", " \"OutputDataConfig\": {\n", " \"KmsKeyId\": \"\",\n", " \"S3OutputPath\": \"s3://sagemaker-eu-west-1-245582572290/mnist\"\n", " },\n", " \"ProfilerConfig\": {\n", " \"ProfilingIntervalInMilliseconds\": 500,\n", " \"S3OutputPath\": \"s3://sagemaker-eu-west-1-245582572290/mnist\"\n", " },\n", " \"ProfilerRuleConfigurations\": [\n", " {\n", " \"RuleConfigurationName\": \"ProfilerReport-1608226064\",\n", " \"RuleEvaluatorImage\": \"929884845733.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-debugger-rules:latest\",\n", " \"RuleParameters\": {\n", " \"rule_to_invoke\": \"ProfilerReport\"\n", " },\n", " \"VolumeSizeInGB\": 0\n", " }\n", " ],\n", " \"ProfilerRuleEvaluationStatuses\": [\n", " {\n", " \"LastModifiedTime\": \"2020-12-17 17:27:46.164000+00:00\",\n", " \"RuleConfigurationName\": \"ProfilerReport-1608226064\",\n", " \"RuleEvaluationStatus\": \"InProgress\"\n", " }\n", " ],\n", " \"ProfilingStatus\": \"Enabled\",\n", " \"ResourceConfig\": {\n", " \"InstanceCount\": 1,\n", " \"InstanceType\": \"ml.c4.xlarge\",\n", " \"VolumeSizeInGB\": 30\n", " },\n", " \"ResponseMetadata\": {\n", " \"HTTPHeaders\": {\n", " \"content-length\": \"3129\",\n", " \"content-type\": \"application/x-amz-json-1.1\",\n", " \"date\": \"Thu, 17 Dec 2020 17:28:02 GMT\",\n", " \"x-amzn-requestid\": \"7f9fff2e-87a8-41a1-9055-9e9df5ab7bc3\"\n", " },\n", " \"HTTPStatusCode\": 200,\n", " \"RequestId\": \"7f9fff2e-87a8-41a1-9055-9e9df5ab7bc3\",\n", " \"RetryAttempts\": 0\n", " },\n", " \"RoleArn\": \"arn:aws:iam::245582572290:role/workshop-sagemaker\",\n", " \"SecondaryStatus\": \"Starting\",\n", " \"SecondaryStatusTransitions\": [\n", " {\n", " \"StartTime\": \"2020-12-17 17:27:44.884000+00:00\",\n", " \"Status\": \"Starting\",\n", " \"StatusMessage\": \"Launching requested ML instances\"\n", " }\n", " ],\n", " \"StoppingCondition\": {\n", " \"MaxRuntimeInSeconds\": 86400\n", " },\n", " \"TrainingJobArn\": \"arn:aws:sagemaker:eu-west-1:245582572290:training-job/tensorflow-training-2020-12-17-17-27-44-577\",\n", " \"TrainingJobName\": \"tensorflow-training-2020-12-17-17-27-44-577\",\n", " \"TrainingJobStatus\": \"Stopping\"\n", "}\n" ] } ], "source": [ "%tfjob delete" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Describe latest traning Job" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"AlgorithmSpecification\": {\n", " \"EnableSageMakerMetricsTimeSeries\": false,\n", " \"TrainingImage\": \"763104351884.dkr.ecr.eu-west-1.amazonaws.com/tensorflow-training:2.3.0-cpu-py37\",\n", " \"TrainingInputMode\": \"File\"\n", " },\n", " \"CreationTime\": \"2020-12-17 17:27:44.884000+00:00\",\n", " \"DebugHookConfig\": {\n", " \"CollectionConfigurations\": [],\n", " \"S3OutputPath\": \"s3://sagemaker-eu-west-1-245582572290/mnist\"\n", " },\n", " \"EnableInterContainerTrafficEncryption\": false,\n", " \"EnableManagedSpotTraining\": false,\n", " \"EnableNetworkIsolation\": false,\n", " \"HyperParameters\": {\n", " \"model_dir\": \"\\\"s3://sagemaker-eu-west-1-245582572290/mnist/tensorflow-training-2020-12-17-17-27-44-577/model\\\"\",\n", " \"sagemaker_container_log_level\": \"20\",\n", " \"sagemaker_job_name\": \"\\\"tensorflow-training-2020-12-17-17-27-44-577\\\"\",\n", " \"sagemaker_program\": \"\\\"tmp-223ca4e1-019c-489c-9b25-d553c2cc311d.py\\\"\",\n", " \"sagemaker_region\": \"\\\"eu-west-1\\\"\",\n", " \"sagemaker_submit_directory\": \"\\\"s3://sagemaker-eu-west-1-245582572290/tensorflow-training-2020-12-17-17-27-44-577/source/sourcedir.tar.gz\\\"\"\n", " },\n", " \"InputDataConfig\": [\n", " {\n", " \"ChannelName\": \"training\",\n", " \"CompressionType\": \"None\",\n", " \"DataSource\": {\n", " \"S3DataSource\": {\n", " \"S3DataDistributionType\": \"FullyReplicated\",\n", " \"S3DataType\": \"S3Prefix\",\n", " \"S3Uri\": \"s3://sagemaker-eu-west-1-245582572290/mnist\"\n", " }\n", " },\n", " \"RecordWrapperType\": \"None\"\n", " },\n", " {\n", " \"ChannelName\": \"testing\",\n", " \"CompressionType\": \"None\",\n", " \"DataSource\": {\n", " \"S3DataSource\": {\n", " \"S3DataDistributionType\": \"FullyReplicated\",\n", " \"S3DataType\": \"S3Prefix\",\n", " \"S3Uri\": \"s3://sagemaker-eu-west-1-245582572290/mnist\"\n", " }\n", " },\n", " \"RecordWrapperType\": \"None\"\n", " }\n", " ],\n", " \"LastModifiedTime\": \"2020-12-17 17:27:46.906000+00:00\",\n", " \"OutputDataConfig\": {\n", " \"KmsKeyId\": \"\",\n", " \"S3OutputPath\": \"s3://sagemaker-eu-west-1-245582572290/mnist\"\n", " },\n", " \"ProfilerConfig\": {\n", " \"ProfilingIntervalInMilliseconds\": 500,\n", " \"S3OutputPath\": \"s3://sagemaker-eu-west-1-245582572290/mnist\"\n", " },\n", " \"ProfilerRuleConfigurations\": [\n", " {\n", " \"RuleConfigurationName\": \"ProfilerReport-1608226064\",\n", " \"RuleEvaluatorImage\": \"929884845733.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-debugger-rules:latest\",\n", " \"RuleParameters\": {\n", " \"rule_to_invoke\": \"ProfilerReport\"\n", " },\n", " \"VolumeSizeInGB\": 0\n", " }\n", " ],\n", " \"ProfilerRuleEvaluationStatuses\": [\n", " {\n", " \"LastModifiedTime\": \"2020-12-17 17:27:46.164000+00:00\",\n", " \"RuleConfigurationName\": \"ProfilerReport-1608226064\",\n", " \"RuleEvaluationStatus\": \"InProgress\"\n", " }\n", " ],\n", " \"ProfilingStatus\": \"Enabled\",\n", " \"ResourceConfig\": {\n", " \"InstanceCount\": 1,\n", " \"InstanceType\": \"ml.c4.xlarge\",\n", " \"VolumeSizeInGB\": 30\n", " },\n", " \"ResponseMetadata\": {\n", " \"HTTPHeaders\": {\n", " \"content-length\": \"2944\",\n", " \"content-type\": \"application/x-amz-json-1.1\",\n", " \"date\": \"Thu, 17 Dec 2020 17:27:58 GMT\",\n", " \"x-amzn-requestid\": \"00ea4352-8229-450b-825d-db61e06559bd\"\n", " },\n", " \"HTTPStatusCode\": 200,\n", " \"RequestId\": \"00ea4352-8229-450b-825d-db61e06559bd\",\n", " \"RetryAttempts\": 0\n", " },\n", " \"RoleArn\": \"arn:aws:iam::245582572290:role/workshop-sagemaker\",\n", " \"SecondaryStatus\": \"Starting\",\n", " \"SecondaryStatusTransitions\": [\n", " {\n", " \"StartTime\": \"2020-12-17 17:27:44.884000+00:00\",\n", " \"Status\": \"Starting\",\n", " \"StatusMessage\": \"Launching requested ML instances\"\n", " }\n", " ],\n", " \"StoppingCondition\": {\n", " \"MaxRuntimeInSeconds\": 86400\n", " },\n", " \"TrainingJobArn\": \"arn:aws:sagemaker:eu-west-1:245582572290:training-job/tensorflow-training-2020-12-17-17-27-44-577\",\n", " \"TrainingJobName\": \"tensorflow-training-2020-12-17-17-27-44-577\",\n", " \"TrainingJobStatus\": \"InProgress\"\n", "}\n" ] } ], "source": [ "%tfjob status" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Show logs for latest traning Job" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2020-12-17 17:28:33 Starting - Launching requested ML instances.null\n" ] } ], "source": [ "%tfjob logs\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## List traning jobs" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"NextToken\": \"cIws2QhTXUIa8bi8X9aU7gCAR0Xdc3x9L/Ofg4vsVMTtcNqRqLcpBqE42+cDc29TFQi5WMntyYF8Dtfi7hilXAF3S3jOJ0DmOuxvXC7MuU1Q6+20eQKMbbovB90pwL5DPnINepnlLEmFhvO87tIVNZR4vTy3ef5rgF6dqbA0VVq0m92q6y2SOofBaZP49sRdVnJtTcTQaS5EqeYVIuH5KxlQ5w0j5RQq6GZMUD+Yb2yDOCsvqfy1owdkaN5KZ4UOJZ02BEXJUQQfH9slv9djHeoqy1UQeAT4Uj9JfW7GSGGhGbqBQpALEjcsu7VkI9wUADMYCWYOlP+lUtW8E2nSCujGvcsgo9AYnWN4Mg9fAZA1v19kYU5BB3xHSRprEU/tz/yJxJdxE2jqVM3qT/9GmPydPS9kETLj6F4XNZHCzJgu8qXvXFjinhEQdrLD1c3EhItDsieIUexyFvCnBZ2zc5gUDNcBOUTPIKpF17tfSgav50UtzbMbcYSaK2kGOSl3i6+ahsHMUSm91mecIbqrxUjSdqToSsJwf3n0hkGr/bDXAA+L+TosLKfp4NkhLxwo0n6xKnMMAO+WIA==\",\n", " \"ResponseMetadata\": {\n", " \"HTTPHeaders\": {\n", " \"content-length\": \"917\",\n", " \"content-type\": \"application/x-amz-json-1.1\",\n", " \"date\": \"Thu, 17 Dec 2020 17:27:48 GMT\",\n", " \"x-amzn-requestid\": \"229a98f7-db12-44e8-bfc8-4cf522d8542a\"\n", " },\n", " \"HTTPStatusCode\": 200,\n", " \"RequestId\": \"229a98f7-db12-44e8-bfc8-4cf522d8542a\",\n", " \"RetryAttempts\": 0\n", " },\n", " \"TrainingJobSummaries\": [\n", " {\n", " \"CreationTime\": \"2020-12-17 17:27:44.884000+00:00\",\n", " \"LastModifiedTime\": \"2020-12-17 17:27:46.906000+00:00\",\n", " \"TrainingJobArn\": \"arn:aws:sagemaker:eu-west-1:245582572290:training-job/tensorflow-training-2020-12-17-17-27-44-577\",\n", " \"TrainingJobName\": \"tensorflow-training-2020-12-17-17-27-44-577\",\n", " \"TrainingJobStatus\": \"InProgress\"\n", " }\n", " ]\n", "}\n" ] } ], "source": [ "%tfjob list" ] } ], "metadata": { "kernelspec": { "display_name": "SageMakerMagic (lblokhin/26)", "language": "python", "name": "sm__SAGEMAKER_INTERNAL__arn:aws:sagemaker:eu-west-1:245582572290:image-version/lblokhin/26" }, "language_info": { "codemirror_mode": { "name": "python", "version": 3 }, "mimetype": "text/x-python", "name": "sm_kernel", "pygments_lexer": "python" } }, "nbformat": 4, "nbformat_minor": 4 }