{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Computer Vision (CV) On SageMaker - Tensorflow\n", "\n", "1. [Introduction](#Introduction)\n", "2. [Prerequisites](#Prerequisites)\n", "3. [Setup](#Setup)\n", "4. [Dataset](#Dataset)\n", "5. [Training a CV model](#Training-a-CV-model)\n", " 1. [TFRecord Data Ingestion](#TFRecord-Data-Ingestion)\n", " 2. [Create Experiment](#Create-Experiment)\n", " 3. [Configure Training](#Configure-Training)\n", " 4. [Analyzing Training Job](#Analyzing-Training-Job)\n", "6. [Hyperparameter tuning Job](#Automatic-Model-Tuning)\n", " 1. [Configure HPO Job](#Configure-HPO-Job)\n", " 2. [Associate HPO to Experiment](#Associate-HPO-to-Experiment)\n", "7. [Clean Up](#Clean-up)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction\n", "This lab is focused on SageMaker Training for CV. We'll show an example for the performant Pipe Mode data ingestion, HyperParameter Optimization, as well as experiment tracking. In the future labs we'll show how experiment tracking can be automated through SageMaker Pipeline's native integration. The model used for this notebook is a simple deep CNN that is based on the [Keras examples](https://www.tensorflow.org/tutorials/images/cnn). \n", "\n", "** Note: This Notebook was tested on Data Science Kernel for SageMaker Studio**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prerequisites\n", "\n", "To run this notebook, you can simply execute each cell in order. To understand what's happening, you'll need:\n", "\n", "- Access to the SageMaker default S3 bucket. All the files related to this lab will be stored under the \"cv_keras_cifar10\" prefix of the bucket.\n", "- Familiarity with Python and numpy\n", "- Basic familiarity with AWS S3.\n", "- Basic understanding of AWS Sagemaker.\n", "- Basic familiarity with AWS Command Line Interface (CLI) -- ideally, you should have it set up with credentials to access the AWS account you're running this notebook from.\n", "- SageMaker Studio is preferred for the full UI integration" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup\n", "\n", "Setting up the environment, load the libraries, and define the parameter for the entire notebook.\n", "\n", "Run the cell below if you are missing smexperiments or Tensorflow in your kernel" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install sagemaker-experiments\n", "!pip install tensorflow" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "import time\n", "import pytz\n", "import boto3\n", "import sagemaker\n", "import tensorflow as tf\n", "from tensorflow import keras\n", "from sagemaker.tensorflow import TensorFlow\n", "from sagemaker import get_execution_role\n", "\n", "from smexperiments.experiment import Experiment\n", "from smexperiments.trial import Trial\n", "\n", "sagemaker_session = sagemaker.Session()\n", "sess = boto3.Session()\n", "sm = sess.client(\"sagemaker\")\n", "\n", "role = get_execution_role()\n", "\n", "bucket = sagemaker_session.default_bucket()\n", "prefix = \"cv_keras_cifar10\"\n", "\n", "print(\"Bucket: {}\".format(bucket))\n", "print(\"SageMaker ver: \" + sagemaker.__version__)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dataset\n", "The [CIFAR-10 dataset](https://www.cs.toronto.edu/~kriz/cifar.html) is one of the most popular machine learning datasets. It consists of 60,000 32x32 images belonging to 10 different classes (6,000 images per class). Here are the classes in the dataset, as well as 10 random images from each:\n", "\n", "![cifar10](../statics/CIFAR-10.png)\n", "\n", "In this tutorial, we will train a deep CNN to recognize these images.\n", "\n", "Downloading the test and training data takes around 5 minutes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from tensorflow.keras.datasets import cifar10\n", "\n", "(x_train, y_train), (x_test, y_test) = cifar10.load_data()\n", "print(\"x_train shape:\", x_train.shape)\n", "print(x_train.shape[0], \"train samples\")\n", "print(x_test.shape[0], \"test samples\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Training a CV model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## TFRecord Data Ingestion\n", "\n", "Converting dataset into [TFRecord files](https://www.tensorflow.org/tutorials/load_data/tfrecord) \n", "This allows us to use SageMaker [Pipe Mode](https://aws.amazon.com/blogs/machine-learning/accelerate-model-training-using-faster-pipe-mode-on-amazon-sagemaker/).\n", "\n", "To do this with your own dataset, here are examples of how to generate TFRecords from image datasets:\n", "* Image Classification: https://www.tensorflow.org/tutorials/load_data/tfrecord#walkthrough_reading_and_writing_image_data\n", "* Object Detection: https://keras.io/examples/keras_recipes/creating_tfrecords/\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "if not os.path.exists(\"./data/validation\"):\n", " os.makedirs(\"./data/validation\")\n", "\n", "if not os.path.exists(\"./data/train\"):\n", " os.makedirs(\"./data/train\")\n", "\n", "\n", "def write_tfrecords(x, y, filename):\n", " writer = tf.io.TFRecordWriter(filename)\n", "\n", " for image, label in zip(x, y):\n", " example = tf.train.Example(\n", " features=tf.train.Features(\n", " feature={\n", " \"image\": tf.train.Feature(\n", " bytes_list=tf.train.BytesList(value=[image.tobytes()])\n", " ),\n", " \"label\": tf.train.Feature(int64_list=tf.train.Int64List(value=[label])),\n", " }\n", " )\n", " )\n", " writer.write(example.SerializeToString())\n", "\n", "\n", "write_tfrecords(x_test, y_test, \"./data/validation/validation.tfrecords\")\n", "\n", "write_tfrecords(x_train, y_train, \"./data/train/train.tfrecords\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create Experiment\n", "\n", "[SageMaker Experiment](https://docs.aws.amazon.com/sagemaker/latest/dg/experiments.html) helps you organize, track, compare and evaluate machine learning (ML) experiments and model versions. SInce ML is a highly iterative process, Experiment helps data scientists and ML engineers to explore thousands of different models in an organized manner. Exspecially when you are using tools like [Automatic Model Tuning](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html) and [Amazon SageMaker Autopilot](https://docs.aws.amazon.com/sagemaker/latest/dg/experiments.html), it will help you explore a large number of combinations automatically, and quickly zoom in on high-performance models.\n", "\n", "We will first create an experiment for a training job, and then do an example for Automatic Model Tuning." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cv_experiment = Experiment.create(\n", " experiment_name=f\"manual-experiment-cv-keras-{int(time.time())}\",\n", " description=\"CV Workshop example\",\n", " sagemaker_boto_client=sm,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Uploading the data to s3" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataset_location = sagemaker_session.upload_data(path=\"data\", key_prefix=prefix + \"/data\")\n", "display(dataset_location)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Configure Training\n", "\n", "### Define Custom Metrics\n", "SageMaker can get training metrics directly from the logs and send them to CloudWatch metrics." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "keras_metric_definition = [\n", " {\n", " \"Name\":\"train:loss\",\n", " \"Regex\":\".*loss: ([0-9\\\\.]+) - accuracy: [0-9\\\\.]+.*\"\n", " },\n", " {\n", " \"Name\":\"train:accuracy\",\n", " \"Regex\":\".*loss: [0-9\\\\.]+ - accuracy: ([0-9\\\\.]+).*\"\n", " },\n", " {\n", " \"Name\":\"validation:accuracy\",\n", " \"Regex\":\".*step - loss: [0-9\\\\.]+ - accuracy: [0-9\\\\.]+ - val_loss: [0-9\\\\.]+ - val_accuracy: ([0-9\\\\.]+).*\"\n", " },\n", " {\n", " \"Name\":\"validation:loss\",\n", " \"Regex\":\".*step - loss: [0-9\\\\.]+ - accuracy: [0-9\\\\.]+ - val_loss: ([0-9\\\\.]+) - val_accuracy: [0-9\\\\.]+.*\"\n", " },\n", " {\n", " \"Name\":\"sec/steps\",\n", " \"Regex\":\".* (\\\\d+)[mu]s/step - loss: [0-9\\\\.]+ - accuracy: [0-9\\\\.]+ - val_loss: [0-9\\\\.]+ - val_accuracy: [0-9\\\\.]+\"\n", " }\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Build A Training Estimator" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are going to train the model with Pipe Mode input. SageMaker Pipe Mode is a mechanism for providing S3 data to a training job via Linux fifos. Training programs can read from the fifo and get high-throughput data transfer from S3, without managing the S3 access in the program itself.\n", "Pipe Mode is covered in more detail in the SageMaker [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-running-container.html#your-algorithms-training-algo-running-container-trainingdata)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hyperparameters = {\"epochs\": 2, \"batch-size\": 256}\n", "\n", "inputs = {\"train\": dataset_location + \"/train\", \"validation\": dataset_location + \"/validation\"}\n", "\n", "trial_name = f\"cv-keras-training-job-{int(time.time())}\"\n", "cnn_trial = Trial.create(\n", " trial_name=trial_name,\n", " experiment_name=cv_experiment.experiment_name,\n", " sagemaker_boto_client=sm,\n", ")\n", "\n", "experiment_config={\n", " \"ExperimentName\": cv_experiment.experiment_name,\n", " \"TrialName\": cnn_trial.trial_name,\n", " \"TrialComponentDisplayName\": \"Training\",\n", "}\n", "\n", "estimator = TensorFlow(\n", " base_job_name=\"cv-keras-pipe\",\n", " entry_point=\"keras_cifar10.py\",\n", " source_dir=\"source_dir\",\n", " role=role,\n", " framework_version=\"2.3.0\",\n", " py_version=\"py37\",\n", " hyperparameters=hyperparameters,\n", " instance_count=1,\n", " instance_type=\"ml.c5.xlarge\",\n", " metric_definitions=keras_metric_definition,\n", " enable_sagemaker_metrics=True,\n", " input_mode=\"Pipe\",\n", " \n", ")\n", "\n", "estimator.fit(inputs, wait=True, logs=False, experiment_config=experiment_config)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The **```fit```** method will create a training job on **ml.c5.xlarge** instance.\n", "\n", "These instances will write checkpoints and logs to the S3 bucket we've set up earlier. If you don't have this bucket yet, **```sagemaker_session```** will create it for you. These checkpoints and logs can be used for restoring the training job, and to analyze training job metrics using TensorBoard. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Analyzing Training Job\n", "\n", "You can set `logs=True` in the above fit call in order to see the container logs directly in the notebook. Alternatively you can view the SageMaker console under \"Training Jobs\" for a more user friendly report with links to CloudWatch for the full logs indefinetely.\n", "\n", "Since we specified an Experiment trial, you can also view the \"SageMaker resources\" icon in SageMaker Studio, select \"Experiments and trials\", open the trial, and eplorer trial details to view metric charts, summary stats, and hyperparameters associated with the experiment.\n", "\n", "![Experiment UI](../statics/Experiments.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Automatic Model Tuning\n", "\n", "[Amazon SageMaker automatic model tuning](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html), also known as hyperparameter optimization (HPO), finds the best version of a model by running many training jobs on your dataset using the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that result in a model that performs the best, as measured by a metric that you choose." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Configure HPO Job\n", "Next, the tuning job with the following configurations need to be specified:\n", "- hyperparameters that SageMaker Automatic Model Tuning will tune: `learning-rate`, `batch-size` and `optimizer`;\n", "- maximum number of training jobs it will run to optimize the objective metric: `6`\n", "- number of parallel training jobs that will run in the tuning job: `2`\n", "- objective metric that Automatic Model Tuning will use: `validation:accuracy`\n", "\n", "**Note: you may ran into resource limits in your account. If you do, please raise a support case to increase the limit**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "shared_hyperparameters = {\"epochs\": 4}\n", "\n", "estimator = TensorFlow(\n", " base_job_name=\"cv-keras-pipe\",\n", " entry_point=\"keras_cifar10.py\",\n", " source_dir=\"source_dir\",\n", " role=role,\n", " framework_version=\"2.3.0\",\n", " py_version=\"py37\",\n", " hyperparameters=shared_hyperparameters,\n", " instance_count=1,\n", " instance_type=\"ml.m5.2xlarge\",\n", " metric_definitions=keras_metric_definition,\n", " input_mode=\"Pipe\",\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.tuner import (\n", " IntegerParameter,\n", " CategoricalParameter,\n", " ContinuousParameter,\n", " HyperparameterTuner,\n", ")\n", "\n", "hyperparameter_ranges = {\n", " \"learning-rate\": ContinuousParameter(0.00001, 0.001),\n", " \"batch-size\": CategoricalParameter([64, 128, 256]),\n", " \"optimizer\": CategoricalParameter([\"sgd\", \"adam\", \"rmsprop\"]),\n", "}\n", "\n", "objective_metric_name = \"validation:accuracy\"\n", "\n", "inputs = {\"train\": dataset_location + \"/train\", \"validation\": dataset_location + \"/validation\"}\n", "\n", "tuner = HyperparameterTuner(\n", " estimator,\n", " objective_metric_name,\n", " hyperparameter_ranges,\n", " metric_definitions=keras_metric_definition,\n", " objective_type=\"Maximize\",\n", " max_jobs=6,\n", " max_parallel_jobs=2,\n", " base_tuning_job_name=\"cv-hpo\",\n", ")\n", "\n", "tuner.fit(inputs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Associate HPO to Experiment\n", "This process is can be eliminated when expecuted from a [SageMaker Pipeline Tuning Step](https://sagemaker.readthedocs.io/en/stable/workflows/pipelines/sagemaker.workflow.pipelines.html#sagemaker.workflow.steps.TuningStep)\n", "\n", "After running the code below, you should see something like this in your studio environment:\n", "![HPO Experiments](../statics/HPO_experiments.png)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from smexperiments.search_expression import Filter, Operator, SearchExpression\n", "from smexperiments.trial_component import TrialComponent" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get the most recently created tuning job\n", "\n", "list_tuning_jobs_response = sm.list_hyper_parameter_tuning_jobs(\n", " SortBy=\"CreationTime\", SortOrder=\"Descending\"\n", ")\n", "print(f'Found {len(list_tuning_jobs_response[\"HyperParameterTuningJobSummaries\"])} tuning jobs.')\n", "tuning_jobs = list_tuning_jobs_response[\"HyperParameterTuningJobSummaries\"]\n", "most_recently_created_tuning_job = tuning_jobs[0]\n", "tuning_job_name = most_recently_created_tuning_job[\"HyperParameterTuningJobName\"]\n", "experiment_name = \"cv-hpo-experiment\"\n", "trial_name = tuning_job_name + \"-trial\"\n", "\n", "print(f\"Associate all training jobs created by {tuning_job_name} with trial {trial_name}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create the experiment if it doesn't exist\n", "try:\n", " experiment = Experiment.load(experiment_name=experiment_name)\n", "except Exception as ex:\n", " if \"ResourceNotFound\" in str(ex):\n", " experiment = Experiment.create(experiment_name=experiment_name)\n", "\n", "\n", "# create the trial if it doesn't exist\n", "try:\n", " trial = Trial.load(trial_name=trial_name)\n", "except Exception as ex:\n", " if \"ResourceNotFound\" in str(ex):\n", " trial = Trial.create(experiment_name=experiment_name, trial_name=trial_name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get the trial components derived from the training jobs\n", "\n", "creation_time = most_recently_created_tuning_job[\"CreationTime\"]\n", "creation_time = creation_time.astimezone(pytz.utc)\n", "creation_time = creation_time.strftime(\"%Y-%m-%dT%H:%M:%SZ\")\n", "\n", "created_after_filter = Filter(\n", " name=\"CreationTime\",\n", " operator=Operator.GREATER_THAN_OR_EQUAL,\n", " value=str(creation_time),\n", ")\n", "\n", "# The training job names contain the tuning job name (and the training job name is in the source arn)\n", "source_arn_filter = Filter(\n", " name=\"TrialComponentName\", operator=Operator.CONTAINS, value=tuning_job_name\n", ")\n", "source_type_filter = Filter(\n", " name=\"Source.SourceType\", operator=Operator.EQUALS, value=\"SageMakerTrainingJob\"\n", ")\n", "\n", "search_expression = SearchExpression(\n", " filters=[created_after_filter, source_arn_filter, source_type_filter]\n", ")\n", "\n", "# Search iterates over every page of results by default\n", "trial_component_search_results = list(\n", " TrialComponent.search(search_expression=search_expression, sagemaker_boto_client=sm)\n", ")\n", "print(f\"Found {len(trial_component_search_results)} trial components.\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Associate the trial components with the trial\n", "for tc in trial_component_search_results:\n", " print(f\"Associating trial component {tc.trial_component_name} with trial {trial.trial_name}.\")\n", " trial.add_trial_component(tc.trial_component_name)\n", " # sleep to avoid throttling\n", " time.sleep(0.5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Clean up\n", "To avoid incurring charges to your AWS account for the resources used in this tutorial you need to remove all data and model artifacts from the SageMaker S3 bucket." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "Python 3 (Data Science)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-2:429704687514:image/datascience-1.0" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" }, "notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.", "pycharm": { "stem_cell": { "cell_type": "raw", "metadata": { "collapsed": false }, "source": [] } } }, "nbformat": 4, "nbformat_minor": 4 }