{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Automatic Model Tuning : Automatic training job early stopping\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", "\n", "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/hyperparameter_tuning|image_classification_early_stopping|hpo_image_classification_early_stopping.ipynb)\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "_**Using automatic training job early stopping to speed up the tuning of an end-to-end multiclass image classification task**_\n", "\n", "---\n", "## Important notes:\n", "\n", "* Two hyperparameter tuning jobs will be created in this sample notebook. With current setting, each tuning job takes around an hour to complete. \n", "* Due to cost consideration, the goal of this example is to show you how to use the new feature, not necessarily to achieve the best result.\n", "* The built-in image classification algorithm on GPU instance will be used in this example.\n", "* Different runs of this notebook may lead to different results, due to the non-deterministic nature of Automatic Model Tuning. But it is fair to assume some training jobs will be stopped by automatic early stopping.\n", "\n", "---\n", "## Contents\n", "1. [Background](#Background)\n", "1. [Set_up](#Set-up)\n", "1. [Data_preparation](#Data-preparation)\n", "1. [Set_up_hyperparameter_tuning_job](#Set-up-hyperparameter-tuning-job)\n", "1. [Launch_hyperparameter_tuning_job](#Launch-hyperparameter-tuning-job)\n", "1. [Launch_hyperparameter_tuning_job_with_automatic_early_stopping](#Launch-hyperparameter-tuning-job-with-automatic-early-stopping)\n", "1. [Wrap_up](#Wrap-up)\n", "\n", "\n", "---\n", "\n", "## Background\n", "\n", "Selecting the right hyperparameter values for machine learning model can be difficult. The right answer dependes on the algorithm and the data; Some algorithms have many tuneable hyperparameters; Some are very sensitive to the hyperparameter values selected; and yet most have a non-linear relationship between model fit and hyperparameter values. Amazon SageMaker Automatic Model Tuning helps by automating the hyperparameter tuning process.\n", "\n", "Experienced data scientist often stop a training when it is not promising based on the first few validation metrics emitted during the training. This notebook will demonstrate how to use the automatic training job early stopping of Amazon SageMaker Automatic Model Tuning to speed up the tuning process with a simple switch.\n", "\n", "---\n", "\n", "## Set up\n", "\n", "Let us start by specifying:\n", "\n", "- The role that is used to give learning and hosting the access to the data. This will automatically be obtained from the role used to start the notebook.\n", "- The S3 bucket that will be used for loading training data and saving model data.\n", "- The Amazon SageMaker image classification docker image which need not to be changed." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "parameters" ] }, "outputs": [], "source": [ "import boto3\n", "import sagemaker\n", "from sagemaker import get_execution_role\n", "from sagemaker.image_uris import retrieve\n", "\n", "sess = sagemaker.Session()\n", "role = get_execution_role()\n", "\n", "bucket = sess.default_bucket()\n", "\n", "training_image = retrieve(\"image-classification\", boto3.Session().region_name, \"1\")\n", "print(training_image)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data preparation\n", "\n", "In this example, [Caltech-256 dataset](https://paperswithcode.com/dataset/caltech-256) dataset will be used, which contains 30608 images of 256 objects." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "import urllib.request\n", "import boto3\n", "\n", "\n", "def download(url):\n", " filename = url.split(\"/\")[-1]\n", " if not os.path.exists(filename):\n", " urllib.request.urlretrieve(url, filename)\n", "\n", "\n", "def upload_to_s3(channel, file):\n", " s3 = boto3.resource(\"s3\")\n", " data = open(file, \"rb\")\n", " key = channel + \"/\" + file\n", " s3.Bucket(bucket).put_object(Key=key, Body=data)\n", "\n", "\n", "s3_train_key = \"image-classification-full-training/train\"\n", "s3_validation_key = \"image-classification-full-training/validation\"\n", "\n", "download(\"http://data.mxnet.io/data/caltech-256/caltech-256-60-train.rec\")\n", "upload_to_s3(s3_train_key, \"caltech-256-60-train.rec\")\n", "download(\"http://data.mxnet.io/data/caltech-256/caltech-256-60-val.rec\")\n", "upload_to_s3(s3_validation_key, \"caltech-256-60-val.rec\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Set up hyperparameter tuning job\n", "\n", "For this example, three hyperparameters will be tuned: learning_rate, mini_batch_size and optimizer, which has the greatest impact on the objective metric. See [here](https://docs.aws.amazon.com/sagemaker/latest/dg/IC-tuning.html) for more detail and the full list of hyperparameters that can be tuned.\n", "\n", "Before launching the tuning job, training jobs that the hyperparameter tuning job will launch need to be configured by defining an estimator that specifies the following information:\n", "\n", "* The container image for the algorithm (image-classification).\n", "* The s3 location for training and validation data.\n", "* The type and number of instances to use for the training jobs.\n", "* The output specification where the output can be stored after training.\n", "\n", "The values of any hyperparameters that are not tuned in the tuning job (StaticHyperparameters):\n", " * **num_layers**: The number of layers (depth) for the network. We use 18 in this samples but other values such as 50, 152 can be used.\n", " * **image_shape**: The input image dimensions,'num_channels, height, width', for the network. It should be no larger than the actual image size. The number of channels should be same as in the actual image.\n", " * **num_classes**: This is the number of output classes for the new dataset. For caltech, we use 257 because it has 256 object categories + 1 clutter class.\n", " * **num_training_samples**: This is the total number of training samples. It is set to 15240 for caltech dataset with the current split.\n", " * **epochs**: Number of training epochs. In this example we set it to only 10 to save the cost. If you would like to get higher accuracy the number of epochs can be increased.\n", " * **top_k**: Report the top-k accuracy during training.\n", " * **precision_dtype**: Training data type precision (default: float32). If set to 'float16', the training will be done in mixed_precision mode and will be faster than float32 mode.\n", " * **augmentation_type**: 'crop'. Randomly crop the image and flip the image horizontally." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3_train_data = \"s3://{}/{}/\".format(bucket, s3_train_key)\n", "s3_validation_data = \"s3://{}/{}/\".format(bucket, s3_validation_key)\n", "\n", "s3_output_key = \"image-classification-full-training/output\"\n", "s3_output = \"s3://{}/{}/\".format(bucket, s3_output_key)\n", "\n", "s3_input_train = sagemaker.TrainingInput(\n", " s3_data=s3_train_data, content_type=\"application/x-recordio\"\n", ")\n", "s3_input_validation = sagemaker.TrainingInput(\n", " s3_data=s3_validation_data, content_type=\"application/x-recordio\"\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sess = sagemaker.Session()\n", "imageclassification = sagemaker.estimator.Estimator(\n", " training_image,\n", " role,\n", " instance_count=1,\n", " instance_type=\"ml.p3.2xlarge\",\n", " output_path=s3_output,\n", " sagemaker_session=sess,\n", ")\n", "\n", "imageclassification.set_hyperparameters(\n", " num_layers=18,\n", " image_shape=\"3,224,224\",\n", " num_classes=257,\n", " epochs=10,\n", " top_k=\"2\",\n", " num_training_samples=15420,\n", " precision_dtype=\"float32\",\n", " augmentation_type=\"crop\",\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, the tuning job with the following configurations need to be specified:\n", "* the hyperparameters that SageMaker Automatic Model Tuning will tune: learning_rate, mini_batch_size and optimizer\n", "* the maximum number of training jobs it will run to optimize the objective metric: 10\n", "* the number of parallel training jobs that will run in the tuning job: 2\n", "* the objective metric that Automatic Model Tuning will use: validation:accuracy" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from time import gmtime, strftime\n", "from sagemaker.tuner import (\n", " IntegerParameter,\n", " CategoricalParameter,\n", " ContinuousParameter,\n", " HyperparameterTuner,\n", ")\n", "\n", "tuning_job_name = \"imageclassif-job-{}\".format(strftime(\"%d-%H-%M-%S\", gmtime()))\n", "\n", "hyperparameter_ranges = {\n", " \"learning_rate\": ContinuousParameter(0.00001, 1.0),\n", " \"mini_batch_size\": IntegerParameter(16, 64),\n", " \"optimizer\": CategoricalParameter([\"sgd\", \"adam\", \"rmsprop\", \"nag\"]),\n", "}\n", "\n", "objective_metric_name = \"validation:accuracy\"\n", "\n", "tuner = HyperparameterTuner(\n", " imageclassification,\n", " objective_metric_name,\n", " hyperparameter_ranges,\n", " objective_type=\"Maximize\",\n", " max_jobs=10,\n", " max_parallel_jobs=2,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Launch hyperparameter tuning job\n", "Now we can launch a hyperparameter tuning job by calling fit in tuner. We will wait until the tuning finished, which may take around an hour." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tuner.fit(\n", " {\"train\": s3_input_train, \"validation\": s3_input_validation},\n", " job_name=tuning_job_name,\n", " include_cls_metadata=False,\n", ")\n", "tuner.wait()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After the tuning finished, the top 5 performing hyperparameters can be listed below. One can analyse the results deeper by using [HPO_Analyze_TuningJob_Results.ipynb notebook](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/hyperparameter_tuning/analyze_results/HPO_Analyze_TuningJob_Results.ipynb)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tuner_metrics = sagemaker.HyperparameterTuningJobAnalytics(tuning_job_name)\n", "tuner_metrics.dataframe().sort_values([\"FinalObjectiveValue\"], ascending=False).head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The total training time and training jobs status can be checked with the following script. Because automatic early stopping is by default off, all the training jobs should be completed normally." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "total_time = tuner_metrics.dataframe()[\"TrainingElapsedTimeSeconds\"].sum() / 3600\n", "print(\"The total training time is {:.2f} hours\".format(total_time))\n", "tuner_metrics.dataframe()[\"TrainingJobStatus\"].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Launch hyperparameter tuning job with automatic early stopping\n", "Now we lunch the same tuning job with only one difference: setting **early_stopping_type**=**'Auto'** to enable automatic training job early stopping." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tuning_job_name_es = \"imageclassif-job-{}-es\".format(strftime(\"%d-%H-%M-%S\", gmtime()))\n", "\n", "tuner_es = HyperparameterTuner(\n", " imageclassification,\n", " objective_metric_name,\n", " hyperparameter_ranges,\n", " objective_type=\"Maximize\",\n", " max_jobs=10,\n", " max_parallel_jobs=2,\n", " early_stopping_type=\"Auto\",\n", ")\n", "\n", "tuner_es.fit(\n", " {\"train\": s3_input_train, \"validation\": s3_input_validation},\n", " job_name=tuning_job_name_es,\n", " include_cls_metadata=False,\n", ")\n", "tuner_es.wait()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After the tuning job finished, we again list the top 5 performing training jobs." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tuner_metrics_es = sagemaker.HyperparameterTuningJobAnalytics(tuning_job_name_es)\n", "tuner_metrics_es.dataframe().sort_values([\"FinalObjectiveValue\"], ascending=False).head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The total training time and training jobs status can be checked with the following script." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = tuner_metrics_es.dataframe()\n", "total_time_es = df[\"TrainingElapsedTimeSeconds\"].sum() / 3600\n", "print(\"The total training time with early stopping is {:.2f} hours\".format(total_time_es))\n", "df[\"TrainingJobStatus\"].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The stopped training jobs can be listed using the following scripts. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[df.TrainingJobStatus == \"Stopped\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Wrap up\n", "In this notebook, we demonstrated how to use automatic early stopping to speed up model tuning. One thing to keep in\n", "mind is as the training time for each training job gets longer, the benefit of training job early stopping becomes more significant. On the other hand, smaller training jobs won’t benefit as much due to infrastructure overhead. For example, our experiments show that the effect of training job early stopping typically becomes noticeable when the training jobs last longer than **4 minutes**. To enable automatic early stopping, one can simply set **early_stopping_type** to **'Auto'**.\n", "\n", "For more information on using SageMaker's Automatic Model Tuning, see our other [example notebooks](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/hyperparameter_tuning) and [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html)." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Notebook CI Test Results\n", "\n", "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", "\n", "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/hyperparameter_tuning|image_classification_early_stopping|hpo_image_classification_early_stopping.ipynb)\n", "\n", "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/hyperparameter_tuning|image_classification_early_stopping|hpo_image_classification_early_stopping.ipynb)\n", "\n", "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/hyperparameter_tuning|image_classification_early_stopping|hpo_image_classification_early_stopping.ipynb)\n", "\n", "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/hyperparameter_tuning|image_classification_early_stopping|hpo_image_classification_early_stopping.ipynb)\n", "\n", "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/hyperparameter_tuning|image_classification_early_stopping|hpo_image_classification_early_stopping.ipynb)\n", "\n", "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/hyperparameter_tuning|image_classification_early_stopping|hpo_image_classification_early_stopping.ipynb)\n", "\n", "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/hyperparameter_tuning|image_classification_early_stopping|hpo_image_classification_early_stopping.ipynb)\n", "\n", "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/hyperparameter_tuning|image_classification_early_stopping|hpo_image_classification_early_stopping.ipynb)\n", "\n", "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/hyperparameter_tuning|image_classification_early_stopping|hpo_image_classification_early_stopping.ipynb)\n", "\n", "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/hyperparameter_tuning|image_classification_early_stopping|hpo_image_classification_early_stopping.ipynb)\n", "\n", "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/hyperparameter_tuning|image_classification_early_stopping|hpo_image_classification_early_stopping.ipynb)\n", "\n", "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/hyperparameter_tuning|image_classification_early_stopping|hpo_image_classification_early_stopping.ipynb)\n", "\n", "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/hyperparameter_tuning|image_classification_early_stopping|hpo_image_classification_early_stopping.ipynb)\n", "\n", "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/hyperparameter_tuning|image_classification_early_stopping|hpo_image_classification_early_stopping.ipynb)\n", "\n", "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/hyperparameter_tuning|image_classification_early_stopping|hpo_image_classification_early_stopping.ipynb)\n" ] } ], "metadata": { "celltoolbar": "Tags", "kernelspec": { "display_name": "Python 3 (MXNet 1.9 Python 3.8 CPU Optimized)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-west-2:236514542706:image/mxnet-1.9-cpu-py38-ubuntu20.04-sagemaker-v1.0" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 2 }