# Compile and Tune a Vision Transformer Model using HyperParameter Tuner on a Single Node 

---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. 

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/sagemaker-training-compiler|tensorflow|single_gpu_single_node|hyper-parameter-tuning.ipynb)

---

1. [SageMaker Training Compiler Overview](#SageMaker-Training-Compiler-Overview)
    1. [Introduction](#Introduction)
2. [Working with the Caltech-256 dataset](#Working-with-the-Caltech-256-dataset)
    1. [Installation](#Installation)
    2. [SageMaker environment](#SageMaker-environment)
3. [How effective is SageMaker Training Compiler?](#How-effective-is-SageMaker-Training-Compiler-?)
    1. [SageMaker Training Job](#SageMaker-Training-Job)
    2. [Training Setup](#Training-Setup)
    3. [Experimenting with Native TensorFlow](#Experimenting-with-Native-TensorFlow)
    4. [Experimenting with Optimized TensorFlow](#Experimenting-with-Optimized-TensorFlow)
    5. [Wait for tuning jobs to complete](#Wait-for-tuning-jobs-to-complete)
    6. [Fastest Training Job](#Fastest-Training-Job)
4. [Continue tuning with SageMaker Training Compiler](#Continue-tuning-with-SageMaker-Training-Compiler)
    1. [Wait for tuning jobs to complete](#Wait-for-tuning-jobs-to-complete)
    2. [Fastest Convergence](#Fastest-Convergence)
5. [Conclusion](#Conclusion)
6. [Clean up](#Clean-up)


## SageMaker Training Compiler Overview

SageMaker Training Compiler is a capability of SageMaker that makes hard-to-implement optimizations to reduce training time on GPU instances. The compiler optimizes DL models to accelerate training by more efficiently using SageMaker machine learning (ML) GPU instances. SageMaker Training Compiler is available at no additional charge within SageMaker and can help reduce total billable time as it accelerates training. 

SageMaker Training Compiler is integrated into the AWS Deep Learning Containers (DLCs). Using the SageMaker Training Compiler enabled AWS DLCs, you can compile and optimize training jobs on GPU instances with minimal changes to your code. Bring your deep learning models to SageMaker and enable SageMaker Training Compiler to accelerate the speed of your training job on SageMaker ML instances for accelerated computing. 

For more information, see [SageMaker Training Compiler](https://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler.html) in the *Amazon SageMaker Developer Guide*.

### Introduction

In this demo, you'll use SageMaker Training Compiler and SageMaker Hyperparameter Tuner to speed up training the `Vision Transformer` model on the `Caltech-256` dataset. To get started, we need to set up the environment with a few prerequisite steps, for permissions, configurations, and so on. 

**NOTE:** You can run this demo in SageMaker Studio, SageMaker notebook instances, or your local machine with AWS CLI set up. If using SageMaker Studio or SageMaker notebook instances, make sure you choose one of the TensorFlow-based kernels, `Python 3 (TensorFlow x.y Python 3.x CPU Optimized)` or `conda_tensorflow_p39` respectively.

**NOTE:** This notebook uses 20 `ml.p3.2xlarge` instances, each with a single GPU. However, it can easily be extended to multiple GPUs on a single node. If you don't have enough quota, see [Request a service quota increase for SageMaker resources](https://docs.aws.amazon.com/sagemaker/latest/dg/regions-quotas.html#service-limit-increase-request-procedure). 

## Development Environment 


### Installation

This example notebook requires **SageMaker Python SDK v2.115.0 or later**


In [None]:
!pip install pip --upgrade
!pip install sagemaker botocore boto3 awscli --upgrade

In [None]:
import botocore
import boto3
import sagemaker

print(f"botocore: {botocore.__version__}")
print(f"boto3: {boto3.__version__}")
print(f"sagemaker: {sagemaker.__version__}")

### SageMaker environment 

In [None]:
import sagemaker

sess = sagemaker.Session()

# SageMaker session bucket -> used for uploading data, models and logs
# SageMaker will automatically create this bucket if it does not exist
sagemaker_session_bucket = None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

role = sagemaker.get_execution_role()
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sagemaker_session_bucket}")
print(f"sagemaker session region: {sess.boto_region_name}")

## Working with the Caltech-256 dataset

We have hosted the [Caltech-256](https://authors.library.caltech.edu/7694/) dataset in S3 in us-west-2. We will transfer this dataset to your account and region for use with SageMaker Training.

The dataset consists of JPEG images organized into directories with each directory representing an object category.

In [None]:
import os
from sagemaker.s3 import S3Downloader, S3Uploader

! mkdir caltech-256

source = f"s3://sagemaker-example-files-prod-{sess.boto_region_name}/datasets/image/caltech-256/256_ObjectCategories"
destn = f"s3://{sagemaker_session_bucket}/caltech-256"
local = "caltech-256"


S3Downloader.download(source, local, sagemaker_session=sess)
S3Uploader.upload(local, destn)

## How effective is SageMaker Training Compiler?

The effectiveness of SageMaker Training Compiler depends on the model architecture, model size, input shape, and the training loop. Please refer to our [Best Practices](https://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler-tips-pitfalls.html) documentation to understand how to get the most out of your training job using SageMaker Training Compiler. In this section, we will compare and contrast a training job with and without SageMaker Training Compiler.


### SageMaker Training Job

To create a SageMaker training job, we use a `TensorFlow` estimator. Using the estimator, you can define which training script should SageMaker use through `entry_point`, which `instance_type` to use for training, which `hyperparameters` to pass, and so on.

When a SageMaker training job starts, SageMaker takes care of starting and managing all the required machine learning instances, picks up the `TensorFlow` Deep Learning Container, uploads your training script, and downloads the data from `sagemaker_session_bucket` into the container at `/opt/ml/input/data`.

In the following section, you learn how to set up two versions of the SageMaker `TensorFlow` estimator, a native one without the compiler and an optimized one with the compiler.

### Training Setup

In this section, we set our hyperparameters to a naive first guess. Notice the low value for `EPOCHS` - this is because we are just experimenting with our hyperparameters to find the best setting that will lead to the fastest training. The effectiveness of SageMaker Training Compiler is often apparent within the first few steps. In the example below we will inspect the speed of the training job after every epoch.

In [None]:
EPOCHS = 3
LEARNING_RATE = 1e-3
WEIGHT_DECAY = 1e-4

### Experimenting with Native TensorFlow

We attempt to find the largest `BATCH_SIZE` that can fit into the memory of a `ml.p3.2xlarge` instance. This will consequently give us the fastest training speed.

In [None]:
from sagemaker.tensorflow import TensorFlow

estimator_args = dict(
    source_dir="scripts",
    entry_point="vit.py",
    model_dir=False,
    instance_type="ml.p3.2xlarge",
    instance_count=1,
    framework_version="2.11",
    py_version="py39",
    debugger_hook_config=None,
    disable_profiler=True,
    max_run=60 * 20,  # 20 minutes
    role=role,
)

# Configure the training job
native_estimator = TensorFlow(
    hyperparameters={
        "EPOCHS": EPOCHS,
        "LEARNING_RATE": LEARNING_RATE,
        "WEIGHT_DECAY": WEIGHT_DECAY,
    },
    **estimator_args,
)

### SageMaker Hyperparameter Tuning Job

We use the ```sagemaker.tuner.HyperparameterTuner``` object to define a Hyperparameter Tuning Job. It will import the training job configuration specified in the ```estimator```. We additionally specify some ```metric_definitions``` to extract training metrics from the training logs. From these ```metric_definitions``` we select a single metric as the ```objective_metric_name``` and configure the tuning job to ```Minimize``` or ```Maximize``` it. We further provide a constrained search space through the ```hyperparameter_ranges``` argument.

We can limit the number of training jobs spawned concurrently in the ```max_parallel_jobs``` argument and limit the total number of training jobs spawned in the ```max_jobs``` argument.

For more information regarding SageMaker Hyperparameter Tuner refer to [Perform Automatic Model Tuning with SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html)

In the example below, we are trying to find the best batch size between 32 and 80 that will result in the smallest possible epoch latency, by launching 40 training jobs, 10 at a time. The range for batch sizes is our best guess. You can always reuse and restart a tuning job with an extended range, as explained in [Run a Warm Start Hyperparameter Tuning Job](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-warm-start.html).


In [None]:
from sagemaker.tuner import HyperparameterTuner, IntegerParameter

tuner_args = dict(
    objective_metric_name="training_latency_per_epoch",
    objective_type="Minimize",
    metric_definitions=[
        {"Name": "training_loss", "Regex": "loss: ([0-9.]*?) "},
        {"Name": "training_accuracy", "Regex": "accuracy: ([0-9.]*?) "},
        {"Name": "training_latency_per_epoch", "Regex": "- ([0-9.]*?)s/epoch"},
        {"Name": "training_avg_latency_per_step", "Regex": "- ([0-9.]*?)ms/step"},
        {"Name": "training_avg_latency_per_step", "Regex": "- ([0-9.]*?)s/step"},
    ],
    max_jobs=40,
    max_parallel_jobs=10,
    early_stopping_type="Auto",
)


# Define a Hyperparameter Tuning Job
native_tuner = HyperparameterTuner(
    estimator=native_estimator,
    hyperparameter_ranges={
        "BATCH_SIZE": IntegerParameter(32, 80, "Linear"),
    },
    base_tuning_job_name="native-tf210-vit",
    **tuner_args,
)

# Start the tuning job with the specified input data
native_tuner.fit(inputs=destn, wait=False)

# Save the name of the tuning job
native_tuner.latest_tuning_job.name

**Tip**: You can reduce the cost of tuning by restricting the batch size to be multiple of 8. Refer to Nvidia's article on the [significance of the number 8](https://developer.nvidia.com/blog/optimizing-gpu-performance-tensor-cores/) when training with Automatic Mixed Precision.

```python
from sagemaker.tuner import CategoricalParameter
hyperparameter_ranges={
                        'BATCH_SIZE': CategoricalParameter(list(range(32, 80, 8))),
                    }
```

This can restrict the search space to just 6 training jobs as opposed to 40!

### Experimenting with Optimized TensorFlow

Compilation through SageMaker Training Compiler changes the memory footprint of the model. Most commonly, this manifests as a reduction in memory utilization and a consequent increase in the largest batch size that can fit on the GPU. But in some cases, the compiler intelligently promotes caching which leads to increased memory utilization and a consequent decrease in the largest batch size that can fit on the GPU. In the example below we will find the new batch size with SageMaker Training Compiler enabled and the resultant latency per epoch.

**Note:** We recommend you to turn the SageMaker Debugger's profiling and debugging tools off when you use compilation to avoid additional overheads.

In [None]:
from sagemaker.tensorflow import TensorFlow, TrainingCompilerConfig

# Configure the training job
optimized_estimator = TensorFlow(
    hyperparameters={
        "EPOCHS": EPOCHS,
        "LEARNING_RATE": LEARNING_RATE,
        "WEIGHT_DECAY": WEIGHT_DECAY,
    },
    compiler_config=TrainingCompilerConfig(),
    **estimator_args,
)

In [None]:
from sagemaker.tuner import HyperparameterTuner, IntegerParameter

# Define the tuning job
optimized_tuner = HyperparameterTuner(
    estimator=optimized_estimator,
    hyperparameter_ranges={"BATCH_SIZE": IntegerParameter(20, 60, "Linear")},
    base_tuning_job_name="optimized-tf210-vit",
    **tuner_args,
)

# Start the tuning job with the specified input data
optimized_tuner.fit(inputs=destn, wait=False)

# Save the name of the tuning job
optimized_tuner.latest_tuning_job.name




### Wait for tuning jobs to complete

The tuning jobs described above typically take around 50 mins to complete

**Note:** If the tuner object is no longer available due to a kernel break or refresh, you need to directly use the training job name and manually attach the tuning job to a new tuner. For example:

```python
native_tuner = HyperparameterTuner.attach("<your_tuning_job_name>")
```

In [None]:
native_tuner.wait()
optimized_tuner.wait()

In [None]:
from sagemaker.tuner import HyperparameterTuner

native_tuner = HyperparameterTuner.attach(native_tuner.latest_tuning_job.name)
optimized_tuner = HyperparameterTuner.attach(optimized_tuner.latest_tuning_job.name)

native_tuner.wait()
optimized_tuner.wait()

### Fastest Training Job

Let us collate and analyze the results from the tuning jobs. The tuner provides the results as a Pandas dataframe.
We combine the results from both the tuners, sort them according to the epoch latency and display the top 5 results.

In [None]:
import pandas as pd

pd.set_option("display.max_rows", None, "display.max_columns", None)

# Collect the results from the tuners
native_results = native_tuner.analytics().dataframe()
optimized_results = optimized_tuner.analytics().dataframe()

# Sort results according to Epoch Latency
native_results = native_results.sort_values(
    ["FinalObjectiveValue", "BATCH_SIZE"], ascending=[True, False]
)
optimized_results = optimized_results.sort_values(
    ["FinalObjectiveValue", "BATCH_SIZE"], ascending=[True, False]
)

# Combine the top N results for viewing
N = 3
results = pd.concat([native_results.head(N), optimized_results.head(N)])
results

In [None]:
# Calculating potential percentage Savings from Training Compiler
difference = (
    native_results.iloc[0]["FinalObjectiveValue"] - optimized_results.iloc[3]["FinalObjectiveValue"]
)
percentage = difference * 100 / native_results.iloc[0]["FinalObjectiveValue"]

f"With the SageMaker Training Compiler the epoch latency is {percentage:.1f}% lower " f"meaning the training job could be upto {percentage:.1f}% faster!"

## Continue tuning with SageMaker Training Compiler

Now that we have the fastest batch size and compiler configuration, we need to tune the associated hyperparameters to get the fastest convergence.

**Remember** Total_Training_Time ~= Latency_per_epoch * Number_of_epochs

First, we tuned to reduce the Latency_per_epoch. Now we will tune to reduce the number of epochs required for convergence. Since, hyperparameters that directly affect convergence (like learning rate, weight decay, learning schedule, etc.) are dependent on batch size, we decouple the 2 steps as described. 

We now train for a higher number of epochs since we are testing the speed of convergence. Ideally, you should tune learning rate and weight decay to minimize validation loss, but for the sake of example let's minimize the training loss. 

In [None]:
EPOCHS = 10
BATCH_SIZE = 56

estimator_args["max_run"] = 60 * 60  # 60 minutes

tuner_args["objective_metric_name"] = "training_loss"

In [None]:
from sagemaker.tensorflow import TensorFlow, TrainingCompilerConfig

# Configure the training job
convergence_estimator = TensorFlow(
    hyperparameters={
        "EPOCHS": EPOCHS,
        "BATCH_SIZE": BATCH_SIZE,
    },
    compiler_config=TrainingCompilerConfig(),
    **estimator_args,
)

In [None]:
from sagemaker.tuner import HyperparameterTuner, ContinuousParameter

# Define the tunung job
convergence_tuner = HyperparameterTuner(
    estimator=convergence_estimator,
    hyperparameter_ranges={
        "LEARNING_RATE": ContinuousParameter(1e-6, 1e-3, "Logarithmic"),
        "WEIGHT_DECAY": ContinuousParameter(1e-6, 1e-3, "Logarithmic"),
    },
    base_tuning_job_name="optimized-tf210-vit",
    **tuner_args,
)

# Start the tuning job with the specified input data
convergence_tuner.fit(inputs=destn, wait=False)

# Save the name of the tuning job
convergence_tuner.latest_tuning_job.name

### Wait for tuning jobs to complete

The tuning jobs described above typically take around 2 hours to complete

**Note:** If the tuner object is no longer available due to a kernel break or refresh, you need to directly use the training job name and manually attach the tuning job to a new tuner. For example:

```python
tuner = HyperparameterTuner.attach("<your_tuning_job_name>")
```

In [None]:
convergence_tuner.wait()

In [None]:
from sagemaker.tuner import HyperparameterTuner

convergence_tuner = HyperparameterTuner.attach(convergence_tuner.latest_tuning_job.name)

convergence_tuner.wait()

### Fastest Convergence

Let us analyze the results from the tuning jobs. The tuner provides the results as a Pandas dataframe.
We sort by training loss and display the top 5 results.

In [None]:
import pandas as pd

pd.set_option("display.max_rows", None, "display.max_columns", None)

# Gather results from the tuner
results = convergence_tuner.analytics().dataframe()

# Sort according to Training Loss
results = results.sort_values(["FinalObjectiveValue"], ascending=[True])

# Display the top 5 results
results.head(5)

Having obtained the best configuration for your training job, you can now train to completion. Please consider [check-pointing](https://docs.aws.amazon.com/sagemaker/latest/dg/model-checkpoints.html) in order to resume training from the best performing job indicated by the tuner.

## Conclusion

In conclusion, we first arrived at the batch size and compiler configuration that leads to the highest training throughput. Then, we tuned the associated hyperparameters to arrive at the configuration that leads to the fastest convergence. The resultant combinations lead to maximum savings !


## Clean up

Stop all tuning jobs launched if the jobs are still running.

In [None]:
native_tuner.stop_tuning_job()
optimized_tuner.stop_tuning_job()
convergence_tuner.stop_tuning_job()

Also, to find instructions on cleaning up resources, see [Clean Up](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-cleanup.html) in the *Amazon SageMaker Developer Guide*.

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/sagemaker-training-compiler|tensorflow|single_gpu_single_node|hyper-parameter-tuning.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/sagemaker-training-compiler|tensorflow|single_gpu_single_node|hyper-parameter-tuning.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/sagemaker-training-compiler|tensorflow|single_gpu_single_node|hyper-parameter-tuning.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/sagemaker-training-compiler|tensorflow|single_gpu_single_node|hyper-parameter-tuning.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/sagemaker-training-compiler|tensorflow|single_gpu_single_node|hyper-parameter-tuning.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/sagemaker-training-compiler|tensorflow|single_gpu_single_node|hyper-parameter-tuning.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/sagemaker-training-compiler|tensorflow|single_gpu_single_node|hyper-parameter-tuning.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/sagemaker-training-compiler|tensorflow|single_gpu_single_node|hyper-parameter-tuning.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/sagemaker-training-compiler|tensorflow|single_gpu_single_node|hyper-parameter-tuning.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/sagemaker-training-compiler|tensorflow|single_gpu_single_node|hyper-parameter-tuning.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/sagemaker-training-compiler|tensorflow|single_gpu_single_node|hyper-parameter-tuning.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/sagemaker-training-compiler|tensorflow|single_gpu_single_node|hyper-parameter-tuning.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/sagemaker-training-compiler|tensorflow|single_gpu_single_node|hyper-parameter-tuning.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/sagemaker-training-compiler|tensorflow|single_gpu_single_node|hyper-parameter-tuning.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/sagemaker-training-compiler|tensorflow|single_gpu_single_node|hyper-parameter-tuning.ipynb)
