# Computer Vision (CV) On SageMaker - Tensorflow

1. [Introduction](#Introduction)
2. [Prerequisites](#Prerequisites)
3. [Setup](#Setup)
4. [Dataset](#Dataset)
5. [Training a CV model](#Training-a-CV-model)
 1. [TFRecord Data Ingestion](#TFRecord-Data-Ingestion)
 2. [Create Experiment](#Create-Experiment)
 3. [Configure Training](#Configure-Training)
 4. [Analyzing Training Job](#Analyzing-Training-Job)
6. [Hyperparameter tuning Job](#Automatic-Model-Tuning)
 1. [Configure HPO Job](#Configure-HPO-Job)
 2. [Associate HPO to Experiment](#Associate-HPO-to-Experiment)
7. [Clean Up](#Clean-up)

# Introduction
This lab is focused on SageMaker Training for CV. We'll show an example for the performant Pipe Mode data ingestion, HyperParameter Optimization, as well as experiment tracking. In the future labs we'll show how experiment tracking can be automated through SageMaker Pipeline's native integration. The model used for this notebook is a simple deep CNN that is based on the [Keras examples](https://www.tensorflow.org/tutorials/images/cnn). 

** Note: This Notebook was tested on Data Science Kernel for SageMaker Studio**

## Prerequisites

To run this notebook, you can simply execute each cell in order. To understand what's happening, you'll need:

- Access to the SageMaker default S3 bucket. All the files related to this lab will be stored under the "cv_keras_cifar10" prefix of the bucket.
- Familiarity with Python and numpy
- Basic familiarity with AWS S3.
- Basic understanding of AWS Sagemaker.
- Basic familiarity with AWS Command Line Interface (CLI) -- ideally, you should have it set up with credentials to access the AWS account you're running this notebook from.
- SageMaker Studio is preferred for the full UI integration

## Setup

Setting up the environment, load the libraries, and define the parameter for the entire notebook.

Run the cell below if you are missing smexperiments or Tensorflow in your kernel

In [None]:
!pip install sagemaker-experiments
!pip install tensorflow

In [None]:
import os
import time
import pytz
import boto3
import sagemaker
import tensorflow as tf
from tensorflow import keras
from sagemaker.tensorflow import TensorFlow
from sagemaker import get_execution_role

from smexperiments.experiment import Experiment
from smexperiments.trial import Trial

sagemaker_session = sagemaker.Session()
sess = boto3.Session()
sm = sess.client("sagemaker")

role = get_execution_role()

bucket = sagemaker_session.default_bucket()
prefix = "cv_keras_cifar10"

print("Bucket: {}".format(bucket))
print("SageMaker ver: " + sagemaker.__version__)

## Dataset
The [CIFAR-10 dataset](https://www.cs.toronto.edu/~kriz/cifar.html) is one of the most popular machine learning datasets. It consists of 60,000 32x32 images belonging to 10 different classes (6,000 images per class). Here are the classes in the dataset, as well as 10 random images from each:

![cifar10](../statics/CIFAR-10.png)

In this tutorial, we will train a deep CNN to recognize these images.

Downloading the test and training data takes around 5 minutes.

In [None]:
from tensorflow.keras.datasets import cifar10

(x_train, y_train), (x_test, y_test) = cifar10.load_data()
print("x_train shape:", x_train.shape)
print(x_train.shape[0], "train samples")
print(x_test.shape[0], "test samples")

## Training a CV model

## TFRecord Data Ingestion

Converting dataset into [TFRecord files](https://www.tensorflow.org/tutorials/load_data/tfrecord) 
This allows us to use SageMaker [Pipe Mode](https://aws.amazon.com/blogs/machine-learning/accelerate-model-training-using-faster-pipe-mode-on-amazon-sagemaker/).

To do this with your own dataset, here are examples of how to generate TFRecords from image datasets:
* Image Classification: https://www.tensorflow.org/tutorials/load_data/tfrecord#walkthrough_reading_and_writing_image_data
* Object Detection: https://keras.io/examples/keras_recipes/creating_tfrecords/


In [None]:
import os

if not os.path.exists("./data/validation"):
 os.makedirs("./data/validation")

if not os.path.exists("./data/train"):
 os.makedirs("./data/train")


def write_tfrecords(x, y, filename):
 writer = tf.io.TFRecordWriter(filename)

 for image, label in zip(x, y):
 example = tf.train.Example(
 features=tf.train.Features(
 feature={
 "image": tf.train.Feature(
 bytes_list=tf.train.BytesList(value=[image.tobytes()])
 ),
 "label": tf.train.Feature(int64_list=tf.train.Int64List(value=[label])),
 }
 )
 )
 writer.write(example.SerializeToString())


write_tfrecords(x_test, y_test, "./data/validation/validation.tfrecords")

write_tfrecords(x_train, y_train, "./data/train/train.tfrecords")

## Create Experiment

[SageMaker Experiment](https://docs.aws.amazon.com/sagemaker/latest/dg/experiments.html) helps you organize, track, compare and evaluate machine learning (ML) experiments and model versions. SInce ML is a highly iterative process, Experiment helps data scientists and ML engineers to explore thousands of different models in an organized manner. Exspecially when you are using tools like [Automatic Model Tuning](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html) and [Amazon SageMaker Autopilot](https://docs.aws.amazon.com/sagemaker/latest/dg/experiments.html), it will help you explore a large number of combinations automatically, and quickly zoom in on high-performance models.

We will first create an experiment for a training job, and then do an example for Automatic Model Tuning.

In [None]:
cv_experiment = Experiment.create(
 experiment_name=f"manual-experiment-cv-keras-{int(time.time())}",
 description="CV Workshop example",
 sagemaker_boto_client=sm,
)

Uploading the data to s3

In [None]:
dataset_location = sagemaker_session.upload_data(path="data", key_prefix=prefix + "/data")
display(dataset_location)

## Configure Training

### Define Custom Metrics
SageMaker can get training metrics directly from the logs and send them to CloudWatch metrics.

In [None]:
keras_metric_definition = [
 {
 "Name":"train:loss",
 "Regex":".*loss: ([0-9\\.]+) - accuracy: [0-9\\.]+.*"
 },
 {
 "Name":"train:accuracy",
 "Regex":".*loss: [0-9\\.]+ - accuracy: ([0-9\\.]+).*"
 },
 {
 "Name":"validation:accuracy",
 "Regex":".*step - loss: [0-9\\.]+ - accuracy: [0-9\\.]+ - val_loss: [0-9\\.]+ - val_accuracy: ([0-9\\.]+).*"
 },
 {
 "Name":"validation:loss",
 "Regex":".*step - loss: [0-9\\.]+ - accuracy: [0-9\\.]+ - val_loss: ([0-9\\.]+) - val_accuracy: [0-9\\.]+.*"
 },
 {
 "Name":"sec/steps",
 "Regex":".* (\\d+)[mu]s/step - loss: [0-9\\.]+ - accuracy: [0-9\\.]+ - val_loss: [0-9\\.]+ - val_accuracy: [0-9\\.]+"
 }
]

### Build A Training Estimator

We are going to train the model with Pipe Mode input. SageMaker Pipe Mode is a mechanism for providing S3 data to a training job via Linux fifos. Training programs can read from the fifo and get high-throughput data transfer from S3, without managing the S3 access in the program itself.
Pipe Mode is covered in more detail in the SageMaker [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-running-container.html#your-algorithms-training-algo-running-container-trainingdata).

In [None]:
hyperparameters = {"epochs": 2, "batch-size": 256}

inputs = {"train": dataset_location + "/train", "validation": dataset_location + "/validation"}

trial_name = f"cv-keras-training-job-{int(time.time())}"
cnn_trial = Trial.create(
 trial_name=trial_name,
 experiment_name=cv_experiment.experiment_name,
 sagemaker_boto_client=sm,
)

experiment_config={
 "ExperimentName": cv_experiment.experiment_name,
 "TrialName": cnn_trial.trial_name,
 "TrialComponentDisplayName": "Training",
}

estimator = TensorFlow(
 base_job_name="cv-keras-pipe",
 entry_point="keras_cifar10.py",
 source_dir="source_dir",
 role=role,
 framework_version="2.3.0",
 py_version="py37",
 hyperparameters=hyperparameters,
 instance_count=1,
 instance_type="ml.c5.xlarge",
 metric_definitions=keras_metric_definition,
 enable_sagemaker_metrics=True,
 input_mode="Pipe",
 
)

estimator.fit(inputs, wait=True, logs=False, experiment_config=experiment_config)

The **```fit```** method will create a training job on **ml.c5.xlarge** instance.

These instances will write checkpoints and logs to the S3 bucket we've set up earlier. If you don't have this bucket yet, **```sagemaker_session```** will create it for you. These checkpoints and logs can be used for restoring the training job, and to analyze training job metrics using TensorBoard. 

## Analyzing Training Job

You can set `logs=True` in the above fit call in order to see the container logs directly in the notebook. Alternatively you can view the SageMaker console under "Training Jobs" for a more user friendly report with links to CloudWatch for the full logs indefinetely.

Since we specified an Experiment trial, you can also view the "SageMaker resources" icon in SageMaker Studio, select "Experiments and trials", open the trial, and eplorer trial details to view metric charts, summary stats, and hyperparameters associated with the experiment.

![Experiment UI](../statics/Experiments.png)

## Automatic Model Tuning

[Amazon SageMaker automatic model tuning](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html), also known as hyperparameter optimization (HPO), finds the best version of a model by running many training jobs on your dataset using the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that result in a model that performs the best, as measured by a metric that you choose.

### Configure HPO Job
Next, the tuning job with the following configurations need to be specified:
- hyperparameters that SageMaker Automatic Model Tuning will tune: `learning-rate`, `batch-size` and `optimizer`;
- maximum number of training jobs it will run to optimize the objective metric: `6`
- number of parallel training jobs that will run in the tuning job: `2`
- objective metric that Automatic Model Tuning will use: `validation:accuracy`

**Note: you may ran into resource limits in your account. If you do, please raise a support case to increase the limit**

In [None]:
shared_hyperparameters = {"epochs": 4}

estimator = TensorFlow(
 base_job_name="cv-keras-pipe",
 entry_point="keras_cifar10.py",
 source_dir="source_dir",
 role=role,
 framework_version="2.3.0",
 py_version="py37",
 hyperparameters=shared_hyperparameters,
 instance_count=1,
 instance_type="ml.m5.2xlarge",
 metric_definitions=keras_metric_definition,
 input_mode="Pipe",
)

In [None]:
from sagemaker.tuner import (
 IntegerParameter,
 CategoricalParameter,
 ContinuousParameter,
 HyperparameterTuner,
)

hyperparameter_ranges = {
 "learning-rate": ContinuousParameter(0.00001, 0.001),
 "batch-size": CategoricalParameter([64, 128, 256]),
 "optimizer": CategoricalParameter(["sgd", "adam", "rmsprop"]),
}

objective_metric_name = "validation:accuracy"

inputs = {"train": dataset_location + "/train", "validation": dataset_location + "/validation"}

tuner = HyperparameterTuner(
 estimator,
 objective_metric_name,
 hyperparameter_ranges,
 metric_definitions=keras_metric_definition,
 objective_type="Maximize",
 max_jobs=6,
 max_parallel_jobs=2,
 base_tuning_job_name="cv-hpo",
)

tuner.fit(inputs)

## Associate HPO to Experiment
This process is can be eliminated when expecuted from a [SageMaker Pipeline Tuning Step](https://sagemaker.readthedocs.io/en/stable/workflows/pipelines/sagemaker.workflow.pipelines.html#sagemaker.workflow.steps.TuningStep)

After running the code below, you should see something like this in your studio environment:
![HPO Experiments](../statics/HPO_experiments.png)


In [None]:
from smexperiments.search_expression import Filter, Operator, SearchExpression
from smexperiments.trial_component import TrialComponent

In [None]:
# Get the most recently created tuning job

list_tuning_jobs_response = sm.list_hyper_parameter_tuning_jobs(
 SortBy="CreationTime", SortOrder="Descending"
)
print(f'Found {len(list_tuning_jobs_response["HyperParameterTuningJobSummaries"])} tuning jobs.')
tuning_jobs = list_tuning_jobs_response["HyperParameterTuningJobSummaries"]
most_recently_created_tuning_job = tuning_jobs[0]
tuning_job_name = most_recently_created_tuning_job["HyperParameterTuningJobName"]
experiment_name = "cv-hpo-experiment"
trial_name = tuning_job_name + "-trial"

print(f"Associate all training jobs created by {tuning_job_name} with trial {trial_name}")

In [None]:
# Create the experiment if it doesn't exist
try:
 experiment = Experiment.load(experiment_name=experiment_name)
except Exception as ex:
 if "ResourceNotFound" in str(ex):
 experiment = Experiment.create(experiment_name=experiment_name)


# create the trial if it doesn't exist
try:
 trial = Trial.load(trial_name=trial_name)
except Exception as ex:
 if "ResourceNotFound" in str(ex):
 trial = Trial.create(experiment_name=experiment_name, trial_name=trial_name)

In [None]:
# Get the trial components derived from the training jobs

creation_time = most_recently_created_tuning_job["CreationTime"]
creation_time = creation_time.astimezone(pytz.utc)
creation_time = creation_time.strftime("%Y-%m-%dT%H:%M:%SZ")

created_after_filter = Filter(
 name="CreationTime",
 operator=Operator.GREATER_THAN_OR_EQUAL,
 value=str(creation_time),
)

# The training job names contain the tuning job name (and the training job name is in the source arn)
source_arn_filter = Filter(
 name="TrialComponentName", operator=Operator.CONTAINS, value=tuning_job_name
)
source_type_filter = Filter(
 name="Source.SourceType", operator=Operator.EQUALS, value="SageMakerTrainingJob"
)

search_expression = SearchExpression(
 filters=[created_after_filter, source_arn_filter, source_type_filter]
)

# Search iterates over every page of results by default
trial_component_search_results = list(
 TrialComponent.search(search_expression=search_expression, sagemaker_boto_client=sm)
)
print(f"Found {len(trial_component_search_results)} trial components.")

In [None]:
# Associate the trial components with the trial
for tc in trial_component_search_results:
 print(f"Associating trial component {tc.trial_component_name} with trial {trial.trial_name}.")
 trial.add_trial_component(tc.trial_component_name)
 # sleep to avoid throttling
 time.sleep(0.5)

## Clean up
To avoid incurring charges to your AWS account for the resources used in this tutorial you need to remove all data and model artifacts from the SageMaker S3 bucket.