# Track an experiment while training a Pytorch model with a SageMaker Training Job
***
To execute this notebook in SageMaker Studio, you should select the **`PyTorch 1.12 Python 3.8 CPU Optimizer`** image.
***

This notebook shows how you can use the SageMaker SDK to track a Machine Learning experiment using a Pytorch model trained in a SageMaker Training Job with Script mode, where you will provide the model script file.

We introduce two concepts in this notebook -

* *Experiment:* An experiment is a collection of runs. When you initialize a run in your training loop, you include the name of the experiment that the run belongs to. Experiment names must be unique within your AWS account. 
* *Run:* A run consists of all the inputs, parameters, configurations, and results for one iteration of model training. Initialize an experiment run for tracking a training job with Run(). 



You can track artifacts for experiments, including datasets, algorithms, hyperparameters and metrics. Experiments executed on SageMaker such as SageMaker training jobs are automatically tracked and any existen SageMaker experiment on your AWS account is automatically migrated to the new UI version.

In this notebook we will demonstrate the capabilities through an MNIST handwritten digits classification example. The notebook is organized as follow:

1. Train a Convolutional Neural Network (CNN) Model and log the model training metrics
1. Tune the hyperparameters that configures the number of hidden channels and the optimized in the model. Track teh parameter's configuration, resulting model loss and accuracy and automatically plot a confusion matrix using the Experiments capabilities of the SageMaker SDK.
1. Analyse your model results and plot graphs comparing your model different runs generated from the tunning step 3.

## Runtime
This notebook takes approximately 20 minutes to run.

## Contents
1. [Install modules](#Install-modules)
1. [Setup](#Setup)
1. [Create model training script](#Create-model-training-script)
1. [Train model with Run context](#Train-model-with-Run-context)
1. [Contact](#Contact)

## Install modules

Let's ensure we have the latest SageMaker SDK available, including the SageMaker Experiments functionality

In [None]:
# update boto3 and sagemaker to ensure latest SDK version
%pip install --upgrade pip
%pip install --upgrade boto3
%pip install --upgrade sagemaker
%pip install torch
%pip install torchvision
%pip install --upgrade seaborn

## Setup

Import required libraries and set logging and experiment configuration

SageMaker Experiments now provides the `Run` class that allows you to create a new experiment run. 

In [None]:
from sagemaker import get_execution_role
from sagemaker.experiments.run import Run, load_run
from sagemaker.pytorch import PyTorch
from sagemaker.session import Session
from sagemaker.utils import name_from_base
import datetime

In [None]:
role = get_execution_role()
region = Session().boto_session.region_name

## Check model training script
* Optional Step: check *`mnist.py`* using the cell below, the pytorch script file to train our model.

In [None]:
%load  ./script/mnist.py

The cell above implements the code necessary to train our PyTorch model in SageMaker, using the SageMaker PyTorch image. It uses the `load_run` function to automatically detect the experiment configuration and `run.log_parameter`, `run.log_parameters`, `run.log_file`, `run.log_metric` and `run.log_confusion_matrix` to track the model training

## Train model with Run context

Let's now train the model with passing the experiement run context to the training job

For detailed explanation of API run, refer to source code [here](https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/experiments/run.py)

In [None]:
# set new experiment configuration
experiment_name = "training-job-experiment"
run_name_base = "run-example"
run_name = name_from_base(run_name_base, short=True)
print(f"Experiment name: {experiment_name}\nRun name: {run_name}")

In [None]:
%%time
# Start training job with experiment setting
with Run(experiment_name=experiment_name, run_name=run_name, sagemaker_session=Session()) as run:
    estmator = PyTorch(
        entry_point="mnist.py",
        source_dir="script",
        role=role,
        model_dir=False,
        framework_version="1.12",
        py_version="py38",
        instance_type="ml.c5.xlarge",
        instance_count=1,
        hyperparameters={"epochs": 8, "hidden_channels": 5, "optimizer": "adam"},
        keep_alive_period_in_seconds=3600,
    )

    estmator.fit()

Checking the SageMaker Experiments UI, you can observe the Experiment run, populated with the metrics and parameters logged. We can also see the automatically generated outputs for the model data


<img src="images/sm_training_exp_overview.png" width="100%" style="float: left;" />
<img src="images/sm_training_inputs.png" width="100%" style="float: left;" />
<img src="images/sm_training_parameters.png" width="100%" style="float: left;" />
<img src="images/sm_training_metrics.png" width="100%" style="float: left;" />
<img src="images/sm_training_outputs.png" width="100%" style="float: left;" />

## Run multiple experiments

You can now create multiple runs of your experiment by varying a few parameters. Feel free to play with the parameters. 

In [None]:
%%time
# Start training job with experiment setting

hidden_channels = [5, 10]
optimizer = ["adam", "sgd"]

for h in hidden_channels:
    for j in optimizer:
        now = f"{datetime.datetime.now():%Y-%m-%d-%H-%M-%S}"
        run_name_n = f"{run_name_base}-{j}-{h}-{now}"
        with Run(experiment_name=experiment_name, run_name=run_name_n) as run:
            print("hidden_channels-", h, " optimizer-", j)
            estmator = PyTorch(
                entry_point="./script/mnist.py",
                role=role,
                model_dir=False,
                framework_version="1.12",
                py_version="py38",
                instance_type="ml.c5.xlarge",
                instance_count=1,
                hyperparameters={
                    "epochs": 10,
                    "hidden_channels": h,
                    "optimizer": j,
                },
                keep_alive_period_in_seconds=1200,  # use warm pool
            )

            estmator.fit(wait=True)

## Compare the performance through Experiment UI

In the SageMaker Experiments UI, you can compare the different runs and analyze the metrics for those runs 


<img src="images/compare_experiments.png" wname_from_base="100%"/>


## Bonus Point: customized analysis 

Besides all the built-in analysis withn Experiments, you can also customize your analysis and plot based on available metrics and parameter!

Below, we have provided you an example for your reference. 

This analsis is built on top of [Experiment Analysis API](https://sagemaker.readthedocs.io/en/stable/api/training/analytics.html)

In [None]:
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import seaborn.objects as so
from sagemaker.analytics import ExperimentAnalytics


In [None]:
def analyze_experiment(experiment_name:str ,parameter_names: str, metric_names: str, stat_name: str = "Last"):
    re_expr = (
        f"(?:{'|'.join([f'{k}.*- {stat_name}' for k in metric_names] + parameter_names + ['DisplayName'])})"
    )

    trial_component_analytics = ExperimentAnalytics(
        experiment_name=experiment_name,
        parameter_names=parameter_names,
    )
    df = trial_component_analytics.dataframe()
    # df = df[df["SourceArn"].isna()]
    df = df.filter(regex=re_expr)

    # join the categorical parameters
    df_temp = df[parameter_names].select_dtypes("object")

    cat_col_name = "_".join(df_temp.columns.values)

    if len(df_temp.columns) > 1:
        df.loc[:, cat_col_name] = df_temp.astype(str).apply("_".join, axis=1)
        df = df.drop(columns=df_temp.columns.values)

    ordinal_params = df[parameter_names].select_dtypes("number").columns.tolist()
    df_plot = df.melt(id_vars=["DisplayName"] + ordinal_params + [cat_col_name])
    df_plot[["Dataset", "Metrics"]] = (
        df_plot.variable.str.split(" - ").str[0].str.split(":", expand=True)
    )
    f = plt.Figure(
        figsize=(8, 6 * len(ordinal_params)),
        facecolor="w",
        layout="constrained",
        frameon=True,
    )
    f.suptitle("Experiment Analysis")
    sf = f.subfigures(1, len(ordinal_params))

    if isinstance(sf, mpl.figure.SubFigure):
        sf = [sf]

    for k, p in zip(sf, ordinal_params):

        (
            so.Plot(
                df_plot,
                y="value",
                x=p,
                color=cat_col_name,
            )
            .facet(col="Dataset", row="Metrics")
            .add(so.Dot())
            .share(y=False)
            .limit(y=(0, None))
            .on(k)
            .plot()
        )
    return f

In [None]:
parameter_names = ["hidden_channels", "optimizer"]
metric_names = ["accuracy", "loss"]
stat_name = "Last"


analyze_experiment(experiment_name, parameter_names, metric_names)
