# Monitor and Analyze Training Jobs Using Metrics
An Amazon SageMaker training job is an iterative process that teaches a model to make predictions by presenting examples from a training dataset.  
Typically, a training algorithm computes several metrics, such as training error and prediction accuracy. These metrics help diagnose whether the model is learning well and will generalize well for making predictions on unseen data.  

The training algorithm writes the values of these metrics to logs, which Amazon SageMaker monitors and sends to Amazon CloudWatch in real time.

If you want Amazon SageMaker to parse logs from a custom algorithm and send metrics that the algorithm emits to CloudWatch, you have to specify the metrics that you want Amazon SageMaker to send to CloudWatch when you configure the training job.  
You specify the name of the metrics that you want to send and the regular expressions that Amazon SageMaker uses to parse the logs that your algorithm emits to find those metrics.

## Defining Training Metrics (Amazon SageMaker Python SDK)
Define the metrics that you want to send to CloudWatch by specifying a list of metric names and regular expressions as the metric_definitions argument when you initialize an Estimator object. For example, if you want to monitor both the train:error and validation:error metrics in CloudWatch, your Estimator initialization would look like the following:
```python 
estimator = TensorFlow(base_job_name='cifar10',
                       entry_point='cifar10_keras_sm.py',
                       source_dir='training_script',
                       role=role,
                       framework_version='1.12.0',
                       py_version='py3',
                       metric_definitions=[
                               {'Name': 'train:error', 'Regex': 'Train_error=(.*?);'},
                               {'Name': 'validation:error', 'Regex': 'Valid_error=(.*?);'}
                        ],
                       hyperparameters={'epochs' : 20},
                       train_instance_count=1, train_instance_type='ml.p3.2xlarge')
```

## Monitoring the cifar10 training
Find your previous (cifar10_keras_sm) training job in the SageMaker console.
Open the job details and look at the job cloudwatch logs.  
Configure the metrics regex that fits the logs. use regex tools to check your expressions, use () to catch each matric
In this example, the solution (One option for a solution) is below.

In [None]:
metric_definitions = [
    {'Name': 'train:loss', 'Regex': 'loss: ([0-9\\.]+) - acc: [0-9\\.]+'},
    {'Name': 'train:accuracy', 'Regex': 'loss: [0-9\\.]+ - acc: ([0-9\\.]+)'},
    {'Name': 'validation:accuracy', 'Regex': 'val_loss: [0-9\\.]+ - val_acc: ([0-9\\.]+)'},
    {'Name': 'validation:loss', 'Regex': 'val_loss: ([0-9\\.]+) - val_acc: [0-9\\.]+'},
]

Continue with the notebook and run the same job as before (Same estimator configuration).  this time, add the ```metric_definitions=metric_definitions``` argument.  
Run the job for 20 epochs

In [None]:
import os
import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()

role = get_execution_role()

### Load the SageMaker experiment

In [None]:
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
import time
cifar10_experiment = Experiment.load(
    experiment_name="TensorFlow-cifar10-experiment")

In [None]:
# create a new trial
trial_name = f"cifar10-training-job-{int(time.time())}"
trial = Trial.create(
    trial_name=trial_name, 
    experiment_name=cifar10_experiment.experiment_name
)

In [None]:
# Configure the dataset location variable
dataset_location = sagemaker_session.upload_data(path='data', key_prefix='data/DEMO-cifar10')
display(dataset_location)

In [None]:
from sagemaker.tensorflow import TensorFlow
estimator = ... # Make sure you use the metric_definitions=metric_definitions argument.

Connect the trial configured above to the job. add the experiment config to the fit function.
```python
experiment_config={
                  "ExperimentName": cifar10_experiment.experiment_name, 
                  "TrialName": trial.trial_name,
                  "TrialComponentDisplayName": "Training"}
```

In [None]:
estimator.fit({'train' :  'train_data_location',
               'validation' :  'validation_data_location',
               'eval' :  'eval_data_location'},
             experiment_config=)

### View the job training metrics
SageMaker used the regular expression configured above, to send the job metrics to CloudWatch metrics.
You can now view the job metrics directly from the SageMaker console.  

login to the [SageMaker console](https://console.aws.amazon.com/sagemaker/home) choose the latest training job, scroll down to the monitor section.  
Using CloudWatch metrics, you can change the period and configure the statistics

Find the metrics using the following cell:

In [None]:
from IPython.core.display import Markdown

link = 'https://console.aws.amazon.com/cloudwatch/home?region='+sagemaker_session.boto_region_name+'#metricsV2:query=%7B/aws/sagemaker/TrainingJobs,TrainingJobName%7D%20'+estimator.latest_training_job.job_name
display(Markdown('CloudWatch metrics: [link]('+link+')'))
display(Markdown('After you choose a metric, change the period to 1 Minute (Graphed Metrics -> Period)'))

## Monitor with TensorBoard
TensorBoard provides the visualization and tooling needed for machine learning experimentation:
* Tracking and visualizing metrics such as loss and accuracy
* Visualizing the model graph (ops and layers)
* Viewing histograms of weights, biases, or other tensors as they change over time
* Projecting embeddings to a lower dimensional space
* Displaying images, text, and audio data
* And much more

In the next section we'll update the script to save TensorBoard logs.  
We'll be able to use TensorBoard for monitoring our jobs in real time.  

Update your cifar10-keras-sm.py script to send logs to TensorBoard.  
Add the `from keras.callbacks import TensorBoard` import.

Keras will send TensorBoard logs in each batch. sending the logs to S3 will slow down the training job, change the TensorBoard callback to send the logs only at the end of an epoch.

Add the TensorBoard callback to your script (add this line after the ModelCheckpoint callback)
```python
tb_callback = TensorBoard(log_dir=args.model_dir,update_freq='epoch')
```

Add tb_callback to the model.fit callbacks list.
```python
callbacks=[checkpoint, tb_callback]
```

### Run a training job with TensorBoard support

In [None]:
# create a new trial
trial_name = f"cifar10-training-job-{int(time.time())}"
trial = Trial.create(
    trial_name=trial_name, 
    experiment_name=cifar10_experiment.experiment_name
)

In [None]:
estimator = ...

In [None]:
estimator.fit({'train' :  'train_data_location',
               'validation' :  'validation_data_location',
               'eval' :  'eval_data_location'},
              experiment_config=,
              wait=False) # Use wait=False to run async jobs

### Install Tensorboard on your local machine
Install [TensorBoard](https://github.com/tensorflow/tensorboard) locally (On you computer) using `pip install tensorboard`.  
To access an S3 log directory, configure the TensorBoard default region.  
You can do this by configuring an environment variable named AWS_REGION, and setting the value of the environment variable to the AWS region your training job runs in.  
For example, `AWS_REGION='us-east-1' tensorboard --logdir model_dir`  

**You can get your TensorBoard command from the next cell.**

You'll need an AccessKey + SecretKey with access to model_dir for this Event, get those from https://dashboard.eventengine.run/dashboard.  

AWS Console -> copy the Credentials / CLI Snippets and run in your Terminal / CMD.

In [None]:
from IPython.core.display import Markdown

link = 'AWS_REGION=\''+sagemaker_session.boto_region_name+'\' tensorboard --logdir ' + estimator.model_dir
display(Markdown('TensorBoard command: '+link))

You can access TensorBoard locally at http://localhost:6006

### Analyze the experiments

In [None]:
search_expression = {
    "Filters":[
        {
            "Name": "DisplayName",
            "Operator": "Equals",
            "Value": "Training",
        }
    ],
}

In [None]:
import pandas as pd 
pd.options.display.max_columns = 500

from sagemaker.analytics import ExperimentAnalytics
trial_component_analytics = ExperimentAnalytics(
    sagemaker_session=sagemaker_session, 
    experiment_name=cifar10_experiment.experiment_name,
    search_expression=search_expression
)

table = trial_component_analytics.dataframe(force_refresh=True)
display(table)

**Good job!**   
You can now monitor your job using both CloudWatch metrics and TensorBoard.  
Before continuing to the next notebook, take a look at the [TensorBoard callback configuration](https://keras.io/callbacks/#tensorboard) for other TensorBoard configurations.