# Module 3. Training on SageMaker Environment with Profiling Enabled
---

This notebook can train on a single GPU or perform distributed training using PyTorch DDP.

This hands-on can be completed in about **20 minutes**. 

<div class="alert alert-info"><h4>Note</h4><p>
If you are considering training large models & hundreds of gigabytes to terabytes of data, consider the SageMaker Distributed Training option.</p></div>

In [None]:
%store -r 
%store

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import copy
import time
import numpy as np
import torch, os
import matplotlib.pyplot as plt
import src.train_utils

import boto3
import sagemaker

boto_session = boto3.Session()
sagemaker_session = sagemaker.Session(boto_session=boto_session)
region = boto3.Session().region_name

In [None]:
try:
    bucket 
    dataset_dir        
    print("[OK] You can proceed.")
except NameError:
    print("+"*60)
    print("[ERROR] Please run '01_make_augmented_imgs.ipynb' before you continue.")
    print("+"*60)

In [None]:
from sagemaker.pytorch import PyTorch
role = sagemaker.get_execution_role()
s3_path = f's3://{bucket}/{dataset_dir}'

In [None]:
# import wandb
# os.environ["WANDB_NOTEBOOK_NAME"] = dataset_dir
# wandb.sagemaker_auth(path="./src")

<br>

# 1. SageMaker Debugger Profiling Enabled
---

To enable profiling, create a `ProfilerConfig` object and pass it to the SageMaker estimator `profiler_config` parameter. This example sets the profiling interval to 500 milliseconds (0.5 seconds).

In [None]:
from sagemaker.debugger import (ProfilerConfig,
                                FrameworkProfile,
                                CollectionConfig,
                                DebuggerHookConfig,
                                DetailedProfilingConfig, 
                                DataloaderProfilingConfig, 
                                PythonProfilingConfig,
                                Rule,
                                PythonProfiler,
                                cProfileTimer,
                                ProfilerRule,
                                rule_configs)

# Location in S3 where the debugger output will be stored is mentioned in the previous step

# Set the profile config for both system and framework metrics
profiler_config = ProfilerConfig(
    system_monitor_interval_millis = 500,
    framework_profile_params = FrameworkProfile(
        detailed_profiling_config = DetailedProfilingConfig(
            start_step = 5, 
            num_steps = 10
        ),
        dataloader_profiling_config = DataloaderProfilingConfig(
            start_step = 7, 
            num_steps = 10
        ),
        python_profiling_config = PythonProfilingConfig(
            start_step = 9, 
            num_steps = 10,
            python_profiler = PythonProfiler.CPROFILE, 
            cprofile_timer = cProfileTimer.TOTAL_TIME
        )
    )
)

# Set the debugger hook config to save tensors
debugger_hook_config = DebuggerHookConfig(
    collection_configs = [
        CollectionConfig(name = 'weights'),
        CollectionConfig(name = 'gradients')
    ]
)

# Set the rules to analyze tensors emitted during training. 
# These specific set of rules will inspect the overall training performance and progress of the model
rules = [
    ProfilerRule.sagemaker(rule_configs.ProfilerReport()),
    Rule.sagemaker(rule_configs.loss_not_decreasing()),
    Rule.sagemaker(rule_configs.overfit()),
    Rule.sagemaker(rule_configs.overtraining()),
    Rule.sagemaker(rule_configs.stalled_training_rule())
]

<br>

# 2. Run a Training Job
---

Create a training job using the standard SageMaker Estimator API for PyTorch. 

The python training script file utilized by SageMaker will also use the same python file as the on-premise. Therefore, if you add only SageMaker environment variables, you can train without modifying the source code. The code snippet below is an example of setting SageMaker environment variables.

```python
parser = argparse.ArgumentParser()
...

# SageMaker environment variables
parser.add_argument('--hosts', type=list,
                    default=json.loads(os.environ['SM_HOSTS']))
parser.add_argument('--current_host', type=str,
                    default=os.environ['SM_CURRENT_HOST'])
parser.add_argument('--model_dir', type=str,
                    default=os.environ['SM_MODEL_DIR'])
parser.add_argument('--model_chkpt_dir', type=str,
                    default='/opt/ml/checkpoint')    
parser.add_argument('--train_dir', type=str,
                    default=os.environ['SM_CHANNEL_TRAIN'])
parser.add_argument('--valid_dir', type=str,
                    default=os.environ['SM_CHANNEL_VALID'])    
parser.add_argument('--num_gpus', type=int,
                    default=os.environ['SM_NUM_GPUS'])
parser.add_argument('--output_data_dir', type=str,
                    default=os.environ.get('SM_OUTPUT_DATA_DIR'))
```

For example, among the environment variables, `SM_MODEL_DIR` means `/opt/ml/model` in the SageMaker training container environment. For various environment variables, please refer to [SageMaker Inference Toolkit](https://github.com/aws/sagemaker-inference-toolkit).

In [None]:
IS_DISTRIBUTED_TRAINING = False
USE_WANDB = False
FRAMEWORK_VERSION = '1.6.0'
entry_point = 'train_single_gpu_wandb.py' if USE_WANDB else 'train_single_gpu.py'

if IS_DISTRIBUTED_TRAINING:
    estimator = PyTorch(entry_point='train_multi_gpu.py',
                        source_dir='./src',
                        role=role,
                        instance_type='ml.g4dn.12xlarge',
                        instance_count=1,
                        framework_version=FRAMEWORK_VERSION,
                        py_version='py3',
                        disable_profiler=False,
                        profiler_config=profiler_config,  
                        debugger_hook_config=debugger_hook_config,
                        hyperparameters = {'num_epochs': 10, 
                                           'batch_size': 128, # This parameter is divided by the number of GPUs
                                           'lr': 0.0005,
                                          }
                       )
else:
    estimator = PyTorch(entry_point=entry_point,
                        source_dir='./src',
                        role=role,
                        instance_type='ml.g4dn.xlarge',
                        instance_count=1,
                        framework_version=FRAMEWORK_VERSION,
                        py_version='py3',
                        disable_profiler=False,
                        profiler_config=profiler_config,  
                        debugger_hook_config=debugger_hook_config,                        
                        hyperparameters = {'num_epochs': 6, 
                                           'batch_size': 64,
                                           'lr': 0.0005,
                                          }     
                       )
    
train_input = sagemaker.TrainingInput(s3_data='s3://{}/{}/train'.format(bucket, dataset_dir))  
valid_input = sagemaker.TrainingInput(s3_data='s3://{}/{}/valid'.format(bucket, dataset_dir))      

In [None]:
%%time
estimator.fit({'train': train_input, 'valid': valid_input}, wait=False)
train_job_name = estimator.latest_training_job.job_name

In [None]:
from IPython.core.display import display, HTML

display(
    HTML(
        '<b>Review <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/jobs/{}">Training Job</a> After About 5 Minutes</b>'.format(
            region, train_job_name
        )
    )
)

In [None]:
display(
    HTML(
        '<b>Review <a target="blank" href="https://console.aws.amazon.com/cloudwatch/home?region={}#logStream:group=/aws/sagemaker/TrainingJobs;prefix={};streamFilter=typeLogStreamPrefix">CloudWatch Logs</a> After About 5 Minutes</b>'.format(
            region, train_job_name
        )
    )
)

In [None]:
display(HTML(
        '<b>Review <a target="blank" href="https://s3.console.aws.amazon.com/s3/buckets/{}/{}/?region={}&tab=overview">S3 Output Data</a> After The Training Job Has Completed</b>'.format(
            bucket, train_job_name, region
        )
    )
)

If you set the wait argument of the `fit(.)` function to False, you can execute the code cell below to change it to a synchronous way to wait until training is complete.

In [None]:
sagemaker.Session().logs_for_job(job_name=train_job_name, wait=True)

<br>
    
# 3. Analysis profiled data
---

When the training job starts, SageMaker Debugger starts collecting system and framework metrics. Once metrics collection begins, the profiling data can be analyzed in various ways, including plot and query.

In [None]:
rule_output_path = estimator.output_path + train_job_name + "/rule-output"
print(f"You will find the profiler report in {rule_output_path}")

## Get Profiler Report

The `ProfilerReport()` rule generates an html report `profiler-report.html` with a summary of the basic rules and recommendations for next steps. You can find this report in your S3 bucket.

For more information on how to download and open the Debugger Profiling Report, see '[SageMaker Debugger Profiling Report](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-profiling-report.html) in the SageMaker Developer Guide.

**[Caution] If running in JupyterLab, click "Trust HTML" at the top left of the screen to display the html report normally!**

In [None]:
import src.profiling_utils as profiling_utils
import json, os
from IPython.core.display import display, HTML

rule_output_path = estimator.output_path + train_job_name + "/rule-output"
output_dir = './output'
profile_output = output_dir+'/ProfilerReport'
profile_report_folder = profiling_utils.get_profile_report_folder(bucket, train_job_name + "/rule-output")

!rm -rf $output_dir

if not os.path.exists(output_dir):
    os.makedirs(output_dir)
    
if not os.path.exists(profile_output):
    os.makedirs(profile_output)    
    
!aws s3 ls {rule_output_path}/{profile_report_folder}/profiler-output/
!aws s3 cp {rule_output_path}/{profile_report_folder}/profiler-output/ {output_dir}/ProfilerReport/ --recursive    

display(HTML('<b>ProfilerReport : <a href="{}profiler-report.html">Profiler Report</a></b>'.format(output_dir+"/ProfilerReport/")))

In [None]:
from smdebug.profiler.analysis.notebook_utils.training_job import TrainingJob
tj = TrainingJob(train_job_name, region)

# Retrieve a description of the training job description and the S3 bucket URI where the metric data are saved
tj.describe_training_job()
tj.get_config_and_profiler_s3_output_path()

In [None]:
# Wait for the data to be available
tj.wait_for_sys_profiling_data_to_be_available()
tj.wait_for_framework_profiling_data_to_be_available()

## Get Metrics Dataframe

In [None]:
%%time
system_metrics_df, framework_metrics_df = profiling_utils.get_profiling_df(tj)

display(system_metrics_df.head())
display(framework_metrics_df.head())

## Plot profiling metrics

In [None]:
%%time
profiling_utils.plot_profiling_metrics(tj)

<br>

# 4. Import model artifacts from S3 into your Local Environment
---

In [None]:
import os
local_model_path = 'model'
model_name = 'model_best.pth'
base_model_name = 'mobilenetv2'
os.makedirs(local_model_path, exist_ok=True)
s3_model_path = estimator.model_data

In [None]:
%%bash -s "$local_model_path" "$s3_model_path"
aws s3 cp $2 $1
cd $1
tar -xzvf model.tar.gz
rm model.tar.gz

In [None]:
%store train_job_name s3_model_path base_model_name local_model_path model_name

<br>

# Next Step

In this session, the model was trained by invoking the SageMaker Training job. Please proceed to `4.1_neo_compile.ipynb`.