# Visualize Training Jobs and Performance of Your Model Using TensorBoard on SageMaker


---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. 

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/training|distributed_training|pytorch|data_parallel|mnist|pytorch_smdataparallel_mnist_demo.ipynb)

---


## Background
Use [TensorBoard on Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/tensorboard-on-sagemaker.html) to take the benefit of TensorBoard visualization features and to monitor model performance metrics, such as loss, accuracy, weights, and gradients, collected from training jobs of your model. 
This notebook example shows how to use the `sagemaker.interactive_apps.tensorboard.TensorBoardApp` API, which launches the hosted TensorBoard application within SageMaker. The sample training script is prepared with PyTorch, the SageMaker data parallelism library `smdistributed.dataparallel` for distributed training, and the MNIST dataset.

### Dataset
This example uses the MNIST dataset. MNIST is a widely used dataset for handwritten digit classification. It consists of 70,000 labeled 28x28 pixel grayscale images of hand-written digits. The dataset is split into 60,000 training images and 10,000 test images. There are 10 classes (one for each of the 10 digits).

## Prerequisites
You should have a [SageMaker Domain](https://docs.aws.amazon.com/sagemaker/latest/dg/sm-domain.html) configured with at least one User Profile in the domain. The TensorBoard application also requires the following minimum set of permissions for the execution role attached to the User Profile:
* `sagemaker:CreateApp`
* `sagemaker:DeleteApp`
* `sagemaker:DescribeTrainingJob`
* `sagemaker:Search`
* `s3:GetObject`
* `s3:ListBucket`

For more information, see the [official documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/tensorboard-on-sagemaker.html#debugger-htb-prerequisites) to start using SageMaker with TensorBoard



**NOTE:** This example requires SageMaker Python SDK v2.150 or higher.

In [None]:
!pip install sagemaker --upgrade

### Initialize SageMaker

In [None]:
import sagemaker

sagemaker_session = sagemaker.Session()

bucket = sagemaker_session.default_bucket()
prefix = "sagemaker/DEMO-sagemaker-tensorboard-pytorch"
region = sagemaker_session.boto_region_name

role = sagemaker.get_execution_role()
role_name = role.split(["/"][-1])
print(f"The Amazon Resource Name (ARN) of the role used for this demo is: {role}")
print(f"The name of the role used for this demo is: {role_name[-1]}")

To verify that the role above has required permissions:

1. Go to the IAM console: https://console.aws.amazon.com/iam/home.
2. Select **Roles**.
3. Enter the role name in the search box to find the role retrieved from the output of the previous code cell. 
4. Select the role.
5. Use the **Permissions** tab to verify that the role has all the required permissions attached.

## Prepare a training script to collect model tensors

### About the sample training script accompanied with this example notebook

The sample training script uses the `torchvision.datasets` module to download the MNIST dataset from the public SageMaker dataset S3 bucket. You can see how this is implemented in the `train_pytorch_smdataparallel_mnist.py` training script.

The training script provides the code you need for distributed data parallel (DDP) training using SageMaker's distributed data parallel library (`smdistributed.dataparallel`). For details about how to use `smdistributed.dataparallel`'s DDP in your native PyTorch script, see the [Modify a PyTorch Training Script Using SMD Data Parallel](https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-modify-sdp.html#data-parallel-modify-sdp-pt).

### Customizing training script for TensorBoard data collection

In order to activate TensorBoard data collection, you need to explicitly modify the training script to store the model data using the `torch.utils.tensorboard.SummaryWriter` class. Note that this requires TensorBoard installed in the training container, so you need to add the `tensorboard` dependency in `requirements.txt`.

When collecting the model tensors from a distributed training job, using the `SummaryWriter` across multiple ranks can cause a duplication of data collection, so make sure to specify the rank for the `SummaryWriter`. The following code snippet shows the necessary changes to set up the SummaryWriter class and collect model tensors.


```
from torch.utils.tensorboard import SummaryWriter

...
args.rank = rank = dist.get_rank()
args.local_rank = local_rank = int(os.getenv("LOCAL_RANK", -1))
...
writer = SummaryWriter('/opt/ml/output/tensorboard') if rank == 0 and local_rank == 0 else None
...

def train(args, model, device, train_loader, optimizer, epoch):
 ...
 # Only write on rank 0 to avoid duplicate entries
 if writer:
 writer.add_scalar("example_metric", example_metric)
 
# Close writer at the end of training 
writer.close()
```
To see the full training script, run the cell below:

In [None]:
!pygmentize code/train_pytorch_smdataparallel_mnist.py

## Launch a training job using the SageMaker PyTorch estimator class with the TensorBoard output configuration

To instruct SageMaker to find the local path where the model tensors are saved and upload the tensor data to an Amazon S3 bucket, configure the `sagemaker.debugger.TensorBoardOutputConfig` class as follows.


In [None]:
from sagemaker.debugger import TensorBoardOutputConfig

tensorboard_s3_output_path = "s3://{}/{}".format(bucket, prefix)

# Create a TensorBoardOutputConfig object to configure automatic data upload from
# the training job's local data storage to S3
tensorboard_output_config = TensorBoardOutputConfig(
 s3_output_path=tensorboard_s3_output_path,
 container_local_output_path="/opt/ml/output/tensorboard",
)

The following two code cells show how to set the `tensorboard_output_config` parameter and other necessary parameters for the SageMaker PyTorch estimator class, and launch a training job by running the `estimator.fit()` method. Also note that you need to pass the `requirements.txt` file to the `SAGEMAKER_REQUIREMENTS` environment variable through the estimator. SageMaker checks the requirements and installs packages (the `tensorboard` package in this case) listed in the file.


In [None]:
from sagemaker.pytorch import PyTorch

env = {
 "SAGEMAKER_REQUIREMENTS": "requirements.txt", # tensorboard dependency required for PyTorch
}

estimator = PyTorch(
 base_job_name="pytorch-tensorboard-dataparallel-mnist",
 source_dir="code",
 entry_point="train_pytorch_smdataparallel_mnist.py",
 role=role,
 framework_version="1.11.0",
 py_version="py38",
 instance_count=1,
 env=env,
 instance_type="ml.p3.16xlarge",
 sagemaker_session=sagemaker_session,
 distribution={"smdistributed": {"dataparallel": {"enabled": True}}},
 # Configure TensorBoard output config in estimator to enable automatic sync in training job
 tensorboard_output_config=tensorboard_output_config,
)

In [None]:
estimator.fit(wait=False)

## Track the training job in TensorBoard

Once the job starts, you can generate the URL to the TensorBoard application, add your training job directly into the TensorBoard application, and retrieve a link by using the `TensorBoardApp().get_app_url()` API from the `sagemaker.interactive_apps.tensorboard` module.

If you are running this notebook in SageMaker Studio, the `TensorBoardApp().get_app_url()` API will generate a direct URL. Otherwise, you will receive a URL to the Amazon SageMaker console, which you can access SageMaker Domain and then the Domain user profile under which you want the TensorBoard session to run. 

For more information on how to configure TensorBoard on SageMaker, see [Use TensorBoard to Debug and Analyze Training Jobs in Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/tensorboard-on-sagemaker.html).

In [None]:
from sagemaker.interactive_apps.tensorboard import TensorBoardApp

app = TensorBoardApp(region)

print("Navigate to the following URL:")
print(app.get_app_url(training_job_name=estimator._current_job_name))
print("Data may not appear until job is started and emitting data")
estimator.logs()

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/training|distributed_training|pytorch|data_parallel|mnist|pytorch_smdataparallel_mnist_demo.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/training|distributed_training|pytorch|data_parallel|mnist|pytorch_smdataparallel_mnist_demo.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/training|distributed_training|pytorch|data_parallel|mnist|pytorch_smdataparallel_mnist_demo.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/training|distributed_training|pytorch|data_parallel|mnist|pytorch_smdataparallel_mnist_demo.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/training|distributed_training|pytorch|data_parallel|mnist|pytorch_smdataparallel_mnist_demo.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/training|distributed_training|pytorch|data_parallel|mnist|pytorch_smdataparallel_mnist_demo.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/training|distributed_training|pytorch|data_parallel|mnist|pytorch_smdataparallel_mnist_demo.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/training|distributed_training|pytorch|data_parallel|mnist|pytorch_smdataparallel_mnist_demo.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/training|distributed_training|pytorch|data_parallel|mnist|pytorch_smdataparallel_mnist_demo.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/training|distributed_training|pytorch|data_parallel|mnist|pytorch_smdataparallel_mnist_demo.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/training|distributed_training|pytorch|data_parallel|mnist|pytorch_smdataparallel_mnist_demo.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/training|distributed_training|pytorch|data_parallel|mnist|pytorch_smdataparallel_mnist_demo.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/training|distributed_training|pytorch|data_parallel|mnist|pytorch_smdataparallel_mnist_demo.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/training|distributed_training|pytorch|data_parallel|mnist|pytorch_smdataparallel_mnist_demo.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/training|distributed_training|pytorch|data_parallel|mnist|pytorch_smdataparallel_mnist_demo.ipynb)
