## BERT Pre training using Trainum

This tutorial explains how to run BERT pretraining using Amazon SageMaker and AWS trainium.This example demonstrates the steps required to perform multinode/multi-accelerator training using AWS Trainium and SageMaker.

We will be doing the below activites as part of this example

1. Download the Wiki data needed for training and upload it to S3.
2. Run the model compilation and save the result in S3. This is recommended but not mandatory for long running jobs which might need multiple restarts.
3. Efficiently Train the model on multi node /multi accelerator setup.

## 1. Download Training Data

For this example we will use data that is tokenized and sharded in prior. We can always use our own data if needed. We will use wiki corpus data that is tokenized and sharded using sequence length 128. We can also use tokenized wiki data of sequence length 512.

To understand how the data is created please refer to this link -> 

https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT/README.md#getting-the-data

In [None]:
# Lets begin by upgrading the SageMaker SDK to the latest version
!pip install --upgrade sagemaker

In [None]:
# inorder to use trainium we need the SDK version should be minimum 2.116.0
import sagemaker

assert sagemaker.__version__ >= "2.116.0"

#### Download and uncompress the training data.

We will download pre created tokenized data and use it. The data is about 48 GB and will take sometime to finish downlaod.

In [None]:
! aws s3 cp s3://neuron-s3/training_datasets/bert_pretrain_wikicorpus_tokenized_hdf5/bert_pretrain_wikicorpus_tokenized_hdf5_seqlen128.tar . --no-sign-request
! tar -xf bert_pretrain_wikicorpus_tokenized_hdf5_seqlen128.tar
!rm bert_pretrain_wikicorpus_tokenized_hdf5_seqlen128.tar

### Initialize SageMaker and Upload the training data to our S3 Bucket

In [None]:
%%time
import os

import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.huggingface import HuggingFace

role = (
 get_execution_role()
) # provide a pre-existing role ARN as an alternative to creating a new role
print(f"SageMaker Execution Role: {role}")

client = boto3.client("sts")
account = client.get_caller_identity()["Account"]
print(f"AWS account: {account}")

session = boto3.session.Session()
region = session.region_name
print(f"AWS region: {region}")

sm_boto_client = boto3.client("sagemaker")
sagemaker_session = sagemaker.session.Session(boto_session=session)

In [None]:
sagemaker_session_bucket = (
 None # Provide a bucket if you don't want to use the default bucket
)

if sagemaker_session_bucket is None and sagemaker_session is not None:
 # set to default bucket if a bucket name is not given
 sagemaker_session_bucket = sagemaker_session.default_bucket()

#### Upload the data to our s3 bucket.

In [None]:
# Upload data to s3

train_path_128 = sagemaker_session.upload_data(
 "bert_pretrain_wikicorpus_tokenized_hdf5_seqlen128",
 sagemaker_session_bucket,
 "train/wiki128",
)

In [None]:
print(f"The training data used for training {train_path_128}")

## 2. Compile the model using Neuron SDK

The compilation job runs the training with fewer steps and then creates a neuron cache file which will be used for training the larger model.

PyTorch Neuron evaluates operations lazily during execution of the training loops, which means it builds a symbolic graph in the background and the graph is executed in hardware only when the tensor is printed, transfered to CPU, or xm.mark_step() is encountered (xm.mark_step() is implicitly called by pl.MpDeviceLoader/pl.ParallelLoader). 

During execution of the training loops, PyTorch Neuron can build multiple graphs depending on the number of conditional paths taken. For BERT-Large pretraining, PyTorch Neuron builds multiple unique graphs that should be compiled before running on the NeuronCores. PyTorch Neuron will compile those graphs only if they are not in the XLA in-memory cache or the persistent cache. To reduce the compilation time of these graphs, you can pre-compile those graphs using the utility neuron_parallel_compile (provided by the libneuronxla package, a transitive dependency of torch-neuronx).

In [None]:
instance_type = "ml.trn1.32xlarge"
instance_count = 1

In [None]:
hyperparameters = {
 "batch_size": 16,
 "grad_accum_usteps": 32,
 "data_dir": "/opt/ml/input/data/training/", # this is the path where sagemaker will copy the data into from s3
 "output_dir": "/opt/ml/model",
}

In [None]:
checkpoint_s3 = "s3://" + sagemaker_session_bucket + "/trainium/bert/cache"

In [None]:
from sagemaker.pytorch import PyTorch

smp_estimator = PyTorch(
 entry_point="compile.sh",
 source_dir="code",
 role=role,
 instance_type=instance_type,
 volume_size=1024,
 instance_count=instance_count,
 sagemaker_session=sagemaker_session,
 framework_version="1.11.0",
 py_version="py38",
 hyperparameters=hyperparameters,
 checkpoint_local_path="/opt/ml/checkpoints",
 checkpoint_s3_uri=checkpoint_s3,
 debugger_hook_config=False,
 disable_profiler=True,
)

In [None]:
smp_estimator.fit(inputs={"training": train_path_128}, logs=True)

This pre step performs a fast trial run of the training script to build graphs and then do parallel compilations on those graphs using multiple processes of Neuron Compiler before populating the on-disk persistent cache with compiled graphs. This helps make the actual training run faster because the compiled graphs will loaded from the persistent cache.

## 3. Train the model

After running the pre-compilation step, continue with the actual pretraining by running the following set of commands to launch 32 data parallel distributed training workers on trn1.32xlarge. SageMaker pytorch Estimator provides an option to support torchrun which makes sure to run a separate process for each neuron core available in the training cluster.

We will pass the compiled model as an input channel. This will be used during the training process rather than recompiling the model. 

In [None]:
hyperparameters = {
 "batch_size": 16,
 "grad_accum_usteps": 32,
 "data_dir": "/opt/ml/input/data/training/", # this is the path where sagemaker will copy the data into from s3
 "output_dir": "/opt/ml/model",
 "cache_dir": "/opt/ml/input/data/cache/", # the compiled model will be copied to this path.
 "max_steps": 200,
}

In [None]:
from sagemaker.pytorch import PyTorch

smp_estimator = PyTorch(
 entry_point="bert_pretrain.py",
 source_dir="code",
 role=role,
 instance_type=instance_type,
 volume_size=512,
 instance_count=instance_count,
 sagemaker_session=sagemaker_session,
 framework_version="1.11.0",
 py_version="py38",
 hyperparameters=hyperparameters,
 debugger_hook_config=False,
 disable_profiler=True,
 distribution={"torch_distributed": {"enabled": True}},
)

In [None]:
smp_estimator.fit(
 inputs={"training": train_path_128, "cache": checkpoint_s3}, logs=True
)

Congrats!!! we successfully trained a BERT model using AWS Trainium and Amazon SageMaker.