# Train and Deploy GPT-J-6B model using Tensor Parallelism approach within SageMaker Model Parallel Library

In this notebook you will learn how to use the tensor parallelism feature provided by the [SageMaker Model Parallelism Library](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel.html) to train the [EleutherAI's](https://www.eleuther.ai/) [GPT-J](https://arankomatsuzaki.wordpress.com/2021/06/04/gpt-j/) Model using the [GLUE/sst2 Dataset](https://huggingface.co/datasets/glue/viewer/sst2/train).

EleutherAI released GPT-J 6B, an open-source alternative to [OpenAIs GPT-3](https://openai.com/blog/gpt-3-apps/). [GPT-J 6B](https://huggingface.co/EleutherAI/gpt-j-6B) is the 6 billion parameter successor to EleutherAIs GPT-NEO family, a family of transformer-based language models based on the GPT architecture for text generation.

EleutherAI's primary goal is to train a model that is equivalent in size to GPT⁠-⁠3 and make it available to the public under an open license.
Over the last few months, GPT-J gained a lot of interest from Researchers, Data Scientists, and even Software Developers, but it remained very challenging to fine tune GPT-J.

The weights of the 6 billion parameter model represent a ~24GB memory footprint. To load it in float32, one would need at least 2x model size CPU RAM: 1x for initial weights and another 1x to load the checkpoint. Apart from the model parameters, there are the gradients, optimizer states, and activations taking memory, so the actual memory usage might be significantly higher than 48GB. Just as an example, with Adam optimizer and FP32 training, the use from parameters, gradients and optimizer states might be 96GB+, and activation memory footprint would be even more than this, so the total memory usage might be easily larger than 200 GB.

![GPT-J Memory requirements](img/GPT-J-Memory.png)

We will walk you through how to easily run FP16 training and fine tune GPT-J using Amazon SageMaker and Hugging Face on NVIDIA GPU instances. The notebook demonstrates the use of Tensor Parallel approach of SageMaker Model Parallel library. 

This notebook depends on the following files and folders:

1. `train_gptj_smp_tensor_parallel_script.py`: This is an entry-point script that is passed to the PyTorch estimator in the notebook instructions. This script is responsible for end to end training of the GPT-J model with SMP. The script has additional comments at places where the SMP API is used.
<!-- 2. `fp16`: This folder is used for 16-bit float training, which contains a fp16 optimizer and various fp16 utilities. -->
3. `learning_rates.py`: This contains the functions for learning rate schedule.
4. `requirements.txt`: This will install the dependencies, like the right version of huggingface transformers.
5. `memory_tracker.py`: This contains a function to print the memory status.


## SageMaker Distributed Training 

SageMaker provides distributed training libraries for data parallelism and model parallelism. The libraries are optimized for the SageMaker training environment, help adapt your distributed training jobs to SageMaker, and improve training speed and throughput.

### Approaches

![SageMaker Distributed Training Approaches](img/TypesOfDistributedTraining.png)


### SageMaker Model Parallel

Model parallelism is the process of splitting a model up between multiple devices or nodes (such as GPU-equipped instances) and creating an efficient pipeline to train the model across these devices to maximize GPU utilization.

Increasing deep learning model size (layers and parameters) can result in better accuracy. However, there is a limit to the maximum model size you can fit in a single GPU. When training deep learning models, GPU memory limitations can be a bottleneck in the following ways:

1. They can limit the size of the model you train. Given that larger models tend to achieve higher accuracy, this directly translates to trained model accuracy.

2. They can limit the batch size you train with, leading to lower GPU utilization and slower training.

To overcome the limitations associated with training a model on a single GPU, you can use model parallelism to distribute and train your model on multiple computing devices.

### Core features of SageMaker Model Parallel 

1. [Automated Model Splitting](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-core-features.html): When you use SageMaker's model parallel library, you can take advantage of automated model splitting, also referred to as automated model partitioning. The library uses a partitioning algorithm that balances memory, minimizes communication between devices, and optimizes performance. You can configure the automated partitioning algorithm to optimize for speed or memory.

2. [Pipeline Execution Schedule](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-core-features.html): A core feature of SageMaker's distributed model parallel library is pipelined execution, which determines the order in which computations are made and data is processed across devices during model training. Pipelining is a technique to achieve true parallelization in model parallelism, by having the GPUs compute simultaneously on different data samples, and to overcome the performance loss due to sequential computation.

Pipelining is based on splitting a mini-batch into microbatches, which are fed into the training pipeline one-by-one and follow an execution schedule defined by the library runtime. A microbatch is a smaller subset of a given training mini-batch. The pipeline schedule determines which microbatch is executed by which device for every time slot.

In addition to its [core features](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-core-features.html), the SageMaker distributed model parallel library offers [memory-saving features](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch.html) for training deep learning models with PyTorch: [tensor parallelism](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-tensor-parallelism.html), [optimizer state sharding](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-optimizer-state-sharding.html), [activation checkpointing](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-activation-checkpointing.html), and [activation offloading](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-activation-offloading.html). 

### SageMaker Model Parallel configuration

Please refer to all the [configuration parameters](https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_general.html) related to SageMaker Distributed Training.

As we are going to use PyTorch and Hugging Face for training GPT-J, it is important to understand all the SageMaker Distributed configuration parameters specific to PyTorch [here](https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_general.html#pytorch-specific-parameters).

#### Important

`process_per_host` must not be greater than the number of GPUs per instance and typically will be equal to the number of GPUs per instance.

#### SageMaker Tensor Parallel

Tensor parallelism splits individual layers, or nn.Modules, across devices, to be run in parallel. The following figure shows the simplest example of how the library splits a model with four layers to achieve two-way tensor parallelism ("tensor_parallel_degree": 2). The layers of each model replica are bisected and distributed into two GPUs. In this example case, the model parallel configuration also includes "pipeline_parallel_degree": 1 and "ddp": True (uses PyTorch DistributedDataParallel package in the background), so the degree of data parallelism becomes eight. The library manages communication across the tensor-distributed model replicas.

![SageMaker Distributed Training Approaches](img/smdmp-tensor-parallel-only.png)

The usefulness of this feature is in the fact that you can select specific layers or a subset of layers to apply tensor parallelism. To dive deep into tensor parallelism and other memory-saving features for PyTorch, and to learn how to set a combination of pipeline and tensor parallelism, see Extended Features of the SageMaker Model Parallel Library for PyTorch.



#### Additional Resources
If you are a new user of Amazon SageMaker, you may find the following helpful to learn more about SMP and using SageMaker with PyTorch.

1. To learn more about the SageMaker model parallelism library, see [Model Parallel Distributed Training with SageMaker Distributed](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel.html).

2. To learn more about using the SageMaker Python SDK with PyTorch, see Using [PyTorch with the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html).

3. To learn more about launching a training job in Amazon SageMaker with your own training image, see [Use Your Own Training Algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html).

<div class="alert alert-block alert-info">
<b>Kernel Selection</b> <br>
Please run this notebook using either the <code>Data Science, Python 3</code> Kernel on SageMaker Studio Notebook or a <code>conda_pytorch_p38</code> Kernel on SageMaker Notebook instances.
</div>

## Install and Upgrade Libraries

The SageMaker model parallelism library's tensor parallelism feature requires the SageMaker Python SDK and the SageMaker Experiments library. Run the following cell to install or upgrade the libraries.

<div class="alert alert-block alert-warning">
<b>Kernel Restart:</b> To finish applying the changes, you must restart the kernel.
</div>

In [None]:
# run once, restart kernel, then comment out this cell
%pip install -qU pip
%pip install -qU "sagemaker>=2,<3"
%pip install -qU sagemaker-experiments
%pip install -qU transformers datasets

import IPython

IPython.Application.instance().kernel.do_shutdown(True)

>**Note:** After you run the above cell, comment it out for future runs.

Import and check if the SageMaker Python SDK version is successfully set to the latest version

#### Check SageMaker Version

In [None]:
import sagemaker

print(sagemaker.__version__)

## Amazon SageMaker Initialization

Throughout this example, you'll use a training script of GPT-J model and a small text dataset. 

Run the following cell to import SageMaker modules and retrieve information of your current SageMaker work environment: your AWS account ID, the AWS Region you are using to run the notebook, and the ARN of your Amazon SageMaker execution role.

In [None]:
%%time
import os

from sagemaker import get_execution_role
from sagemaker.huggingface import HuggingFace
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
import boto3

role = (
    get_execution_role()
)  # provide a pre-existing role ARN as an alternative to creating a new role
print(f"SageMaker Execution Role:{role}")

client = boto3.client("sts")
account = client.get_caller_identity()["Account"]
print(f"AWS account:{account}")

session = boto3.session.Session()
region = session.region_name
print(f"AWS region:{region}")

sm_boto_client = boto3.client("sagemaker")

sagemaker_session = sagemaker.session.Session(boto_session=session)


# get default bucket
default_bucket = sagemaker_session.default_bucket()
print()
print("Default bucket for this session: ", default_bucket)

_This completes the SageMaker setup._

## Download and Prepare glue/sst2 Data
In this section you will download and prepare the glue/sst2 dataset, then copy the files to S3. This is done because the `train_gptj_smp_tensor_parallel_script.py` requires either S3 input or paths in an FSx file system of an already tokenized dataset. However, FSX for Lustre configuration is not a part of this notebook.

The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating and analyzing Natural Language Understanding (NLU) systems. Additional information about [GLUE](https://gluebenchmark.com/) .

The Standford Sentiment Analysis Treeback (glue/sst2) consists of sentences from movie reviews and human annotation of their sentiments. The task is to predict the sentiment of a given sentence. It uses the two-way (positive/negative) class split, with only sentence-level labels. Further information on [glue/sst2](https://nlp.stanford.edu/sentiment/index.html).

We are now ready to begin training and tuning the GPT-J model.

### 0. Import Libraries and Specify Parameters

Datasets is a library for easily accessing and sharing data for various applications including Natural Language Processing (NLP), Computer Vision and Audio.

In [None]:
# Import the GLUE dataset
import datasets
from datasets import load_dataset, load_from_disk, load_metric

Since the training script is written in Pytorch, you need to import the PyTorch estimator class.

In [None]:
from sagemaker.pytorch import PyTorch
import transformers
import logging

from transformers import (
    AutoTokenizer,
)

from transformers.testing_utils import CaptureLogger

In [None]:
logger = logging.getLogger(__name__)

Next, you define the dataset configuration.

In [None]:
data_config = {
    "dataset_name": "glue",
    "dataset_config_name": "sst2",
    "do_train": True,
    "do_eval": True,
    "cache_dir": "tmp",
}

### 1. Load data
This section loads the dataset and splits it to training and validation datasets.

In [None]:
raw_datasets = load_dataset(
    data_config["dataset_name"],
    data_config["dataset_config_name"],
)

In [None]:
# If the dataset is not split, then create 5% training split
if "validation" not in raw_datasets.keys():
    raw_datasets["validation"] = load_dataset(
        data_config["dataset_name"],
        data_config["dataset_config_name"],
        split="train[:5%]",
        cache_dir=data_config["cache_dir"],
    )

    raw_datasets["train"] = load_dataset(
        data_config["dataset_name"],
        data_config["dataset_config_name"],
        split="train[5%:]",
        cache_dir=data_config["cache_dir"],
    )

### 2. Load tokenizer
Nearly every NLP task requires a tokenization. A tokenizer converts your input into smaller units, usually words or subwords that can be easily ingested by a model. In many cases, the architecture you want to use can be guessed from the name or path of the pretrained model passed to the from_pretrained() method. AutoClasses are here to do this job for you so that you automatically retrieve the relevant model based on the name/path to the pretrained weights/config/vocabulary. You can learn more about HuggingFace [Auto Classes](https://huggingface.co/docs/transformers/model_doc/auto).

The following cell loads a tokenizer with [AutoTokenizer.from_pretrained()](https://huggingface.co/docs/transformers/v4.19.4/en/autoclass_tutorial#autotokenizer)

In [None]:
# Load a tokenizer using the AutoTokenizer Auto Class
tokenizer_kwargs = {
    "cache_dir": data_config["cache_dir"],
}

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B", **tokenizer_kwargs)

### 3. Preprocess data

This section preprocesses the raw sst2 dataset in two steps. First, the dataset is tokenized, next the block-size is set and the tokenized dataset is concatenated into block-sized chunks.

Define the `tokenize` and `concatenation` helper functions.

In [None]:
# Tokenizer helper function
def tokenize_function(examples):
    tok_logger = transformers.utils.logging.get_logger(
        "transformers.tokenization_utils_base"
    )

    with CaptureLogger(tok_logger) as cl:
        output = tokenizer(examples[text_column_name])
        # clm input could be much much longer than block_size
        if "Token indices sequence length is longer than the" in cl.out:
            tok_logger.warning(
                "^^^^^^^^^^^^^^^^ Please ignore the warning above - this long input will be chunked into smaller bits before being passed to the model."
            )
    return output


# Main data processing function that will concatenate all texts from our dataset and generate chunks of block_size.
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size
        # Split by chunks of max_len.
        result = {
            k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
            for k, t in concatenated_examples.items()
        }
    result["labels"] = result["input_ids"].copy()
    return result

The following code block tokenizes the raw dataset, sets the block-size to 1024 and concatenates the tokenized dataset.

In [None]:
# Get the column names from the training raw dataset and set the movie review sentence column
column_names = raw_datasets["train"].column_names
text_column_name = "text" if "text" in column_names else column_names[0]

# since this will be pickled to avoid _LazyModule error in Hasher force logger loading before tokenize_function
tok_logger = transformers.utils.logging.get_logger(
    "transformers.tokenization_utils_base"
)

tokenized_datasets = raw_datasets.map(
    tokenize_function,
    batched=True,
    num_proc=1,
    remove_columns=column_names,
    desc="Running tokenizer on dataset",
)


block_size = tokenizer.model_max_length
if block_size > 1024:
    logger.warning(
        f"The tokenizer picked seems to have a very large `model_max_length` ({tokenizer.model_max_length}). "
        "Picking 1024 instead. You can change that default value by passing --block_size xxx."
    )
    block_size = 1024
else:
    if args.block_size > tokenizer.model_max_length:
        logger.warning(
            f"The block_size passed ({block_size}) is larger than the maximum length for the model"
            f"({tokenizer.model_max_length}). Using block_size={tokenizer.model_max_length}."
        )
    block_size = min(block_size, tokenizer.model_max_length)

lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    #     num_proc=args.preprocessing_num_workers,
    desc=f"Grouping texts in chunks of {block_size}",
)

Next, check to see that training and validation data exist in the tokenized dataset.

In [None]:
# check to see if train and validation is set in the config
if data_config["do_train"]:
    if "train" not in tokenized_datasets:
        raise ValueError("--do_train requires a train dataset")
    train_dataset = lm_datasets["train"]


if data_config["do_eval"]:
    if "validation" not in tokenized_datasets:
        raise ValueError("--do_eval requires a validation dataset")
    eval_dataset = lm_datasets["validation"]

Convert a dataset object to json and set the s3 bucket locations for the training and validation tokenized datasets.

In [None]:
training_dataset_location = None
validation_dataset_location = None


if data_config["do_train"]:
    train_dataset.to_json("./training.json")
    training_dataset_location = "s3://{}/dataset/train/".format(default_bucket)
    object_name_training = "dataset/train/training.json"

if data_config["do_eval"]:
    eval_dataset.to_json("./validation.json")
    validation_dataset_location = "s3://{}/dataset/validation/".format(default_bucket)
    object_name_validation = "dataset/validation/validation.json"

Upload training and validation tokenized datasets to the s3 bucket.

In [None]:
# upload the training and validation data to s3
import boto3

s3_client = boto3.client("s3")

if training_dataset_location is not None:
    response = s3_client.upload_file(
        "./training.json", default_bucket, object_name_training
    )

if validation_dataset_location is not None:
    response = s3_client.upload_file(
        "./validation.json", default_bucket, object_name_validation
    )

In [None]:
# clean up local dir
if data_config["do_train"]:
    command = "rm ./training.json"
    os.system(command)

if data_config["do_eval"]:
    command = "rm ./validation.json"
    os.system(command)

### 4. SageMaker Tensor Parallel Training

In the cells above you downloaded, preprocessed and split the glue/sst2 dataset into training and validation datasets. And finally uploaded the json files to an s3 bucket in your AWS account.

<div class="alert alert-block alert-warning">
<b>AWS S3 Region</b><br>
The S3 bucket used to store your training and validation data must be in the same AWS region as you training jobs.
</div>

You will use these files to train a GPT-J model using SageMaker Tensor Parallel training jobs.

After you successfully run this example tensor parallel training job, you can modify the S3 bucket to where your own dataset is stored.

Set the S3 bucket location for model training input and output.

In [None]:
s3_train_bucket = training_dataset_location
s3_test_bucket = validation_dataset_location
s3_output_bucket = f"s3://sagemaker-{region}-{account}/smp-tensorparallel-outputdir/"

#### Define Data Channels for SageMaker Training

Set the SageMaker training data channels using the s3 locations for training and validation data.

In [None]:
train = sagemaker.inputs.TrainingInput(
    s3_train_bucket, distribution="FullyReplicated", s3_data_type="S3Prefix"
)
test = sagemaker.inputs.TrainingInput(
    s3_test_bucket, distribution="FullyReplicated", s3_data_type="S3Prefix"
)
data_channels = {"train": train, "test": test}

In [None]:
s3_train_bucket

#### Set Model Hyperparameters and Model Options
Define the model Hyperparameters, custom [Message Passing Interface(MPI)](https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-config.html#data-parallel-config-mpi-custom) options, and any metrics definitions.

The following model hyperparameters are particularly noteworthy:

- "fp16”: 1 - half precision enabled to save GPU memory,
- "“save_final_full_model”: 1 - saving a full model from the last step, supported only if sharded data parallelism is disabled,
- “manual_partition”: 0 - disabled which means that library splits model automatically,
- “shard_optimizer_state”: 1 - optimizer state is sharded across data parallel group,
- “activation_checkpointing”: 1 - clearing activations of certain layers to reduce memory usage enabled.

In [None]:
hyperparameters = {
    "max_steps": 100,
    "seed": 12345,
    "fp16": 1,
    "lr": 2.0e-4,
    "lr_decay_iters": 125000,
    "min_lr": 0.00001,
    "lr-decay-style": "linear",
    "warmup": 0.01,
    "num_kept_checkpoints": 1,
    "checkpoint_freq": 200,
    "validation_freq": 1000,
    "logging_freq": 10,
    "save_final_full_model": 1,
    "manual_partition": 0,
    "shard_optimizer_state": 1,
    "activation_checkpointing": 0,
    "activation_strategy": "each",
    "optimize": "speed",
    # below flag loads model and optimizer state from checkpoint_s3_uri
    # 'load_partial': 1,
}

mpioptions = "-x NCCL_DEBUG=WARN -x SMDEBUG_LOG_LEVEL=ERROR "
mpioptions += "-x SMP_DISABLE_D2D=1 -x SMP_D2D_GPU_BUFFER_SIZE_BYTES=1 -x SMP_NCCL_THROTTLE_LIMIT=1 "
mpioptions += "-x FI_EFA_USE_DEVICE_RDMA=1 -x FI_PROVIDER=efa -x RDMAV_FORK_SAFE=1"

metric_definitions = [
    {"Name": "base_metric", "Regex": "<><><><><><>"}
]  # Add your custom metric definitions

#### Set the model configuration below.
<div class="alert alert-block alert-warning">
<b>GPU Requirements</b><br>
GPT-J-6B requires at least one EC2 instance type of either <code>g5.48xlarge</code>, <code>p3dn.24xlarge</code>, <code>p4d.24xlarge</code> (or multiple smaller GPU instance types). You can use a single smaller GPU instance type with the GPT-J-XL 1.5B parameter model.
</div> 

In [None]:
model_config = "gpt-j-xl"

if model_config == "gpt-j-6B":
    model_params = {
        "tensor_parallel_degree": 8,
        "pipeline_parallel_degree": 1,
        # if sharded data par degree >1 -> tensor 1 pipeline 1 shard_optimizer_state to be set to False
        "sharded_data_parallel_degree": 1,
        "train_batch_size": 8,
        "val_batch_size": 8,
        "prescaled_batch": 1,
        "max_context_width": 2048,
        "use_distributed_transformer": 1,
        "finetune_6b": 1,
    }
elif model_config == "gpt-j-xl":
    model_params = {
        "tensor_parallel_degree": 8,
        "pipeline_parallel_degree": 1,
        "sharded_data_parallel_degree": 1,
        "train_batch_size": 4,
        "val_batch_size": 4,
        "prescaled_batch": 1,
        "hidden_width": 1600,
        "num_heads": 25,
        "num_layers": 48,
        "finetune_6b": 0,
    }

for k, v in model_params.items():
    hyperparameters[k] = v

#### Set Up SageMaker Studio Experiment
Create a SageMaker Experiment for the training job.

[SageMaker Experiments](https://docs.aws.amazon.com/sagemaker/latest/dg/experiments.html) allows you to automatically track input and output artifacts, (hyper-)parameters, metrics of your machine learning pipelines. You can efficiently manage your model development process by view, analyzing and comparing experimentation results.

In [None]:
from time import gmtime, strftime

# Specify your experiment name
experiment_name = "smp-gptj-tensor-parallel"
experiment_name = "smp-gptj-sharded-data-parallel"
# Specify your trial name
trial_name = f"{experiment_name}-trial1"

all_experiment_names = [exp.experiment_name for exp in Experiment.list()]
# Load the experiment if it exists, otherwise create
if experiment_name not in all_experiment_names:
    experiment = Experiment.create(
        experiment_name=experiment_name, sagemaker_boto_client=sm_boto_client
    )
else:
    experiment = Experiment.load(
        experiment_name=experiment_name, sagemaker_boto_client=sm_boto_client
    )

# Create the trial
trial = Trial.create(
    trial_name="smp-{}-{}".format(trial_name, strftime("%Y-%m-%d-%H-%M-%S", gmtime())),
    experiment_name=experiment.experiment_name,
    sagemaker_boto_client=sm_boto_client,
)

#### Specify Essential SageMaker Training Job Parameters

Use the SageMaker [Estimator API](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html) to define a SageMaker Training Job. Pass the following parameters to this training job.

* `instance_count`
* `instance_type`
* `volume_size`
* `base_job_name`

> **Total GPUs**: The total number of GPU's available for training is determined by the `instance_type` and `instance_count`.

<details>
  <Summary>Training Instance Types</Summary>

   #### GPT-J-XL
  * ml.g5.24xlarge
  * ml.p3.16xlarge
  * ml.p2.16xlarge
  
  #### GPT-J-6B
  * ml.g5.48xlarge
  * ml.p3dn.24xlarge
  * ml.p4d.24xlarge

</details>

Set the instance type.

In [None]:
instance_type = "ml.p3.16xlarge"
instance_count = 1

<div class="alert alert-block alert-info">
<b>Instance Specs</b> <br>
Amazon SageMaker <b>ml.p4d.24xlarge</b> instances are powered by the latest Intel® Cascade Lake processors and eight NVIDIA A100 Tensor Core GPUs. They provide up to 100 Gbps networking throughput with 96 vCPUs, 8 NVIDIA A100 GPUs, 1.1 TB instance memory, 8 TB local NVMe-based SSD storage, and 19 Gbps EBS burst bandwidth.
</div>

In [None]:
# Set the processes per host equal to the number of GPUs on the EC2 instance
if instance_type in [
    "ml.p3.16xlarge",
    "ml.p3dn.24xlarge",
    "ml.g5.48xlarge",
    "ml.p4d.24xlarge",
]:
    processes_per_host = 8
elif instance_type == "ml.p2.16xlarge":
    processes_per_host = 16
else:
    processes_per_host = 4

print("processes_per_host is set to:", processes_per_host)

Set the instance volume size to 500GB. The volume size must be larger than your input data size.

In [None]:
volume_size = 500

Specify a base job name to track a configuration.

In [None]:
machine_str = instance_type.split(".")[1] + instance_type.split(".")[2][:3]
pp_degree = hyperparameters["pipeline_parallel_degree"]
tp_degree = hyperparameters["tensor_parallel_degree"]
base_job_name = f'smp-{model_config}-{machine_str}-tp{tp_degree}-pp{pp_degree}-sdp{hyperparameters["sharded_data_parallel_degree"]}-bs{hyperparameters["train_batch_size"]}'

Set the S3 location for the model checkpoint

In [None]:
# If you want to resume training, set checkpoint_s3_uri to the same path as a previous job.
# Previous checkpoint to load must have same model config.
checkpoint_bucket = f"s3://sagemaker-{region}-{account}/"
checkpoint_s3_uri = f"{checkpoint_bucket}/experiments/gptj_synthetic_simpletrainer_checkpoints/{base_job_name}/"

#### Create a SageMaker PyTorch Estimator

Amazon SageMakers PyTorch estimator objects contain a [distribution](https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_general.html) parameter which you can use to enable and specify parameters for SageMaker distributed training. The SageMaker model parallel library internally uses MPI. To use model parallelism, both `smdistributed` and `mpi` must be enabled through the distribution parameter.

In [None]:
distribution = {
    "mpi": {
        "enabled": True,
        "processes_per_host": processes_per_host,
        "custom_mpi_options": mpioptions,
    },
    "smdistributed": {
        "modelparallel": {
            "enabled": True,
            "parameters": {
                "ddp": True,
                "tensor_parallel_degree": hyperparameters["tensor_parallel_degree"],
                # partitions is a required param in the current SM SDK so it needs to be passed,
                # these two map to the same config
                "partitions": hyperparameters["pipeline_parallel_degree"],
                "shard_optimizer_state": hyperparameters["shard_optimizer_state"] > 0,
                "prescaled_batch": hyperparameters["prescaled_batch"] > 0,
                "fp16": hyperparameters["fp16"] > 0,
                "optimize": hyperparameters["optimize"],
                "auto_partition": False
                if hyperparameters["manual_partition"]
                else True,
                "default_partition": 0,
                "optimize": hyperparameters["optimize"],
            },
        }
    },
}

The following cell constructs a PyTorch estimator using the parameters defined above. To see how the SageMaker tensor parallelism modules and functions are applied to the script, see the `train_gptj_smp_tensor_parallel_script.py` file.

In [None]:
kwargs = {}

smp_estimator = PyTorch(
    entry_point="train_gptj_smp_tensor_parallel_script.py",
    source_dir=os.getcwd(),
    role=role,
    instance_type=instance_type,
    volume_size=volume_size,
    instance_count=instance_count,
    sagemaker_session=sagemaker_session,
    distribution=distribution,
    framework_version="1.13",
    py_version="py39",
    output_path=s3_output_bucket,
    checkpoint_s3_uri=checkpoint_s3_uri,
    metric_definitions=metric_definitions,
    hyperparameters=hyperparameters,
    debugger_hook_config=False,
    disable_profiler=True,
    base_job_name=base_job_name,
    **kwargs,
)

Finally, run the estimator to launch the SageMaker training job of GPT-J model with tensor parallelism.

In [None]:
smp_estimator.fit(
    inputs=data_channels,
    experiment_config={
        "ExperimentName": experiment.experiment_name,
        "TrialName": trial.trial_name,
        "TrialComponentDisplayName": "Training",
    },
    logs=True,
)


<div class="alert alert-block alert-info">
<b>ResourceLimitExceeded Error</b> <br>
If you receive a <code>ResourceLimitExceeded</code> error message when running the training job, you can request an increase on the default quota by contacting <a href="https://console.aws.amazon.com/support">AWS Support</a>. For the quota Limit Type select SageMaker Training Jobs and the instance type.
</div>


In [None]:
model_location = smp_estimator.model_data
model_location

<div class="alert alert-block alert-info">
<b>Training Time</b> <br>
The model training job takes approximately <b>30 minutes</b> to complete.
</div>


# Accessing the Training Logs

You can access the training logs from [Amazon CloudWatch](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/WhatIsCloudWatch.html). Make sure to look at the logs of algo-1 as that is the master node whose output stream will have the training job logs.

You can use CloudWatch to track SageMaker GPU and memory utilization during training and inference. To view the metrics and logs that SageMaker writes to CloudWatch, see *Processing Job, Training Job, Batch Transform Job, and Endpoint Instance Metrics* in [Monitor Amazon SageMaker with Amazon CloudWatch](https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html).

If you are a new user of CloudWatch, see [Getting Started with Amazon CloudWatch](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/GettingStarted.html). 

For additional information on monitoring and analyzing Amazon SageMaker training jobs, see [Monitor and Analyze Training Jobs Using Metrics](https://docs.aws.amazon.com/sagemaker/latest/dg/training-metrics.html).


<div class="alert alert-block alert-info">
<b>End of Lab3a</b> <br>
Please return to workshop studio and continue to Lab3b - Inference