# Fine-tune FLAN-T5 XXL on Multiple nodes using DeepSpeed on Amazon SageMaker 

FLAN-T5 is an enhanced version of T5 that has been fine-tuned in a mixture of tasks, or simple words, a better T5 model in any aspect. FLAN-T5 outperforms T5 by double-digit improvements for the same number of parameters.

This repo will show how to fine-tune FLAN-T5 XXL(11B) on multiple nodes using [DeepSpeed ZeRO](https://www.deepspeed.ai/tutorials/zero/) on Amazon SageMaker. And the repo is tested successfully on Data Science image and Python 3 kernel of Sagemaker studio with ml.m5.large kernel gateway instance in us-east-1 region.

It is structured as follows:
1. process dataset and upload to S3
2. prepare training script and deepspeed launcher
3. Fine-tune FLAN-T5 XXL on Amazon SageMaker

Before we start, let’s install the required libraries and make sure we have the correct permissions to access S3.

In [None]:
!pip install "transformers==4.26.0" "datasets[s3]==2.9.0" sagemaker --upgrade

If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it.



In [None]:
import sagemaker
from sagemaker import get_execution_role
import boto3

sess = sagemaker.Session()
role = get_execution_role()

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

## 1. process dataset and upload to S3

We prepare a dataset on the [CNN Dailymail Dataset](https://huggingface.co/datasets/cnn_dailymail). 


In [None]:
# experiment config
model_id = "google/flan-t5-xxl" # Hugging Face Model Id
dataset_id = "cnn_dailymail" # Hugging Face Dataset Id
dataset_config = "3.0.0" # config/verison of the dataset
save_dataset_path = "data" # local path to save processed dataset
text_column = "article" # column of input text is
summary_column = "highlights" # column of the output text 
# custom instruct prompt start
prompt_template = f"Summarize the following news article:\n{{input}}\nSummary:\n"

We process (tokenize) the dataset, upload to s3 and pass it into our managed Training job.

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer
import numpy as np 

dataset = load_dataset(dataset_id,name=dataset_config)
tokenizer = AutoTokenizer.from_pretrained(model_id)

print(f"Train dataset size: {len(dataset['train'])}")
print(f"Test dataset size: {len(dataset['test'])}")

# Train dataset size: 287113
# Test dataset size: 11490

We defined a `prompt_template` in our config, which we will use to construct an instruct prompt for better performance of our model. Our `prompt_template` has a “fixed” start and end, and our document is in the middle. This means we need to ensure that the “fixed” template parts + document are not exceeding the max length of the model. Therefore we calculate the max length of our document, which we will later use for padding and truncation

In [None]:
prompt_lenght = len(tokenizer(prompt_template.format(input=""))["input_ids"])
max_sample_length = tokenizer.model_max_length - prompt_lenght
print(f"Prompt length: {prompt_lenght}")
print(f"Max input length: {max_sample_length}")

# Prompt length: 12
# Max input length: 500

We know now that our documents can be “500” tokens long to fit our `template_prompt` still correctly. In addition to our input, we need to understand better our “target” sequence length meaning and how long are the summarization ins our dataset. Therefore we iterate over the dataset and calculate the max input length (at max 500) and the max target length. (takes a few minutes)

In [None]:
from datasets import concatenate_datasets
import numpy as np

# The maximum total input sequence length after tokenization. 
# Sequences longer than this will be truncated, sequences shorter will be padded.
tokenized_inputs = concatenate_datasets([dataset["train"], dataset["test"]]).map(lambda x: tokenizer(x[text_column], truncation=True), batched=True, remove_columns=[text_column, summary_column])
max_source_length = max([len(x) for x in tokenized_inputs["input_ids"]])
max_source_length = min(max_source_length, max_sample_length)
print(f"Max source length: {max_source_length}")

# The maximum total sequence length for target text after tokenization. 
# Sequences longer than this will be truncated, sequences shorter will be padded."
tokenized_targets = concatenate_datasets([dataset["train"], dataset["test"]]).map(lambda x: tokenizer(x[summary_column], truncation=True), batched=True, remove_columns=[text_column, summary_column])
target_lenghts = [len(x) for x in tokenized_targets["input_ids"]]
# use 95th percentile as max target length
max_target_length = int(np.percentile(target_lenghts, 95))
print(f"Max target length: {max_target_length}")

We now have everything needed to process our dataset.

In [None]:
def preprocess_function(sample, padding="max_length"):
 # created prompted input
 inputs = [prompt_template.format(input=item) for item in sample[text_column]]

 # tokenize inputs
 model_inputs = tokenizer(inputs, max_length=max_source_length, padding=padding, truncation=True)

 # Tokenize targets with the `text_target` keyword argument
 labels = tokenizer(text_target=sample[summary_column], max_length=max_target_length, padding=padding, truncation=True)

 # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
 # padding in the loss.
 if padding == "max_length":
 labels["input_ids"] = [
 [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
 ]

 model_inputs["labels"] = labels["input_ids"]
 return model_inputs

# process dataset
tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=list(dataset["train"].features))

After we processed the datasets we are going to use the new [FileSystem integration](https://huggingface.co/docs/datasets/filesystems) to upload our dataset to S3. We are using the `sess.default_bucket()`, adjust this if you want to store the dataset in a different S3 bucket. We will use the S3 path later in our training script.

In [None]:
# save train_dataset to s3
training_input_path = f's3://{sess.default_bucket()}/processed-404/{dataset_id}/train'
tokenized_dataset["train"].save_to_disk(training_input_path)

# save test_dataset to s3
test_input_path = f's3://{sess.default_bucket()}/processed-404/{dataset_id}/test'
tokenized_dataset["test"].save_to_disk(test_input_path)


print("uploaded data to:")
print(f"training dataset to: {training_input_path}")
print(f"test dataset to: {test_input_path}")

## 2. prepare training script and deepspeed launcher

Here we use torch.distribute.launch to launch deepspeed on multiple nodes. First, we use start.py to configure some enviroments and invoke the shell script torch_launch.sh. Second, the shell script torch_launch.sh will configure all of parameters required for both torch.distribute.launch and training script run_seq2seq_deepspeed.py.
In addition, we create a deepspeed config file named ds_flan_t5_z3_config_bf16.json to configure our training setup. 

We are going to use a p4dn.24xlarge AWS EC2 Instance including 8x NVIDIA A100 40GB. This means we can leverage `bf16`, which reduces the memory footprint of the model by almost ~2x, which allows us to train without offloading efficiently. 


## 3. Fine-tune FLAN-T5 XXL on Amazon SageMaker



In order to create a sagemaker training job we need an `HuggingFace` Estimator. The Estimator handles end-to-end Amazon SageMaker training and deployment tasks. The Estimator manages the infrastructure use. 
SagMaker takes care of starting and managing all the required ec2 instances for us, provides the correct huggingface container, uploads the provided scripts and downloads the data from our S3 bucket into the container at /opt/ml/input/data. Then, it starts the training job by running.

In [None]:
import time
from sagemaker.huggingface import HuggingFace
from sagemaker import get_execution_role

role = get_execution_role()
# define Training Job Name 
job_name = f'huggingface-flan-t5-deepspeed-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'
#define the model s3 path which will store your trained model asset
#Note: you should use your real s3 path to configure model_s3_path
model_s3_path='s3://your_bucket/flan-t5-xxl-4102xx668899-liangaws/model/'

instance_count = 2
#define the enviroment variables for your scripts.
environment = {'NODE_NUMBER':str(instance_count),
 'FI_PROVIDER': 'efa',
 'NCCL_PROTO': 'simple',
 'FI_EFA_USE_DEVICE_RDMA': '1',
 'NCCL_DEBUG': 'INFO',
 'MODEL_S3_PATH': model_s3_path
}

# create the Estimator
huggingface_estimator = HuggingFace(
 entry_point = 'start.py', # user endpoint script
 source_dir = 'src', # directory which includes all the files needed for training
 instance_type = 'ml.p4d.24xlarge', # instances type used for the training job
 instance_count = instance_count, # the number of instances used for training
 base_job_name = job_name, # the name of the training job
 role = role, # Iam role used in training job to access AWS ressources, e.g. S3
 transformers_version = '4.17', # the transformers version used in the training job
 pytorch_version = '1.10', # the pytorch_version version used in the training job
 py_version = 'py38', # the python version used in the training job
 environment = environment,
)

We created our `HuggingFace` estimator including the `start.py` as `entry_point` . We can now start our training job, with the `.fit()` method passing our S3 path to the training script.

In [None]:
# define a data input dictonary with our uploaded s3 uris
#Here we set test_input_path for both training channel and test channel to quickly verify the whole training procedure.
data = {
 'training': test_input_path,
 'test': test_input_path
}

# starting the train job with our uploaded datasets as input
huggingface_estimator.fit(data, wait=True)