## Instruction Fine tune GPT NEO

Language models have recently exploded in both size and popularity. In 2018, BERT-large entered the scene and, with its 340M parameters and novel transformer architecture, set the standard on NLP task accuracy. Within just a few years, state-of-the-art NLP model size has grown by more than 500x with models such as OpenAI’s 175 billion parameter GPT-3 and similarly sized open source Bloom 176B raising the bar on NLP accuracy. This increase in the number of parameters is driven by the simple and empirically-demonstrated positive relationship between model size and accuracy: more is better. With easy access from models zoos such as Hugging Face and improved accuracy in NLP tasks such as classification and text generation, practitioners are increasingly reaching for these large models. However, deploying them can be a challenge because of their size.

In this notebook, we explore how to train a large language model - GPT-Neo on SageMaker using SageMaker Distributed Model Parallel Library.
SageMaker provides distributed training libraries and supports various distributed training options for deep learning tasks such as computer vision (CV) and natural language processing (NLP). With SageMaker’s distributed training libraries, you can run highly scalable and cost-effective custom data parallel and model parallel deep learning training jobs. For training GPT-Neo model we will be using Sharded Data Parallel(SDP). Sharded data parallelism is a memory-saving distributed training technique that splits the training state of a model (model parameters, gradients, and optimizer states) across GPUs in a data parallel group.

## Licence agreement
 - View license information https://github.com/EleutherAI/gpt-neox/blob/main/LICENSE before using the model.
 - This notebook is a sample notebook and not intended for production use. Please refer to the licence at https://github.com/aws/mit-0.


 
 


#### Lets begin by installing SageMaker SDK and importing libraries

In [2]:
! pip install -U sagemaker

Collecting sagemaker
 Using cached sagemaker-2.150.0-py2.py3-none-any.whl
Collecting tblib==1.7.0
 Using cached tblib-1.7.0-py2.py3-none-any.whl (12 kB)
Installing collected packages: tblib, sagemaker
 Attempting uninstall: tblib
 Found existing installation: tblib 1.6.0
 Uninstalling tblib-1.6.0:
 Successfully uninstalled tblib-1.6.0
 Attempting uninstall: sagemaker
 Found existing installation: sagemaker 2.147.0
 Uninstalling sagemaker-2.147.0:
 Successfully uninstalled sagemaker-2.147.0
Successfully installed sagemaker-2.150.0 tblib-1.7.0
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
import sagemaker
from sagemaker.pytorch import PyTorch

In [3]:
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
 # set to default bucket if a bucket name is not given
 sagemaker_session_bucket = sess.default_bucket()

role = sagemaker.get_execution_role()

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

sagemaker role arn: arn:aws:iam::706553727873:role/service-role/AmazonSageMaker-ExecutionRole-20211019T121285
sagemaker bucket: sagemaker-us-east-1-706553727873
sagemaker session region: us-east-1


#### Data Preparation

For running the training job we will use a dataset available in Huggingface datasets.

In [5]:
! pip install datasets

Collecting datasets
 Downloading datasets-2.11.0-py3-none-any.whl (468 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m00:01[0m
Collecting xxhash
 Using cached xxhash-3.2.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (213 kB)
Collecting huggingface-hub<1.0.0,>=0.11.0
 Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting tqdm>=4.62.1
 Using cached tqdm-4.65.0-py3-none-any.whl (77 kB)
Collecting responses<0.19
 Using cached responses-0.18.0-py3-none-any.whl (38 kB)
Collecting packaging
 Downloading packaging-23.1-py3-none-any.whl (48 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.9/48.9 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: xxhash, tqdm, packaging, responses, huggingface-hub, datasets
 Attempt

In [4]:
from datasets import load_dataset

instruction_data = load_dataset('tatsu-lab/alpaca')

A Jupyter Widget

Downloading and preparing dataset parquet/tatsu-lab--alpaca to /root/.cache/huggingface/datasets/tatsu-lab___parquet/tatsu-lab--alpaca-9b55fb286e3c7ab6/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...


A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

A Jupyter Widget

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/tatsu-lab___parquet/tatsu-lab--alpaca-9b55fb286e3c7ab6/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.


A Jupyter Widget

In [5]:
import pandas as pd

instructionDF = pd.DataFrame(instruction_data["train"])

In [6]:
train_df = instructionDF.iloc[:5000]
valid_df = instructionDF.iloc[5000:7000]

train_df.to_csv("train.csv",index=False)
valid_df.to_csv("valid.csv",index=False)

#### Upload the training data to s3

In [7]:
train_data_url = sess.upload_data(
 path="train.csv",
 key_prefix="alpaca",
)

valid_data_url = sess.upload_data(
 path="valid.csv",
 key_prefix="alpaca",
)

In [8]:
print(f"training file path {train_data_url}")
print(f"validation file path {valid_data_url}")

training file path s3://sagemaker-us-east-1-706553727873/alpaca/train.csv
validation file path s3://sagemaker-us-east-1-706553727873/alpaca/valid.csv


### Train Model

Now we are ready to run the training using SageMaker Estimator. A training script is required for SageMaker PyTorch estimator to run a model training job. Below is the script for fine-tuning a pretrained Hugging Face GPT-Neo model with the dataset we just put in the S3.

In [9]:
!pygmentize ./scripts/train.py

[34mimport[39;49;00m [04m[36mtorch[39;49;00m[37m[39;49;00m
[37m[39;49;00m
[34mimport[39;49;00m [04m[36mmath[39;49;00m[37m[39;49;00m
[34mfrom[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mutils[39;49;00m[04m[36m.[39;49;00m[04m[36mdata[39;49;00m [34mimport[39;49;00m DataLoader[37m[39;49;00m
[37m[39;49;00m
[34mfrom[39;49;00m [04m[36mtransformers[39;49;00m [34mimport[39;49;00m ([37m[39;49;00m
 AutoModelForCausalLM,[37m[39;49;00m
 AutoTokenizer,[37m[39;49;00m
 default_data_collator,[37m[39;49;00m
 get_scheduler,[37m[39;49;00m
)[37m[39;49;00m
[34mfrom[39;49;00m [04m[36mitertools[39;49;00m [34mimport[39;49;00m chain[37m[39;49;00m
[34mimport[39;49;00m [04m[36mcopy[39;49;00m[37m[39;49;00m
[37m[39;49;00m
[34mfrom[39;49;00m [04m[36mdatasets[39;49;00m [34mimport[39;49;00m load_dataset[37m[39;49;00m
[34mfrom[39;49;00m [04m[36mtqdm[39;49;00m [34mimport[39;49;00m tqdm[37m[39;49;00m
[34mfrom[3

In [10]:
hyperparameters = {}
SM_DATA_DIR = "/opt/ml/input/data" 

hyperparameters["model_name_or_path"] = "EleutherAI/gpt-neo-2.7B"
hyperparameters["checkpoint_dir"] = "/opt/ml/checkpoints"
hyperparameters["train_file"] = f"{SM_DATA_DIR}/train/train.csv"
hyperparameters["validation_file"] = f"{SM_DATA_DIR}/valid/valid.csv"
hyperparameters["per_device_train_batch_size"] = 1
hyperparameters["per_device_eval_batch_size"] = 1
hyperparameters["block_size"] = 2048
hyperparameters["num_train_epochs"] = 2

##### Store model files as checkpoints for easy deployment


In [11]:

checkpoint_dir = "/opt/ml/checkpoints"
checkpoint_s3_path = "s3://" + sess.default_bucket() + "/gptneo-checkpoints"

#### Setup params for Sharded Data Parallel (SDP)

In [12]:
smp_options = {
 "enabled":True,
 "parameters": { # Required
 "pipeline_parallel_degree": 1, # Required
 "ddp": True,
 "ddp_dist_backend": "auto",
 # parameters for sharded data parallelism
 "sharded_data_parallel_degree": 4, # Add this to activate sharded data parallelism
 "partitions":1,
 "offload_activations": True, 
 "fp16":True,
 "skip_tracing": True

 }
}

mpi_options = {
 "enabled" : True, # Required
 "processes_per_host" : 4 # Required
}

#### Start the training job
We use g5.12.xlarge which consists of 4 GPU to shard the model states and run the training.

In [13]:

base_job_name = "gpt-neo-instruction-fine-tuning"
estimator = PyTorch(
 base_job_name=base_job_name,
 source_dir="./scripts",
 entry_point="train.py",
 role=role,
 framework_version="2.0.0",
 py_version="py310",
 instance_count=1,
 instance_type="ml.g5.12xlarge",
 hyperparameters=hyperparameters,
 checkpoint_local_path=checkpoint_dir,
 checkpoint_s3_uri=checkpoint_s3_path,
 disable_profiler=True,
 distribution={
 "smdistributed": {"modelparallel": smp_options},
 "mpi": mpi_options
 }, 
 keep_alive_period_in_seconds = 15*60 # 15mins

)

In [14]:
estimator.fit({"train":train_data_url,"valid":valid_data_url})

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: gpt-neo-instruction-fine-tuning-2023-04-27-14-10-18-045


Using provided s3_resource
2023-04-27 14:10:18 Starting - Starting the training job...
2023-04-27 14:10:44 Starting - Preparing the instances for training......
2023-04-27 14:11:39 Downloading - Downloading input data...
2023-04-27 14:11:59 Training - Downloading the training image...........................
2023-04-27 14:16:35 Training - Training image download completed. Training in progress......[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2023-04-27 14:17:26,049 sagemaker-training-toolkit INFO Imported framework sagemaker_pytorch_container.training[0m
[34m2023-04-27 14:17:26,079 sagemaker-training-toolkit INFO No Neurons detected (normal if no neurons installed)[0m
[34m2023-04-27 14:17:26,088 sagemaker_pytorch_container.training INFO Block until all host DNS lookups succeed.[0m
[34m2023-04-27 14:17:26,090 sagemaker_pytorch_container.training INFO Invoking user training script.[0m
[34m2

#### Store the checkpoint path to reuse in the deploy notebook

In [None]:
%store checkpoint_s3_path