# Lab 2

## Introduction

Language models have recently exploded in both size and popularity. In 2018, BERT-large entered the scene and, with its 340M parameters and novel transformer architecture, set the standard on NLP task accuracy. Within just a few years, state-of-the-art NLP model size has grown by more than 500x with models such as OpenAI’s 175 billion parameter GPT-3 and similarly sized open source Bloom 176B raising the bar on NLP accuracy. This increase in the number of parameters is driven by the simple and empirically-demonstrated positive relationship between model size and accuracy: more is better. With easy access from models zoos such as HuggingFace and improved accuracy in NLP tasks such as classification and text generation, practitioners are increasingly reaching for these large models. These models can be used in pretrained form as so called "Foundation Models". 

However, since they are trained on large datasets of generic data they are often not suited for use cases requiring domain-specific knowledge. Especially the models of large parameter size are usually able to generalize well and perform surprisingly good in various zero-shot/few-shot scenarios. Nevertheless, zero-shot/few-shot performance in complex tasks like question-answering or handling human-like conversation is decreasing rapidly the more specifik the tasks become.  

In these cases, fine-tuning a model on a smaller, domain-specific use case can help increase the performance to a satisfying level. However, training/finetuning these models can be a challenge because of their size.

In this Lab, we'll explore how to finetune a large language model on Amazon SageMaker using Sagemaker Training, one of many ready-to-use AWS Deep Learning Containers (DLCs) and the built-in HuggingFace integration of the Sagemaker SDK. 

## Background and Details
Since training such models requires even more resources than hosting them, for this lab we'll be working with a rather small Large Language Model (LLM) to learn the basic concepts of finetuning LLMs . However, the proposed approach works similarily at scale for larger models. 

'distilGPT', a transformer-based large language model with around 82M parameters is the distilled version of GPT2 (predecessor of GPT-3/4), which was pre-trained on the WebText dataset. Since it is a decoder-only model it was trained using a causal language modeling (CLM) loss. We will use the exact approach for finetuning the data on the 'tiny_shakespeare' dataset to adjust the model output in terms of writing-style and content of the generated text. 

The 'tiny_shakespeare' is a dataset consisting of 40000 lines of Shakespeare from a variety of Shakespeare's plays avaliable in a train/test/validation split. It can be retrieved conveniently from the HuggingFace datasets hub.

Finally, we will deploy both the original and the finetuned model to experience the impact of the performed training.


## Instructions

### Prerequisites

#### To run this workshop...
You need a computer with a web browser, preferably with the latest version of Chrome / FireFox.
Sequentially read and follow the instructions described in AWS Hosted Event and Work Environment Set Up

#### Recommended background
It will be easier for you to run this workshop if you have:

- Experience with Deep learning models
- Familiarity with Python or other similar programming languages
- Experience with Jupyter notebooks
- Begineers level knowledge and experience with SageMaker Hosting/Inference.

#### Target audience
Data Scientists, ML Engineering, ML Infrastructure, MLOps Engineers, Technical Leaders.
Intended for customers working with large Generative AI models including Language, Computer vision and Multi-modal use-cases.
Customers using EKS/EC2/ECS/On-prem for hosting or experience with SageMaker.

Level of expertise - 400

#### Time to complete
Approximately 1 hour.

# Import of required dependencies

For this lab, we will use the following libraries:

 - boto3, the AWS SDK for python
 - SageMaker SDK for interacting with Amazon SageMaker. We especially want to highlight the classes 'HuggingFaceModel' and 'HuggingFace', utilizing the built-in HuggingFace integration into SageMaker SDK. These classes are used to encapsulate functionality around the model and the deployed endpoint we will use. They inherit from the generic 'Model' and 'Estimator' classes of the native SageMaker SDK, however implementing some additional functionality specific to HuggingFace and the HuggingFace model hub.
 - os, a python library implementing miscellaneous operating system interfaces 


In [None]:
import boto3
import sagemaker
import sagemaker.session
import os

from sagemaker.huggingface import HuggingFace, HuggingFaceModel

# Setup of notebook environment

Before we begin with the actual work for finetuning and deploying the model to Amazon SageMaker, we need to setup the notebook environment respectively. This includes:

- retrieval of the execution role our SageMaker Studio domain is associated with for later usage
- retrieval of our account_id for later usage
- retrieval of the chosen region for later usage

In [None]:
# Retrieve SM execution role
role = sagemaker.get_execution_role()

In [None]:
# Create a new STS client
sts_client = boto3.client('sts')

# Call the GetCallerIdentity operation to retrieve the account ID
response = sts_client.get_caller_identity()
account_id = response['Account']
account_id

In [None]:
# Retrieve region
region = boto3.Session().region_name
region

# Setup of S3 bucket for storage of training artifacts
When training a model with AWS SageMaker Training several artifacts can be written to an S3 bucket. This includes the trained model in form of a ‘model.tar.gz’ but also other artifacts like log files and the source code base. For this purpose, (if not already present) we create a dedicated S3 bucket.

In [None]:
# specifying bucket name for model artifact storage
model_bucket_name = f'immersion-day-bucket-{account_id}'
model_bucket_name

In [None]:
# Create S3 bucket
s3_client = boto3.client('s3', region_name=region)
location = {'LocationConstraint': region}

bucket_name = model_bucket_name

# Check if bucket already exists
bucket_exists = True
try:
    s3_client.head_bucket(Bucket=bucket_name)
except:
    bucket_exists = False

# Create bucket if it does not exist
if not bucket_exists:
    if region == 'us-east-1':
        s3_client.create_bucket(Bucket=bucket_name)
    else: 
        s3_client.create_bucket(Bucket=bucket_name,
        CreateBucketConfiguration=location)
    print(f"Bucket '{bucket_name}' created successfully")


# Diving deep into the training code

The code artifacts required for finetuning are residing in the finetuning directory. This directory is composed as follows:

`finetuning/`
- `finetuning.py`
- `requirements.txt`

The "finetuning" directory contains your training script (finetuning.py) and your requirements.txt file (for installation of additional dependencies not preinstalled in the container image upon start of the training container). We will now take a closer look into the training code:

## Import of required dependencies

On top of several commodity Python libraries, for this training script we will use the following DL specific libraries:
    
- torch: PyTorch is a Python package that provides two high-level features: 1/Tensor computation (like NumPy) with strong GPU acceleration and 2/Deep neural networks built on a tape-based autograd system
- transformers: HuggingFace transformers provides APIs and tools to easily download and train state-of-the-art pretrained models. Transformers support framework interoperability between PyTorch, TensorFlow, and JAX. 
- evaluate: HuggingFace evaluate is a library for easily evaluating machine learning models and datasets.
- datasets: HuggingFace datasets is a library for easily accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing (NLP) tasks in the context of the HuggingFace dataset hub.

## Script invocation and hyperparameter parsing

After the ephemeral training cluster has been provisioned and the respective Docker image has been pulled onto the machines, SageMaker Training starts the container which invokes the training python script 'finetuning.py' as entrypoint. Thereby it passes the defined hyperparameters as command line arguments. We will dive deeper into our hyperparameter selection at a later point.

The hyperparameters can be parsed by an 'argpars' ArgumentParser:

```python
parser = argparse.ArgumentParser()
    
# Training parameters
parser.add_argument("--model_name_or_path", default="distilgpt2")

args = parser.parse_args()

```

## Logging

For logging we use the 'logging' library. We first setup the basic config: 

```python
 # Setup logging
logging.basicConfig(
    format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
    datefmt="%m/%d/%Y %H:%M:%S",
    handlers=[logging.StreamHandler(sys.stdout)],
)

```

 Then we set the log level to 'INFO':
 
```python
log_level = logging.INFO
logger.setLevel(log_level)

```

Finally we configure logging for the HuggingFace frameworks 'datasets' and 'transformers'.

```python
datasets.utils.logging.set_verbosity(log_level)
transformers.utils.logging.set_verbosity(log_level)
transformers.utils.logging.enable_default_handler()
transformers.utils.logging.enable_explicit_format()

```

## Loading the dataset

We then use the 'datasets' library to load our dataset from the HuggingFace dataset hub:

```python
# Downloading and loading a dataset from the hub.
raw_datasets = load_dataset(args.dataset_name)
```

In case the training script we are loading is not available with a train/test split, there is additional functionality implemented to achieve this.

In a real world scenario training data could be ingested from various data sources like S3, databases, ... . 


## Preprocessing

Since we want to finetune the model on a CLM task we need to consider both NLP-related steps and CLM-related steps when it comes to data preprocessing:

For training NLP models the full text string has to be tokenized to enable the model to "digest" it as an input. Beyond access to a huge amount of open-source NLP models, the HuggingFace model hub offers also compatible tokenizers. By utilizing the 'transformers' library we are downloading a tokenizer for the revision of 'distilGPT2' which will be finetuned later on:

```python
tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, use_fast = True, revision = args.model_revision)

```

In a similar fashion we are also loading the model artifacts from the HuggingFace model hub:

```python
model = AutoModelForCausalLM.from_pretrained(args.model_name_or_path, revision=args.model_revision, torch_dtype="auto")

```

The tokenizer is now wrapped into an object of the 'AutoTokenizer' class, while the model resides in an object of the 'AutoModelForCausalLM' class. 

### Tokenization

For tokenization we define a function taking care of the actual tokenization task:

```python
def tokenize_function(examples):
    
    ...        
     
    output = tokenizer(examples[text_column_name])
    
    ...
    
    return output

```


        
        
Then we utilize it as a higher order function in a map approach on the dataset:

```python
tokenized_datasets = raw_datasets.map(
    tokenize_function,
    batched=True,
    remove_columns=column_names,
    desc="Running tokenizer on dataset"
)

```

### CLM-related tasks

The utilized training task consumes token blocks of 'block_size' (number of token a model is consuming in one forward pass. This is model specific plus bound to the instance type used for training.) and trains the model using a CLM loss (for details read [this](https://huggingface.co/docs/transformers/tasks/language_modeling#causal-language-modeling)). Therefore we need to group our tokenized dataset into token blocks of 'block_size'. We again define a function that performs the acutual grouping task:

# Main data processing function that will concatenate all texts from our dataset and generate chunks of block_size.

```python
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

```
    
        
Then we utilize it as a higher order function in a map approach on the tokenized dataset:

```python
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    desc=f"Grouping texts in chunks of {block_size}",
)

```


## Training

Since we specified that we want to run evaluations on the model to be finetuned (both stepwise during and after the training process), we need to define our evaluation metric first. Therefore we load one of various pre-implemented metrics available using HuggingFace's 'evaluate' library:

```python
metric = evaluate.load("accuracy")

```


Then we define a function computing the actual metrics tied to our training job:

```python
def compute_metrics(eval_preds):
    preds, labels = eval_preds
    # preds have the same shape as the labels, after the argmax(-1) has been calculated
    # by preprocess_logits_for_metrics but we need to shift the labels
    labels = labels[:, 1:].reshape(-1)
    preds = preds[:, :-1].reshape(-1)
    return metric.compute(predictions=preds, references=labels)
        
```

The next step is configuring the actual training job. Therefore we first initialize a 'TrainingArguments' object fed with our hyperparamters plus a seed.:


```python
# Specifying training_args. Going with default values for every parameter not explicitly specified. See documentation for more information: https://huggingface.co/docs/transformers/v4.27.2/en/main_classes/trainer#transformers.TrainingArguments   
training_args = TrainingArguments(
    per_device_train_batch_size = int(args.per_device_train_batch_size), 
    per_device_eval_batch_size=int(args.per_device_eval_batch_size), 
    output_dir=args.output_dir, 
    seed=42, 
    disable_tqdm=False
)
    
```
    
Then we initialize the Trainer object, which will orchestrate the training and evaluation process holistically. Several artifacts defined in the flow we executed so far are passed as parameters (model, training_args, datasets, compute_metrics function):

```python
# Initialize our Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset if args.do_train else None,
    eval_dataset=eval_dataset if args.do_eval else None,
    tokenizer=tokenizer,
    # Data collator will default to DataCollatorWithPadding, so we change it.
    data_collator=default_data_collator,
    compute_metrics=compute_metrics if args.do_eval and not is_torch_tpu_available() else None,
    preprocess_logits_for_metrics=preprocess_logits_for_metrics
    if args.do_eval and not is_torch_tpu_available()
    else None,
)
```

Finally the Trainer's .train() function is invoked executing the actual training. After successful completion the model artifacts are persisted according to the 'output_path' configuration.
    
```python
train_result = trainer.train()
trainer.save_model()  # Saves the tokenizer too for easy upload

```
       
## Evaluation

After successful completion of the training run, we perform a final evaluation: 

```python
metrics = trainer.evaluate()

```

# Hyperparameters

For the finetuning job to be conducted we specify the following hyperparameters explicitly:
- model_name_or_path: model id in HuggingFace ecosystem
- dataset_name: dataset id in HuggingFace ecosystem
- do_train: boolean variable indicating if training run should be executed. In our case 1.
- do_eval: boolean variable indicating if evaluation run should be executed. In our case 1.
- output_dir: directory path for storing the produced model artifacts locally within the container. We pick the default output directory of our SageMaker Training job (will be uploaded to S3 upon job success) '/opt/ml/model'.
- per_device_train_batch_size: batch size to be used when training. We choose 2.
- per_device_eval_batch_size: batch size to used when evaluating. We choose 2.

The default values for the remaining configurable parameters can be found in the [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer) and [TrainingArguments](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer) documentation.

# Configure the environment for model finetuning using the SageMaker HuggingFace Estimator and a AWS HuggingFace DLC

For conveniently training a model with AWS SageMaker Training we can use the Estimator class of SageMaker. Thanks to the AWS x HuggingFace partnership we can use the HuggingFace Estimator natively integrated into the SageMaker SDK, implementing some additional functionality specific to HuggingFace and the HuggingFace model hub. This enables us to finetune the model by providing the training script and some configuration parameters only, while SageMaker is taking care of all the undifferentiated heavy lifting in the background for you. 
In the constructor we specify the following parameters:
- source_dir: directory path to where the training script file is residing. In our case, this is the relative path to the 'finetuning' directory. Please note, that we've also created a 'requirements.txt' file for installing dependencies the training script requires on container-start time.
- entry_point: file in which the training script is implemented. Residing in the 'finetuning' directory, this is 'finetuning.py'. 
- instance_type: EC2 instance type for executing the training job. We pick the 'ml.p3.2xlarge', an instance with 16 GB GPU acceleration (NVIDIA Tesla V100 GPU), 8 vCPUs and 61GB RAM.
- instance_count: size of the ephemeral training cluster. We pick a single node cluster.
- image_uri: The image uri of a Docker image used for training the model. We will be using on of the many ready-to-use Deep Learning Containers AWS is providing [here](https://aws.amazon.com/machine-learning/containers/). Deep Learning Containers are Docker images that are preinstalled and tested with the latest versions of popular deep learning frameworks. Deep Learning Containers let you train models in custom ML environments quickly without building and optimizing your environments from scratch. Since we will be training a model from the HuggingFace model hub by leveraging various HuggingFace frameworks, we will use one of the HuggingFace DLCs, coming with preinstalled python 3.8, pytorch 1.10.2, transformers 4.17.0 dependencies and optimized for training in GPU-accelerated environments. 
- py_version: version of the python runtime installed in the container. This parameter is redundant, since we have explicitly specified a container image uri.
- hyperparameters: hyperparameters, passed as command line arguments to the training script. 
- output_path: S3 path for storing the artifacts produced by the training job. Therefore, we use the S3 bucket we created in the beginning.

Finally, we execute the SageMaker Training job by calling the .fit() function. This will take a couple of minutes. 


In [None]:
hyperparameters = {
        "model_name_or_path": 'distilgpt2',
        "dataset_name": 'tiny_shakespeare',
        "do_train": 1, 
        "do_eval": 1, 
        "output_dir": '/opt/ml/model',
        "per_device_train_batch_size": 2, 
        "per_device_eval_batch_size": 2,
        }

In [None]:
huggingface_estimator = HuggingFace(
                            source_dir='finetuning',
                            entry_point='finetuning.py',
                            instance_type='ml.p3.2xlarge',
                            instance_count=1,
                            role=role,
                            image_uri=f'763104351884.dkr.ecr.{region}.amazonaws.com/huggingface-pytorch-training:1.10.2-transformers4.17.0-gpu-py38-cu113-ubuntu20.04',
                            py_version=None,
                            hyperparameters = hyperparameters,
                            output_path = f's3://{model_bucket_name}'
                            )
        

In [None]:
huggingface_estimator.fit()

# Model deployment

Now we want to deploy both the original 'distilGPT2' model and our finetuned 'shakespeare-distilGPT2' model. Therefore we first retrieve the S3 path to the model artifact archive of our finetuned model:

In [None]:
latest_job_name = huggingface_estimator.latest_training_job.job_name
latest_job_name

In [None]:
def get_s3_artifact_path(training_job_name):
    # Get the ModelArtifacts object for the training job
    sagemaker_session = sagemaker.Session()

    training_job = sagemaker_session.describe_training_job(training_job_name)

    model_artifacts = training_job['ModelArtifacts']

    # Retrieve the S3 path to the model artifact
    s3_path = model_artifacts['S3ModelArtifacts']
    return s3_path

In [None]:
s3_path = get_s3_artifact_path(latest_job_name)
s3_path

Then we deploy the model to a 'ml.g4dn.xlarge' instance using the HuggingFaceModel class:

In [None]:
# create Hugging Face Model Class
huggingface_model_finetuned = HuggingFaceModel(
    image_uri=f'763104351884.dkr.ecr.{region}.amazonaws.com/huggingface-pytorch-inference:1.10.2-transformers4.17.0-gpu-py38-cu113-ubuntu20.04',
    model_data=s3_path ,
	role=role
    )

In [None]:
predictor_finetuned = huggingface_model_finetuned.deploy(
    initial_instance_count=1, # number of instances
    instance_type='ml.g4dn.xlarge', 
    endpoint_name='sm-endpoint-distilgpt2-shakespeare-immersion-day',
)

We also deploy the original model to a 'ml.g4dn.xlarge' instance. Therefore we use a cool feature built-in into the SageMaker SDK - we can define a model to be deployed directly from the HuggingFace model hub together with the model task to be performed directly as environment variables when creating a HuggingFaceModel, SageMaker Inference handles the rest:

In [None]:
hub = {
  'HF_MODEL_ID':'distilgpt2', # model_id from hf.co/models
  'HF_TASK':'text-generation' # NLP task you want to use for predictions
}

# create Hugging Face Model Class
huggingface_model_plain = HuggingFaceModel(
   env=hub,                                                # configuration for loading model from Hub
   role=role,                                              # IAM role with permissions to create an endpoint
   image_uri=f'763104351884.dkr.ecr.{region}.amazonaws.com/huggingface-pytorch-inference:1.10.2-transformers4.17.0-gpu-py38-cu113-ubuntu20.04'                               # Python version used
)

In [None]:
predictor_plain = huggingface_model_plain.deploy(
    initial_instance_count=1, # number of instances
    instance_type='ml.g4dn.xlarge', 
    endpoint_name='sm-endpoint-distilgpt2-immersion-day',
)

# Inference

Having the two endpoints available, we can experiment and observe the impact the finetuning has in terms of performance of the text-generation task.

In [None]:
predictor_finetuned.predict({"inputs": "The meaning of life",
"parameters": {
    "min_length": 50,
    "max_length": 100
}})[0]['generated_text']

In [None]:
predictor_plain.predict({"inputs": "The meaning of life",
"parameters": {
    "min_length": 50,
    "max_length": 100
}})[0]['generated_text']

# Cleanup
Finally, we clean up all resources not needed anymore since we pledge for the responsible use of compute resources. In this case this is the created endpoint together with the respective endpoint configuration. 

In [None]:
predictor_finetuned.delete_endpoint(delete_endpoint_config=True)

In [None]:
predictor_plain.delete_endpoint(delete_endpoint_config=True)