# Lab 1

## Introduction

Language models have recently exploded in both size and popularity. In 2018, BERT-large entered the scene and, with its 340M parameters and novel transformer architecture, set the standard on NLP task accuracy. Within just a few years, state-of-the-art NLP model size has grown by more than 500x with models such as OpenAI’s 175 billion parameter GPT-3 and similarly sized open source Bloom 176B raising the bar on NLP accuracy. This increase in the number of parameters is driven by the simple and empirically-demonstrated positive relationship between model size and accuracy: more is better. With easy access from models zoos such as HuggingFace and improved accuracy in NLP tasks such as classification and text generation, practitioners are increasingly reaching for these large models. However, deploying them can be a challenge because of their size.

In this Lab, we'll explore how to host a large language model on Amazon SageMaker using Sagemaker Inference, one of many ready-to-use AWS Deep Learning Containers (DLCs) and the built-in HuggingFace integration of the Sagemaker SDK. 

## Background and Details
We'll be working with GPT-J , a large language model with over 6B parameters pre-trained on the Pile dataset. The pre-trained weights of the original model come in FP32 format (4 bytes per parameter) and combine to roughly 24Gb in total. Since most of today's state-of-the-art single-GPU-powered instances are only equipped with 16, max. 24 GPU GB memory, the size of the model weights is bringing up challenges for model inference. With quantization, we'll explore one of several optimization options available that allow us to host this model on single GPU instances with fewer than 16Gb of GPU memory. Finally, we take a closer look into the art of prompt engineering and discover how [few-shot](https://www.analyticsvidhya.com/blog/2021/05/an-introduction-to-few-shot-learning/) (as opposed to zero-shot) approaches can significantly improve model performance on a huge variety of NLP tasks. 

## Instructions

### Prerequisites

#### To run this workshop...
You need a computer with a web browser, preferably with the latest version of Chrome / FireFox.
Sequentially read and follow the instructions described in AWS Hosted Event and Work Environment Set Up

#### Recommended background
It will be easier for you to run this workshop if you have:

- Experience with Deep learning models
- Familiarity with Python or other similar programming languages
- Experience with Jupyter notebooks
- Begineers level knowledge and experience with SageMaker Hosting/Inference.

#### Target audience
Data Scientists, ML Engineering, ML Infrastructure, MLOps Engineers, Technical Leaders.
Intended for customers working with large Generative AI models including Language, Computer vision and Multi-modal use-cases.
Customers using EKS/EC2/ECS/On-prem for hosting or experience with SageMaker.

Level of expertise - 400

#### Time to complete
Approximately 1 hour.

# Import of required dependencies

For this lab, we will use the following libraries:

 - SageMaker SDK for interacting with Amazon SageMaker. We especially want to highlight the classes 'HuggingFaceModel' and 'HuggingFacePredictor', utilizing the built-in HuggingFace integration into SageMaker SDK. These classes are used to encapsulate functionality around the model and the deployed endpoint we will use. They inherit from the generic 'Model' and 'Predictor' classes of the native SageMaker SDK, however implementing some additional functionality specific to HuggingFace and the HuggingFace model hub.
 - boto3, the AWS SDK for python
 - os, a python library implementing miscellaneous operating system interfaces 
 - tarfile, a python library to read and write tar archive files

In [None]:
from sagemaker.huggingface import HuggingFaceModel, HuggingFacePredictor
import sagemaker
import boto3
import os
import tarfile

# Setup of notebook environment

Before we begin with the actual work for packaging and deploying the model to Amazon SageMaker, we need to setup the notebook environment respectively. This includes:
- retrieval of the execution role our SageMaker Studio domain is associated with for later usage
- retrieval of our account_id for later usage
- retrieval of the chosen region for later usage

In [None]:
# IAM role with permissions to create endpoint
role = sagemaker.get_execution_role()

In [None]:
# Create a new STS client
sts_client = boto3.client('sts')

# Call the GetCallerIdentity operation to retrieve the account ID
response = sts_client.get_caller_identity()
account_id = response['Account']
account_id

In [None]:
# Retrieve region
region = boto3.Session().region_name
region

# Create Model Artifact Archive

For hosting a model with AWS SageMaker Inference we need to package our model artifacts into an archive called ‘model.tar.gz’ and upload it to S3. Within this archive, your model artifacts should be stored in the following directory structure:

`model.tar.gz`
- `model.bin`
- `code/`
 - `inference.py`
 - `requirements.txt`

The "code" directory contains your inference script (inference.py) and your requirements.txt file (if you have additional dependencies, detailed description see below). The “model.bin†file is a file in one of various binary formats containing the model weights as well as some configuration. In our case this will be non-existing at archive creation time, since we will be loading our model files on endpoint-start time from the HuggingFace model hub. Before you continue, open the code directory to get familiar with the structure and be able to follow when going through the subsequent steps below.

If you have additional dependencies for your model, you can include them in a requirements.txt file in the "code" directory. SageMaker will install these dependencies during the deployment of your model.

Sometimes you may want to override one of the [five functions](https://huggingface.co/docs/sagemaker/inference#user-defined-code-and-modules) within the hosting cycle, such as the model_fn function. To do this, you can create a new function in your inference.py file with the same name as the function you want to override. SageMaker will automatically use your new function instead of the default function. Since we want to dynamically load the model binaries on endpoint-start time, we will override the model_fn, the default method for loading a model. 

Within the model_fn() function of inference.py, this can be done by leveraging the capabilities of HuggingFace. HuggingFace is a company focussing on democratisation of open-source AI and closely partnering with AWS. With the ‘transformers’ library, they have created a popular open-source API/framework for natural language processing on top of common frameworks like PyTorch or Tensorflow. They have also built the HuggingFace model hub, a model repository providing thousands of open-source models throughout different ML tasks. 

SageMaker provides built-in support for HuggingFace models through the SageMaker HuggingFace SDK. Here are the steps we will take to dynamically load our model from the HuggingFace model hub into SageMaker:
1. Import the transformers library
2. Use the from_pretrained method to download a pre-built tokenised together with a pre-trained model from the HuggingFace Model Hub
```python
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", revision="float16", torch_dtype=torch.float16)

```
3. Use the pipeline method to perform inference on your text data. These pipelines can be configured to be task specific. For our use case we will use the [‘text-generation’ task](link***)
```python
generation = pipeline("text-generation", model=model, tokenizer=tokenizer, device=device)

```
4. Pass the generated pipeline object as return value of the model_fn() function to integrate with the rest of the inference lifecycle

The pre-trained weights of the original model come in FP32 format (4 bytes per parameter) and combine to roughly 24Gb in total. Since most of today's state-of-the-art single-GPU-powered instances are only equipped with 16, max. 24 GPU GB memory, the size of the model weights is bringing up challenges for model inference. Quantization is a technique to reduce the memory footprint when hosting large models. Thereby, the model weights are converted into FP16 or int8 format, resulting into a reduction of the hosting footprint by 2-4. By applying quantization, the GPT-J model can be hosted on a single-gpu instance like the ml.g4dn series.
The from_pretrained method in the HuggingFace SDK allows you to download different revisions of a pre-trained model from the HuggingFace Model Hub. You can specify the revision you want to download using the revision parameter. For our use case, we will be using the FP16 revision of the model weights.


In [None]:
# function to compress the code directory into a model.tar.gz archive as outlined above
def compress(tar_dir=None, output_file="./model.tar.gz"):
 with tarfile.open(output_file, "w:gz") as tar:
 tar.add(tar_dir, arcname="code")

In [None]:
# specifying source code directory path
model_code_dir = './code'
# create tar.gz archive
print("creating `model.tar.gz` archive")
compress(model_code_dir)

# Uploading to S3
We now have successfully created the model.tar.gz. However, it is still residing within the EBS volume of our SageMaker Studio domain. In the next step we will upload the archive file into a S3 bucket to make it available for SageMaker Inference. Therefor we will perform the following steps:
- Creation of a new S3 bucket for model artifact storage
- Upload of the model artifact to S3 using the python AWS SDK 'boto3'

In [None]:
# function to upload the model artifact model.tar.gz into a S3 bucket 
def upload_file_to_s3(bucket_name=None, file_name="model.tar.gz", key_prefix=""):
 s3 = boto3.resource("s3")
 key_prefix_with_file_name = os.path.join(key_prefix, file_name)
 s3.Bucket(bucket_name).upload_file(file_name, key_prefix_with_file_name)
 return f's3://{bucket_name}/{key_prefix_with_file_name}'

In [None]:
# specifying bucket name for model artifact storage
model_bucket_name = f'immersion-day-bucket-{account_id}-{region}'
# specifying key prefix for model artifact storage
model_s3_key_prefix = 'huggingface/gpt-j/'

In [None]:
# Create S3 bucket
s3_client = boto3.client('s3', region_name=region)
location = {'LocationConstraint': region}

bucket_name = model_bucket_name

# Check if bucket already exists
bucket_exists = True
try:
 s3_client.head_bucket(Bucket=bucket_name)
except:
 bucket_exists = False

# Create bucket if it does not exist
if not bucket_exists:
 if region == 'us-east-1':
 s3_client.create_bucket(Bucket=bucket_name)
 else: 
 s3_client.create_bucket(Bucket=bucket_name,
 CreateBucketConfiguration=location)
 print(f"Bucket '{bucket_name}' created successfully")

In [None]:
# upload to s3
print(
 f"uploading `model.tar.gz` archive to s3://{bucket_name}/{model_s3_key_prefix}model.tar.gz"
)
model_uri = upload_file_to_s3(bucket_name=bucket_name, key_prefix=model_s3_key_prefix)
print(f"Successfully uploaded to {model_uri}")

Amazon SageMaker Inference is a managed service that allows you to deploy machine learning models to make predictions or perform inference on new data. It enables you to create an endpoint that can be accessed using HTTP requests to make predictions in real-time. This service is designed to make it easy to deploy and manage machine learning models in production. SageMaker Inference provides a scalable, reliable, and cost-effective way to deploy machine learning models. For deploying a model with SageMaker Inference we will use the SageMaker SDK and leverage the built-in HuggingFace integration.

## Model packaging
First, we package the model into the 'HuggingFaceModel' class by specifying the following parameters:
- image_uri: The image uri of a Docker image used for hosting the model. We will be using on of the many ready-to-use Deep Learning Containers AWS is providing [here](https://aws.amazon.com/machine-learning/containers/). Deep Learning Containers are Docker images that are preinstalled and tested with the latest versions of popular deep learning frameworks. Deep Learning Containers lets you deploy custom ML environments quickly without building and optimizing your environments from scratch. Since we are deploying a model from the HuggingFace model hub, we will use one of the HuggingFace DLCs, coming with preinstalled python 3.8, pytorch 1.10.2, transformers 4.17.0 dependencies and optimized for inference in GPU-accelerated environments. 
- model_data: S3 path to the model artifact we just created and uploaded to S3
- role: IAM role, holding the required IAM permissions to perform the operations required to deploy a SageMaker Inference endpoint



In [None]:
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
 image_uri=f'763104351884.dkr.ecr.{region}.amazonaws.com/huggingface-pytorch-inference:1.10.2-transformers4.17.0-gpu-py38-cu113-ubuntu20.04',
 model_data=model_uri,
	role=role
 )

## Model deployment
The created model package can now be used to deploy the actual model by calling its .deploy() function. Thereby, the following parameters have to be specified:
- initial_instance_count: number of endpoint instances to be deployed 
- instance_type: EC2 instance type used for endpoint hosting
- endpoint_name: name of endpoint

Note that the SageMaker SDK creates the following two resources for you in the background:
- EndpointConfiguration
- Endpoint
You can check these in the Inference section of the SageMaker section in the AWS console once the model has been successfully deployed.



In [None]:
# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
 initial_instance_count=1, # number of instances
 instance_type='ml.g4dn.4xlarge', 
 endpoint_name='sm-endpoint-gpt-j-6b-immersion-day',
)

# Inference 
## First try
The .deploy() function returns an object of the HuggingFacePredictor class. This class implements functionality around the interfaces for the actual inference against deployed endpoints. Amongst others, it implements a .predict() function that can be used to conveniently call the endpoint for inference. When calling it, we can pass an object tothe function that consists of an 'inputs' parameter holding the prompt to be passed to the model. 

Hint: in case an error occurs, go and check the CloudWatch logs. Try to figure out what happened!

In [None]:
# Calling the predict() function for inference 
predictor.predict({"inputs": "What is the capital of Germany?"})

For more advanced use cases we can also specify a second parameter: 'parameters' is a python dictionary consisting of model-specific parameters passed to the model for customization of the output generated. For the GPT-J model, amongst [many options given](https://huggingface.co/docs/api-inference/detailed_parameters#text-generation-task) we specify the following parameters:
- max_new_tokens: maximum number of tokens to be generated by the model 
- temperature: creativity of generated text. According to our experience, values between 0.2 (newspaper article, code) and 1.2 (poem) lead to high-quality results. 
- repetition_penalty: penalty for repeating occurence of tokens
- top_k: breadth of vocabulary used. top amount of token candidates taken into account when sampling for next token prediction
- return_full_text: boolean variable indicating if input prompt should be returned with result

In [None]:
predictor.predict({"inputs": "What is the capital of Germany?",
"parameters": {
 "max_new_tokens": 30,
 "temperature": 0.5,
 "repetition_penalty": 1.1,
 "top_k": 20,
 "return_full_text": False
}
})

## Prompt Engineering
Prompt engineering is a technique used to design effective prompts for LLMs with the goal to achieve: 

- Control over the output: With prompt engineering, developers can control the output generated by LLMs. By designing prompts that specify the desired topic, style, tone, and level of formality, they can guide the LLM to produce text that meets the desired criteria.
- Mitigating bias: LLMs have been shown to produce biased outputs when prompted with certain topics or language patterns. By engineering prompts that avoid biased language and encourage fairness, developers can help mitigate these issues.
- Improving efficiency: Prompt engineering can help LLMs work more efficiently by guiding them to generate the desired output with fewer iterations. By providing clear, concise, and specific prompts, developers can help LLMs achieve the desired outcome faster and with fewer errors.

In general, a prompt can contain any of the following components:

- Instruction - a specific task or instruction you want the model to perform
- Context - can involve external information or additional context that can steer the model to better responses
- Input Data - is the input or question that we are interested to find a response for
- Output Indicator - indicates the type or format of output.

In general, the more information we provide with the prompt the better the above mentioned goals will be achieved.

Let's try it out!

In [None]:
# Simple unstructured prompt
prompt = """
Teplizumab traces its roots to a New Jersey drug company called Ortho Pharmaceutical. There, scientists generated an early version of the antibody, dubbed OKT3. Originally sourced from mice, the molecule was able to bind to the surface of T cells and limit their cell-killing potential. In 1986, it was approved to help prevent organ rejection after kidney transplants, making it the first therapeutic antibody allowed for human use.

What was OKT3 originally sourced from?"""

predictor.predict({"inputs": prompt,
"parameters": {
 "max_new_tokens": 10,
 "temperature": 0.7,
 "repetition_penalty": 1.1,
 "top_k": 20,
 "return_full_text": False
}
})

In [None]:
# We now stick to the scheme proposed above
prompt = """
Answer the question based on the context below. Keep the answer short and concise. Respond "Unsure about answer" if not sure about the answer.

Context: Teplizumab traces its roots to a New Jersey drug company called Ortho Pharmaceutical. There, scientists generated an early version of the antibody, dubbed OKT3. Originally sourced from mice, the molecule was able to bind to the surface of T cells and limit their cell-killing potential. In 1986, it was approved to help prevent organ rejection after kidney transplants, making it the first therapeutic antibody allowed for human use.

Question: What was OKT3 originally sourced from?

Answer:"""

predictor.predict({"inputs": prompt,
"parameters": {
 "max_new_tokens": 10,
 "temperature": 0.7,
 "repetition_penalty": 1.1,
 "top_k": 20,
 "return_full_text": False
}
})

In addition, [few-shot learning](https://www.analyticsvidhya.com/blog/2021/05/an-introduction-to-few-shot-learning/) is an interesting approach for the context element of a prompt. Few-shot learning is a prompt engineering technique that enables models to learn new tasks or concepts from only a few examples (usually a single digit number is just fine) or samples. Despite of the fact that the model has never seen this task in the training phase, we experience a significant boost in performance. 

In [None]:
# One-shot
prompt = """
Tweet: "This new music video was incredibile"
Sentiment:"""
predictor.predict({"inputs": prompt,
"parameters": {
 "max_new_tokens": 20,
 "temperature": 0.5,
 "repetition_penalty": 1.1,
 "top_k": 20,
 "return_full_text": False
}
})

In [None]:
# Few-shot
prompt = """
Tweet: "I hate it when my phone battery dies."
Sentiment: Negative
###
Tweet: "My day has been ðŸ‘"
Sentiment: Positive
###
Tweet: "This is the link to the article"
Sentiment: Neutral
###
Tweet: "This new music video was incredibile"
Sentiment:"""
predictor.predict({"inputs": prompt,
"parameters": {
 "max_new_tokens": 20,
 "temperature": 0.5,
 "repetition_penalty": 1.1,
 "top_k": 20,
 "return_full_text": False
}
})

# Cleanup
Finally, we clean up all resources not needed anymore since we pledge for the responsible use of compute resources. In this case this is the created endpoint together with the respective endpoint configuration. 

In [None]:
predictor.delete_endpoint(delete_endpoint_config=True)