# Lab 1 (b)

## Objective
In this lab, we'll explore how to host a large language model on Amazon SageMaker using [Hugging Face LLM Inference Container for Amazon SageMaker](https://huggingface.co/blog/sagemaker-huggingface-llm), which allows you to easily deploy the most popular open-source LLMs, including Falcon, StarCoder, BLOOM, GPT-NeoX, Llama, and T5

## Introduction

Language models have recently exploded in both size and popularity. In 2018, BERT-large entered the scene and, with its 340M parameters and novel transformer architecture, set the standard on NLP task accuracy. Within just a few years, state-of-the-art NLP model size has grown by more than 500x with models such as OpenAI’s 175 billion parameter GPT-3 and similarly sized open source Bloom 176B raising the bar on NLP accuracy. This increase in the number of parameters is driven by the simple and empirically-demonstrated positive relationship between model size and accuracy: more is better. With easy access from models zoos such as HuggingFace and improved accuracy in NLP tasks such as classification and text generation, practitioners are increasingly reaching for these large models. However, deploying them can be a challenge because of their size.

## Background and Details
We'll be working with [Falcon-40B-Instruct](https://huggingface.co/tiiuae/falcon-40b-instruct) that was developed by the Technology Innovation Institute (TII). Falcon-40B-Instruct is a 40B parameters causal decoder-only model built by TII based on Falcon-40B and finetuned on a mixture of Baize. It is made available under the Apache 2.0 license.
## Instructions

### Prerequisites

#### To run this workshop...
You need a computer with a web browser, preferably with the latest version of Chrome / FireFox.
Sequentially read and follow the instructions described in AWS Hosted Event and Work Environment Set Up

#### Recommended background
It will be easier for you to run this workshop if you have:

- Experience with Deep learning models
- Familiarity with Python or other similar programming languages
- Experience with Jupyter notebooks
- Begineers level knowledge and experience with SageMaker Hosting/Inference.

#### Target audience
Data Scientists, ML Engineering, ML Infrastructure, MLOps Engineers, Technical Leaders.
Intended for customers working with large Generative AI models including Language, Computer vision and Multi-modal use-cases.
Customers using EKS/EC2/ECS/On-prem for hosting or experience with SageMaker.

Level of expertise - 400

#### Time to complete
Approximately 45 minutes.

We are going to use the SageMaker Python SDK to deploy Falcon-40b-Instruct model to Amazon SageMaker. 

In [None]:
!pip install --upgrade boto3 sagemaker



Before we begin with the actual work for packaging and deploying the model to Amazon SageMaker, we need to setup the notebook environment respectively. This includes:

- retrieval of the execution role our SageMaker Studio domain is associated with for later usage
- retrieval of our bucket for later usage
- retrieval of the chosen region for later usage

In [None]:
import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
 # set to default bucket if a bucket name is not given
 sagemaker_session_bucket = sess.default_bucket()

try:
 role = sagemaker.get_execution_role()
except ValueError:
 iam = boto3.client('iam')
 role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {sess.boto_region_name}")


Compared to deploying regular Hugging Face models, we first need to retrieve the container uri and provide it to our HuggingFaceModel model class with a **image_uri** pointing to the image. To retrieve the new Hugging Face LLM Deep Learning Container in Amazon SageMaker, we can use the **get_huggingface_llm_image_uri** method provided by the SageMaker SDK. This method allows us to retrieve the URI for the desired Hugging Face LLM DLC based on the specified backend, session, region, and version. 

In [None]:
from sagemaker.huggingface import get_huggingface_llm_image_uri

# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(
 "huggingface",
 version="0.8.2"
)

# print ecr image uri
print(f"llm image uri: {llm_image}")

To deploy Falcon-40B-Instruct model to Amazon SageMaker, we create a HuggingFaceModel model class and define our endpoint configuration including the **hf_model_id**, and **instance_type**. We will use a **g5.12xlarge** instance type with 4 NVIDIA A10G GPUs and 96GB of GPU memory.



In [None]:
import json
from sagemaker.huggingface import HuggingFaceModel

# sagemaker config
instance_type = "ml.g5.12xlarge"
number_of_gpu = 4

# TGI config
config = {
 'HF_MODEL_ID': "tiiuae/falcon-40b-instruct", # model id from hf.co/models
 'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
 'MAX_INPUT_LENGTH': json.dumps(1024), # Max length of input text
 'MAX_TOTAL_TOKENS': json.dumps(2048), # Max length of the generation (including input text)
 # 'HF_MODEL_QUANTIZE': "bitsandbytes", # comment in to quantize
}

# create HuggingFaceModel
llm_model = HuggingFaceModel(
 role=role,
 image_uri=llm_image,
 env=config
)


After we have created the HuggingFaceModel we can deploy it to Amazon SageMaker using the deploy method. We will deploy the model with the ml.g5.12xlarge instance type. The Hugging Face LLM Deep Learning Container is powered by [Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference), an open-source, purpose-built solution for deploying and serving Large Language Models.TGI will automatically distribute and shard the model across all GPUs.

In [None]:
# Deploy model to an endpoint

llm = llm_model.deploy(
 initial_instance_count=1,
 instance_type=instance_type,
 # volume_size=400, # If using an instance with local SSD storage, volume_size must be None, e.g. p4 but not p3
)


After our endpoint is deployed we can run inference on it using the predict method from the predictor. We can use different parameters to control the generation, defining them in the parameters attribute of the payload. As of today TGI supports the following parameters:

- temperature: Controls randomness in the model. Lower values will make the model more deterministic and higher values will make the model more random. Default value is 1.0.
- max_new_tokens: The maximum number of tokens to generate. Default value is 20, max value is 512.
- repetition_penalty: Controls the likelihood of repetition, defaults to null.
- seed: The seed to use for random generation, default is null.
- stop: A list of tokens to stop the generation. The generation will stop when one of the tokens is generated.
- top_k: The number of highest probability vocabulary tokens to keep for top-k-filtering. Default value is null, which disables top-k-filtering.
- top_p: The cumulative probability of parameter highest probability vocabulary tokens to keep for nucleus sampling, default to null
- do_sample: Whether or not to use sampling; use greedy decoding otherwise. Default value is false.
- best_of: Generate best_of sequences and return the one if the highest token logprobs, default to null.
- details: Whether or not to return details about the generation. Default value is false.
- return_full_text: Whether or not to return the full text or only the generated part. Default value is false.
- truncate: Whether or not to truncate the input to the maximum length of the model. Default value is true.
- typical_p: The typical probability of a token. Default value is null.
- watermark: The watermark to use for the generation. Default value is false.


In [None]:
# define payload
prompt = """You are an helpful Assistant, called Falcon. Knowing everyting about AWS.

User: Can you tell me something about Amazon SageMaker?
Falcon:"""

# hyperparameters for llm
payload = {
 "inputs": prompt,
 "parameters": {
 "do_sample": True,
 "top_p": 0.9,
 "temperature": 0.8,
 "max_new_tokens": 1024,
 "repetition_penalty": 1.03,
 "stop": ["\nUser:","<|endoftext|>",""]
 }
}

# send request to endpoint
response = llm.predict(payload)

for seq in response:
 print(f"Result: {seq['generated_text']}")


## Prompt Engineering
Prompt engineering is a technique used to design effective prompts for LLMs with the goal to achieve: 

- Control over the output: With prompt engineering, developers can control the output generated by LLMs. By designing prompts that specify the desired topic, style, tone, and level of formality, they can guide the LLM to produce text that meets the desired criteria.
- Mitigating bias: LLMs have been shown to produce biased outputs when prompted with certain topics or language patterns. By engineering prompts that avoid biased language and encourage fairness, developers can help mitigate these issues.
- Improving efficiency: Prompt engineering can help LLMs work more efficiently by guiding them to generate the desired output with fewer iterations. By providing clear, concise, and specific prompts, developers can help LLMs achieve the desired outcome faster and with fewer errors.

In general, a prompt can contain any of the following components:

- Instruction - a specific task or instruction you want the model to perform
- Context - can involve external information or additional context that can steer the model to better responses
- Input Data - is the input or question that we are interested to find a response for
- Output Indicator - indicates the type or format of output.

In general, the more information we provide with the prompt the better the above mentioned goals will be achieved.

Let's try it out!

In [None]:
# Simple unstructured prompt
prompt = """
Teplizumab traces its roots to a New Jersey drug company called Ortho Pharmaceutical. There, scientists generated an early version of the antibody, dubbed OKT3. Originally sourced from mice, the molecule was able to bind to the surface of T cells and limit their cell-killing potential. In 1986, it was approved to help prevent organ rejection after kidney transplants, making it the first therapeutic antibody allowed for human use.

User: What was OKT3 originally sourced from?

Falcon:"""


# hyperparameters for llm
payload = {
 "inputs": prompt,
 "parameters": {
 "do_sample": True,
 "top_p": 0.9,
 "temperature": 0.8,
 "max_new_tokens": 1024,
 "repetition_penalty": 1.03,
 "stop": ["\nUser:","<|endoftext|>",""]
 }
}

# send request to endpoint
response = llm.predict(payload)

for seq in response:
 print(f"Result: {seq['generated_text']}")

In [None]:
# We now stick to the scheme proposed above
prompt = """
Answer the question based on the context below. Keep the answer short and concise. Respond "Unsure about answer" if not sure about the answer.

Context: Teplizumab traces its roots to a New Jersey drug company called Ortho Pharmaceutical. There, scientists generated an early version of the antibody, dubbed OKT3. Originally sourced from mice, the molecule was able to bind to the surface of T cells and limit their cell-killing potential. In 1986, it was approved to help prevent organ rejection after kidney transplants, making it the first therapeutic antibody allowed for human use.

Question: What was OKT3 originally sourced from?

Answer:"""


# hyperparameters for llm
payload = {
 "inputs": prompt,
 "parameters": {
 "do_sample": True,
 "top_p": 0.9,
 "temperature": 0.8,
 "max_new_tokens": 1024,
 "repetition_penalty": 1.03,
 "stop": ["\nUser:","<|endoftext|>",""]
 }
}

# send request to endpoint
response = llm.predict(payload)
for seq in response:
 print(f"Result: {seq['generated_text']}")


In addition, [few-shot learning](https://www.analyticsvidhya.com/blog/2021/05/an-introduction-to-few-shot-learning/) is an interesting approach for the context element of a prompt. Few-shot learning is a prompt engineering technique that enables models to learn new tasks or concepts from only a few examples (usually a single digit number is just fine) or samples. Despite of the fact that the model has never seen this task in the training phase, we experience a significant boost in performance. 

In [None]:
# One-shot
prompt = """
Tweet: "This new music video was incredibile"
Sentiment:"""


# hyperparameters for llm
payload = {
 "inputs": prompt,
 "parameters": {
 "do_sample": True,
 "top_p": 0.9,
 "temperature": 0.8,
 "max_new_tokens": 1024,
 "repetition_penalty": 1.03,
 "stop": ["\nUser:","<|endoftext|>",""]
 }
}

# send request to endpoint
response = llm.predict(payload)

for seq in response:
 print(f"Result: {seq['generated_text']}")


In [None]:
# Few-shot
prompt = """
Tweet: "I hate it when my phone battery dies."
Sentiment: Negative
###
Tweet: "My day has been 👍"
Sentiment: Positive
###
Tweet: "This is the link to the article"
Sentiment: Neutral
###
Tweet: "This new music video was incredibile"
Sentiment:"""

# hyperparameters for llm
payload = {
 "inputs": prompt,
 "parameters": {
 "do_sample": True,
 "top_p": 0.9,
 "temperature": 0.8,
 "max_new_tokens": 1024,
 "repetition_penalty": 1.03,
 "stop": ["\nUser:","<|endoftext|>",""]
 }
}

# send request to endpoint
response = llm.predict(payload)
for seq in response:
 print(f"Result: {seq['generated_text']}")


In [None]:
llm.delete_model()
llm.delete_endpoint()
