# GPT2 on SageMaker using Hugging Face Text Generation Inference (TGI) DLC

This notebook demonstrates how to deploy [GPT2](https://huggingface.co/gpt2) Language Model using Hugging Face [Text Generation Inference](https://huggingface.co/text-generation-inference) (TGI) Deep Learning Container on Amazon SageMaker. GPT-2 is a transformers model pretrained on a very large corpus of English data in a self-supervised fashion for causal language modeling. 

TGI is an open source, high performance inference library that can be used to deploy large language models from Hugging Face’s repository in minutes. The library includes advanced functionality like model parallelism and dynamic batching to simplify production inference with large language models like flan-t5-xxl, LLaMa, StableLM, and GPT-NeoX.

This notebook was tested on a `ml.t3.medium` instance using the `Python 3 (Data Science)` kernel on SageMaker Studio


## Setup

Install the SageMaker Python SDK

First, make sure that the latest version of SageMaker SDK is installed.

In [16]:
%pip install "sagemaker>=2.163.0" "transformers>=4.30.1" --quiet

[0mNote: you may need to restart the kernel to use updated packages.


## Setup account and role

Then, we import the SageMaker python SDK and instantiate a sagemaker_session which we use to determine the current region and execution role.

In [17]:
import sagemaker
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
import time

sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
role = sagemaker.get_execution_role()

## Retrieve the LLM Image URI

We use the helper function get_huggingface_llm_image_uri() to generate the appropriate image URI for the Hugging Face Large Language Model (LLM) inference.

The function takes a required parameter backend and several optional parameters. The backend specifies the type of backend to use for the model, the values can be "lmi" and "huggingface". The "lmi" stands for SageMaker LMI inference backend, and "huggingface" refers to using Hugging Face TGI inference backend.

In [18]:
image_uri = get_huggingface_llm_image_uri(backend="huggingface", region=region)

print(image_uri)

763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.0.0-tgi0.8.2-gpu-py39-cu118-ubuntu20.04


## Create the Hugging Face Model

Next we configure the model object by specifying a unique name, the image_uri for the managed TGI container, and the execution role for the endpoint. Additionally, we specify a number of environment variables including the HF_MODEL_ID which corresponds to the model from the HuggingFace Hub that will be deployed, and the HF_TASK which configures the inference task to be performed by the model.


SM_NUM_GPUS specifies the tensor parallelism degree of the model and is necessary if the LLMs is too big for a single GPU. GPT2 model can fit in a single GPU and therefore we set SM_NUM_GPUS to 1. 

The variable HF_TASK is set to 'text-generation'.


In [19]:
model_name = "gpt2-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

hub = {
 "HF_MODEL_ID": "gpt2",
 "HF_TASK": "text-generation",
 "SM_NUM_GPUS": "1",
}

model = HuggingFaceModel(name=model_name, env=hub, role=role, image_uri=image_uri)

## Creating a SageMaker Endpoint
Next we deploy the model by invoking the deploy() function. Here we use an ml.g5.xlarge instance which come with 1 NVIDIA A10 GPUs.

In [20]:
predictor = model.deploy(
 initial_instance_count=1, instance_type="ml.g5.xlarge", endpoint_name=model_name
)

--------!

## Running Inference

Once the endpoint is up and running, we can evaluate the model using the predict() function. 

The default decoding strategy for text generation is greedy search, which is the simplest decoding strategy that picks a token with the highest probability as the next token. However you can customize the output by passing the parameters and their values in the predict() function. 

Some of the commonly adjusted parameters include:

**do_sample**: if set to True, this parameter enables decoding strategies such as multinomial sampling, beam-search multinomial sampling, Top-K sampling and Top-p sampling. All these strategies select the next token from the probability distribution over the entire vocabulary with various strategy-specific adjustments.

**max_new_tokens**: the size of the output sequence, not including the tokens in the prompt.

**temperature**: parameter which increases the chances of high probability words and decreases the chances of low probability words in the sampling.

**num_beams**: Set number of beams to be higher than 1 to switch from greedy search to beam search. Beam-search decoding keeps several hypotheses at each time step and eventually chooses the hypothesis that has the overall highest probability for the entire sequence. This has the advantage of identifying high-probability sequences that start with lower probability initial tokens and would’ve been ignored by the greedy search.

In [21]:
input_data = {
 "inputs": "The diamondback terrapin was the first reptile to",
 "parameters": {
 "do_sample": True,
 "max_new_tokens": 100,
 "temperature": 0.7,
 "num_beams": 5,
 },
}

predictor.predict(input_data)

[{'generated_text': 'The diamondback terrapin was the first reptile to make a living in the wild.\n\nThe diamondback terrapin is the first reptile to make a living in the wild.\n\nCrown Victoria Historical Society\n\ncrownvirginia.org\n\nReference:\n\nCorneh, C. 1991. "What was the first diamondback terrapin?" Proceedings of the Royal Society B: Biological Sciences, 266(2): 727-734.\n\nCorneh, C. 1991. "'}]

## Cleaning Up

After you've finished using the endpoint, it's important to delete it to avoid incurring unnecessary costs.

In [22]:
predictor.delete_model()
predictor.delete_endpoint()

## Conclusion

In this tutorial, we used a TGI container to deploy GPT2 on a SageMaker ml.g5.xlarge instance. With Hugging Face's Text Generation Inference and SageMaker Hosting, you can easily host large language models like GPT-NeoX, flan-t5-xxl, and LLaMa.