# Hosting bloom-7b1 on Amazon SageMaker using HuggingFace Text Generation Inference (TGI)

HuggingFace [Text Generation Inference](https://huggingface.co/blog/sagemaker-huggingface-llm) (TGI) is a model server for text generation inference. TGI is powered by Python, Rust and gRPC, an open source high performance Remote Procedure Call framework.

TGI enables high-performance text generation using Tensor Parallelism and dynamic batching for the most popular open-source LLMs, including StarCoder, BLOOM, GPT-NeoX, Llama, and T5.

TGI is optimized for many large language [model architectures](https://github.com/huggingface/text-generation-inference#optimized-architectures). BLOOM is one of them.

## BLOOM-7B1

[BLOOM](https://bigscience.huggingface.co/) is an Open-Acces autoregressive Large Language Model (LLM), trained to continue text from a prompt on vast amounts of text data using industrial-scale computational resources. BLOOM can be instructed to perform text tasks it hasn't been explicitly trained for, by casting them as text generation tasks. The BLOOM model was trained using a mix of floating point precision formats. For models with up to 7.1 billion parameters, float16 precision was used. However, for the largest model, BLOOM with 176 billion parameters, **bfloat16** precision was used. 

 Refer to [Bloom: A 176B-parameter Open-Access Multilingual Language Model](https://arxiv.org/abs/2211.05100) paper for more details.

In this notebook, we show how to deploy [bloom-7b1](https://huggingface.co/bigscience/bloom-7b1) model using HuggingFace TGI Deep Learning Container (DLC) on Amazon SageMaker.


## Install Pre-requisites

Install the SageMaker Python SDK
First, make sure that the latest version of SageMaker SDK is installed.



In [None]:
!pip install "sagemaker>=2.163.0" watermark --upgrade --quiet

In [None]:
%load_ext watermark
%watermark -p sagemaker

## Setup account and role

We instantiate `sagemaker.Session()` which we use to determine the current region and execution role.

In [None]:
from datetime import datetime
from pathlib import Path
from uuid import uuid4

import sagemaker
from sagemaker import get_execution_role
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

sm_session = sagemaker.Session()
role = sagemaker.get_execution_role()
region = sm_session.boto_region_name

HF_MODEL_ID = "bigscience/bloom-7b1"


print(f"Role: {role}")
print(f"Region: {region}")

## Create model using HF Text-Generation-Inference (TGI) backend

We use the helper function `get_huggingface_llm_image_uri()` to retrieve the new HuggingFace TGI powered DLC.

- The function takes a required parameter `backend` and other optional parameters. 
- The _backend_ specifies the type of backend to use for the model, valid values are **`lmi`** and **`huggingface`**. 
 - using **`huggingface`** invokes Hugging Face Text-Generation-Inference (TGI) backend
 - specifying **`lmi`** will use SageMaker Large Model Inference (LMI) backend (with `djl-deepspeed` as the framework)


Additionally, we specify the following environment variables to be use while model creation:
- `HF_MODEL_ID` : model_id as defined in `huggingface` hub. e.g. `'bigscience/bloom-7b1'`
- `HF_TASK` : NLP task for the model in use. e.g. `'text-generation'`
- `HF_MODEL_QUANTIZE`: library to use for quantization e.g. `'bitsandbytes'`
- `SM_NUM_GPUS`: Number of GPUs used per replica, specifies the tensor parallelism degree of the model. e.g. `4`


Set `SM_NUM_GPUS` to the number of available GPUs when you plan to host your model on an instance type with multiple GPUs.

TGI will automatically distribute and shard the model across all GPUs.

In [None]:
model_name = str(Path(HF_MODEL_ID.split("/")[-1]))
suffix = f"{str(uuid4())[:5]}-{datetime.now().strftime('%d%b%Y')}"

# define model name, endpoint_name
model_name = f"{model_name}-{suffix}"
endpoint_name = model_name

# We use ml.g5.12xlarge instance type with 4 GPUs
instance_type = "ml.g5.12xlarge"

# retrieve the llm image uri with huggingface TGI backend
llm_image_uri = get_huggingface_llm_image_uri(
 backend="huggingface", version="0.8.2", region=region, session=sm_session
)

# print(f"llm image uri: {llm_image_uri}")

# define hugging face config values
config = {
 "HF_MODEL_ID": HF_MODEL_ID,
 "HF_TASK": "text-generation",
 "SM_NUM_GPUS": "4", # Number of GPU used per replica
 "HF_MODEL_QUANTIZE": "bitsandbytes", # Quantization library used
}

print(f"Creating model: {model_name} using {HF_MODEL_ID}")

# create HuggingFaceModel with the HF TGI image uri
model = HuggingFaceModel(name=model_name, env=config, role=role, image_uri=llm_image_uri)

### Deploy model to Amazon SageMaker real-time endpoint

**NOTE:** Model deployment takes approx. 10+mins

In [None]:
health_check_timeout = 600

print(f"Deploying model to endpoint: {endpoint_name} | {instance_type} ...")

predictor = model.deploy(
 initial_instance_count=1,
 instance_type=instance_type,
 endpoint_name=endpoint_name,
 container_startup_health_check_timeout=health_check_timeout,
)

### Test Inference

Once the endpoint is `InService` status, we can evaluate the model using the `predict()` function.

Refere [here](https://huggingface.co/blog/sagemaker-huggingface-llm#4-run-inference-and-chat-with-our-model) for a full list of parameters supported by TGI.

In [None]:
input_data = {
 "inputs": "What is the weather forecast for Seattle today?",
 "parameters": {
 "do_sample": True,
 "max_new_tokens": 100,
 "temperature": 0.7,
 "watermark": True,
 },
}

prompt_length = len(input_data["inputs"])

response = predictor.predict(input_data)

output = response[0]["generated_text"][prompt_length:]

print(output)

## Cleanup

After you've finished testing the endpoint, it's important to delete the `model` and `endpoint` resources to avoid incurring charges.

In [None]:
try:
 print(f"Deleting model: {model_name}")
 predictor.delete_model()
except Exception as e:
 print(f"{e}")

try:
 print(f"Deleting endpoint: {endpoint_name}")
 predictor.delete_endpoint()
except Exception as e:
 print(f"{e}")

### Conclusion

In this tutorial, we used a HuggingFace TGI container to deploy **BLOOM-7B1** to an Amazon SageMaker real-time endpoint. 

With Hugging Face's Text Generation Inference and SageMaker Hosting, you can easily host large language models like Bloom-7b1, BLOOM-530M, GPT-NeoX, flan-t5-xxl and many more.