# Serve FlanT5-XXL on SageMaker with Faster Transformers using DJL container.


---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. 

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/inference|generativeai|llm-workshop|lab5-flan-t5-xxl|flant5-xxl-fastertransformer-no-code.ipynb)

---


In this notebook, we explore how to host a large language model on SageMaker using the latest container that packages some of the most popular open source libraries for model parallel inference like Faster Transfomers, DeepSpeed and Hugging Face Accelerate. We use DJLServing as the model serving solution in this example. DJLServing is a high-performance universal model serving solution powered by the Deep Java Library (DJL) that is programming language agnostic. To learn more about DJL and DJLServing, you can refer to our recent blog post (https://aws.amazon.com/blogs/machine-learning/deploy-bloom-176b-and-opt-30b-on-amazon-sagemaker-with-large-model-inference-deep-learning-containers-and-deepspeed/).

Language models have recently exploded in both size and popularity. In 2018, BERT-large entered the scene and, with its 340M parameters and novel transformer architecture, set the standard on NLP task accuracy. Within just a few years, state-of-the-art NLP model size has grown by more than 500x with models such as OpenAI’s 175 billion parameter GPT-3 and similarly sized open source Bloom 176B raising the bar on NLP accuracy. This increase in the number of parameters is driven by the simple and empirically-demonstrated positive relationship between model size and accuracy: more is better. With easy access from models zoos such as Hugging Face and improved accuracy in NLP tasks such as classification and text generation, practitioners are increasingly reaching for these large models. However, deploying them can be a challenge because of their size.

SageMaker has rolled out Deep Learning containers container which now provides users with the ability to leverage the managed serving capabilities and help to provide the un-differentiated heavy lifting.

In this notebook, we deploy the open source Flan-T5-XXL model across GPUs on a ml.g5.12xlarge instance. The model is loaded using split layer or tensor model partitioning through [Faster Transformers](https://github.com/NVIDIA/FasterTransformer). You can also quantize the model weights to int8 thereby greatly reducing the memory footprint of the model from the initial FP32. See this [blog post](https://huggingface.co/blog/hf-bitsandbytes-integration) from Hugging Face for additional information.

FlanT5 is available at [Huggingface Model Hub - Model Weights](https://huggingface.co/google/flan-t5-xxl) . For optimizations the model weights can be compiled ahead of time as well by following the steps from [Compile Model](https://github.com/NVIDIA/FasterTransformer/blob/main/examples/pytorch/t5/utils/huggingface_t5_ckpt_convert.py)

## Step 1: Install, import the required libraries; set some variables

In [None]:
%pip install sagemaker boto3 awscli --upgrade  --quiet

In [None]:
import boto3
import jinja2
import sagemaker
from pathlib import Path
from sagemaker import Model, image_uris, serializers, deserializers

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
region = sess._region_name  # region name of the current SageMaker Studio environment
account_id = sess.account_id()  # account_id of the current SageMaker Studio environment
jinja_env = jinja2.Environment()

## Step 2: Start preparing model artifacts
In LMI container, we expect some artifacts to help setting up the model
- serving.properties (required): Defines the model server settings

This is a configuration file to indicate to LMI Indicator which model parallelization and inference optimization libraries you would like to use. Depending on your need, you can set the appropriate configuration.

There are a few options specified here. Lets go through them in turn<br>
1. `engine` - specifies the engine that will be used for this workload. In this case we'll be hosting a model using the **FasterTransformer**
2. `option.entryPoint` - specifies the entrypoint code that will be used to host the model. `djl_python.fastertransformer` refers to the `fastertransformer.py` module from [djl_python repo](https://github.com/deepjavalibrary/djl-serving/tree/master/engines/python/setup/djl_python).  
3. `option.s3url`: Set this to the URI of the Amazon S3 bucket that contains the model. When this is set, the container leverages [s5cmd](https://github.com/peak/s5cmd) to download the model from s3. This enables faster deployments by utilizing optimized approach within the DJL inference container to transfer the model from S3 into the hosting instance 
4. `option.tensor_parallel_degree`: Set to the number of GPU devices over which DeepSpeed needs to partition the model. This parameter also controls the number of workers per model which will be started up when DJL serving runs. As an example if we have a 8 GPU machine and we are creating 8 partitions then we will have 1 worker per model to serve the requests.


For more information on the available options, please refer to the [SageMaker Large Model Inference Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-configuration.html)


If you want to download the model from huggingface.co, you can set `option.modelid`. The model id of a pre-trained model hosted inside a model repository on huggingface.co (https://huggingface.co/models). The container uses this model id to download the corresponding model repository on huggingface.co. In this example we will leverage the option.modelid to download from Hugging Face Hub


Below files are optional
- model.py (optional): A python file to define the core inference logic
- requirements.txt (optional): Any additional pip wheel need to install


In [None]:
!rm -r code_flant5_ft
!mkdir -p code_flant5_ft

In [None]:
%%writefile code_flant5_ft/serving.properties
engine = FasterTransformer
option.entryPoint = djl_python.fastertransformer
option.s3url = {{s3url}}
option.tensor_parallel_degree = 4

### Define a variable to contain the s3url of the location that has the model

In [None]:
pretrained_model_location = (
    f"s3://sagemaker-example-files-prod-{region}/models/hf-large-model-djl-ds/flant5-xxl/"
)
print(f"Pretrained model will be downloaded from ---- > {pretrained_model_location}")

In [None]:
# we plug in the appropriate model location into our `serving.properties` file based on the region in which this notebook is running
template = jinja_env.from_string(Path("code_flant5_ft/serving.properties").open().read())
Path("code_flant5_ft/serving.properties").open("w").write(
    template.render(s3url=pretrained_model_location)
)
!pygmentize code_flant5_ft/serving.properties | cat -n

In [None]:
!rm -f model.tar.gz
!tar czvf model.tar.gz -C code_flant5_ft .

## Step 3: To create the end point the steps are:

1. Create the Model using the Image container and the Model Tarball uploaded earlier
2. Create the endpoint config using the following key parameters

    a) Instance Type is ml.g5.12xlarge
    
    b) ContainerStartupHealthCheckTimeoutInSeconds is 3600 to ensure health check starts after the model is ready    
3. Create the end point using the endpoint config create

#### Create the Model
Use the image URI for the DJL container and the s3 location to which the tarball was uploaded.

The container downloads the model into the `/tmp` space on the container because SageMaker maps the `/tmp` to the Amazon Elastic Block Store (Amazon EBS) volume that is mounted when we specify the endpoint creation parameter VolumeSizeInGB. It leverages `s5cmd`(https://github.com/peak/s5cmd) which offers a very fast download speed and hence extremely useful when downloading large models.

For instances like p4dn, which come pre-built with the volume instance, we can continue to leverage the `/tmp` on the container. The size of this mount is large enough to hold the model.


### Getting the container image URI

Available frameworks are:
- djl-deepspeed (0.20.0, 0.21.0)
- djl-fastertransformer (0.21.0)

[Deep learning containers for large model inference](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-dlc.html) has more details about each version and framework in the container.

In [None]:
inference_image_uri = image_uris.retrieve(
    framework="djl-fastertransformer", region=sess.boto_session.region_name, version="0.21.0"
)
print(f"Image going to be used is ---- > {inference_image_uri}")

#### Create a TAR Ball and upload to S3

This mainly contains the serving.properties

In [None]:
s3_code_prefix = "flants/code"
bucket = sess.default_bucket()  # bucket to house artifacts
code_artifact = sess.upload_data("model.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {code_artifact}")
env = {"HUGGINGFACE_HUB_CACHE": "/tmp", "TRANSFORMERS_CACHE": "/tmp"}

model = Model(image_uri=inference_image_uri, model_data=code_artifact, env=env, role=role)

#### Create SageMaker endpoint

You need to specify the instance to use and endpoint names

In [None]:
instance_type = "ml.g5.12xlarge"
endpoint_name = sagemaker.utils.name_from_base("lmi-model")
print(endpoint_name)
model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    endpoint_name=endpoint_name,
    container_startup_health_check_timeout=3600,
)

# our requests and responses will be in json format so we specify the serializer and the deserializer
predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sess,
    serializer=serializers.JSONSerializer(),
    deserializer=deserializers.JSONDeserializer(),
)

In [None]:
# our requests and responses will be in json format so we specify the serializer and the deserializer
predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sess,
    serializer=serializers.JSONSerializer(),
    deserializer=deserializers.JSONDeserializer(),
)

#### While you wait for the endpoint to be created, you can read more about:
- [Deep Learning containers for large model inference](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-dlc.html)
- [FasterTransformer](https://github.com/NVIDIA/FasterTransformer)   
- [Realtime Endpoint for Large Models - FasterTransformer](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-tutorials-fastertransformer.html)
   
## Step 4: Invoke the endpoint. 

This is a generative model so we pass in a Text as a prompt and Model will complete the sentence and return the results.

You can pass a batch of prompts as input to the model. This is done by setting `inputs` to the list of prompts. The model then returns a result for each prompt. The text generation can be configured using appropriate parameters. These `parameters` need to be passed to the endpoint as a dictionary of `kwargs`. 

The below code sample illustrates the invocation of the endpoint using a batch of prompts and also sets some parameters.   
   

Here's a list of default arguments that's used by the model for inference. You can pass specific values based on the use case - 
```
default_args = dict(
            inputs_embeds=None,
            beam_width=1,
            max_seq_len=200,
            top_k=1,
            top_p=0.0,
            beam_search_diversity_rate=0.0,
            temperature=1.0,
            len_penalty=0.0,
            repetition_penalty=1.0,
            presence_penalty=None,
            min_length=0,
            random_seed=0,
            is_return_output_log_probs=False,
            is_return_cum_log_probs=False,
            is_return_cross_attentions=False,
            bad_words_list=None,
            stop_words_list=None
        )
```



In [None]:
predictor.predict(
    {
        "inputs": ["Amazon.com is the best ", "Large model inference is"],
        "parameters": {"min_length": 20, "max_seq_len": 200, "temperature": 0.1},
    }
)

### Examples of prompt engineering for 'Zero Shot' NLP tasks
Here are some examples of how you can use the deployed model for zero-shot NLP tasks.

#### Common Sense Reasoning

In [None]:
predictor.predict(
    {
        "inputs": [
            "The world cup has kicked off in Los Angeles, United States.\n\nBased on the paragraph above can we conclude that \”The world cup takes place in United States.\”?\n\n[\”yes\”, \”no\”]"
        ],
        "parameters": {"temperature": 0.7},
    }
)

#### Text/Sentiment Classification

In [None]:
predictor.predict(
    {
        "inputs": [
            "Review:\nThis movie is so great and once again dazzles and delights us\nIs this movie review sentence negative or positive?\nOPTIONS:\n-positive \n-negative"
        ],
        "parameters": {},
    }
)

#### Translation

In [None]:
predictor.predict(
    {"inputs": ["My name is Arthur\n\nTranslate to German"], "parameters": {"temperature": 0.7}}
)

#### Article Generation

In [None]:
predictor.predict(
    {
        "inputs": [
            "Title: \”University has new facility coming up“\\nGiven the above title of an imaginary article, imagine the article.\n"
        ],
        "parameters": {"max_seq_len": 200, "temperature": 0.7},
    }
)

#### Abstractive Question Answering

In [None]:
context = """
Customer: Hi there, I'm having a problem with my iPhone.
Agent: Hi! I'm sorry to hear that. What's happening?
Customer: The phone is not charging properly, and the battery seems to be draining very quickly. I've tried different charging cables and power adapters, but the issue persists.
Agent: Hmm, that's not good. Let's try some troubleshooting steps. Can you go to Settings, then Battery, and see if there are any apps that are using up a lot of battery life?
Customer: Yes, there are some apps that are using up a lot of battery.
Agent: Okay, try force quitting those apps by swiping up from the bottom of the screen and then swiping up on the app to close it.
Customer: I did that, but the issue is still there.
Agent: Alright, let's try resetting your iPhone's settings to their default values. This won't delete any of your data. Go to Settings, then General, then Reset, and then choose Reset All Settings.
Customer: Okay, I did that. What's next?
Agent: Now, let's try restarting your iPhone. Press and hold the power button until you see the "slide to power off" option. Slide to power off, wait a few seconds, and then turn your iPhone back on.
Customer: Alright, I restarted it, but it's still not charging properly.
Agent: I see. It looks like we need to run a diagnostic test on your iPhone. Please visit the nearest Apple Store or authorized service provider to get your iPhone checked out.
Customer: Do I need to make an appointment?
Agent: Yes, it's always best to make an appointment beforehand so you don't have to wait in line. You can make an appointment online or by calling the Apple Store or authorized service provider.
Customer: Okay, will I have to pay for the repairs?
Agent: That depends on whether your iPhone is covered under warranty or not. If it is, you won't have to pay anything. However, if it's not covered under warranty, you will have to pay for the repairs.
Customer: How long will it take to get my iPhone back?
Agent: It depends on the severity of the issue, but it usually takes 1-2 business days.
Customer: Can I track the repair status online?
Agent: Yes, you can track the repair status online or by calling the Apple Store or authorized service provider.
Customer: Alright, thanks for your help.
Agent: No problem, happy to help. Is there anything else I can assist you with?
Customer: No, that's all for now.
Agent: Alright, have a great day and good luck with your iPhone!
"""

#### Q1

In [None]:
query = (
    "What troubleshooting steps were suggested to the customer to fix their iPhone charging issue?"
)

In [None]:
predictor.predict(
    {"inputs": [f"{context}\n{query}"], "parameters": {"max_seq_len": 200, "temperature": 0.7}}
)

#### Q2

In [None]:
query = "Was resetting the iPhone to its default settings able to solve the charging issue and battery drain problem?"

In [None]:
predictor.predict(
    {"inputs": [f"{context}\n{query}"], "parameters": {"max_seq_len": 200, "temperature": 0.7}}
)

## Clean up the environment

In [None]:
sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_name)
model.delete_model()

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/inference|generativeai|llm-workshop|lab5-flan-t5-xxl|flant5-xxl-fastertransformer-no-code.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/inference|generativeai|llm-workshop|lab5-flan-t5-xxl|flant5-xxl-fastertransformer-no-code.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/inference|generativeai|llm-workshop|lab5-flan-t5-xxl|flant5-xxl-fastertransformer-no-code.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/inference|generativeai|llm-workshop|lab5-flan-t5-xxl|flant5-xxl-fastertransformer-no-code.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/inference|generativeai|llm-workshop|lab5-flan-t5-xxl|flant5-xxl-fastertransformer-no-code.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/inference|generativeai|llm-workshop|lab5-flan-t5-xxl|flant5-xxl-fastertransformer-no-code.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/inference|generativeai|llm-workshop|lab5-flan-t5-xxl|flant5-xxl-fastertransformer-no-code.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/inference|generativeai|llm-workshop|lab5-flan-t5-xxl|flant5-xxl-fastertransformer-no-code.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/inference|generativeai|llm-workshop|lab5-flan-t5-xxl|flant5-xxl-fastertransformer-no-code.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/inference|generativeai|llm-workshop|lab5-flan-t5-xxl|flant5-xxl-fastertransformer-no-code.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/inference|generativeai|llm-workshop|lab5-flan-t5-xxl|flant5-xxl-fastertransformer-no-code.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/inference|generativeai|llm-workshop|lab5-flan-t5-xxl|flant5-xxl-fastertransformer-no-code.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/inference|generativeai|llm-workshop|lab5-flan-t5-xxl|flant5-xxl-fastertransformer-no-code.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/inference|generativeai|llm-workshop|lab5-flan-t5-xxl|flant5-xxl-fastertransformer-no-code.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/inference|generativeai|llm-workshop|lab5-flan-t5-xxl|flant5-xxl-fastertransformer-no-code.ipynb)
