# Deploy Flan-UL2 on SageMaker with DJLServing using PySDK

---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/inference|generativeai|llm-workshop|flan-ul2-pySDK|flan-ul2-pySDK.ipynb)

---

Sparked by the release of large AI models like AlexaTM, GPT, OpenChatKit, BLOOM, GPT-J, GPT-NeoX, FLAN-T5, OPT, Stable Diffusion, ControlNet, etc; the popularity of generative AI has seen a recent boom.

However, as the size and complexity of the deep learning models that power generative AI continue to grow, deployment can be a challenging task. Advanced techniques such as model parallelism and quantization become necessary to achieve latency and throughput requirements. Without expertise in using these techniques, many customers struggle to get started with hosting large models for generative AI applications.

We demonstrate how you can deploy these large models on SageMaker using [DJL Serving](https://github.com/deepjavalibrary/djl-serving) and [Large Model Inference containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers).

Specifically we deploy the open source [Flan-UL2](https://huggingface.co/google/flan-ul2) model, which is comprised of 20B parameters on an instance with 4 GPUs. (`ml.g5.24xlarge`)

In this notebook, we will -
- leverage the [SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/frameworks/djl/index.html) to deploy the models using [DeepSpeed](https://www.deepspeed.ai/), [HuggingFace Accelerate](https://huggingface.co/docs/accelerate/index) and [FasterTransformer](https://github.com/NVIDIA/FasterTransformer)
- perform a quick benchmark of the latency incurred when invoking the endpoints when deployed using each of the frameworks.
- demonstrate how you can let DJL Serving determine the best backend based on your model architecture and configuration.

### Update the sagemaker and boto3 packages

In [None]:
!pip install sagemaker==2.160.0 boto3==1.26.145 botocore==1.29.145 awscli==0.11.0

### Import the required packages

In [None]:
import warnings

warnings.filterwarnings("ignore")

import sagemaker
from sagemaker import image_uris
from sagemaker.session import Session
from sagemaker.djl_inference import (
 HuggingFaceAccelerateModel,
 DeepSpeedModel,
 FasterTransformerModel,
 DJLModel,
)

### Define variables to store the sagemaker session, region and execution role

In [None]:
sagemaker_session = Session()
region = sagemaker_session._region_name
role = sagemaker.get_execution_role()

instance_type = "ml.g5.24xlarge" # define the instance type on which you want to deploy the model

### Define the S3 location of the model weights.
Downloading the model from the [HuggingFace Hub](https://huggingface.co/google/flan-ul2) is time consuming. Hence, we recommend that you download the model and upload the uncompressed artifacts to a S3 bucket.

For the purpose of demonstration, we use a S3 location that already contains the model weights and is accesible publicly. 

In [None]:
pretrained_model_location = f"s3://sagemaker-example-files-prod-{region}/models/flan-ul2" # replace with the S3 URI that has your model

### Define the payload - Article Generation
We now define a payload that will be used to invoke the endpoints when they are deployed.

The below payload consists of a prompt that has a title and prompts the model to write an article.

In [None]:
data = {
 "inputs": [
 "Title: ”Utility of Large Language Models“\\n Given the above title, write an article.\n"
 ],
 "parameters": {
 "max_length": 250,
 "temperature": 0.1,
 },
}

## Deploy using HuggingFace Accelerate

[HuggingFace Accelerate](https://huggingface.co/docs/accelerate/index) is a library that can be used to host large models across multiple GPUs by model partitioning. It leverages layer wise paritioning to load individual layers onto different GPUs.

By using DJL Serving's [HuggingFaceAccelerateModel](https://sagemaker.readthedocs.io/en/stable/frameworks/djl/sagemaker.djl_inference.html#huggingfaceacceleratemodel) you can host a large model to a SageMaker endpoint with multiple GPUs. The model paritioning is done by HuggingFace Accelerate using the different parameters that are passed to the model object.

Here, we set the `dtype` to `fp16` and `number_of_paritions` to `4` to partition the model across 4 GPUs.
See [HuggingFaceAccelerateModel](https://sagemaker.readthedocs.io/en/stable/frameworks/djl/sagemaker.djl_inference.html#huggingfaceacceleratemodel) for other parameters that can be used.

In [None]:
hf_accelerate_model = HuggingFaceAccelerateModel(
 pretrained_model_location,
 role,
 device_map="auto",
 dtype="fp16",
 number_of_partitions=4,
)

hf_accelerate_predictor = hf_accelerate_model.deploy(
 instance_type=instance_type, initial_instance_count=1
)

#### Invoke the endpoint and perform a quick benchmark.

In [None]:
%%timeit -n3 -r1
hf_accelerate_predictor.predict(data)

#### Delete the model and the endpoint

In [None]:
hf_accelerate_predictor.delete_model()
hf_accelerate_predictor.delete_endpoint()

## Deploy the model using DeepSpeed

[DeepSpeed](https://www.deepspeed.ai/) provides various [inference optimizations](https://www.deepspeed.ai/tutorials/inference-tutorial/) for compatible transformer based models including model sharding, optimized inference kernels, and quantization.

By using DJL Serving's [DeepSpeedModel](https://sagemaker.readthedocs.io/en/stable/frameworks/djl/sagemaker.djl_inference.html#deepspeedmodel) you can host a large model to a SageMaker endpoint with multiple GPUs. The model paritioning is done by DeepSpeed using the different parameters that are passed to the model object.

Here, we set the `dtype` to `bf16` and `number_of_paritions` to `4` to partition the model across 4 GPUs.
See [DeepSpeedModel](https://sagemaker.readthedocs.io/en/stable/frameworks/djl/sagemaker.djl_inference.html#deepspeedmodel) for other parameters that can be used.

In [None]:
deepspeed_model = DeepSpeedModel(
 pretrained_model_location, role, dtype="bf16", tensor_parallel_degree=4
)
deepspeed_predictor = deepspeed_model.deploy(instance_type=instance_type, initial_instance_count=1)

#### Invoke the endpoint and perform a quick benchmark.

In [None]:
%%timeit -n3 -r1
deepspeed_predictor.predict(data)

#### Delete the model and the endpoint.

In [None]:
deepspeed_predictor.delete_model()
deepspeed_predictor.delete_endpoint()

## Deploy the Model using FasterTransformer

[FasterTransformer](https://github.com/NVIDIA/FasterTransformer) is a library implementing an accelerated engine for the inference of transformer-based neural networks, with a special emphasis on large models, spanning many GPUs and nodes in a distributed manner.
FasterTransformer contains the implementation of the highly-optimized version of the transformer block that contains the encoder and decoder parts.

By using DJL Serving's [FasterTransformerModel](https://sagemaker.readthedocs.io/en/stable/frameworks/djl/sagemaker.djl_inference.html#fastertransformermodel) you can host a large model to a SageMaker endpoint with multiple GPUs. The model paritioning is done by FasterTransformer using the different parameters that are passed to the model object.

Here, we set the `dtype` to `fp16` and `number_of_paritions` to `4` to partition the model across 4 GPUs.
See [FasterTransformer](https://sagemaker.readthedocs.io/en/stable/frameworks/djl/sagemaker.djl_inference.html#fastertransformermodel) for other parameters that can be used.

In [None]:
fastertransformer_model = FasterTransformerModel(
 pretrained_model_location, role, dtype="fp16", tensor_parallel_degree=4
)
fastertransformer_predictor = fastertransformer_model.deploy(
 instance_type=instance_type, initial_instance_count=1
)

#### Invoke the endpoint and perform a quick benchmark.

In [None]:
%%timeit -n3 -r1
fastertransformer_predictor.predict(data)

#### Delete the model and the endpoint.

In [None]:
fastertransformer_predictor.delete_model()
fastertransformer_predictor.delete_endpoint()

### Performance Comparision
From the above benchmark results, you can see that `FasterTransformer` and `DeepSpeed` have a lower latency compared to `HuggingFace Accelerate`, with `FasterTransformer` having the best performance.

We'll now use [`DJLModel`](https://sagemaker.readthedocs.io/en/stable/frameworks/djl/sagemaker.djl_inference.html#djlmodel) to deploy the model and let it pick the best backend based on the model architecture.

## Deploy the Model using DJLModel
Instantiating an instance of the [DJLModel](https://sagemaker.readthedocs.io/en/stable/frameworks/djl/sagemaker.djl_inference.html#sagemaker.djl_inference.model.DJLModel) let's you use the framework recommendation for the model type without explicity specifying one.

For additional parameters that can be set while creating the object, please refer [DJLModel](https://sagemaker.readthedocs.io/en/stable/frameworks/djl/sagemaker.djl_inference.html#sagemaker.djl_inference.model.DJLModel).

In [None]:
djl_model = DJLModel(pretrained_model_location, role, dtype="fp16", tensor_parallel_degree=4)
djl_predictor = djl_model.deploy(instance_type=instance_type, initial_instance_count=1)

Note that DJLModel returns an instance of a framework specific model. i.e. the framework that was used as the backend to deploy the model.

In this case, `FasterTransformer` is optimal.

In [None]:
djl_model

#### Invoke the endpoint and perform a quick benchmark.

In [None]:
%%timeit -n3 -r1
djl_predictor.predict(data)

#### Delete the model and the endpoint.

In [None]:
djl_predictor.delete_model()
djl_predictor.delete_endpoint()

### Conclusion

We have demonstrated how to use the SageMaker SDK to deploy large language models like Flan-UL2 using DJL Serving and benchmarked the performance of different frameworks.

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.


![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/inference|generativeai|llm-workshop|flan-ul2-pySDK|flan-ul2-pySDK.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/inference|generativeai|llm-workshop|flan-ul2-pySDK|flan-ul2-pySDK.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/inference|generativeai|llm-workshop|flan-ul2-pySDK|flan-ul2-pySDK.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/inference|generativeai|llm-workshop|flan-ul2-pySDK|flan-ul2-pySDK.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/inference|generativeai|llm-workshop|flan-ul2-pySDK|flan-ul2-pySDK.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/inference|generativeai|llm-workshop|flan-ul2-pySDK|flan-ul2-pySDK.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/inference|generativeai|llm-workshop|flan-ul2-pySDK|flan-ul2-pySDK.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/inference|generativeai|llm-workshop|flan-ul2-pySDK|flan-ul2-pySDK.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/inference|generativeai|llm-workshop|flan-ul2-pySDK|flan-ul2-pySDK.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/inference|generativeai|llm-workshop|flan-ul2-pySDK|flan-ul2-pySDK.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/inference|generativeai|llm-workshop|flan-ul2-pySDK|flan-ul2-pySDK.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/inference|generativeai|llm-workshop|flan-ul2-pySDK|flan-ul2-pySDK.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/inference|generativeai|llm-workshop|flan-ul2-pySDK|flan-ul2-pySDK.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/inference|generativeai|llm-workshop|flan-ul2-pySDK|flan-ul2-pySDK.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/inference|generativeai|llm-workshop|flan-ul2-pySDK|flan-ul2-pySDK.ipynb)
