# Serve large models on SageMaker with DeepSpeed Container. In this notebook we show Bloom-176B model hosting


---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. 

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/inference|nlp|realtime|llm|bloom_176b|djl_deepspeed_deploy.ipynb)

---


In this notebook, we explore how to host a large language model on SageMaker using the latest container launched using DeepSpeed and DJL. DJL provides for the serving framework while DeepSpeed is the key sharding library we leverage to enable hosting of large models. We use DJLServing as the model serving solution in this example. DJLServing is a high-performance universal model serving solution powered by the Deep Java Library (DJL) that is programming language agnostic. To learn more about DJL and DJLServing, you can refer to our recent blog post (https://aws.amazon.com/blogs/machine-learning/deploy-large-models-on-amazon-sagemaker-using-djlserving-and-deepspeed-model-parallel-inference/).

Language models have recently exploded in both size and popularity. In 2018, BERT-large entered the scene and, with its 340M parameters and novel transformer architecture, set the standard on NLP task accuracy. Within just a few years, state-of-the-art NLP model size has grown by more than 500x with models such as OpenAI’s 175 billion parameter GPT-3 and similarly sized open source Bloom 176B raising the bar on NLP accuracy. This increase in the number of parameters is driven by the simple and empirically-demonstrated positive relationship between model size and accuracy: more is better. With easy access from models zoos such as Hugging Face and improved accuracy in NLP tasks such as classification and text generation, practitioners are increasingly reaching for these large models. However, deploying them can be a challenge because of their size.

Model parallelism can help deploy large models that would normally be too large for a single GPU. With model parallelism, we partition and distribute a model across multiple GPUs. Each GPU holds a different part of the model, resolving the memory capacity issue for the largest deep learning models with billions of parameters. This notebook uses tensor parallelism techniques which allow GPUs to work simultaneously on the same layer of a model and achieve low latency inference relative to a pipeline parallel solution.

SageMaker has rolled out DeepSpeed container which now provides users with the ability to leverage the managed serving capabilities and help to provide the un-differentiated heavy lifting.

In this notebook, we deploy the open source Bloom 176B quantized model across GPU's on a ml.p4d.24xlarge instance. DeepSpeed is used for tensor parallelism inference while DJLServing handles inference requests and the distributed workers. For further reading on DeepSpeed you can refer to https://arxiv.org/pdf/2207.00032.pdf 


## License agreement
View license information https://huggingface.co/spaces/bigscience/license for this model including the use-based restrictions in Section 5 before using the model. 


In [None]:
# Instal boto3 library to create model and run inference workloads
%pip install -Uqq boto3 awscli sagemaker

## Optional Section to Download Model from Hugging Face Hub

Use this section of you are interested in downloading the model directly from Huggingface hub and storing in your own S3 bucket. Please change the variable "install_model_locally" to True in that case.

**However, this notebook currently leverages the model stored in AWS public S3 location for ease of use. So you can skip this step**

The below step to download and then upload to S3 can take several minutes since the model size is extremely large

In [None]:
install_model_locally = False

In [None]:
if install_model_locally:
 %pip install huggingface-hub -Uqq

In [None]:
if install_model_locally:

 from huggingface_hub import snapshot_download
 from pathlib import Path

 # - This will download the model into the ./model directory where ever the jupyter file is running
 local_model_path = Path("./model")
 local_model_path.mkdir(exist_ok=True)
 model_name = "microsoft/bloom-deepspeed-inference-int8"
 commit_hash = "aa00a6626f6484a2eef68e06d1e089e4e32aa571"

 # - Leverage the snapshot library to donload the model since the model is stored in repository using LFS
 snapshot_download(repo_id=model_name, revision=commit_hash, cache_dir=local_model_path)

 # - Upload to S3 using AWS CLI
 s3_model_prefix = "hf-large-model-djl-ds/model" # folder where model checkpoint will go
 model_snapshot_path = list(local_model_path.glob("**/snapshots/*"))[0]

 !aws s3 cp --recursive {model_snapshot_path} s3://{bucket}/{s3_model_prefix}

## Create SageMaker compatible Model artifact and Upload Model to S3

SageMaker needs the model to be in a Tarball format. 

The tarball is in the following format

```
code
├──── 
│ └── model.py
│ └── serving.properties

``` 

- `model.py` is the key file which will handle any requests for serving. It is also responsible for loading the model from S3
- `serving.properties` is the configuration file that can be used to configure the model server.


In [None]:
import sagemaker
from sagemaker import image_uris
import boto3
import os
import time
import json
from pathlib import Path

#### Create required variables and initialize them to create the endpoint, we leverage boto3 for this

In [None]:
role = sagemaker.get_execution_role() # execution role for the endpoint
sess = sagemaker.session.Session() # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket() # bucket to house artifacts
model_bucket = f"sagemaker-example-files-prod-{sess.boto_region_name}"
s3_code_prefix = "hf-large-model-djl-ds/code" # folder within bucket where code artifact will go
s3_model_prefix = "models/bloom-176B/raw_model_microsoft/" # "bloom-176B/raw_model_microsoft/" # folder where model checkpoint will go
# S3 URI-- s3://sagemaker-example-files-prod-{region}/models/bloom-176B/raw_model_microsoft/ -

region = sess._region_name
account_id = sess.account_id()

s3_client = boto3.client("s3")
sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")

**Image URI of the DJL Container to be used**

In [None]:
inference_image_uri = image_uris.retrieve(
 framework="djl-deepspeed", region=sess.boto_session.region_name, version="0.21.0"
)
print(f"Image going to be used is ---- > {inference_image_uri}")

**Create the Tarball and then upload to S3 location**

In [None]:
!mkdir -p code_bloom176

In [None]:
%%writefile code_bloom176/model.py
from djl_python import Input, Output
import deepspeed
import torch
import logging
import math
import os
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer

model = None
tokenizer = None
generator = None


def load_model(properties):
 # number of partitions
 tensor_parallel = properties["tensor_parallel_degree"]

 # location on the hosting instance where the model checkpoints are downloaded (from the s3url)
 model_location = properties["model_id"]

 logging.info(f"Loading model in {model_location}")

 tokenizer = AutoTokenizer.from_pretrained(model_location)

 # Construct model with fake meta tensors, later will be replaced during ds-inference checkpoint load
 with deepspeed.OnDevice(dtype=torch.float16, device="meta"):
 model = AutoModelForCausalLM.from_config(
 AutoConfig.from_pretrained(model_location), torch_dtype=torch.bfloat16
 )

 ### Deepspeed-Inference Loading
 logging.info(f"Starting DeepSpeed init with TP={tensor_parallel}")

 # tensor parallel presharded repos come with their own checkpoint config file
 model = deepspeed.init_inference(
 model,
 mp_size=tensor_parallel,
 dtype=torch.int8,
 replace_method="auto",
 replace_with_kernel_inject=True,
 base_dir=model_location,
 checkpoint=os.path.join(model_location, "ds_inference_config.json"),
 )
 model = model.module
 return model, tokenizer


def run_inference(model, tokenizer, data, params):
 generate_kwargs = params
 tokenizer.pad_token = tokenizer.eos_token
 input_tokens = tokenizer.batch_encode_plus(data, return_tensors="pt", padding=True)
 for t in input_tokens:
 if torch.is_tensor(input_tokens[t]):
 input_tokens[t] = input_tokens[t].to(torch.cuda.current_device())
 outputs = model.generate(**input_tokens, **generate_kwargs)
 return tokenizer.batch_decode(outputs, skip_special_tokens=True)


def handle(inputs: Input):
 """
 inputs: Contains the configurations from serving.properties
 """
 global model, tokenizer

 if not model:
 model, tokenizer = load_model(inputs.get_properties())

 if inputs.is_empty():
 # Model server makes an empty call to warmup the model on startup
 return None

 data = inputs.get_as_json()

 input_sentences = data["inputs"]
 params = data["parameters"]

 outputs = run_inference(model, tokenizer, input_sentences, params)
 result = {"outputs": outputs}
 return Output().add_as_json(result)

#### Serving.properties has engine parameter which tells the DJL model server to use the DeepSpeed engine to load the model

Here is a list of settings that we use in this configuration file -

- `engine`: The engine for DJL to use. In this case, we intend to use Accelerate and hence set it to Python.
- `option.entryPoint`: The entrypoint python file or module. This should align with the engine that is being used.
- `option.s3url`: Set this to the URI of the Amazon S3 bucket that contains the model. When this is set, the container leverages s5cmd to download the model from s3. This is extremely fast and useful when downloading large models like this one.

The container downloads the model into the /tmp space on the container because SageMaker maps the /tmp to the Amazon Elastic Block Store (Amazon EBS) volume that is mounted when we specify the endpoint creation parameter VolumeSizeInGB. It leverages s5cmd(https://github.com/peak/s5cmd) which offers a very fast download speed and hence extremely useful when downloading large models.

For instances like p4dn, which come pre-built with the volume instance, we can continue to leverage the /tmp on the container. The size of this mount is large enough to hold the model.

For more details on the configuration options and an exhaustive list, you can refer the documentation - https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-configuration.html

In [None]:
props = f"""
engine = DeepSpeed
option.tensor_parallel_degree = 8
option.s3url = s3://sagemaker-example-files-prod-{sess.boto_region_name}/models/bloom-176B/raw_model_microsoft/
"""
print(props, file=open("code_bloom176/serving.properties", "a"))

In [None]:
!rm model.tar.gz
!tar czvf model.tar.gz code_bloom176

In [None]:
s3_code_artifact = sess.upload_data("model.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {s3_code_artifact}")

In [None]:
print(f"S3 Model Prefix where the model files are -- > {s3_model_prefix}")
print(f"S3 Model Bucket is -- > {model_bucket}")

### This is optional in case you want to use VpcConfig to specify when creating the end points

For more details you can refer to this link https://docs.aws.amazon.com/sagemaker/latest/dg/host-vpc.html

The below is just an example to extract information about Security Groups and Subnets needed to configure

In [None]:
!aws ec2 describe-security-groups --filter Name=vpc-id,Values= | python3 -c "import sys, json; print(json.load(sys.stdin)['SecurityGroups'])"

In [None]:
# - provide networking configs if needed.
security_group_ids = [] # add the security group id's
subnets = [] # add the subnet id for this vpc
privateVpcConfig = {"SecurityGroupIds": security_group_ids, "Subnets": subnets}
print(privateVpcConfig)

### To create the end point the steps are:

1. Create the Model using the Image container and the Model Tarball uploaded earlier
2. Create the endpoint config using the following key parameters

 a) Instance Type is ml.p4d.24xlarge 
 
 b) ModelDataDownloadTimeoutInSeconds is 2400 which is needed to ensure the Model downloads from S3 successfully,
 
 c) ContainerStartupHealthCheckTimeoutInSeconds is 2400 to ensure health check starts after the model is ready
 
3. Create the end point using the endpoint config created 
 

In [None]:
from sagemaker.utils import name_from_base

model_name = name_from_base(f"bloom-djl-ds")
print(model_name)

create_model_response = sm_client.create_model(
 ModelName=model_name,
 ExecutionRoleArn=role,
 PrimaryContainer={
 "Image": inference_image_uri,
 "ModelDataUrl": s3_code_artifact,
 },
 # Uncomment if providing networking configs
 # VpcConfig=privateVpcConfig
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

VolumnSizeInGB has been commented out. You should use this value for Instance types which support EBS volume mounts. The current instance we are using comes with a pre-configured space and does not support additional volume mounts

In [None]:
endpoint_config_name = f"{model_name}-config"
endpoint_name = f"{model_name}-endpoint"

endpoint_config_response = sm_client.create_endpoint_config(
 EndpointConfigName=endpoint_config_name,
 ProductionVariants=[
 {
 "VariantName": "variant1",
 "ModelName": model_name,
 "InstanceType": "ml.p4d.24xlarge",
 "InitialInstanceCount": 1,
 # "VolumeSizeInGB" : 400,
 "ModelDataDownloadTimeoutInSeconds": 2400,
 "ContainerStartupHealthCheckTimeoutInSeconds": 2400,
 },
 ],
)
endpoint_config_response

In [None]:
create_endpoint_response = sm_client.create_endpoint(
 EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)
print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")

#### Wait for the end point to be created. This can take a few minutes. Please be patient
However, while that happens, let us look at the critical areas of the helper files we are using to load the model
1. We will look at the code snippets for model.py to see the model downloading mechanism
2. Serving.properties to see the environment related properties

In [None]:
# This is the code snippet which is responsible to load the model from S3
! sed -n '40,60p' code_bloom176/model.py

In [None]:
# This is the code snippet which shows the environment variables being used to customize runtime
! sed -n '1,3p' code_bloom176/serving.properties

In [None]:
import time

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
 time.sleep(60)
 resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
 status = resp["EndpointStatus"]
 print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

#### Leverage the Boto3 api to invoke the endpoint. 

This is a generative model, so we pass in a Text (specified in the 'input' field in the json) as a prompt and Model will complete the sentence and return the results. More details on these parameters can be found at https://huggingface.co/docs/api-inference/detailed_parameters#text-generation-task. Some quick explainations are below
1. temperature -- > The temperature of the sampling operation. 1 means regular sampling, 0 means always take the highest score and 100 means uniform probability
2. max_new_tokens -- > The amount of new tokens or text to be gnerated. More tokens will increase the prediction time
3. num_beams -- > Beam Search keeps track of the n-th most likely word sequences.


In [None]:
%%time
smr_client.invoke_endpoint(
 EndpointName=endpoint_name,
 Body=json.dumps(
 {
 "inputs": ["Amazon.com is the best "],
 "parameters": {
 "min_length": 5,
 "max_new_tokens": 100,
 "temperature": 0.8,
 "num_beams": 5,
 "no_repeat_ngram_size": 2,
 },
 }
 ),
 ContentType="application/json",
)["Body"].read().decode("utf8")

#### With do_sample to false we are making a greedy optimization for token generation

In [None]:
%%time
# -- Greedy generation
smr_client.invoke_endpoint(
 EndpointName=endpoint_name,
 Body=json.dumps(
 {
 "inputs": ["Amazon.com is the best ", "Large Models are the way to go"],
 "parameters": {
 "min_length": 5,
 "max_new_tokens": 10,
 "do_sample": False,
 "early_stopping": True,
 },
 "padding": True,
 }
 ),
 ContentType="application/json",
)["Body"].read().decode("utf8")

## Conclusion
In this post, we demonstrated how to use SageMaker large model inference containers to host two large language models, BLOOM-176B and OPT-30B. We used DeepSpeed’s model parallel techniques with multiple GPUs on a single SageMaker machine learning instance. For more details about Amazon SageMaker and its large model inference capabilities, refer to the following:

* Amazon SageMaker now supports deploying large models through configurable volume size and timeout quotas (https://aws.amazon.com/about-aws/whats-new/2022/09/amazon-sagemaker-deploying-large-models-volume-size-timeout-quotas/)
* Real-time inference – Amazon SageMaker (https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html)


## Clean Up

In [None]:
# - Delete the end point
sm_client.delete_endpoint(EndpointName=endpoint_name)

In [None]:
# - In case the end point failed we still want to delete the model
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_model(ModelName=model_name)

#### Optionally delete the model checkpoint from S3

In [None]:
!aws s3 rm --recursive s3://{bucket}/{s3_model_prefix}

In [None]:
s3_client = boto3.client("s3")

In [None]:
len(s3_client.list_objects(Bucket=bucket, Prefix=f"{s3_model_prefix}/")["Contents"])

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/inference|nlp|realtime|llm|bloom_176b|djl_deepspeed_deploy.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/inference|nlp|realtime|llm|bloom_176b|djl_deepspeed_deploy.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/inference|nlp|realtime|llm|bloom_176b|djl_deepspeed_deploy.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/inference|nlp|realtime|llm|bloom_176b|djl_deepspeed_deploy.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/inference|nlp|realtime|llm|bloom_176b|djl_deepspeed_deploy.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/inference|nlp|realtime|llm|bloom_176b|djl_deepspeed_deploy.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/inference|nlp|realtime|llm|bloom_176b|djl_deepspeed_deploy.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/inference|nlp|realtime|llm|bloom_176b|djl_deepspeed_deploy.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/inference|nlp|realtime|llm|bloom_176b|djl_deepspeed_deploy.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/inference|nlp|realtime|llm|bloom_176b|djl_deepspeed_deploy.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/inference|nlp|realtime|llm|bloom_176b|djl_deepspeed_deploy.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/inference|nlp|realtime|llm|bloom_176b|djl_deepspeed_deploy.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/inference|nlp|realtime|llm|bloom_176b|djl_deepspeed_deploy.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/inference|nlp|realtime|llm|bloom_176b|djl_deepspeed_deploy.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/inference|nlp|realtime|llm|bloom_176b|djl_deepspeed_deploy.ipynb)
