# Serve OpenAssistant Open-Assistant SFT-1 12B Model on Amazon SageMaker using LMI (Large Model Inference) DJL-based container
**Recommended kernel(s):** This notebook can be run with any Amazon SageMaker Studio kernel.

This notebook focuses on deploying the [`OpenAssistant/oasst-sft-1-pythia-12b`](https://huggingface.co/OpenAssistant/oasst-sft-1-pythia-12b) HuggingFace model to a SageMaker Endpoint for a text generation task. In this example, you will use the SageMaker-managed [LMI (Large Model Inference)](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-dlc.html) Docker image as inference image. LMI images features a [DJL serving](https://github.com/deepjavalibrary/djl-serving) stack powered by the [Deep Java Library](https://djl.ai/).

You will successively deploy the `OpenAssistant/oasst-sft-1-pythia-12b` model twice on a `ml.g5.12xlarge` GPU instance (4 devices), once using the DeepSpeed inference handler, once using the HuggingFace Accelerate inference handler. This will allow you to compare the latency and the quality of the text generated by these two solutions.

**Notices:**
* Make sure that the `ml.g5.12xlarge` instance type is available in your AWS Region.
* Make sure that the value of your "ml.g5.12xlarge for endpoint usage" Amazon SageMaker service quota allows you to deploy one Endpoint using this instance type.

This notebook leverages the [`sagemaker` Python SDK](https://sagemaker.readthedocs.io/en/stable/index.html) to abstract away the management of as many resources and configuration as we can, hence demonstrating that the deployment of LLMs to SageMaker can be performed with great simplicity and minimal amount of code.

### License agreement
* This model and the dataset it has been trained on are both under the [Apache 2.0](https://huggingface.co/models?license=license:apache-2.0) license.
* This notebook is a sample notebook and not intended for production use.

### Execution environment setup
This notebook requires the following third-party Python dependencies:
* AWS [`boto3`](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html#)
* AWS [`sagemaker`](https://sagemaker.readthedocs.io/en/stable/index.html), DJL support requires versions greater than 2.136.0 
* HuggingFace [`huggingface_hub`](https://huggingface.co/docs/huggingface_hub/index)

Let's install or upgrade these dependencies using the following command:

In [2]:
!pip install sagemaker huggingface_hub --upgrade --quiet

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
spyder 5.3.3 requires pyqt5<5.16, which is not installed.
spyder 5.3.3 requires pyqtwebengine<5.16, which is not installed.
spyder 5.3.3 requires ipython<8.0.0,>=7.31.1, but you have ipython 8.12.0 which is incompatible.
spyder 5.3.3 requires pylint<3.0,>=2.5.0, but you have pylint 3.0.0a6 which is incompatible.
spyder-kernels 2.3.3 requires ipython<8,>=7.31.1; python_version >= "3", but you have ipython 8.12.0 which is incompatible.
spyder-kernels 2.3.3 requires jupyter-client<8,>=7.3.4; python_version >= "3", but you have jupyter-client 8.1.0 which is incompatible.
docker-compose 1.29.2 requires PyYAML<6,>=3.10, but you have pyyaml 6.0 which is incompatible.
distributed 2022.7.0 requires tornado<6.2,>=6.0.3, but you have tornado 6.2 which is incompatible.
awscli 1.27.111 requires botocore==1.29.111, but you

### Imports & global variables assignment

In [3]:
import os
from pathlib import Path
import shutil
from typing import Any, Dict, List

import boto3
import huggingface_hub
import sagemaker

In [4]:
SM_DEFAULT_EXECUTION_ROLE_ARN = sagemaker.get_execution_role()
SM_SESSION = sagemaker.session.Session()
SM_ARTIFACT_BUCKET_NAME = SM_SESSION.default_bucket()

REGION_NAME = SM_SESSION._region_name
S3_CLIENT = boto3.client("s3", region_name=REGION_NAME)

In [5]:
HOME_DIR = os.environ["HOME"]

# HuggingFace local model storage
HF_LOCAL_CACHE_DIR = Path(HOME_DIR) / ".cache" / "huggingface" / "hub"
HF_LOCAL_DOWNLOAD_DIR = Path.cwd() / "model_repo"
HF_LOCAL_DOWNLOAD_DIR.mkdir(exist_ok=True)

# Inference code local storage
SOURCE_DIR = Path.cwd() / "code"
SOURCE_DIR.mkdir(exist_ok=True)

# Selected HuggingFace model
HF_HUB_MODEL_NAME = "OpenAssistant/oasst-sft-1-pythia-12b"

# HuggingFace remote model storage (Amazon S3)
HF_MODEL_KEY_PREFIX = f"hf-large-model-djl/{HF_HUB_MODEL_NAME}"

### Storage utility functions

In [6]:
def list_s3_objects(bucket: str, key_prefix: str) -> List[Dict[str, Any]]:
    paginator = S3_CLIENT.get_paginator("list_objects")
    operation_parameters = {"Bucket": bucket, "Prefix": key_prefix}
    page_iterator = paginator.paginate(**operation_parameters)
    return [obj for page in page_iterator for obj in page["Contents"]]


def delete_s3_objects(bucket: str, keys: str) -> None:
    S3_CLIENT.delete_objects(Bucket=bucket, Delete={"Objects": [{"Key": key} for key in keys]})


def get_local_model_cache_dir(hf_model_name: str) -> str:
    for dir_name in os.listdir(HF_LOCAL_CACHE_DIR):
        if dir_name.endswith(hf_model_name.replace("/", "--")):
            break
    else:
        raise ValueError(f"Could not find HF local cache directory for model {hf_model_name}")
    return HF_LOCAL_CACHE_DIR / dir_name

### Inference utility functions
Prompting the model requires marking the beginning and the end of the prompt with [special and model-specific tokens](https://huggingface.co/OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5#prompting). The following inference helper functions are used for all deployments.

In [7]:
# Model-specific tokens
PROMPT_BOS_TOKEN = "<|prompter|>"
PROMPT_EOS_TOKEN = "<|endoftext|><|assistant|>"


def decorate_prompt(prompt: str) -> str:
    return f"{PROMPT_BOS_TOKEN}{prompt}{PROMPT_EOS_TOKEN}"

## 1. Model upload to Amazon S3
Models served by a LMI container can be downloaded to the container in different ways:
* Like all the SageMaker Inference containers, having the container to download the model from Amazon S3 as a single `model.tar.gz` file. In the case of LLMs, this approach is discouraged since downloading and decompression times can become unreasonably high.
* Having the container to download the model directly from the HuggingFace Hub for you. This option may involve high download times too.
* Having the container to download the uncompressed model from Amazon S3 with maximal throughput by using the [`s5cmd`](https://github.com/peak/s5cmd) utility. This option is specific to LMI containers and is the recommended one. It requires however, that the model has been previously uploaded to a S3 Bucket. 

In this section, you will:
1. Download the model from the HuggingFace Hub to your local host,
2. Upload the downloaded model to a S3 Bucket. This notebook uses the SageMaker's default regional Bucket. Feel free to upload the model to the Bucket of your choice by modifying the `SM_ARTIFACT_BUCKET_NAME` global variable accordingly.

Each operation takes a few minutes.

In [8]:
huggingface_hub.snapshot_download(
    repo_id=HF_HUB_MODEL_NAME,
    revision="main",
    local_dir=HF_LOCAL_DOWNLOAD_DIR,
    local_dir_use_symlinks="auto",  # Files larger than 5MB are actually symlinked to the local HF cache
    allow_patterns=["*.json", "*.pt", "*.bin", "*.txt", "*.model"],
);

Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

Downloading (…)model.bin.index.json:   0%|          | 0.00/47.3k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/303 [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

Downloading (…)452a1/tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading (…)7de452a1/config.json:   0%|          | 0.00/689 [00:00<?, ?B/s]

Downloading (…)l-00002-of-00003.bin:   0%|          | 0.00/9.93G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/521 [00:00<?, ?B/s]

Downloading (…)l-00001-of-00003.bin:   0%|          | 0.00/9.81G [00:00<?, ?B/s]

Downloading (…)l-00003-of-00003.bin:   0%|          | 0.00/4.10G [00:00<?, ?B/s]

In [9]:
MODEL_ID = SM_SESSION.upload_data(
    path=HF_LOCAL_DOWNLOAD_DIR.as_posix(),
    bucket=SM_ARTIFACT_BUCKET_NAME,
    key_prefix=HF_MODEL_KEY_PREFIX,
)
print(f"Model artifacts have been successfully uploaded to: {MODEL_ID}")

Model artifacts have been successfully uploaded to: s3://sagemaker-eu-west-1-893516550009/hf-large-model-djl/OpenAssistant/oasst-sft-1-pythia-12b


The `huggingface_hub.snapshot_download` function downloaded the model repository to a cache located in your home directory. Downloaded files were duplicated in the target local download directory. Large files (larger than 5 MB) were not duplicated however but simply symlinked. Still, uncompressed LLM artifacts consume disk space. The two following cells removes the downloaded files from your local host.

In [10]:
# Remove HF model artifacts from the local download directory
shutil.rmtree(HF_LOCAL_DOWNLOAD_DIR)

In [11]:
# Remove HF model artifacts from the local HF cache directory
hf_local_cache_dir = get_local_model_cache_dir(hf_model_name=HF_HUB_MODEL_NAME)
shutil.rmtree(hf_local_cache_dir)

## 2. Deployment to a SageMaker Endpoint using a SageMaker LMI Docker image
Start up of LLM inference containers can last for longer than smaller model mainly because of longer model downloading and loading times. Timeout values need to be increased accordingly from their default values. Each endpoint deployment takes a few minutes.

In [12]:
CONTAINER_STARTUP_CONFIGURATION = {
    "model_data_download_timeout": 5 * 60,  # in seconds
    "container_startup_health_check_timeout": 7 * 60,  # in seconds
}

### 2.1. Inference using the DeepSpeed handler
In this section, you will deploy the `OpenAssistant/oasst-sft-1-pythia-12b` model to a SageMaker Endpoint consisting of a single `ml.g5.12xlarge` instance. The inference engine used by the DJL Serving stack is DeepSpeed. The model will be partitioned over all (i.e. 4) the available GPUs. Chosen precision is FP16. See [this section](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-configuration.html) of the documentation for more information about LMI-specific configuration options. To each option corresponds a dedicated argument in the constructor of the DJL `Model` classes.

The model server configuration is generated by the `sagemaker.djl_inference.DeepSpeedModel` class from the argument we pass to its constructor. The DJL classes from the `sagemaker` SDK still allow customization (custom Python handler, Python dependencies and model server configuration), see the [dedicated section](https://sagemaker.readthedocs.io/en/stable/frameworks/djl/using_djl.html#inference-code-and-model-server-properties) of the documentation for more information.

In [13]:
from sagemaker.djl_inference import DeepSpeedModel

In [14]:
deepspeed_model = DeepSpeedModel(
    model_id=MODEL_ID,
    djl_version="0.22.1",
    role=SM_DEFAULT_EXECUTION_ROLE_ARN,
    tensor_parallel_degree=4,
    low_cpu_mem_usage=True,
    task="text-generation",
    dtype="fp16",
)

In [15]:
deepspeed_predictor = deepspeed_model.deploy(
    instance_type="ml.g5.12xlarge", initial_instance_count=1, **CONTAINER_STARTUP_CONFIGURATION
)

--------------!

**Notice:** The structure of the `data` dictionary is set by the handler script. Expected structure is:
```python
{
"inputs": ["prompt1", "prompt2", ...],
"parameters": {"param1": "value1", ...}
}
```
Notice that the handler can handle batches of prompts. In this notebook's examples, you only send one prompt at a time.

In [16]:
%%time
prompts = ["I am currently in Paris. What are my travel options to Berlin?"]
inference_parameters = {"max_new_tokens": 128, "do_sample": False}

response = deepspeed_predictor.predict(
    data={
        "inputs": [decorate_prompt(prompt) for prompt in prompts],
        "parameters": inference_parameters,
    }
)

CPU times: user 7.76 ms, sys: 6.72 ms, total: 14.5 ms
Wall time: 2.54 s


In [17]:
for prompt, generated_text in zip(prompts, response):
    print("#" * 100)
    text = generated_text["generated_text"][len(prompt) :]
    print(f"PROMPT:\n{prompt}\nGENERATED TEXT:\n{text}")
else:
    print("#" * 100)

####################################################################################################
PROMPT:
I am currently in Paris. What are my travel options to Berlin?
GENERATED TEXT:
If you are in Paris, you can take a train to Berlin. The fastest way is to take the high speed train, the Eurostar. It will take around 4 hours and costs around €100. Or else you can take a bus.
####################################################################################################


The following cells destroy the Endpoint and all associated resources (i.e. the EndpointConfig and the Model objects) created by the call to `Model.deploy`.
You may delay this step until you are done comparing the two inference engines. Given the relatively high cost of the `ml.g5.12xlarge` instances, we strongly advocate for a cautious management of the Endpoints they power.

In [18]:
# Clean-up
deepspeed_predictor.delete_endpoint(delete_endpoint_config=True)
deepspeed_model.delete_model()

### 2.2. Inference using the HuggingFace Accelerate handler
Like in the preceding section, you deploy the `OpenAssistant/oasst-sft-1-pythia-12b` model to a SageMaker Endpoint consisting of a single `ml.g5.12xlarge` instance. The model is partitioned over all (i.e. 4) the available GPUs and chosen precision is FP16. The only difference is that you use the HuggingFace Accelerate handler as inference engine (referred as the `Python` engine in the [DJL Serving general settings](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-configuration.html)).

Like in the preceding section, we use the dedicated `Model` class from the `sagemaker` Python SDK, namely the `HuggingFaceAccelerateModel` class.

In [19]:
from sagemaker.djl_inference import HuggingFaceAccelerateModel

In [20]:
hf_accelerate_model = HuggingFaceAccelerateModel(
    model_id=MODEL_ID,
    djl_version="0.22.1",
    role=SM_DEFAULT_EXECUTION_ROLE_ARN,
    number_of_partitions=4,
    device_map="balanced_low_0",
    low_cpu_mem_usage=True,
    task="text-generation",
    dtype="fp16",
)

In [21]:
hf_accelerate_predictor = hf_accelerate_model.deploy(
    instance_type="ml.g5.12xlarge", initial_instance_count=1, **CONTAINER_STARTUP_CONFIGURATION
)

-----------!

In [22]:
%%time
prompts = ["I am currently in Paris. What are my travel options to Berlin?"]
inference_parameters = {"max_new_tokens": 128, "do_sample": False}

response = hf_accelerate_predictor.predict(
    data={
        "inputs": [decorate_prompt(prompt) for prompt in prompts],
        "parameters": inference_parameters,
    }
)

CPU times: user 18.2 ms, sys: 0 ns, total: 18.2 ms
Wall time: 16.2 s


In [23]:
for prompt, generated_text in zip(prompts, response):
    print("#" * 100)
    text = generated_text["generated_text"][len(prompt) :]
    print(f"PROMPT:\n{prompt}\nGENERATED TEXT:\n{text}")
else:
    print("#" * 100)

####################################################################################################
PROMPT:
I am currently in Paris. What are my travel options to Berlin?
GENERATED TEXT:
If you are in Paris, you have several options to travel to Berlin. The most common way to travel from Paris to Berlin is by plane. You can fly from Paris to Berlin with a number of airlines, such as Air France, Lufthansa, or Germanwings. Alternatively, you can take a train from Paris to Berlin. The train ride can be long, but it is a convenient way to travel. You can also take a bus from Paris to Berlin. This option may be more time-consuming, but it can be a good way to get a feel for the country. Finally, you can drive from Paris
####################################################################################################


In [24]:
# Clean-up
hf_accelerate_predictor.delete_endpoint(delete_endpoint_config=True)
hf_accelerate_model.delete_model()

## 3. Clean-up
At this stage:
* All your Endpoint are supposed to be deleted, along with the EndpointConfig and Model resources they were associated with,
* You have freed the disk space of your local host from the large model artifacts downloaded from the HuggingFace Hub.

The only remaining cleanup task consist of removing the model artifacts from Amazon S3. This is what performs the next and last cell of this notebook.

In [25]:
# Remove HF model artifacts from S3
hf_s3_objects = list_s3_objects(bucket=SM_ARTIFACT_BUCKET_NAME, key_prefix=HF_MODEL_KEY_PREFIX)
hf_s3_objects_keys = [obj["Key"] for obj in hf_s3_objects]
delete_s3_objects(bucket=SM_ARTIFACT_BUCKET_NAME, keys=hf_s3_objects_keys)