# Host StabilityAI's StableLM base alpha 7B on SageMaker with Hugging Face using Large Model Inference container.


---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. 

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/inference|generativeai|llm-workshop|lab7-stablelm-base-alpha-7b|stablelm-base-alpha-7b-djl-sagemaker.ipynb)

---


In this notebook, we deploy the open source StabilityAI's [stablelm-base-alpha-7b](https://huggingface.co/stabilityai/stablelm-base-alpha-7b) model on ml.g5.xlarge instance using [Large Model Inference DLC](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-dlc.html) on SageMaker. The model is loaded using Hugging Face [Hugging Face Accelerate](https://huggingface.co/docs/accelerate/usage_guides/big_modeling#designing-a-device-map). 


## Licence agreement
Please refer to license information [here](https://huggingface.co/stabilityai/stablelm-base-alpha-7b#model-details). Base model checkpoints (StableLM-Base-Alpha) are licensed under the Creative Commons license [(CC BY-SA-4.0)](https://creativecommons.org/licenses/by-sa/4.0/). No changes were made in the base model. All credits to [Stability AI](https://stability.ai/) for the model weights.

#### Import the relevant libraries and configure several global variables using boto3

In [None]:
!pip install sagemaker boto3 huggingface_hub --upgrade --quiet

In [None]:
import sagemaker
import jinja2
from sagemaker import image_uris
import boto3
import os
import time
import json
from pathlib import Path

In [None]:
role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # bucket to house artifacts
model_bucket = sess.default_bucket()  # bucket to house artifacts
s3_code_prefix = "hf-large-model-djl/code-stablelm-base-alpha-7b"  # folder within bucket where code artifact will go

s3_model_prefix = "hf-large-model-djl/model-stablelm-base-alpha-7b"  # folder within bucket where code artifact will go
region = sess._region_name
account_id = sess.account_id()

s3_client = boto3.client("s3")
sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")

jinja_env = jinja2.Environment()

## Create SageMaker compatible Model artifact, upload model to S3 and bring your own inference script.

SageMaker Large Model Inference containers can be used to host models without providing your own inference code. This is extremely useful when there is no custom pre-processing of the input data or postprocessing of the model's predictions. We used that approach in Lab1 to host the models where we leveraged the In-Built containers.

In this notebook, we demonstrate how to bring your own inference script which leverages Accelerate to shard the model.

SageMaker needs the model artifacts to be in a Tarball format. In this example, we provide the following files - `serving.properties` and `model.py`.

The tarball is in the following format

```
code
├──── 
│   └── serving.properties
│   └── model.py

```

- `serving.properties` is the configuration file that can be used to configure the model server.
- `model.py` is the file that handles any requests for serving.


In [None]:
!mkdir -p code_stablelm-base-alpha-7b

In [None]:
from huggingface_hub import snapshot_download
from pathlib import Path
import os

# - This will download the model into the current directory where ever the jupyter notebook is running
local_model_path = Path(".")
local_model_path.mkdir(exist_ok=True)
model_name = "stabilityai/stablelm-base-alpha-7b"
# Only download pytorch checkpoint files
allow_patterns = ["*.json", "*.pt", "*.bin", "*.txt", "*.model"]

# - Leverage the snapshot library to donload the model since the model is stored in repository using LFS
model_download_path = snapshot_download(
    repo_id=model_name,
    cache_dir=local_model_path,
    allow_patterns=allow_patterns,
)

In [None]:
model_artifact = sess.upload_data(path=model_download_path, key_prefix=s3_model_prefix)
print(f"Model uploaded to --- > {model_artifact}")
print(f"We will set option.s3url={model_artifact}")

In [None]:
!rm -rf {model_download_path}

#### Create serving.properties 
This is a configuration file to indicate to DJL Serving which model parallelization and inference optimization libraries you would like to use. Depending on your need, you can set the appropriate configuration.

Here is a list of settings that we use in this configuration file -
- `engine`: The engine for DJL to use. In this case, we intend to use Accelerate and hence set it to **Python**. 
- `option.entryPoint`: The entrypoint python file or module. This should align with the engine that is being used. 
- `option.s3url`: Set this to the URI of the Amazon S3 bucket that contains the model. When this is set, the container leverages [s5cmd](https://github.com/peak/s5cmd) to download the model from s3. This is extremely fast and useful when downloading large models like this one.

If you want to download the model from huggingface.co, you can set `option.modelid`. The model id of a pretrained model hosted inside a model repository on huggingface.co (https://huggingface.co/models). The container uses this model id to download the corresponding model repository on huggingface.co. 
- `option.tensor_parallel_degree`: Set to the number of GPU devices over which Accelerate needs to partition the model. This parameter also controls the no of workers per model which will be started up when DJL serving runs. As an example if we have a 8 GPU machine and we are creating 8 partitions then we will have 1 worker per model to serve the requests.

For more details on the configuration options and an exhaustive list, you can refer the documentation - https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-configuration.html.


The approach here is to utilize the built-in functionality within Hugging Face Transformers to enable Large Language Model hosting. 


In the below cell, we leverage [Jinja](https://pypi.org/project/Jinja2/) to create a template for serving.properties. Specifically, we parameterize `option.s3url` so that it can be changed based on the pretrained model location.

In [None]:
%%writefile ./code_stablelm-base-alpha-7b/serving.properties
option.s3url = {{s3url}}
engine = Python
option.tensor_parallel_degree = 1

In [None]:
# we plug in the appropriate model location into our `serving.properties` file based on the region in which this notebook is running
template = jinja_env.from_string(
    Path("code_stablelm-base-alpha-7b/serving.properties").open().read()
)
Path("code_stablelm-base-alpha-7b/serving.properties").open("w").write(
    template.render(s3url=model_artifact)
)
!pygmentize code_stablelm-base-alpha-7b/serving.properties | cat -n

#### Create a model.py with custom inference code

In this script, we load the model and generate predictions using the `transformers` library. Note the use of the following parameters while loading the model -
- `device_map`: Using one of the supported versions lets Accelerate handle the `device_map` computation. With `balanced_low_0`, the model is split evenly across all GPUs except the first one. For other supported options, you can refer to [designing a device map](https://huggingface.co/docs/accelerate/usage_guides/big_modeling#designing-a-device-map). You can also create one yourself.
- `load_in_8bit`: Setting this to `True` quantizes the model weights to int8 thereby greatly reducing the memory footprint of the model from the initial FP32. See this [blog post](https://huggingface.co/blog/hf-bitsandbytes-integration) from Hugging Face for additional information.

The container also makes a warmup call without an payload to the handler.

In [None]:
%%writefile ./code_stablelm-base-alpha-7b/model.py
from djl_python import Input, Output
import torch
import logging
import math
import os
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    pipeline,
    StoppingCriteria,
    StoppingCriteriaList,
)


class StopOnTokens(StoppingCriteria):
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        stop_ids = [50278, 50279, 50277, 1, 0]
        for stop_id in stop_ids:
            if input_ids[0][-1] == stop_id:
                return True
        return False


def load_model(properties):
    tensor_parallel = properties["tensor_parallel_degree"]
    model_location = properties["model_dir"]
    if "model_id" in properties:
        model_location = properties["model_id"]
    logging.info(f"Loading model in {model_location}")

    tokenizer = AutoTokenizer.from_pretrained(model_location)

    model = AutoModelForCausalLM.from_pretrained(
        model_location, torch_dtype=torch.float16, device_map={"": 0}
    ).cuda()
    model.requires_grad_(False)
    model.eval()

    return model, tokenizer


model = None
tokenizer = None
generator = None


def run_inference(model, tokenizer, data, params):
    generate_kwargs = params
    tokenizer.pad_token = tokenizer.eos_token
    input_tokens = tokenizer.batch_encode_plus(data, return_tensors="pt", padding=True)
    for t in input_tokens:
        if torch.is_tensor(input_tokens[t]):
            input_tokens[t] = input_tokens[t].to(torch.cuda.current_device())
    stop = StopOnTokens()
    outputs = model.generate(
        **input_tokens, **generate_kwargs, stopping_criteria=StoppingCriteriaList([stop])
    )
    return tokenizer.batch_decode(outputs, skip_special_tokens=True)


def handle(inputs: Input):
    global model, tokenizer
    if not model:
        model, tokenizer = load_model(inputs.get_properties())

    if inputs.is_empty():
        return None
    data = inputs.get_as_json()

    input_sentences = data["inputs"]
    params = data["parameters"]

    outputs = run_inference(model, tokenizer, input_sentences, params)
    result = {"outputs": outputs}
    return Output().add_as_json(result)

**Image URI for the DJL container is being used here**

In [None]:
# inference_image_uri = f"{account_id}.dkr.ecr.{region}.amazonaws.com/djl-ds:latest"
inference_image_uri = (
    f"763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:0.21.0-fastertransformer5.3.0-cu117"
)
print(f"Image going to be used is ---- > {inference_image_uri}")

**Create the Tarball and then upload to S3 location**

In [None]:
!rm -f model.tar.gz
!tar czvf model.tar.gz -C code_stablelm-base-alpha-7b .

In [None]:
s3_code_artifact = sess.upload_data("model.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {s3_code_artifact}")

### To create the end point the steps are:

1. Create the Model using the Image container and the Model Tarball uploaded earlier
2. Create the endpoint config using the following key parameters

    a) Instance Type is ml.g5.12xlarge
    
    b) ContainerStartupHealthCheckTimeoutInSeconds is 2400 to ensure health check starts after the model is ready    
3. Create the end point using the endpoint config created    

#### Create the Model
Use the image URI for the DJL container and the s3 location to which the tarball was uploaded.

The container downloads the model into the `/tmp` space on the container because SageMaker maps the `/tmp` to the Amazon Elastic Block Store (Amazon EBS) volume that is mounted when we specify the endpoint creation parameter VolumeSizeInGB. It leverages `s5cmd`(https://github.com/peak/s5cmd) which offers a very fast download speed and hence extremely useful when downloading large models.

For instances like p4dn, which come pre-built with the volume instance, we can continue to leverage the `/tmp` on the container. The size of this mount is large enough to hold the model.


In [None]:
from sagemaker.utils import name_from_base

model_name = name_from_base(f"stablelm-base-alpha-7b")
print(model_name)

create_model_response = sm_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={"Image": inference_image_uri, "ModelDataUrl": s3_code_artifact},
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

In [None]:
endpoint_config_name = f"{model_name}-config"
endpoint_name = f"{model_name}-endpoint"

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": "ml.g5.xlarge",
            "InitialInstanceCount": 1,
            # "ModelDataDownloadTimeoutInSeconds": 2400,
            "ContainerStartupHealthCheckTimeoutInSeconds": 600,
        },
    ],
)
endpoint_config_response

In [None]:
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)
print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")

### This step can take ~ 10 min or longer so please be patient

In [None]:
import time

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

#### While you wait for the endpoint to be created, you can read more about:
- [Deep Learning containers for large model inference](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-dlc.html)
- [Quantization in HuggingFace Accelerate](https://huggingface.co/blog/hf-bitsandbytes-integration)
- [Handling big models for inference using Accelerate](https://huggingface.co/docs/accelerate/usage_guides/big_modeling#designing-a-device-map)

#### Leverage Boto3 to invoke the endpoint. 

This is a generative model so we pass in a Text as a prompt and Model will complete the sentence and return the results.

You can pass a batch of prompts as input to the model. This done by setting `inputs` to the list of prompts. The model then returns a result for each prompt. The text generation can be configured using appropriate parameters. These `parameters` need to be passed to the endpoint as a dictionary of `kwargs`. Refer this documentation - https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationConfig for more details.

The below code sample illustrates the invocation of the endpoint using a batch of prompts and also sets some parameters.


## Generating text using different decoding approaches
We will use 5 different decoding approaches as described [here](https://huggingface.co/blog/how-to-generate) and analyze the model output quality. 

### Top_p sampling

In [None]:
%%time
prompts = ["Hi, How are you?"]
response_model = smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps(
        {
            "inputs": prompts,
            "parameters": {
                "early_stopping": True,
                "no_repeat_ngram_size": 4,
                "max_new_tokens": 200,
                "do_sample": True,
                "temperature": 0.1,
                "top_p": 0.95,
            },
        }
    ),
    ContentType="application/json",
)

response_model["Body"].read().decode("utf8")

### Beam search

In [None]:
%%time
prompts = ["Hi, How are you?"]
response_model = smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps(
        {
            "inputs": prompts,
            "parameters": {
                "early_stopping": True,
                "no_repeat_ngram_size": 4,
                "max_new_tokens": 1024,
                "num_beams": 2,
            },
        }
    ),
    ContentType="application/json",
)

response_model["Body"].read().decode("utf8")

### Soft-max sampling 

In [None]:
%%time
prompts = ["Hi, How are you?"]
response_model = smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps(
        {
            "inputs": prompts,
            "parameters": {
                "top_k": 0,
                "temperature": 0.6,
                "num_return_sequences": 1,
                "do_sample": True,
            },
        }
    ),
    ContentType="application/json",
)

response_model["Body"].read().decode("utf8")

### Top-k sampling

In [None]:
%%time
prompts = ["Hi, How are you?"]
response_model = smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps(
        {"inputs": prompts, "parameters": {"max_new_tokens": 200, "top_k": 20, "do_sample": True}}
    ),
    ContentType="application/json",
)

response_model["Body"].read().decode("utf8")

### Top_p sampling

In [None]:
%%time
prompts = ["Hi, How are you?"]
response_model = smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps(
        {
            "inputs": prompts,
            "parameters": {"max_new_tokens": 200, "top_k": 10, "top_p": 0.95, "do_sample": True},
        }
    ),
    ContentType="application/json",
)

response_model["Body"].read().decode("utf8")

## Conclusion
In this notebook, we demonstrated how to use SageMaker large model inference containers to host StabilityAI's stablelm-base-alpha-7b. We used Hugging Face library to host model on GPU-based machine learning instance on SageMaker. We then analyzed different decoding approaches and engineered the inference parameters to get better model output quality. For more details about Amazon SageMaker and its large model inference capabilities, refer to the following:

* Model parallelism and large model inference on Sagemaker (https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-inference.html)
* Amazon SageMaker now supports deploying large models through configurable volume size and timeout quotas (https://aws.amazon.com/about-aws/whats-new/2022/09/amazon-sagemaker-deploying-large-models-volume-size-timeout-quotas/)
* Real-time inference – Amazon SageMake (https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html)



## Clean Up

In [None]:
# - Delete the end point
sm_client.delete_endpoint(EndpointName=endpoint_name)

In [None]:
# - In case the end point failed we still want to delete the model
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_model(ModelName=model_name)

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/inference|generativeai|llm-workshop|lab7-stablelm-base-alpha-7b|stablelm-base-alpha-7b-djl-sagemaker.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/inference|generativeai|llm-workshop|lab7-stablelm-base-alpha-7b|stablelm-base-alpha-7b-djl-sagemaker.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/inference|generativeai|llm-workshop|lab7-stablelm-base-alpha-7b|stablelm-base-alpha-7b-djl-sagemaker.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/inference|generativeai|llm-workshop|lab7-stablelm-base-alpha-7b|stablelm-base-alpha-7b-djl-sagemaker.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/inference|generativeai|llm-workshop|lab7-stablelm-base-alpha-7b|stablelm-base-alpha-7b-djl-sagemaker.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/inference|generativeai|llm-workshop|lab7-stablelm-base-alpha-7b|stablelm-base-alpha-7b-djl-sagemaker.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/inference|generativeai|llm-workshop|lab7-stablelm-base-alpha-7b|stablelm-base-alpha-7b-djl-sagemaker.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/inference|generativeai|llm-workshop|lab7-stablelm-base-alpha-7b|stablelm-base-alpha-7b-djl-sagemaker.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/inference|generativeai|llm-workshop|lab7-stablelm-base-alpha-7b|stablelm-base-alpha-7b-djl-sagemaker.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/inference|generativeai|llm-workshop|lab7-stablelm-base-alpha-7b|stablelm-base-alpha-7b-djl-sagemaker.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/inference|generativeai|llm-workshop|lab7-stablelm-base-alpha-7b|stablelm-base-alpha-7b-djl-sagemaker.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/inference|generativeai|llm-workshop|lab7-stablelm-base-alpha-7b|stablelm-base-alpha-7b-djl-sagemaker.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/inference|generativeai|llm-workshop|lab7-stablelm-base-alpha-7b|stablelm-base-alpha-7b-djl-sagemaker.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/inference|generativeai|llm-workshop|lab7-stablelm-base-alpha-7b|stablelm-base-alpha-7b-djl-sagemaker.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/inference|generativeai|llm-workshop|lab7-stablelm-base-alpha-7b|stablelm-base-alpha-7b-djl-sagemaker.ipynb)
