# Serve large models on SageMaker with model parallel inference and DJLServing

In this notebook, we explore how to host a large language model on SageMaker using model parallelism from DeepSpeed and DJLServing. Please note this is a direct copy from the original content on the [SageMaker examples here.](https://github.com/aws/amazon-sagemaker-examples/blob/main/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb)

Language models have recently exploded in both size and popularity. In 2018, BERT-large entered the scene and, with its 340M parameters and novel transformer architecture, set the standard on NLP task accuracy. Within just a few years, state-of-the-art NLP model size has grown by more than 500x with models such as OpenAIâ€™s 175 billion parameter GPT-3 and similarly sized open source Bloom 176B raising the bar on NLP accuracy. This increase in the number of parameters is driven by the simple and empirically-demonstrated positive relationship between model size and accuracy: more is better. With easy access from models zoos such as Hugging Face and improved accuracy in NLP tasks such as classification and text generation, practitioners are increasingly reaching for these large models. However, deploying them can be a challenge because of their size.

Model parallelism can help deploy large models that would normally be too large for a single GPU. With model parallelism, we partition and distribute a model across multiple GPUs. Each GPU holds a different part of the model, resolving the memory capacity issue for the largest deep learning models with billions of parameters. This notebook uses tensor parallelism techniques which allow GPUs to work simultaneously on the same layer of a model and achieve low latency inference relative to a pipeline parallel solution.

In this notebook, we deploy a PyTorch GPT-J model from Hugging Face with 6 billion parameters across two GPUs on an Amazon SageMaker ml.g5.48xlarge instance. DeepSpeed is used for tensor parallelism inference while DJLServing handles inference requests and the distributed workers.

In [None]:
!pip install boto3==1.24.68

## Step 1: Create a `model.py` and `serving.properties`

In [None]:
%%writefile model.py

from djl_python import Input, Output
import os
import deepspeed
import torch
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

predictor = None


def get_model():
    model_name = "EleutherAI/gpt-j-6B"
    tensor_parallel = int(os.getenv("TENSOR_PARALLEL_DEGREE", "2"))
    local_rank = int(os.getenv("LOCAL_RANK", "0"))
    model = AutoModelForCausalLM.from_pretrained(
        model_name, revision="float32", torch_dtype=torch.float32
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    model = deepspeed.init_inference(
        model,
        mp_size=tensor_parallel,
        dtype=model.dtype,
        replace_method="auto",
        replace_with_kernel_inject=True,
    )
    generator = pipeline(
        task="text-generation", model=model, tokenizer=tokenizer, device=local_rank
    )
    return generator


def handle(inputs: Input) -> None:
    global predictor
    if not predictor:
        predictor = get_model()

    if inputs.is_empty():
        # Model server makes an empty call to warmup the model on startup
        return None

    data = inputs.get_as_string()
    result = predictor(data, do_sample=True, min_tokens=200, max_new_tokens=256)
    return Output().add(result)

### Setup serving.properties

User needs to add engine Rubikon as shown below. If you would like to control how many worker groups, you can set by adding these lines in the below file.

```
gpu.minWorkers=1
gpu.maxWorkers=1
```
By default, we will create as much worker group as possible based on `gpu_numbers/tensor_parallel_degree`.

In [None]:
%%writefile serving.properties
engine = Rubikon

The code below creates the SageMaker model file (`model.tar.gz`) and upload it to S3. 

In [None]:
import sagemaker, boto3

session = sagemaker.Session()
account = session.account_id()
region = session.boto_region_name
img = "djl_deepspeed"
fullname = account + ".dkr.ecr." + region + ".amazonaws.com/" + img + ":latest"
bucket = session.default_bucket()
path = "s3://" + bucket + "/DEMO-djl-big-model"

In [None]:
%%sh
if [ -d gpt-j ]; then
  rm -d -r gpt-j
fi #always start fresh

mkdir -p gpt-j
mv model.py gpt-j
mv serving.properties gpt-j
tar -czvf gpt-j.tar.gz gpt-j/
#aws s3 cp gpt-j.tar.gz {path}

In [None]:
model_s3_url = sagemaker.s3.S3Uploader.upload(
    "gpt-j.tar.gz", path, kms_key=None, sagemaker_session=session
)

## Step 2: Create SageMaker endpoint

Now we create our [SageMaker model](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_model). Make sure your execution role has access to your model artifacts and ECR image. Please check out our SageMaker Roles [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) for more details. 

In [None]:
# hard code point to a custom image I'm hosting for this workshop
image = '911195073761.dkr.ecr.us-east-1.amazonaws.com/djl_deepspeed:0.19'

In [None]:
from datetime import datetime

sm_client = boto3.client("sagemaker")

time_stamp = datetime.now().strftime("%Y-%m-%d-%H-%M-%S")
model_name = "gpt-j-" + time_stamp

create_model_response = sm_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=session.get_caller_identity_arn(),
    PrimaryContainer={
        "Image": image,
        "ModelDataUrl": model_s3_url,
        "Environment": {"TENSOR_PARALLEL_DEGREE": "2"},
    },
)

Now we create an endpoint configuration that SageMaker hosting services uses to deploy models. Note that we configured `ModelDataDownloadTimeoutInSeconds` and `ContainerStartupHealthCheckTimeoutInSeconds` to accommodate the large size of our model. 

In [None]:
initial_instance_count = 1
instance_type = "ml.g5.48xlarge"
variant_name = "AllTraffic"
endpoint_config_name = "t-j-config-" + time_stamp

production_variants = [
    {
        "VariantName": variant_name,
        "ModelName": model_name,
        "InitialInstanceCount": initial_instance_count,
        "InstanceType": instance_type,
        "ModelDataDownloadTimeoutInSeconds": 1800,
        "ContainerStartupHealthCheckTimeoutInSeconds": 3600,
    }
]

endpoint_config = {
    "EndpointConfigName": endpoint_config_name,
    "ProductionVariants": production_variants,
}

ep_conf_res = sm_client.create_endpoint_config(**endpoint_config)

We are ready to create an endpoint using the model and the endpoint configuration created from above steps. 

In [None]:
endpoint_name = "gpt-j" + time_stamp
ep_res = sm_client.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)

The creation of the SageMaker endpoint might take a while. After the endpoint is created, you can test it out using the following code. 

In [None]:
import json

client = boto3.client("sagemaker-runtime")

content_type = "text/plain"  # The MIME type of the input data in the request body.
payload = "Amazon.com is the best"  # Payload for inference.
# response = client.invoke_endpoint(
    EndpointName=endpoint_name, ContentType=content_type, Body=payload
)
print(response["Body"].read())

## Conclusion

In this notebook, you use tensor parallelism to partition a large language model across multiple GPUs for low latency inference. With tensor parallelism, multiple GPUs work on the same model layer at once allowing for faster inference latency when a low batch size is used. Here, we use open source DeepSpeed as the model parallel library to partition the model and open source Deep Java Library Serving as the model serving solution.

As a next step, you can experiment with larger models from Hugging Face such as GPT-NeoX. You can also adjust the tensor parallel degree to see the impact to latency with models of different sizes.