# Serve GPT-J on SageMaker with DJLServing using PySDK

In this notebook, we explore how to host a fine-tuned GPT-J parameter model on SageMaker using [Deep Java Library (DJL) on Amazon SageMaker](https://sagemaker.readthedocs.io/en/stable/frameworks/djl/index.html).

Language models have recently exploded in both size and popularity. In 2018, BERT-large entered the scene and, with its 340M parameters and novel transformer architecture, set the standard on NLP task accuracy. Within just a few years, state-of-the-art NLP model size has grown by more than 500x with models such as OpenAI’s 175 billion parameter GPT-3 and similarly sized open source Bloom 176B raising the bar on NLP accuracy. This increase in the number of parameters is driven by the simple and empirically-demonstrated positive relationship between model size and accuracy: more is better. With easy access from models zoos such as Hugging Face and improved accuracy in NLP tasks such as classification and text generation, practitioners are increasingly reaching for these large models. However, deploying them can be a challenge because of their size.

Model parallelism can help deploy large models that would normally be too large for a single GPU. With model parallelism, we partition and distribute a model across multiple GPUs. Each GPU holds a different part of the model, resolving the memory capacity issue for the largest deep learning models with billions of parameters. This notebook uses tensor parallelism techniques which allow GPUs to work simultaneously on the same layer of a model and achieve low latency inference relative to a pipeline parallel solution.

SageMaker has rolled out DeepSpeed container which now provides users with the ability to leverage the managed serving capabilities and help to provide the un-differentiated heavy lifting.

In this notebook, we deploy the fine tuned GPT-J model. DeepSpeed is used for tensor parallelism inference while DJLServing handles inference requests and the distributed workers. For further reading on DeepSpeed you can refer to https://arxiv.org/pdf/2207.00032.pdf 


In [None]:
# Instal boto3 library to create model and run inference workloads
%pip install -Uqq boto3 awscli

## Section to Download Model from S3

In this section we download the model archive from S3. We will decompress the file and inspect the artifacts. 

In [None]:
import sagemaker
import boto3

bucket = sagemaker.session.Session().default_bucket()
print(bucket)

from sagemaker import get_execution_role

role = sagemaker.get_execution_role()  # execution role for the endpoint
session = (
    sagemaker.session.Session()
)  # sagemaker session for interacting with different AWS APIs
region = session._region_name

sm_client = boto3.client("sagemaker")

Next cell controls which local path to use for fetching the model

In [None]:
local_model_path = "./model"

<div class="alert alert-block alert-info">
<b>Model Location</b> <br>
Set the model_s3_uri variable to the S3 location of the fine-tuned `model.tar.gz` file.   
It should be of the form `s3://{bucket}/smp-tensorparallel-outputdir/{training-job-name}/output/model.tar.gz`

In [None]:
# model_s3_uri = #"<replace-this-with-the-S3-URI>"

The following cell downloads the model from S3 and places it in the local model path we defined above.

In [None]:
! rm -rf $local_model_path
! mkdir -p $local_model_path || echo "Dir exists"
! aws s3 cp $model_s3_uri $local_model_path

The following commands, extract the binary from the archive, clean up our local directory and rename the model to `gptj.pt`

In [None]:
! tar -xvf $local_model_path/model.tar.gz fullmodel.pt

In [None]:
! rm $local_model_path/model.tar.gz
! mv fullmodel.pt $local_model_path/gptj.pt

## Upload to model artifacts to S3 
In this step, we will download the tokenizer and model from Hugging Face hub, load the model weights we extracted in the previous step and upload the uncompressed artifacts to a S3 bucket.

In [None]:
%%bash
pip install -U pip --quiet
pip install -U sagemaker --quiet
pip install -U boto3 --quiet

pip install -U transformers --quiet
pip install smart_open

In [None]:
from sagemaker.s3 import S3Uploader
from transformers import AutoModel, AutoTokenizer, AutoModelForCausalLM
from smart_open import open as smart_open
from transformers import GPTJModel, GPTJConfig
import io
import torch

In [None]:
model_config = "gpt-j-xl"

if model_config == "gpt-j-6B":
    model_id = "EleutherAI/gpt-j-6B"
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        revision="float16",
        torch_dtype=torch.float16,
    )
elif model_config == "gpt-j-xl":
    model_id = "EleutherAI/gpt-j-6B"
    model_config = GPTJConfig(
        vocab_size=50400,
        n_positions=1024,
        n_embd=1600,
        n_layer=48,
        n_head=25,
        n_inner=None,
        activation_function="gelu_new",
        resid_pdrop=0.1,
        embd_pdrop=0.1,
        attn_pdrop=0.1,
        layer_norm_epsilon=1e-05,
        initializer_range=0.02,
        summary_type="cls_index",
        summary_use_proj=True,
        summary_activation=None,
        summary_proj_to_labels=True,
        summary_first_dropout=0.1,
        # gradient_checkpointing=args.gradient_checkpointing > 0,
        use_cache=False,
        bos_token_id=50256,
        eos_token_id=50256,
        return_dict=True,
    )
    model = AutoModelForCausalLM.from_config(model_config, torch_dtype=torch.float16)


load_path = f"{local_model_path}/gptj.pt"
with smart_open(load_path, "rb") as f:
    buffer = io.BytesIO(f.read())
    model.load_state_dict(torch.load(buffer))

tokenizer = AutoTokenizer.from_pretrained(model_id)

model.save_pretrained("tmp/gptj-finetuned")

tokenizer.save_pretrained("tmp/gptj-finetuned")

s3_location = f"s3://{bucket}/djl-serving/gptj-finetuned"
S3Uploader.upload("tmp/gptj-finetuned", s3_location)

print(f"Pretrained model will be downloaded from ---- > {s3_location}")

## Deploy the model to SageMaker

In [None]:
from sagemaker.djl_inference import DJLModel

model = DJLModel(
    s3_location,
    role,
    task="text-generation",
    number_of_partitions=1,
    data_type="fp16",
)

predictor = model.deploy(initial_instance_count=1, instance_type="ml.g4dn.4xlarge")

## Run inference using your endpoint

In [None]:
%%time
data = {
    "inputs": [
        "The ability to spread butter on toast is",
        "Video games are truly the",
    ],
}
outputs = predictor.predict(data)
for output in outputs:
    print(output)

## Conclusion
In this notebook, we demonstrated how to use SageMaker large model inference containers to host a finetuned GTP-J. We used Deep Java Library (DJL) on Amazon SageMaker. For more details about Amazon SageMaker and its large model inference capabilities, refer to the following:

* Amazon SageMaker now supports deploying large models through configurable volume size and timeout quotas (https://aws.amazon.com/about-aws/whats-new/2022/09/amazon-sagemaker-deploying-large-models-volume-size-timeout-quotas/)
* Real-time inference – Amazon SageMake (https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html)



## Clean Up

In [None]:
# - Delete the end point
sm_client.delete_endpoint(EndpointName=predictor.endpoint_name)

In [None]:
# - In case the end point failed we still want to delete the model
# sm_client.delete_endpoint_config(EndpointConfigName=predictor.endpoint_name)
sm_client.delete_model(ModelName=predictor.endpoint_name)