
# Serve GPT4ALL-J-6b on SageMaker with transformer-neuronx using DJL container.

This notebook was run on ml.m5.xlarge

In this notebook, we explore how to host a large language model on SageMaker using the Large Model Inference container that packages DJL model server and transformers-neuronx, an open source library that serves the purpose of running transformer inference workflows using AWS Neuron SDK. Neuron SDK is required to to run deep learning workloads on AWS Inferentia and AWS Trainium based instances . We use DJLServing as the model serving solution in this example. 


In this notebook, we deploy the open source GPT4ALL-J-6B(https://huggingface.co/nomic-ai/gpt4all-j) model across NeuronCores on an ml.inf2.24xlarge instance. The model is downloaded, converted to Inf2 format,partitioned and loaded to the Inferentia Chips.


---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/inference|generativeai|llm-workshop|lab8-Inferentia2-gpt4all-j|inferentia2-llm-GPT4allJ.ipynb)

---

## Licence agreement
 - View license information https://huggingface.co/nomic-ai/gpt4all-j before using the model.
 - This notebook is a sample notebook and not intended for production use. Please refer to the licence at https://github.com/aws/mit-0.

 


#### Import the relevant libraries and configure several global variables using boto3

In [None]:
%pip install sagemaker boto3 awscli huggingface_hub --upgrade --quiet

In [None]:
import boto3
import sagemaker
from sagemaker import Model, serializers, deserializers, image_uris
import jinja2
import os
import time
import json
from pathlib import Path

role = sagemaker.get_execution_role() # execution role for the endpoint
sess = sagemaker.session.Session() # sagemaker session for interacting with different AWS APIs
region = sess._region_name # region name of the current SageMaker Studio environment
account_id = sess.account_id() # account_id of the current SageMaker Studio environment

## Create SageMaker compatible Model artifact, upload model to S3 .

SageMaker Large Model Inference containers can be used to host models without providing your own inference code. You can also provide a '*model.py*' file if you like to customize in the model loading and inference process.
SageMaker needs the model artifacts to be in a Tarball format. In this example, we provide the `serving.properties`.

The tarball is in the following format

```
code
├──── 
│ └── serving.properties

```

- `serving.properties` is the configuration file that can be used to configure the model server.



In [None]:
##Setup S3 bucket to store model artifacts
bucket = sess.default_bucket() # bucket to house artifacts
model_bucket = sess.default_bucket() # bucket to house artifacts
s3_model_prefix = "inf2-llm-djl/gpt4jall" # folder within bucket where model artifact will go

In [None]:
from huggingface_hub import snapshot_download
from pathlib import Path
import os

# - This will download the model into the current directory where ever the jupyter notebook is running
local_model_path = Path(".")
local_model_path.mkdir(exist_ok=True)
model_name = "nomic-ai/gpt4all-j"
# Only download pytorch checkpoint files
allow_patterns = ["*.json", "*.pt", "*.bin", "*.txt", "*.model"]

# - Leverage the snapshot library to donload the model since the model is stored in repository using LFS
model_download_path = snapshot_download(
 repo_id=model_name,
 cache_dir=local_model_path,
 allow_patterns=allow_patterns,
)

In [None]:
# define a variable to contain the s3url of the location that has the model
pretrained_model_location = f"s3://{model_bucket}/{s3_model_prefix}/"
print(f"Pretrained model will be uploaded to ---- > {pretrained_model_location}")

In [None]:
model_artifact = sess.upload_data(path=model_download_path, key_prefix=s3_model_prefix)
print(f"Model uploaded to --- > {model_artifact}")
print(f"We will set option.s3url={model_artifact}")

## Start preparing model artifacts
In LMI contianer, we expect some artifacts to help setting up the model

serving.properties (required): Defines the model server settings
model.py (optional): A python file to define the core inference logic
requirements.txt (optional): Any additional pip wheel need to install

#### Create serving.properties 
This is a configuration file to indicate to DJL Serving which model and configurations you want to use

Here is a list of settings that we use in this configuration file -
- `engine`: The engine for DJL to use. In this case, we intend to use transformers-neuronx and hence set it to **Python**. 
- `option.entryPoint`: The entrypoint python file or module. Since here we are using the default handler, we will configure it to use transformers-neuronx
- `option.model_id` : The model id of a pretrained model hosted inside a model repository on huggingface.co (https://huggingface.co/models). The container uses this model id to download the corresponding model repository on huggingface.co. 
- `option.n_positions`: Number of tokens model can accept.


If you want to download teh model from S3, you can instead set the parameter`option.s3url`to the Amazon S3 bucket that contains the model. When this is set, the container leverages [s5cmd](https://github.com/peak/s5cmd) to download the model from s3. This is extremely fast and useful when downloading large models.


- `option.tensor_parallel_degree`: This property value sets the number of devices over which the tensor parallel modules are distributed. For example, if we have 12 Neruon Core and you do tensor parallel degree of 4. LMI will distribute to 3 copies of models and each of them takes 4 cores

For more details on the configuration options and an exhaustive list, you can refer the documentation - https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-configuration.html.


In [None]:
%%writefile serving.properties
# Start writing content here
engine = Python
option.entryPoint = djl_python.transformers - neuronx
# option.model_id=nomic-ai/gpt4all-j
option.s3url = {{s3url}}
option.tensor_parallel_degree = 2
option.model_loading_timeout = 2400
option.n_positions = 512

Create the model tarball and upload it to a S3 bucket 

In [None]:
%%sh
mkdir mymodel
mv serving.properties mymodel/


In [None]:
# we plug in the appropriate model location into our `serving.properties` file based on the region in which this notebook is running
jinja_env = jinja2.Environment()
template = jinja_env.from_string(Path("mymodel/serving.properties").open().read())
Path("mymodel/serving.properties").open("w").write(template.render(s3url=pretrained_model_location))
!pygmentize mymodel/serving.properties | cat -n

In [None]:
%%sh
tar czvf mymodel.tar.gz mymodel/
rm -rf mymodel

In [None]:
s3_code_prefix = "large-model-lmi/code"
bucket = sess.default_bucket() # bucket to house artifacts
code_artifact = sess.upload_data("mymodel.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {code_artifact}")


env = {"HUGGINGFACE_HUB_CACHE": "/tmp", "TRANSFORMERS_CACHE": "/tmp"}
##Retrieve djl container image
image_uri = image_uris.retrieve(
 framework="djl-deepspeed", region=sess.boto_session.region_name, version="0.21.0"
)
image_uri = image_uri.split(":")[0] + ":" + "0.22.1-neuronx-sdk2.9.0"

model = Model(image_uri=image_uri, model_data=code_artifact, env=env, role=role)

#### Create SageMaker endpoint
1. Create the Model using the Image container and the Model Tarball uploaded earlier
2. Create the endpoint config using the following key parameters

 a) Instance Type is ml.inf2.8xlarge
 
 b) ContainerStartupHealthCheckTimeoutInSeconds is 3600 to ensure health check starts after the model is ready 
3. The container downloads the model into the /tmp space on the container because SageMaker maps the /tmp to the Amazon Elastic Block Store (Amazon EBS). We need to add a volume_size parameter to ensure /tmp directory has enough space to download and compile the model.
4. Create the end point using the endpoint config created 



In [None]:
instance_type = "ml.inf2.8xlarge"
endpoint_name = sagemaker.utils.name_from_base("lmi-model")


model.deploy(
 initial_instance_count=1,
 instance_type=instance_type,
 endpoint_name=endpoint_name,
 container_startup_health_check_timeout=3600,
 volume_size=256,
)

# our requests and responses will be in json format so we specify the serializer and the deserializer
predictor = sagemaker.Predictor(
 endpoint_name=endpoint_name,
 sagemaker_session=sess,
 serializer=serializers.JSONSerializer(),
 deserializer=deserializers.JSONDeserializer(),
)

## Test the endpoint 
This is a generative model so we pass in a Text as a prompt. 

In [None]:
predictor.predict({"inputs": "What is deep learning", "parameters": {"max_length": 500}})

## Conclusion
In this post, we demonstrated how to use SageMaker large model inference containers to host GPT4ALL-J-6B. 



## Clean Up

In [None]:
# - Delete the end point
sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_name)
model.delete_model()

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/inference|generativeai|llm-workshop|lab8-Inferentia2-gpt4all-j|inferentia2-llm-GPT4allJ.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/inference|generativeai|llm-workshop|lab8-Inferentia2-gpt4all-j|inferentia2-llm-GPT4allJ.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/inference|generativeai|llm-workshop|lab8-Inferentia2-gpt4all-j|inferentia2-llm-GPT4allJ.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/inference|generativeai|llm-workshop|lab8-Inferentia2-gpt4all-j|inferentia2-llm-GPT4allJ.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/inference|generativeai|llm-workshop|lab8-Inferentia2-gpt4all-j|inferentia2-llm-GPT4allJ.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/inference|generativeai|llm-workshop|lab8-Inferentia2-gpt4all-j|inferentia2-llm-GPT4allJ.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/inference|generativeai|llm-workshop|lab8-Inferentia2-gpt4all-j|inferentia2-llm-GPT4allJ.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/inference|generativeai|llm-workshop|lab8-Inferentia2-gpt4all-j|inferentia2-llm-GPT4allJ.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/inference|generativeai|llm-workshop|lab8-Inferentia2-gpt4all-j|inferentia2-llm-GPT4allJ.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/inference|generativeai|llm-workshop|lab8-Inferentia2-gpt4all-j|inferentia2-llm-GPT4allJ.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/inference|generativeai|llm-workshop|lab8-Inferentia2-gpt4all-j|inferentia2-llm-GPT4allJ.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/inference|generativeai|llm-workshop|lab8-Inferentia2-gpt4all-j|inferentia2-llm-GPT4allJ.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/inference|generativeai|llm-workshop|lab8-Inferentia2-gpt4all-j|inferentia2-llm-GPT4allJ.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/inference|generativeai|llm-workshop|lab8-Inferentia2-gpt4all-j|inferentia2-llm-GPT4allJ.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/inference|generativeai|llm-workshop|lab8-Inferentia2-gpt4all-j|inferentia2-llm-GPT4allJ.ipynb)
