# Triton on SageMaker - Deploying on Inferentia instance type



[Amazon SageMaker](https://aws.amazon.com/sagemaker/) is a fully managed service for data science and machine learning workflows. It helps data scientists and developers to prepare, build, train, and deploy high-quality ML models quickly by bringing together a broad set of capabilities purpose-built for ML.

Now, [NVIDIA Triton Inference Server](https://github.com/triton-inference-server/server/) can be used to serve models for inference in Amazon SageMaker. Thanks to the new NVIDIA Triton container image, you can easily serve ML models and benefit from the performance optimizations, dynamic batching, and multi-framework support provided by NVIDIA Triton. Triton helps maximize the utilization of GPU and CPU, further lowering the cost of inference.

This notebook was tested on an Amazon SageMaker notebook instance of type `inf2.24xlarge`.

## Introduction to INF2/Neuronx

AWS Inferentia accelerators are designed by AWS to deliver high performance at the lowest cost for your deep learning (DL) inference applications. AWS Inferentia2 accelerator delivers a major leap in performance and capabilities over first-generation AWS Inferentia. Inferentia2 delivers up to 4x higher throughput and up to 10x lower latency compared to Inferentia.

They are optimized to deploy increasingly complex models, such as large language models (LLM) and vision transformers, at scale. Inf2 instances are the first inference-optimized instances in Amazon EC2 to support scale-out distributed inference with ultra-high-speed connectivity between accelerators. You can now efficiently and cost-effectively deploy models with hundreds of billions of parameters across multiple accelerators on Inf2 instances.

[AWS Neuron](https://aws.amazon.com/machine-learning/neuron/) is the SDK that helps developers deploy models on both AWS Inferentia accelerators and run your inference applications for natural language processing/understanding, language translation, text summarization, video and image generation, speech recognition, personalization, fraud detection, and more. It integrates natively with popular machine learning (ML) frameworks, such as PyTorch and TensorFlow, so that you can continue to use your existing code and workflows and run on Inferentia accelerators.


## Introduction to NVIDIA Triton Server

[NVIDIA Triton Inference Server](https://github.com/triton-inference-server/server/) was developed specifically to enable scalable, cost-effective, and easy deployment of models in production. NVIDIA Triton Inference Server is open-source inference serving software that simplifies the inference serving process and provides high inference performance.

Some key features of Triton are:
* **Support for Multiple frameworks**: Triton can be used to deploy models from all major frameworks. Triton supports TensorFlow GraphDef, TensorFlow SavedModel, ONNX, PyTorch TorchScript, TensorRT, RAPIDS FIL for tree based models, and OpenVINO model formats. 
* **Model pipelines**: Triton model ensemble represents a pipeline of one or more models or pre/post processing logic and the connection of input and output tensors between them. A single inference request to an ensemble will trigger the execution of the entire pipeline.
* **Concurrent model execution**: Multiple models (or multiple instances of the same model) can run simultaneously on the same GPU or on multiple GPUs for different model management needs.
* **Dynamic batching**: For models that support batching, Triton has multiple built-in scheduling and batching algorithms that combine individual inference requests together to improve inference throughput. These scheduling and batching decisions are transparent to the client requesting inference.
* **Diverse CPUs and GPUs**: The models can be executed on CPUs or GPUs for maximum flexibility and to support heterogeneous computing requirements.

## Deploying Inf2 models using python backend on SageMaker (SM) Triton Container

In this notebook, we explore how to use the python backend of Triton Inference Server to load a Neuron-traced model on an inf2 accelerator on SageMaker, and run inference requests against it. To make it easier to get started, we use a model.py and a config.pbtxt generator helper as described here: https://github.com/triton-inference-server/python_backend/tree/main/inferentia#using-triton-with-inferentia-2-or-trn1. 

In this example, the model.py and config.pbtxt files are already provided for OPT-125M and GPTJ-6B using the above generator script, and then modifying them as per available Neuron APIs as described here - https://github.com/triton-inference-server/python_backend/tree/main/inferentia#transformers-neuronx.


You can find more information on how to use the python backend of Triton in the documentation here: https://github.com/triton-inference-server/python_backend#usage, specifically the use of initialize(), execute(), and finalize() functions.

There are two ways to load the model using python backend of Triton:
1. Download and trace the neuron model on the SageMaker endpoint in the initialize() function itself. While this means that you can do the tracing directly on the same instance as you host the model, this process can take very long (several hours) before the endpoint is ready to serve the model.
2. Trace the model on an inf2 instance, and upload the neuron-traced model to an S3 bucket. In the initalize() function, load the model from the current directory (relative to where the model.py file is present). In this process, the pre-traced model will take several minutes to load, making it easy to experiment with, and helps when restarting the endpoint multiple times.

In this example, we will use option 2 above i.e. trace a neuron model on an inf2 instance, and use that on a SageMaker endpoint. This means that this notebook should be executed on an inf2 instance to trace the model, and then upload that to S3. This process only needs to happen once for a given version of the model.


### Set up the environment

Installs the dependencies required to package the model and run inferences using Triton server.

Also define the IAM role that will give SageMaker access to the model artifacts and the NVIDIA Triton ECR image.

In [None]:
!pip install -qU pip awscli boto3 sagemaker
!pip install nvidia-pyindex
!pip install tritonclient[http]

In [None]:
import boto3, json, sagemaker, time
from sagemaker import get_execution_role

sm_client = boto3.client(service_name="sagemaker", region_name="us-east-2")
runtime_sm_client = boto3.client("sagemaker-runtime", region_name="us-east-2")
sagemaker_session = sagemaker.Session(boto_session=boto3.Session(region_name="us-east-2"))
role = get_execution_role(sagemaker_session=sagemaker_session)

In [None]:
account_id_map = {
 "us-east-1": "785573368785",
 "us-east-2": "007439368137",
 "us-west-1": "710691900526",
 "us-west-2": "301217895009",
 "eu-west-1": "802834080501",
 "eu-west-2": "205493899709",
 "eu-west-3": "254080097072",
 "eu-north-1": "601324751636",
 "eu-south-1": "966458181534",
 "eu-central-1": "746233611703",
 "ap-east-1": "110948597952",
 "ap-south-1": "763008648453",
 "ap-northeast-1": "941853720454",
 "ap-northeast-2": "151534178276",
 "ap-southeast-1": "324986816169",
 "ap-southeast-2": "355873309152",
 "cn-northwest-1": "474822919863",
 "cn-north-1": "472730292857",
 "sa-east-1": "756306329178",
 "ca-central-1": "464438896020",
 "me-south-1": "836785723513",
 "af-south-1": "774647643957",
}

In [None]:
region = boto3.Session().region_name
if region not in account_id_map.keys():
 raise ValueError("UNSUPPORTED REGION")

In [None]:
base = "amazonaws.com.cn" if region.startswith("cn-") else "amazonaws.com"
triton_image_uri = "{account_id}.dkr.ecr.{region}.{base}/sagemaker-tritonserver:23.03-py3".format(
 account_id=account_id_map[region], region=region, base=base
)

In [None]:
triton_image_uri

**Note: update the 'FROM' base image in Dockerfile located in docker/ folder as per above**

### Add utility methods for preparing request payload

The following method transforms a sample image we will be using for inference into the payload that can be sent for inference to the Triton server.

In [None]:
import numpy as np
from PIL import Image

s3_client = boto3.client("s3")
s3_client.download_file(
 "sagemaker-sample-files", "datasets/image/pets/shiba_inu_dog.jpg", "shiba_inu_dog.jpg"
)


def get_sample_image():
 image_path = "./shiba_inu_dog.jpg"
 img = Image.open(image_path).convert("RGB")
 img = img.resize((224, 224))
 img = (np.array(img).astype(np.float32) / 255) - np.array(
 [0.485, 0.456, 0.406], dtype=np.float32
 ).reshape(1, 1, 3)
 img = img / np.array([0.229, 0.224, 0.225], dtype=np.float32).reshape(1, 1, 3)
 img = np.transpose(img, (2, 0, 1))
 return img.tolist()

The `tritonclient` package provides utility methods to generate the payload without having to know the details of the specification. We'll use the following methods to convert our inference request into a binary format which provides lower latencies for inference.

In [None]:
import tritonclient.http as httpclient


def _get_sample_image_binary(input_name, output_name):
 inputs = []
 outputs = []
 inputs.append(httpclient.InferInput(input_name, [1, 3, 224, 224], "FP32"))
 input_data = np.array(get_sample_image(), dtype=np.float32)
 input_data = np.expand_dims(input_data, axis=0)
 inputs[0].set_data_from_numpy(input_data, binary_data=True)
 outputs.append(httpclient.InferRequestedOutput(output_name, binary_data=True))
 request_body, header_length = httpclient.InferenceServerClient.generate_request_body(
 inputs, outputs=outputs
 )
 return request_body, header_length


def get_sample_image_binary_pt():
 return _get_sample_image_binary("INPUT__0", "OUTPUT__0")


def get_sample_image_binary_trt():
 return _get_sample_image_binary("input", "output")

### Set up the Tritonserver Container with neuronx libraries

In [None]:
!aws ecr get-login-password --region {region} | docker login --username AWS --password-stdin {account_id_map[region]}.dkr.ecr.{region}.amazonaws.com

In [None]:
!mkdir -p docker/mylib

Sockets and rules required by the container to interact with neuron hardware

In [None]:
!cp -R /lib/udev/rules.d/* docker/mylib/

In [None]:
!docker build --no-cache -t tritonserver-neuronx docker/ --build-arg SM_TRITON_IMAGE_URI={triton_image_uri}

In [None]:
# Upload the container to ECR for SageMaker to consume
# inf2 instances are available in us-east-2
curr_account_id = boto3.client("sts").get_caller_identity().get("Account")
inf2_region = "us-east-2"

triton_neuronx_image_uri = (
 f"{curr_account_id}.dkr.ecr.{inf2_region}.amazonaws.com/sagemaker-tritonserver:23.03-py3"
)

In [None]:
!docker tag tritonserver-neuronx:latest {triton_neuronx_image_uri}

In [None]:
!aws ecr get-login-password --region {inf2_region} | docker login --username AWS --password-stdin {curr_account_id}.dkr.ecr.{inf2_region}.amazonaws.com

Push the container image.
Note: This command assumes that that the ECR repository `sagemaker-tritonserver` exists in the account. If it doesn't, please create one

In [None]:
!docker push {triton_neuronx_image_uri}

In [None]:
triton_neuronx_image_uri

### PyTorch-Neuronx 

In [None]:
!pip install torch-neuronx transformers-neuronx

In [None]:
import os
import urllib
from PIL import Image

import torch
import torch_neuronx
from torchvision import models
from torchvision.transforms import functional


def get_image(batch_size=1, image_shape=(224, 224)):
 # Get an example input
 filename = "000000039769.jpg"
 if not os.path.exists(filename):
 url = "http://images.cocodataset.org/val2017/000000039769.jpg"
 urllib.request.urlretrieve(url, filename)
 image = Image.open(filename).convert("RGB")
 image = functional.resize(image, (image_shape))
 image = functional.to_tensor(image)
 image = torch.unsqueeze(image, 0)
 image = torch.repeat_interleave(image, batch_size, 0)
 return (image,)


# Create the model
model = models.resnet50(pretrained=True)
model.eval()

# Get an example input
image = get_image()

# Run inference on CPU
output_cpu = model(*image)

# Compile the model
model_neuron = torch_neuronx.trace(model, image)

# Save the TorchScript for inference deployment
filename = "model.pt"
torch.jit.save(model_neuron, filename)

In [None]:
!mkdir -p triton-pt-inf2/resnet/1
!mv model.pt triton-pt-inf2/resnet/

# Generate a pytorch-neuronx triton model.py and config.pbtxt file
!git clone https://github.com/triton-inference-server/python_backend && cd python_backend
!cd python_backend && python3 inferentia/scripts/gen_triton_model.py --inf2 --model_type pytorch --triton_input INPUT__0,FP32,3x224x224 --triton_output OUTPUT__0,FP32,1000 --compiled_model model.pt --neuron_core_range 0:1 --triton_model_dir ../triton-pt-inf2/resnet --enable_dynamic_batching --max_batch_size 4

Move the generated config.pbtxt and model.py to the model folder

In [None]:
!tar -C triton-pt-inf2/ -czf triton-pt-inf2.tar.gz resnet
model_uri = sagemaker_session.upload_data(
 path="triton-pt-inf2.tar.gz", key_prefix="triton-inf2-models"
)

In [None]:
model_uri

Load the model

In [None]:
sm_model_name = "triton-resnet-inf2-pt-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

container = {
 "Image": triton_neuronx_image_uri,
 "ModelDataUrl": model_uri,
 "Environment": {"SAGEMAKER_TRITON_DEFAULT_MODEL_NAME": "resnet"},
}

create_model_response = sm_client.create_model(
 ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container
)

print("Model Arn: " + create_model_response["ModelArn"])

Create endpoint configuration

In [None]:
endpoint_config_name = "triton-resnet-inf2-pt-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

create_endpoint_config_response = sm_client.create_endpoint_config(
 EndpointConfigName=endpoint_config_name,
 ProductionVariants=[
 {
 "InstanceType": "ml.inf2.xlarge",
 "InitialVariantWeight": 1,
 "InitialInstanceCount": 1,
 "ModelName": sm_model_name,
 "VariantName": "AllTraffic",
 }
 ],
)

print("Endpoint Config Arn: " + create_endpoint_config_response["EndpointConfigArn"])

In [None]:
endpoint_name = "triton-resnet-inf2-pt-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

create_endpoint_response = sm_client.create_endpoint(
 EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)

print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

In [None]:
resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
 time.sleep(60)
 resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
 status = resp["EndpointStatus"]
 print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

Run regular inference

In [None]:
payload = {
 "inputs": [
 {
 "name": "INPUT__0",
 "shape": [1, 3, 224, 224],
 "datatype": "FP32",
 "data": get_sample_image(),
 }
 ]
}

response = runtime_sm_client.invoke_endpoint(
 EndpointName=endpoint_name, ContentType="application/octet-stream", Body=json.dumps(payload)
)

print(json.loads(response["Body"].read().decode("utf8")))

Run inference with binary payload

In [None]:
request_body, header_length = get_sample_image_binary_pt()

response = runtime_sm_client.invoke_endpoint(
 EndpointName=endpoint_name,
 ContentType="application/vnd.sagemaker-triton.binary+json;json-header-size={}".format(
 header_length
 ),
 Body=request_body,
)

# Parse json header size length from the response
header_length_prefix = "application/vnd.sagemaker-triton.binary+json;json-header-size="
header_length_str = response["ContentType"][len(header_length_prefix) :]

# Read response body
result = httpclient.InferenceServerClient.parse_response_body(
 response["Body"].read(), header_length=int(header_length_str)
)
output0_data = result.as_numpy("OUTPUT__0")
print(output0_data)

### Terminate and cleanup

In [None]:
sm_client.delete_endpoint(EndpointName=endpoint_name)
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_model(ModelName=sm_model_name)

## Run a Transformers based OPT-125M model
In this section we will run a pre-trained neuron-traced OPT model on an inf2.24xl instance type

Create and upload OPT-125M model on SageMaker

### Note: the following commands to compile/trace the model should be run on an inf2 instance

In [None]:
!pip install transformers-neuronx --extra-index-url=https://pip.repos.neuron.amazonaws.com

In [None]:
import os
import time
import argparse
import torch
from transformers_neuronx.module import save_pretrained_split
from transformers_neuronx.dtypes import to_torch_dtype
from transformers_neuronx.opt.model import OPTForSampling
from transformers import AutoModelForCausalLM, AutoTokenizer

In [None]:
def amp_callback(model, dtype):
 # cast attention and mlp to low precisions only; layernorms stay as f32
 for block in model.model.decoder.layers:
 block.self_attn.to(dtype)
 block.fc1.to(dtype)
 block.fc2.to(dtype)
 model.lm_head.to(dtype)


def compile(
 model_name, batch_size, compiler_args, amp="bf16", tp_degree=2, n_positions=2048, unroll=None
):
 os.environ["NEURON_CC_FLAGS"] = compiler_args
 os.environ["NEURONX_DUMP_TO"] = "opt-125m-tp12"
 os.environ["NEURON_CACHE"] = "on"

 # Split the GPT-J model for faster loading
 model_dir = f"opt-125m-model"
 if not os.path.exists(model_dir):
 model_cpu = AutoModelForCausalLM.from_pretrained(model_name, low_cpu_mem_usage=True)
 dtype = to_torch_dtype(amp)
 amp_callback(model_cpu, dtype=dtype)
 save_pretrained_split(model_cpu, model_dir)

 # Create and compile the Neuron model.
 model_neuron = OPTForSampling.from_pretrained(
 model_dir,
 batch_size=batch_size,
 amp=amp,
 tp_degree=tp_degree,
 n_positions=n_positions,
 unroll=unroll,
 )
 model_neuron.to_neuron()

The above compilation to neuron may also be done in model.py (see below when creating the tar.gz archive). However, doing it here and packaging it with the model archive will save time when loading the model at the SageMaker endpoint.

In [None]:
compile("facebook/opt-125m", 1, "--model-type=transformer", "bf16", 12)

The `.to_neuron` compilation will create a folder `opt-125m-tp12` to save the split model 

A `model.py` and `config.pbtxt` file has been made available in the `inf2_llm/opt-125m/opt` folder. Both the files have been generated using the `gen_triton_model.py` script, and modified as per the LLM HF model to be used

In [14]:
## Note that in this example, the following step is already complete and provided in the folder inf2_llm/opt-125m/opt

"""
# 1.0 Generate a pytorch-neuronx triton model.py and config.pbtxt file

!git clone https://github.com/triton-inference-server/python_backend && cd python_backend
!cd python_backend && python3 inferentia/scripts/gen_triton_model.py --inf2 --model_type pytorch --triton_input INPUT__0,INT64,128 --triton_output OUTPUT__0,INT64,2048 --compiled_model opt-125m-model/ --neuron_core_range 0:11 --triton_model_dir inf2_llm/opt-125m/opt --enable_dynamic_batching --max_batch_size 128

# 2.0 Modify the model.py to use HF APIs as per the README here - https://github.com/triton-inference-server/python_backend/blob/main/inferentia/README.md#using-triton-with-inferentia-2-or-trn1. 
"""

'\n# 1.0 Generate a pytorch-neuronx triton model.py and config.pbtxt file\n\n!git clone https://github.com/triton-inference-server/python_backend && cd python_backend\n!cd python_backend && python3 inferentia/scripts/gen_triton_model.py --inf2 --model_type pytorch --triton_input INPUT__0,INT64,128 --triton_output OUTPUT__0,INT64,2048 --compiled_model opt-125m-model/ --neuron_core_range 0:11 --triton_model_dir inf2_llm/opt-125m/opt --enable_dynamic_batching --max_batch_size 128\n\n# 2.0 Modify the model.py to use HF APIs as per the README here - https://github.com/triton-inference-server/python_backend/blob/main/inferentia/README.md#using-triton-with-inferentia-2-or-trn1. \n'

In [None]:
# Note: Use -L when copying artifacts to model folder to avoid symlinks
# https://repost.aws/knowledge-center/sagemaker-endpoint-creation-fail
!mkdir -p inf2_llm/opt-125m/opt/opt-125m-model
!mkdir -p inf2_llm/opt-125m/opt/opt-125m-tp12
!cp -R -L opt-125m-model/* inf2_llm/opt-125m/opt/opt-125m-model
!cp -R -L opt-125m-tp12/* inf2_llm/opt-125m/opt/opt-125m-tp12

In [None]:
!tar -C inf2_llm/opt-125m/ -hcvzf triton-transformers-opt-inf2.tar.gz opt

In [None]:
model_uri = sagemaker_session.upload_data(
 path="triton-transformers-opt-inf2.tar.gz", key_prefix="triton-inf2-models"
)

In [None]:
model_uri

#### We are now ready to start a SageMaker endpoint, and load the model

In [None]:
sm_model_name = "triton-opt-inf2-transformers-opt-" + time.strftime(
 "%Y-%m-%d-%H-%M-%S", time.gmtime()
)

container = {
 "Image": triton_neuronx_image_uri,
 "ModelDataUrl": model_uri,
 "Environment": {"SAGEMAKER_TRITON_DEFAULT_MODEL_NAME": "opt"},
}

create_model_response = sm_client.create_model(
 ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container
)

print("Model Arn: " + create_model_response["ModelArn"])

endpoint_config_name = "triton-opt-inf2-transformers-opt-" + time.strftime(
 "%Y-%m-%d-%H-%M-%S", time.gmtime()
)

create_endpoint_config_response = sm_client.create_endpoint_config(
 EndpointConfigName=endpoint_config_name,
 ProductionVariants=[
 {
 "InstanceType": "ml.inf2.24xlarge",
 "InitialVariantWeight": 1,
 "InitialInstanceCount": 1,
 "ModelName": sm_model_name,
 "VariantName": "AllTraffic",
 }
 ],
)

print("Endpoint Config Arn: " + create_endpoint_config_response["EndpointConfigArn"])

endpoint_name = "triton-opt-inf2-transformers-opt-" + time.strftime(
 "%Y-%m-%d-%H-%M-%S", time.gmtime()
)

create_endpoint_response = sm_client.create_endpoint(
 EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)

print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
 time.sleep(60)
 resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
 status = resp["EndpointStatus"]
 print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

#### Once the model is 'InService', it is ready to be served. We now create the payload and run the inference request

In [None]:
from transformers import AutoTokenizer
import torch

prompt = "this summer"

tokenizer = AutoTokenizer.from_pretrained("facebook/opt-125m")
if not tokenizer.pad_token:
 tokenizer.pad_token = tokenizer.eos_token
encoded_text = tokenizer.encode(prompt, padding="max_length", max_length=128, truncation=True)
# Run inference here

payload = {
 "inputs": [{"name": "INPUT__0", "shape": [1, 128], "datatype": "INT64", "data": encoded_text}]
}

response = runtime_sm_client.invoke_endpoint(EndpointName=endpoint_name, Body=json.dumps(payload))

generated_sequence = json.loads(response["Body"].read().decode("utf8"))["outputs"][0]["data"]
output = " ".join(tokenizer.batch_decode(generated_sequence, skip_special_tokens=True)).encode(
 "utf-8"
)

print(output)

## Terminate Resources
As this is a sample notebook, after a successful experiment, we will terminate the resources.

In [None]:
sm_client.delete_endpoint(EndpointName=endpoint_name)
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_model(ModelName=sm_model_name)

## Run a Transformers based GPTJ-6B model
In this section we will run a pre-trained neuron-traced GPTJ-6B model on an inf2.24xl instance type

### PyTorch: Packaging model files and uploading to s3

Create and upload GPTJ-6B model on SageMaker

### Note: the following commands to compile/trace the model should be run on an inf2 instance

In [None]:
!pip install transformers-neuronx --extra-index-url=https://pip.repos.neuron.amazonaws.com

In [None]:
import os
import time
import argparse
import torch
from transformers_neuronx.module import save_pretrained_split
from transformers_neuronx.dtypes import to_torch_dtype
from transformers_neuronx.gptj.model import GPTJForSampling
from transformers import AutoModelForCausalLM, AutoTokenizer


def amp_callback(model, dtype):
 # cast attention and mlp to low precisions only; layernorms stay as f32
 for block in model.transformer.h:
 block.attn.to(dtype)
 block.mlp.to(dtype)
 model.lm_head.to(dtype)


def compile(
 model_name, batch_size, compiler_args, amp="bf16", tp_degree=2, n_positions=2048, unroll=None
):
 os.environ["NEURON_CC_FLAGS"] = compiler_args
 os.environ["NEURONX_DUMP_TO"] = "gpt-j-6b-artifacts-tp4"
 os.enviorn["NEURON_CACHE"] = "on"

 # Split the GPT-J model for faster loading
 model_dir = f"EleutherAI-gpt-j-6B-bf16-local"
 if not os.path.exists(model_dir):
 model_cpu = AutoModelForCausalLM.from_pretrained(model_name, low_cpu_mem_usage=True)
 dtype = to_torch_dtype(amp)
 amp_callback(model_cpu, dtype=dtype)
 save_pretrained_split(model_cpu, model_dir)

 # Create and compile the Neuron model
 model_neuron = GPTJForSampling.from_pretrained(
 model_dir,
 batch_size=batch_size,
 amp=amp,
 tp_degree=tp_degree,
 n_positions=n_positions,
 unroll=unroll,
 )
 model_neuron.to_neuron()
 return model_neuron


compile("EleutherAI/gpt-j-6B", 1, "--model-type=transformer", "bf16", 4)

The `*.to_neuron()` compilation will create a folder `gpt-j-6b-artifacts-tp4` to cache the split model 

A `model.py` and `config.pbtxt` file has been made available in the `inf2_llm/gptj-6b/gptj` folder. Both the files have been generated using the `gen_triton_model.py` script, and modified as per the LLM HF model to be used

In [15]:
## Note that in this example, the following step is already complete and provided in the folder inf2_llm/gpt-j6b/gptj

"""
# 1. Generate a pytorch-neuronx triton model.py and config.pbtxt file
!git clone https://github.com/triton-inference-server/python_backend && cd python_backend
!cd python_backend && python3 inferentia/scripts/gen_triton_model.py --inf2 --model_type pytorch --triton_input INPUT__0,INT64,128 --triton_output OUTPUT__0,INT64,2048 --compiled_model EleutherAI-gpt-j-6B-bf16-local/ --neuron_core_range 0:4 --triton_model_dir inf2_llm/gptj-6b/gptj --enable_dynamic_batching --max_batch_size 128
# 2. Modify the model.py to use HF APIs as per the README here - https://github.com/triton-inference-server/python_backend/blob/main/inferentia/README.md#using-triton-with-inferentia-2-or-trn1.
"""

'\n# 1. Generate a pytorch-neuronx triton model.py and config.pbtxt file\n!git clone https://github.com/triton-inference-server/python_backend && cd python_backend\n!cd python_backend && python3 inferentia/scripts/gen_triton_model.py --inf2 --model_type pytorch --triton_input INPUT__0,INT64,128 --triton_output OUTPUT__0,INT64,2048 --compiled_model EleutherAI-gpt-j-6B-bf16-local/ --neuron_core_range 0:4 --triton_model_dir inf2_llm/gptj-6b/gptj --enable_dynamic_batching --max_batch_size 128\n# 2. Modify the model.py to use HF APIs as per the README here - https://github.com/triton-inference-server/python_backend/blob/main/inferentia/README.md#using-triton-with-inferentia-2-or-trn1.\n'

In [None]:
# Note: Use -L when copying artifacts to model folder to avoid symlinks
# https://repost.aws/knowledge-center/sagemaker-endpoint-creation-fail
!mkdir -p inf2_llm/gptj-6b/gptj/EleutherAI-gpt-j-6B-bf16-local
!mkdir -p inf2_llm/gptj-6b/gptj/gpt-j-6b-artifacts-tp4
!cp -R -L EleutherAI-gpt-j-6B-bf16-local/* inf2_llm/gptj-6b/gptj/EleutherAI-gpt-j-6B-bf16-local
!cp -R -L gpt-j-6b-artifacts-tp4/* inf2_llm/gptj-6b/gptj/gpt-j-6b-artifacts-tp4

In [None]:
!tar -C inf2_llm/gptj-6b/ -hvczf triton-transformers-gptj-inf2.tar.gz gptj

In [None]:
model_uri = sagemaker_session.upload_data(
 path="triton-transformers-gptj-inf2.tar.gz", key_prefix="triton-inf2-models"
)

#### We are now ready to start a SageMaker endpoint, and load the model

In [None]:
preload_model_argument = "true --strict-readiness=false"

In [None]:
sm_model_name = "triton-gptj-inf2-transformers-gptj-" + time.strftime(
 "%Y-%m-%d-%H-%M-%S", time.gmtime()
)

container = {
 "Image": triton_neuronx_image_uri,
 "ModelDataUrl": model_uri,
 "Environment": {
 "SAGEMAKER_TRITON_DEFAULT_MODEL_NAME": "gptj",
 "SAGEMAKER_TRITON_LOG_VERBOSE": preload_model_argument,
 },
}

create_model_response = sm_client.create_model(
 ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container
)

print("Model Arn: " + create_model_response["ModelArn"])

endpoint_config_name = "triton-gptj-inf2-transformers-gptj-" + time.strftime(
 "%Y-%m-%d-%H-%M-%S", time.gmtime()
)

create_endpoint_config_response = sm_client.create_endpoint_config(
 EndpointConfigName=endpoint_config_name,
 ProductionVariants=[
 {
 "InstanceType": "ml.inf2.24xlarge",
 "InitialVariantWeight": 1,
 "InitialInstanceCount": 1,
 "ModelName": sm_model_name,
 "VariantName": "AllTraffic",
 "VolumeSizeInGB": 256,
 "ModelDataDownloadTimeoutInSeconds": 1800,
 "ContainerStartupHealthCheckTimeoutInSeconds": 3600,
 }
 ],
)

print("Endpoint Config Arn: " + create_endpoint_config_response["EndpointConfigArn"])

endpoint_name = "triton-gptj-inf2-transformers-gptj-" + time.strftime(
 "%Y-%m-%d-%H-%M-%S", time.gmtime()
)

create_endpoint_response = sm_client.create_endpoint(
 EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)

print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
 time.sleep(60)
 resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
 status = resp["EndpointStatus"]
 print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

#### Once the model is 'InService', it is ready to be served. We now create the payload and run the inference request

In [None]:
from transformers import AutoTokenizer
import torch

prompt = "this summer"

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
if not tokenizer.pad_token:
 tokenizer.pad_token = tokenizer.eos_token
encoded_text = tokenizer.encode(prompt, padding="max_length", max_length=128, truncation=True)
# Run inference here

payload = {
 "inputs": [{"name": "INPUT__0", "shape": [1, 128], "datatype": "INT64", "data": encoded_text}]
}

response = runtime_sm_client.invoke_endpoint(EndpointName=endpoint_name, Body=json.dumps(payload))

generated_sequence = json.loads(response["Body"].read().decode("utf8"))["outputs"][0]["data"]
output = " ".join(tokenizer.batch_decode(generated_sequence, skip_special_tokens=True)).encode(
 "utf-8"
)

print(output)

## Terminate Resources

In [None]:
sm_client.delete_endpoint(EndpointName=endpoint_name)
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_model(ModelName=sm_model_name)