# Deploying GPT-2 and GPT-J

In this notebook, we will be using Hugging Face models and SageMaker Hugging Face-specific API's to deploy both GPT-2 and GPT-J. We will also showcase how to deploy what would could be GPT2 models fine-tuned on different datasets to the same SageMaker instance as a Multi Model Endpoint. This will allow you to get real-time predictions from several models, while only paying for one running endpoint instance.

*****
## Deploying GTP-2 to SageMaker Multi-Model Endpoint

In [None]:
!pip install -U transformers
!pip install -U sagemaker

### Get sagemaker session, role and default bucket
If you are going to use Sagemaker in a local environment (not SageMaker Studio or Notebook Instances), you need access to an IAM Role with the required permissions for Sagemaker. You can find more about this [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html).

In [None]:
import sagemaker
import boto3

sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
 # set to default bucket if a bucket name is not given
 sagemaker_session_bucket = sess.default_bucket()

try:
 role = sagemaker.get_execution_role()
except ValueError:
 iam = boto3.client('iam')
 role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)
region = sess.boto_region_name
sm_client = boto3.client('sagemaker')

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {region}")

### Load GPT-2 model and tokenizer, save them to the same folder with Transformers `save_pretrained` utility 

In [None]:
import transformers 
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained('gpt2')
model = AutoModelForCausalLM.from_pretrained('gpt2')

model.save_pretrained('gpt2-model/')
tokenizer.save_pretrained('gpt2-model/')

In [None]:
# This cell is meant to test that the model can be loaded from the local artifact
# model = AutoModelForCausalLM.from_pretrained('gpt2-model/')

Test model generation, by generating 5 different sequences for the same prompt.

In [None]:
model.eval()

text = "A rose by any other name would smell as sweet, by William Shakespeare."
input_ids = tokenizer.encode(text, return_tensors = 'pt')

sample_outputs = model.generate(input_ids,
 do_sample = True, 
 max_length = 70,
 num_return_sequences = 5) #to test how long we can generate and it be coh

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
 print("{}: {}...".format(i, tokenizer.decode(sample_output, skip_special_tokens = True)))
 print('')

### Tar model and tokenizer artifacts, upload to S3

In [None]:
import tarfile 

with tarfile.open('gpt2-model.tar.gz', 'w:gz') as f:
 f.add('gpt2-model/',arcname='.')
f.close()

prefix = 'gpt2-hf-workshop/gpt2-test'

Check out the file contents and structure of the model.tar.gz artifact.

In [None]:
! tar -ztvf gpt2-model.tar.gz

We will upload the same model package twice with different names, to simulate deploying 2 models to the same endpoint.

In [None]:
! aws s3 cp gpt2-model.tar.gz s3://"$sagemaker_session_bucket"/"$prefix"/gpt2-model1.tar.gz
! aws s3 cp gpt2-model.tar.gz s3://"$sagemaker_session_bucket"/"$prefix"/gpt2-model2.tar.gz

### Get image URI for Hugging Face inference Deep Learning Container

In [None]:
from sagemaker import image_uris

hf_inference_dlc = image_uris.retrieve(framework='huggingface', 
 region=region, 
 version='4.12.3', 
 image_scope='inference', 
 base_framework_version='pytorch1.9.1', 
 py_version='py38', 
 container_version='ubuntu20.04', 
 instance_type='ml.c5.xlarge')

### Use `MultiDataModel`to setup a multi-model endpoint definition
By setting the `HF_TASK` environment variable, we avoid having to write and test our own inference code. Depending on the task and model you choose, the Hugging Face inference Container will run the appropriate code by default. 

In [None]:
from sagemaker.multidatamodel import MultiDataModel
from sagemaker.predictor import Predictor

hub = {
 'HF_TASK':'text-generation'
}

mme = MultiDataModel(
 name='gpt2-models',
 model_data_prefix=f's3://{sagemaker_session_bucket}/{prefix}/',
 image_uri=hf_inference_dlc,
 env=hub,
 predictor_cls=Predictor,
 role=role,
 sagemaker_session=sess,
 )

We can see that our model object has already "registered" the model artifacts we uploaded to S3 under the `model_data_prefix`.

In [None]:
for model in mme.list_models():
 print(model)

### Deploy Multi-Model Endpoint and send inference requests to both models

In [None]:
import datetime
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

endpoint_name_gpt2 = 'mme-gpt2-'+datetime.datetime.now().strftime(
 "%Y-%m-%d-%H-%M-%S"
)

predictor_gpt2 = mme.deploy(
 initial_instance_count=1,
 instance_type='ml.c5.xlarge',
 serializer=JSONSerializer(),
 deserializer=JSONDeserializer(),
 endpoint_name='mme-gpt2'
 )

You can now get predictions from both models; the first request made to each model will take longer than the subsequent, as the model will be loaded from S3.

In [None]:
predictor_gpt2.predict({'inputs':'A rose by any other name.'},
 target_model='gpt2-model1.tar.gz')[0]

In [None]:
predictor_gpt2.predict({'inputs':'A rose by any other name.'},
 target_model='gpt2-model2.tar.gz')[0]

### Add new model to endpoint
To add a new model to our multi-model endpoint, we only have to upload a new model artifact to the same prefix where we uploaded the other models to. You will be able to load and get inferences from this new model as soon as it is uploded to S3. We will again load the same artifact we previously packaged, for demonstration purposes.

In [None]:
! aws s3 cp gpt2-model.tar.gz s3://"$sagemaker_session_bucket"/"$prefix"/gpt2-model3.tar.gz

In [None]:
predictor_gpt2.predict({'inputs':'A rose by any other name.'},
 target_model='gpt2-model3.tar.gz')[0]


********************************************************************************************************************************************
********************************************************************************************************************************************


# Deploying GPT-J to SageMaker Endpoint

### Clone sample repo and run model preparation script
Hugging Face has solidified best practices for deploying GPT-J on Sagemaker in this [repository](https://github.com/philschmid/amazon-sagemaker-gpt-j-sample). Namely, PyTorch utilities are directly used to save the model to disk, instead of `.save_pretrained()`. On deployment, this helps in reducing model loading time by 10x. Check out this [blog post](https://huggingface.co/blog/gptj-sagemaker) to learn more.

In [None]:
%%bash
git clone https://github.com/philschmid/amazon-sagemaker-gpt-j-sample.git

mv amazon-sagemaker-gpt-j-sample/convert_gptj.py \
 amazon-sagemaker-gpt-j-sample/requirements.txt \
 amazon-sagemaker-gpt-j-sample/code/ .

rm -r amazon-sagemaker-gpt-j-sample/
pip install -r requirements.txt

The `convert_gpj.py` script will save the model, tokenizer and inference script to disk, create a tar file with those artifacts, and upload them to an S3 bucket of our choice, under the `gpt-j` prefix. We also directly get the S3 URI for our model artifact from the script's execution. 

In [None]:
output = !python3 convert_gptj.py --bucket_name "$sagemaker_session_bucket"
model_uri = output[0]

We now use the `HuggingFaceModel` API to deploy GPT-J to a SageMaker endpoint, from which we can get real time predictions. Due to the way we saved our model, it is important that the Transformers and PyTorch version you use match the ones installed in the environment where you are running this notebook.

In [None]:
from sagemaker.huggingface import HuggingFaceModel

huggingface_model = HuggingFaceModel(
 model_data=model_uri,
 transformers_version='4.12.3',
 pytorch_version='1.9.1',
 py_version='py38',
 role=role, 
 )


In [None]:
endpoint_name_gptj = 'gptj-'+datetime.datetime.now().strftime(
 "%Y-%m-%d-%H-%M-%S"
)

predictor_gptj = huggingface_model.deploy(
 initial_instance_count=1,
 instance_type='ml.g4dn.xlarge',
 endpoint_name=endpoint_name_gptj
)

Now we can get real-time predictions from our model!

In [None]:
predictor_gptj.predict({
 "inputs": "Can you please let us know more details about your ",
})

### Cleanup 

In [None]:
predictor_gpt2.delete_model()
predictor_gpt2.delete_endpoint()
predictor_gptj.delete_model()
predictor_gptj.delete_endpoint()

Utility: To load endpoint from name

In [None]:
# from sagemaker.predictor import Predictor

# predictor = Predictor(
# endpoint_name='mme-test-gpt2',
# serializer=JSONSerializer(),
# deserializer=JSONDeserializer())