# Deploy A Hugging Face Pretrained Model to Amazon SageMaker Serverless Endpoint - Boto3

[![Open In Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/aws/studio-lab-examples/blob/main/connect-to-aws/Access_AWS_from_Studio_Lab_Deployment.ipynb)

Amazon SageMaker Studio Lab provides you a free environment to develop an experimental project, which is suitable for ML learners to start their first ML lesson. The CPU session is avaliable for continously 12 hours, and GPU session is avaliable for continously 4 hours.

What if you want to have a think big project for your ML learning, for instances, you want to make the model public avaliable instead of running the model locally, what can be your options? Are you thinking about the same thing? Deploy to an Amazon SageMaker Endpoint!

In this tutorial, I will share you how to deploy a Hugging Face pre-trained model to an Amazon SageMaker Serverless Endpoint. By showing you: 
- (1) introduce the model 
- (2) set up the pre-requisition - AWS user, Amazon SageMaker execution role
- (3) connect your AWS resources(SageMaker) using boto3 
- (4) create a SageMaker model with boto3 SageMaker Client 
- (5) create a serverless endpoint configuration 
- (6) test with boto3 SageMaker run time to invoke the endpoint 
- (7) delete the model, the endpoint config, and the endpoint

![image info](./image/02-serverless-endpoint.png)


# Step 1. Introduce the model 


Hugging Face is a community and data science platform. Which provides tools that enable users to build, train and deploy ML models based on open source code and technologies. In their website, you may see open source model and dataset in their [website](https://huggingface.co/).

`distilbert-base-uncased-finetuned-sst-2-english` model is an English text classification model based on the DistilBERT. [DistilBERT](https://huggingface.co/distilbert-base-uncased) is a transformers model, which is smaller and faster than BERT. It was pretrained on the same corpus and used the BERT base model as a teacher. 

You can test the model with python code as bellow. You can also test the model in the Hosted inference API session on the page:
https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english

In [None]:
%pip install huggingface_hub==0.1.0 
%pip install transformers==4.12

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import TextClassificationPipeline

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer, return_all_scores=True)
pipe("I love Amazon SageMaker Studio Lab!")

![image info](./image/01-hugging-face-model.png)

# Step 2. Set up the pre-requisition

## 2.1 AWS user login
You will need to install AWS CLI, boto3, and configure with your AWS credentials. You can find the credentials from the IAM > users.

If you don't have users, create a new one. If you already have users, select one, and get the credentials in the Security credentials tab.
 
- Boto3

Boto3 is the AWS SDK for Python (Boto3). Which you can use it to create, configure, and manage AWS services. The SDK provides an object-oriented API as well as low-level access to AWS services. To learn more about boto3: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html

- AWS CLI

The AWS Command Line Interface (AWS CLI) is a unified tool to manage your AWS services. With just one tool to download and configure, you can control multiple AWS services from the command line and automate them through scripts. To learn more about AWS CLI: https://aws.amazon.com/cli/



In [None]:
%pip install boto3

In [None]:
%pip install awscli

In [None]:
%mkdir ~/.aws

In [None]:
%%writefile ~/.aws/credentials

[default]
aws_access_key_id = < paste your access key here, run this cell, then delete the cell >
aws_secret_access_key = < paste your secret key here, run this cell, then delete the cell > 

In [None]:
%%writefile ~/.aws/config

[default]
region=us-east-1 

In [None]:
# if you don't want to setup the key/secret with credentials, you can setup the environment variable with
ACCESS_KEY= < paste your access key here, run this cell, then delete the cell >
SECRET_KEY= < paste your secret key here, run this cell, then delete the cell > 

## 2.2 Amazon SageMaker execution role

Amazon SageMaker execution role is an AWS Identity and Access Management (IAM) role that performs operations on your behalf on the AWS hardware that is managed by SageMaker. A SageMaker user can grant these permissions with an IAM role (referred to as an execution role).

You can create an execution role by following the [instruction](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html). If you are already used to using SageMaker within your own AWS account, please copy and paste the arn for your execution role below. 

Please note, in order to complete this you will need to have already created this SageMaker IAM Execution role.

In [None]:
import sagemaker

# create a sagemaker execution role via the AWS SageMaker console, then paste in the arn here
role = ' < paste your execution role here > '

# Step3. connect your AWS resources(SageMaker) using boto3

Amazon SageMaker Serverless Inference is a purpose-built inference option that makes it easy for you to deploy and scale ML models. Serverless Inference is ideal for workloads which have idle periods between traffic spurts and can tolerate cold starts. Serverless endpoints automatically launch compute resources and scale them in and out depending on traffic, eliminating the need to choose instance types or manage scaling policies. This takes away the undifferentiated heavy lifting of selecting and managing servers. Serverless Inference integrates with AWS Lambda to offer you high availability, built-in fault tolerance and automatic scaling. Here is the pseudo code to show you the process to create a Amazon SageMaker serverless endpoint.



```python
container = {
 'Image': CONTAINER_URL, 
 'ModelDataUrl': ‘s3://my-bucket/path/to/artifacts/’, #applies to bring your own model case
 'Mode': 'SingleModel' } 

response = sm_client.create_model( 
 ModelName = 'model-name',
 PrimaryContainer=container_config,
 ExecutionRoleArn=role, 
 EnableNetworkIsolation=False) 

response = sm_client.create_endpoint_config( 
 EndpointConfigName = ‘my-epc’, 
 ServerlessConfig=[{…,…}])

response = sm_client.create_endpoint(EndpointName = ‘my-endpoint’,
 EndpointConfigName = ‘my-epc’)
```

## How Amazon SageMaker Serverless Endpoints Works:
![image info](./image/03-serverless-endpoints-how-it-works.png)

- more reference for serverless
 - https://towardsdatascience.com/sagemaker-serverless-inference-is-now-generally-available-e42550a146fe
 - https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints.html
 - https://aws.amazon.com/about-aws/whats-new/2021/12/amazon-sagemaker-serverless-inference/
 - https://github.com/huggingface/notebooks/blob/main/sagemaker/11_deploy_model_from_hf_hub/deploy_transformer_model_from_hf_hub.ipynb
 - [Host Hugging Face transformer models using Amazon SageMaker Serverless Inference](https://aws.amazon.com/blogs/machine-learning/host-hugging-face-transformer-models-using-amazon-sagemaker-serverless-inference/)


In [None]:
import boto3
sm_client = boto3.client('sagemaker',
 aws_access_key_id=ACCESS_KEY,
 aws_secret_access_key=SECRET_KEY)
response = sm_client.list_endpoints()

len(response)

In [None]:
import time
ml_model_name = "text-classification-hugging-face"
timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
model_name = ml_model_name + '-model' + timestamp
endpoint_config_name = ml_model_name + '-epc' + timestamp
print(model_name)
print(endpoint_config_name)

# Step 4. create a SageMaker model with boto3 SageMaker Client
- Before create an endpoint, you need to create a model to deploy first. Here shows how to use SageMaker Boto3 client to create a model. The document can be found here: [boto3 sagemaker client create model](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_model
)

- For each model, you will need to specify which Amazon SageMaker container you are going to use in this Amazon SageMaker endpoint. AWS has prebuilt deep learning containers for 5 software frameworks, including TensorFlow, PyTorch, MXNet, Hugging Face, and AutoGloun. We call it: **Deep Learning Containers(DLC)**. It depends on the use case, which can be a Tensorflow based, Pytorch based DLC image. In this example, we use Hugging Face with Pytorch as the container image. All the avalible open source framework is documented [here](https://github.com/aws/deep-learning-containers/blob/master/available_images.md). The URL composed with two part, `repository URL by region + framework version with job type, hardware spec, and python version`. 

- To learn more about script mode on SageMaker, check out our documentation [here](https://sagemaker.readthedocs.io/en/stable/frameworks/index.html).



In [None]:
model_data_url="763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-inference:1.9-transformers4.12-cpu-py38-ubuntu20.04"

container_config = {'Image': model_data_url,
 'Mode': 'SingleModel',
 'Environment': {
 'HF_MODEL_ID': 'distilbert-base-uncased-finetuned-sst-2-english',
 'HF_TASK' : 'text-classification',
 'SAGEMAKER_CONTAINER_LOG_LEVEL' : '20',
 'SAGEMAKER_REGION' : 'us-east-1'
 }
 }

response = sm_client.create_model(
 ModelName=model_name,
 PrimaryContainer=container_config,
 ExecutionRoleArn=role, 
 EnableNetworkIsolation=False
)

- Check model is created by visiting: Console > Amazon SageMaker > Models

# step 5. create a serverless endpoint configuration

- In this step, we will create an endpoint configuration with ServerlessConfig, which can be used in `sm_client.create_endpoint`. The configuration detail is listed [here](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint_config).

- The endpoint is not created until you run the `sm_client.create_endpoint`. After running this step, you will need to wait about 2 minutes for the model deployment. The detail of create endpoint API is available [here](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint).

In [None]:
endpoint_config_response = sm_client.create_endpoint_config(
 EndpointConfigName=endpoint_config_name,
 ProductionVariants=[
 {
 "ModelName": model_name,
 "VariantName": "AllTraffic",
 "ServerlessConfig": {
 # Specify MemorySizeInMB and MaxConcurrency in the serverless config object
 "MemorySizeInMB": 4096,
 "MaxConcurrency": 10
 }
 }
 ]
)

print('Endpoint configuration name: {}'.format(endpoint_config_name))
print('Endpoint configuration arn: {}'.format(endpoint_config_response['EndpointConfigArn']))


In [None]:
endpoint_name = "studio-lab-ep" + '-epc' + timestamp
response = sm_client.create_endpoint(
 EndpointName=endpoint_name,
 EndpointConfigName=endpoint_config_name
)

- check endpoint is created!
 - Console > Amazon SageMaker > Endpoints
 
# step 6. test with boto3 SageMaker run time to invoke the endpoint

After checking the endpoint is ready, we now use SageMaker Runtime to interact with the endpoint. According to the [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints-invoke.html), you will need: 
- (1) endpoint_name, use the name of the in-service serverless endpoint you want to invoke. 
- (2) content_type, specify the MIME type of your input data in the request body (for example, application/json). If you want to learn more about what are the other options for content type, you can visit [Common Data Formats for Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-inference.html) to understand more. 
- (3) payload, use your request payload for inference. Your payload should be in bytes or a file-like object.

Now, let's try out the endpoint!

In [None]:
import json
import boto3
runtime = boto3.client("sagemaker-runtime",
 aws_access_key_id=ACCESS_KEY,
 aws_secret_access_key=SECRET_KEY) 

content_type = "application/json"

# example request, you always need to define "inputs"
data = {
 "inputs": "Happy Birthday to you!"
}

response = runtime.invoke_endpoint(
 EndpointName=endpoint_name,
 ContentType=content_type,
 Body=json.dumps(data)
)

In [None]:
response["Body"].read().decode()

# step 7. delete the model, the endpoint config, and the endpoint

Now, if we don't need the model, or the endpoint anymore. We will need to clean up the artifacts that we created earlier. According to the [documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint): 
```
You must not delete an EndpointConfig that is in use by an endpoint that is live or while the UpdateEndpoint or CreateEndpoint operations are being performed on the endpoint. To update an endpoint, you must create a new EndpointConfig .
```

Let's delete the artifacts in order.

In [None]:
sm_client.delete_endpoint(EndpointName=endpoint_name)
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_model(ModelName=model_name)