# Amazon SageMaker Multi-Model Endpoints using XGBoost


With [Amazon SageMaker multi-model endpoints](https://docs.aws.amazon.com/sagemaker/latest/dg/multi-model-endpoints.html), customers can create an endpoint that seamlessly hosts up to thousands of models. These endpoints are well suited to use cases where any one of a large number of models, which can be served from a common inference container to save inference costs, needs to be invokable on-demand and where it is acceptable for infrequently invoked models to incur some additional latency. For applications which require consistently low inference latency, an endpoint deploying a single model is still the best choice.

At a high level, Amazon SageMaker manages the loading and unloading of models for a multi-model endpoint, as they are needed. When an invocation request is made for a particular model, Amazon SageMaker routes the request to an instance assigned to that model, downloads the model artifacts from S3 onto that instance, and initiates loading of the model into the memory of the container. As soon as the loading is complete, Amazon SageMaker performs the requested invocation and returns the result. If the model is already loaded in memory on the selected instance, the downloading and loading steps are skipped and the invocation is performed immediately.

To demonstrate how multi-model endpoints are created and used, this notebook provides an example using a set of XGBoost models that each predict housing prices for a single location. This domain is used as a simple example to easily experiment with multi-model endpoints.

The Amazon SageMaker multi-model endpoint capability is designed to work across with Mxnet, PyTorch and Scikit-Learn machine learning frameworks (TensorFlow coming soon), SageMaker XGBoost, KNN, and Linear Learner algorithms.

In addition, Amazon SageMaker multi-model endpoints are also designed to work with cases where you bring your own container that integrates with the multi-model server library. An example of this can be found [here](https://github.com/aws/amazon-sagemaker-examples/tree/main/advanced_functionality/multi_model_bring_your_own) and documentation [here](https://docs.aws.amazon.com/sagemaker/latest/dg/build-multi-model-build-container.html).

### Step 1: Initialize boto3 client and location of model artifact 
Here you instantiate the S3 client object and the locations inside your default S3 bucket, where the metrics and model artifacts are uploaded. Notice that the default bucket sagemaker-- is automatically created by the SageMaker session object. The model and datasets that was used for training exist in a public S3 bucket named sagemaker-sample-files. The location inside the bucket is specified through the read prefix.

In [None]:
import pandas as pd
import numpy as np
import boto3
import sagemaker
import time
import json
import io
from io import StringIO
import base64
import pprint
import re

from sagemaker.image_uris import retrieve

sess = sagemaker.Session()
default_bucket = sess.default_bucket()
prefix = "fraud-detect-demo-mme"

region = sess.boto_region_name
s3_client = boto3.client("s3", region_name=region)
sm_client = boto3.client("sagemaker", region_name=region)
sm_runtime_client = boto3.client("sagemaker-runtime")

sagemaker_role = sagemaker.get_execution_role()

# S3 locations used for parameterizing the notebook run
read_bucket = "sagemaker-sample-files"
read_prefix = "datasets/tabular/synthetic_automobile_claims" 
model_prefix = "models/xgb-fraud"

# S3 location of trained model artifact
model_uri = f"s3://{read_bucket}/{model_prefix}/fraud-det-xgb-model.tar.gz"

# S3 location of test data
test_data_uri = f"s3://{read_bucket}/{read_prefix}/test.csv"

### Step 2: Create 100 XGBoost model copies to host behind a single Endpoint

In [None]:
s3 = boto3.resource('s3')
copy_source = {
 'Bucket': read_bucket,
 'Key': f"{model_prefix}/fraud-det-xgb-model.tar.gz"
 }
bucket = s3.Bucket(default_bucket)

for i in range (0,100):
 bucket.copy(copy_source, f"{model_prefix}/fraud-det-xgb-model-{i}.tar.gz")

### Step 3: List the models and confirm they all reside in S3

In [None]:
!aws s3 ls s3://{default_bucket}/{model_prefix}/

### Step 4: Create a Real-Time Inference Multi-Model-Endpoint
In SageMaker, there are multiple methods to deploy a trained model to a Real-Time Inference endpoint: SageMaker SDK, AWS SDK - Boto3, and SageMaker console. For more information, see Deploy Models for Inference in the Amazon SageMaker Developer Guide. SageMaker SDK has more abstractions compared to the AWS SDK - Boto3, with the latter exposing lower-level APIs for greater control over model deployment. In this tutorial, you deploy the model using the AWS SDK -Boto3. There are three steps you need to follow in sequence to deploy a model:

1. Create a SageMaker model from the model artifact
2. Create an endpoint configuration to specify properties, including instance type and count
3. Create the endpoint using the endpoint configuration

*For Multi-Model you set `"Mode": "MultiModel"`*

In [None]:
# Retrieve the SageMaker managed XGBoost image
training_image = retrieve(framework="xgboost", region=region, version="1.3-1")

# Specify a unique model name that does not exist
model_name = "fraud-detect-xgb-multi-model"
primary_container = {
 "Image": training_image,
 "ModelDataUrl": f"s3://{default_bucket}/{model_prefix}/",
 "Mode": "MultiModel",
 }

model_matches = sm_client.list_models(NameContains=model_name)["Models"]
if not model_matches:
 model = sm_client.create_model(ModelName=model_name,
 
 PrimaryContainer=primary_container,
 ExecutionRoleArn=sagemaker_role)
else:
 print(f"Model with name {model_name} already exists! Change model name to create new")
 


After the SageMaker model is created, the following code is uses the Boto3 create_endpoint_config method to configure the endpoint. The main inputs to the create_endpoint_config method are the endpoint configuration name and variant information, such as inference instance type and count, the name of the model to be deployed, and the traffic share the endpoint should handle.


In [None]:
# Endpoint Config name
endpoint_config_name = f"{model_name}-endpoint-config"

# Endpoint config parameters
production_variant_dict = {
 "VariantName": "Alltraffic",
 "ModelName": model_name,
 "InitialInstanceCount": 1,
 "InstanceType": "ml.m5.xlarge",
 "InitialVariantWeight": 1
 }


# Create endpoint config if one with the same name does not exist
endpoint_config_matches = sm_client.list_endpoint_configs(NameContains=endpoint_config_name)["EndpointConfigs"]
if not endpoint_config_matches:
 endpoint_config_response = sm_client.create_endpoint_config(
 EndpointConfigName=endpoint_config_name,
 ProductionVariants=[production_variant_dict],
 
 )
else:
 print(f"Endpoint config with name {endpoint_config_name} already exists! Change endpoint config name to create new")

The create_endpoint method takes the endpoint configuration as a parameter, and deploys the model specified in the endpoint configuration to a compute instance. It takes about 6 minutes to deploy the model.

In [None]:
endpoint_name = f"{model_name}-endpoint"

endpoint_matches = sm_client.list_endpoints(NameContains=endpoint_name)["Endpoints"]
if not endpoint_matches:
 endpoint_response = sm_client.create_endpoint(
 EndpointName=endpoint_name,
 EndpointConfigName=endpoint_config_name
 )
else:
 print(f"Endpoint with name {endpoint_name} already exists! Change endpoint name to create new")

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
while status == "Creating":
 print(f"Endpoint Status: {status}...")
 time.sleep(60)
 resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
 status = resp["EndpointStatus"]
print(f"Endpoint Status: {status}")

### Step 5: Invoke the inference Endpoint

After the endpoint status changes to InService, you can invoke the endpoint using the REST API, AWS SDK - Boto3, SageMaker Studio, AWS CLI, or SageMaker Python SDK. In this tutorial, you use the AWS SDK - Boto3. Before calling an endpoint, it is important that the test data is formatted suitably for the endpoint using serialization and deserialization. Serialization is the process of converting raw data in a format such as .csv to byte streams that the endpoint can use. Deserialization is the reverse process of converting byte stream to human readable format. In this tutorial, you invoke the endpoint by sending the first five samples from a test dataset. To invoke the endpoint and get prediction results, copy and paste the following code. Since the request to the endpoint (test dataset) is in the .csv format, a csv serialization process is used to create the payload. The response is then deserialized to an array of predictions. After the execution completes, the cell returns the model predictions and the true labels for the test samples. Notice that the XGBoost model returns probabilities instead of actual class labels. The model has predicted a very low likelihood for the test samples to be fraudulent claims and the predictions are in line with the true labels. To invoke the model of your choice make use of `TargetModel="fraud-det-xgb-model-{i}.tar.gz"` 

In [None]:
# Fetch test data to run predictions with the endpoint
test_df = pd.read_csv(test_data_uri)

# For content type text/csv, payload should be a string with commas separating the values for each feature
# This is the inference request serialization step
# CSV serialization
csv_file = io.StringIO()
test_sample = test_df.drop(["fraud"], axis=1).iloc[:5]
test_sample.to_csv(csv_file, sep=",", header=False, index=False)
payload = csv_file.getvalue()

for i in range (0,5):
 response = sm_runtime_client.invoke_endpoint(
 EndpointName=endpoint_name,
 Body=payload,
 TargetModel=f"fraud-det-xgb-model-{i}.tar.gz",
 ContentType="text/csv",
 Accept="text/csv"
 )
 # This is the inference response deserialization step
 # This is a bytes object
 result = response["Body"].read()
 # Decoding bytes to a string
 result = result.decode("utf-8")
 # Converting to list of predictions
 result = re.split(",|\n",result)
 prediction_df = pd.DataFrame()
 prediction_df["Prediction"] = result[:5]
 prediction_df["Label"] = test_df["fraud"].iloc[:5].values
 prediction_df
 print(f"\nfraud-det-xgb-model-{i}.tar.gz prediction results:")
 print(prediction_df)


### Step 6: Delete Endpoint
Before leaving this exercise, it is a good practice to delete the resources created.

In [None]:
# sm_client.delete_endpoint(EndpointName=endpoint_name)