# Train a XGBoost regression model on Amazon SageMaker and host inference as an API on a Docker container running on AWS App Runner

[Amazon SageMaker](https://aws.amazon.com/sagemaker/) is a fully managed end-to-end Machine Learning (ML) service. With SageMaker, you have the option of using the built-in algorithms or you can bring your own algorithms and frameworks to train your models. After training, you can deploy the models in [one of two ways](https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model.html) for inference - persistent endpoint or batch transform.

With a persistent inference endpoint, you get a fully-managed real-time HTTPS endpoint hosted on either CPU or GPU based EC2 instances. It supports features like auto scaling, data capture, model monitoring and also provides cost-effective GPU support using [Amazon Elastic Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/ei.html). It also supports hosting multiple models using multi-model endpoints that provide A/B testing capability. You can monitor the endpoint using [Amazon CloudWatch](https://aws.amazon.com/cloudwatch/). In addition to all these, you can use [Amazon SageMaker Pipelines](https://aws.amazon.com/sagemaker/pipelines/) which provides a purpose-built, easy-to-use Continuous Integration and Continuous Delivery (CI/CD) service for Machine Learning.

There are use cases where you may want to host the ML model on a real-time inference endpoint that is cost-effective and do not require all the capabilities provided by the SageMaker persistent inference endpoint. These may involve,
* simple models
* models whose sizes are lesser than 200 MB
* models that are invoked sparsely and do not need inference instances running all the time
* models that do not need to be re-trained and re-deployed frequently
* models that do not need GPUs for inference

In these cases, you can take the trained ML model and host it as an API on a Docker container on [AWS App Runner](https://aws.amazon.com/apprunner/). This will be cost-effective as compared to having real-time inference instances and still provide a fully-managed and scalable solution.

[AWS App Runner](https://aws.amazon.com/apprunner/) is a fully managed service that makes it easy for developers to quickly deploy containerized web applications and APIs, at scale and with no prior infrastructure experience required. App Runner automatically builds and deploys the web application and load balances traffic with encryption. App Runner also scales up or down automatically to meet your traffic needs.

This notebook demonstrates this solution by using SageMaker's [built-in XGBoost algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) to train a regression model on the [California Housing dataset](https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html). It loads the trained model as a Python3 [pickle](https://docs.python.org/3/library/pickle.html) object in a Python3 [Flask](https://flask.palletsprojects.com/en/1.1.x/) app script in a Docker container to be hosted as an API on [AWS App Runner](https://aws.amazon.com/apprunner/).

**Warning:** The Python3 [pickle](https://docs.python.org/3/library/pickle.html) module is not secure. Only unpickle data you trust. Keep this in mind if you decide to get the trained ML model file from somewhere instead of building your own model.

**Note:**

* This notebook should only be run from within a SageMaker notebook instance as it references SageMaker native APIs. The underlying OS of the notebook instance can either be Amazon Linux v1 or v2.
* At the time of writing this notebook, the most relevant latest version of the Jupyter notebook kernel for this notebook was `conda_python3` and this came built-in with SageMaker notebooks.
* This notebook uses CPU based instances for training.
* If you already have a trained model that can be loaded as a Python3 [pickle](https://docs.python.org/3/library/pickle.html) object, then you can skip the training step in this notebook and directly upload the model file to S3 and update the code in this notebook's cells accordingly.
* In this notebook, the ML model generated in the training step has not been tuned as that is not the intent of this demo.
* At the time of writing this notebook, [AWS App Runner](https://aws.amazon.com/apprunner/) was a new service and supported only in a few regions. To know if this service is available in a specific region, either refer the [documentation](https://docs.aws.amazon.com/apprunner/index.html) or check in the [AWS console](https://aws.amazon.com/console/).
* At the time of writing this notebook, AWS App Runner supported only public endpoints.
* If you intend to deploy this to Production with access security, then you have to build the authentication and authorization layers in the container prior to letting the call go to the inference script. This is not covered in this notebook.
* This notebook will create resources in the same AWS account and in the same region where this notebook is running.
* Users of this notebook require `root` access to install/update required software. This is set by default when you create the notebook. For more info, refer [here](https://docs.aws.amazon.com/sagemaker/latest/dg/nbi-root-access.html).

**Table of Contents:**

1. [Complete prerequisites](#Complete%20prerequisites)

 1. [Check and configure access to the Internet](#Check%20and%20configure%20access%20to%20the%20Internet)

 2. [Check and upgrade required software versions](#Check%20and%20upgrade%20required%20software%20versions)
 
 3. [Check and configure security permissions](#Check%20and%20configure%20security%20permissions)

 4. [Organize imports](#Organize%20imports)
 
 5. [Create common objects](#Create%20common%20objects)

2. [Prepare the data](#Prepare%20the%20data)

 1. [Create the local directories](#Create%20the%20local%20directories)
 
 2. [Load the dataset and view the details](#Load%20the%20dataset%20and%20view%20the%20details)
 
 3. [(Optional) Visualize the dataset](#(Optional)%20Visualize%20the%20dataset)
 
 4. [Split the dataset into train, validate and test sets](#Split%20the%20dataset%20into%20train,%20validate%20and%20test%20sets)
 
 5. [Standardize the datasets](#Standardize%20the%20datasets)
 
 6. [Save the prepared datasets locally](#Save%20the%20prepared%20datasets%20locally)
 
 7. [Upload the prepared datasets to S3](#Upload%20the%20prepared%20datasets%20to%20S3)

3. [Perform training](#Perform%20training)

 1. [Set the training parameters](#Set%20the%20training%20parameters)
 
 2. [(Optional) Delete previous checkpoints](#(Optional)%20Delete%20previous%20checkpoints)
 
 3. [Run the training job](#Run%20the%20training%20job)

4. [Create and push the Docker container to an Amazon ECR repository](#Create%20and%20push%20the%20Docker%20container%20to%20an%20Amazon%20ECR%20repository)

 1. [Retrieve the model pickle file](#Retrieve%20the%20model%20pickle%20file)
 
 2. [(Optional) Test the model pickle file](#(Optional)%20Test%20the%20model%20pickle%20file)
 
 3. [View the inference script](#View%20the%20inference%20script)
 
 4. [Create the Dockerfile](#Create%20the%20Dockerfile)
 
 5. [Create the container](#Create%20the%20container)
 
 6. [Create the private repository in ECR](#Create%20the%20private%20repository%20in%20ECR)
 
 7. [Push the container to ECR](#Push%20the%20container%20to%20ECR)

5. [Deploy and test on AWS App Runner](#Deploy%20and%20test%20on%20AWS%20App%20Runner)
 
 1. [Create the App Runner service](#Create%20the%20App%20Runner%20service)
 
 2. [Test the App Runner service](#Test%20the%20App%20Runner%20service)

6. [Cleanup](#Cleanup)

 1. [Cleanup App Runner resources](#Cleanup%20App%20Runner%20resources)
 
 2. [Cleanup ECR repository](#Cleanup%20ECR%20repository)
 
 3. [Cleanup S3 objects](#Cleanup%20S3%20objects)


## 1. Complete prerequisites 

Check and complete the prerequisites.

### A. Check and configure access to the Internet 

This notebook requires outbound access to the Internet to download the required software updates and to make calls to the container hosted as an AWS App Runner service. You can either provide direct Internet access (default) or provide Internet access through a VPC. For more information on this, refer [here](https://docs.aws.amazon.com/sagemaker/latest/dg/appendix-notebook-and-internet-access.html).

### B. Check and upgrade required software versions 

This notebook requires:
* [SageMaker Python SDK version 2.x](https://sagemaker.readthedocs.io/en/stable/v2.html)
* [Python 3.6.x](https://www.python.org/downloads/release/python-360/)
* [Boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html)
* [AWS Command Line Interface](https://aws.amazon.com/cli/)
* [Docker](https://www.docker.com/)
* [XGBoost Python module](https://xgboost.readthedocs.io/en/latest/python/python_intro.html)
* [cURL](https://curl.se/)

Capture the version of the OS on which this notebook is running.

In [None]:
import subprocess
from subprocess import Popen

p = Popen(['cat','/etc/system-release'], stdout=subprocess.PIPE, stderr=subprocess.PIPE, universal_newlines=True)
os_cmd_output, os_cmd_error = p.communicate()
if len(os_cmd_error) > 0:
 print('Notebook OS command returned error :: {}'.format(os_cmd_error))
 os_version = ''
else:
 if os_cmd_output.find('Amazon Linux release 2') >= 0:
 os_version = 'ALv2'
 elif os_cmd_output.find('Amazon Linux AMI release 2018.03') >= 0:
 os_version = 'ALv1'
 else:
 os_version = ''
print('Notebook OS version : {}'.format(os_version))

**Note:** When running the following cell, if you get 'module not found' errors, then uncomment the appropriate installation commands and install the modules. Also, uncomment and run the kernel shutdown command. When the kernel comes back, comment out the installation and kernel shutdown commands and run the following cell. Now, you should not see any errors.

In [None]:
"""

Last tested versions:


On Amazon Linux v1 (ALv1) notebook:
-----------------------------------
SageMaker Python SDK version : 2.54.0
Python version : 3.6.13 | packaged by conda-forge | (default, Feb 19 2021, 05:36:01) 
[GCC 9.3.0]
Boto3 version : 1.18.27
XGBoost Python module version : 1.4.2
AWS CLI version : aws-cli/1.20.21 Python/3.6.13 Linux/4.14.238-125.422.amzn1.x86_64 botocore/1.21.27
Docker version : 19.03.13-ce, build 4484c46


On Amazon Linux v2 (ALv2) notebook:
-----------------------------------
SageMaker Python SDK version : 2.59.1
Python version : 3.6.13 | packaged by conda-forge | (default, Feb 19 2021, 05:36:01) 
[GCC 9.3.0]
Boto3 version : 1.18.36
XGBoost Python module version : 1.4.2
AWS CLI version : aws-cli/1.20.24 Python/3.6.13 Linux/4.14.243-185.433.amzn2.x86_64 botocore/1.21.36
Docker version : 20.10.7, build f0df350
Amazon ECR Docker Credential Helper : 0.6.3

"""

import boto3
import IPython
import os
import sagemaker
import sys
try:
 import xgboost as xgb
except ModuleNotFoundError:
 # Install XGBoost and restart kernel
 print('Installing XGBoost module...')
 !{sys.executable} -m pip install -U xgboost
 IPython.Application.instance().kernel.do_shutdown(True)

# Install/upgrade the Sagemaker SDK, Boto3 and XGBoost and restart kernel
#!{sys.executable} -m pip install -U sagemaker boto3 xgboost
#IPython.Application.instance().kernel.do_shutdown(True)

# Get the current installed version of Sagemaker SDK, Python, Boto3 and XGBoost
print('SageMaker Python SDK version : {}'.format(sagemaker.__version__))
print('Python version : {}'.format(sys.version))
print('Boto3 version : {}'.format(boto3.__version__))
print('XGBoost Python module version : {}'.format(xgb.__version__))

# Get the AWS CLI version
print('AWS CLI version : ')
!aws --version

**Docker:**

Docker should be pre-installed in the SageMaker notebook instance. Verify it by running the `docker --version` command. If Docker is not installed, you can install it by uncommenting the install command in the following cell. You will require `sudo` rights to install.

In [None]:
# Verify if docker is installed
!docker --version

# Install docker
#!sudo yum --assumeyes install docker

**cURL:**

cURL should be pre-installed in the SageMaker notebook instance. Verify it by running the `curl --version` command. If cURL is not installed, you can install it by uncommenting the install command in the following cell. You will require `sudo` rights to install.

In [None]:
"""
Last tested version:
curl 7.71.1 (x86_64-conda-linux-gnu) libcurl/7.71.1 OpenSSL/1.1.1j zlib/1.2.11 libssh2/1.9.0 nghttp2/1.43.0
Release-Date: 2020-07-01
Protocols: dict file ftp ftps gopher http https imap imaps pop3 pop3s rtsp scp sftp smb smbs smtp smtps telnet tftp 
Features: AsynchDNS GSS-API HTTP2 HTTPS-proxy IPv6 Kerberos Largefile libz NTLM NTLM_WB SPNEGO SSL TLS-SRP UnixSockets
"""

# Verify if curl is installed
!curl --version

# Install curl
#!sudo yum --assumeyes install curl

**Additional prerequisite (when notebook is running on Amazon Linux v2):**

Install and configure the [Amazon ECR credential helper](https://github.com/awslabs/amazon-ecr-credential-helper). This makes it easier to store and use Docker credentials for use with Amazon ECR private registries.

In [None]:
if os_version == 'ALv2':
 # Install
 !sudo yum --assumeyes install amazon-ecr-credential-helper
 # Verify installation
 print('Amazon ECR Docker Credential Helper version : ')
 !docker-credential-ecr-login version
 # Create the .docker directory if it doesn't exist
 !mkdir -p ~/.docker
 # Configure
 !printf "{\\n\\t\"credsStore\": \"ecr-login\"\\n}" > ~/.docker/config.json
 # Verify configuration
 !cat ~/.docker/config.json

### C. Check and configure security permissions 

Users of this notebook require `root` access to install/update required software. This is set by default when you create the notebook. For more info, refer [here](https://docs.aws.amazon.com/sagemaker/latest/dg/nbi-root-access.html).

This notebook uses the IAM role attached to the underlying notebook instance. This role should have the following permissions,

1. Full access to the S3 bucket that will be used to store training and output data.
2. Full access to launch training instances.
3. Access to create CloudWatch Log Groups.
4. Access to write to CloudWatch Logs and CloudWatch Metrics.
5. Access to create, delete and write to Amazon ECR private registries.
6. Access to create, update and delete AWS App Runner services.

To view the name of this role, run the following cell.

In [None]:
print(sagemaker.get_execution_role())

This notebook creates an [image-based service](https://docs.aws.amazon.com/apprunner/latest/dg/service-source-image.html) on [AWS App Runner](https://docs.aws.amazon.com/apprunner/latest/dg/what-is-apprunner.html). This service requires the following service roles in IAM,

1. Access role with the following permissions:
 * Read access to Amazon ECR.
2. Instance role with the following permissions:
 * Access to create CloudWatch Log Groups.
 * Access to write to CloudWatch Logs and CloudWatch Metrics.

For more information on this, refer [here](https://docs.aws.amazon.com/apprunner/latest/dg/security_iam_service-with-iam.html#security_iam_service-with-iam-roles).

### D. Organize imports 

Organize all the library and module imports for later use.

In [None]:
from io import StringIO
import json
import logging
import matplotlib.pyplot as plt
import numpy as np
import pickle
import pandas as pd
from sagemaker.inputs import TrainingInput
import seaborn as sns
import sklearn.model_selection
from sklearn.preprocessing import StandardScaler
import tarfile
import time

### E. Create common objects 

Create common objects to be used in future steps in this notebook.

In [None]:
# Specify the S3 bucket name
s3_bucket = ''

# Create the S3 Boto3 resource
s3_resource = boto3.resource('s3')
s3_bucket_resource = s3_resource.Bucket(s3_bucket)

# Create the SageMaker Boto3 client
sm_client = boto3.client('sagemaker')

# Create the Amazon ECR client
ecr_client = boto3.client('ecr')

# Create the AWS App Runner client
apprunner_client = boto3.client('apprunner')

# Get the AWS region name
region_name = sagemaker.Session().boto_region_name

# Base name to be used to create resources
nb_name = 'sm-xgboost-ca-housing-apprunner-model-hosting'

# Names of various resources
train_job_name = 'train-{}'.format(nb_name)

# Names of local sub-directories in the notebook file system
data_dir = os.path.join(os.getcwd(), 'data/{}'.format(nb_name))
train_dir = os.path.join(os.getcwd(), 'data/{}/train'.format(nb_name))
val_dir = os.path.join(os.getcwd(), 'data/{}/validate'.format(nb_name))
test_dir = os.path.join(os.getcwd(), 'data/{}/test'.format(nb_name))

# Location of the datasets file in the notebook file system
dataset_csv_file = os.path.join(os.getcwd(), 'datasets/california_housing.csv')

# Container artifacts directory in the notebook file system
container_artifacts_dir = os.path.join(os.getcwd(), 'container-artifacts/{}'.format(nb_name))

# Location of the Python3 Flask script (containing the inference code) and it's corresponding
# requirements.txt in the notebook file system
container_script_file_name = 'container_sm_xgboost_ca_housing_inference.py'
container_script_req_file_name = 'container_sm_xgboost_ca_housing_inference_requirements.txt'
container_script_file = os.path.join(os.getcwd(), 'scripts/{}'.format(container_script_file_name))
container_script_req_file = os.path.join(os.getcwd(), 'scripts/{}'.format(container_script_req_file_name))

# Sub-folder names in S3
train_dir_s3_prefix = '{}/data/train'.format(nb_name)
val_dir_s3_prefix = '{}/data/validate'.format(nb_name)
test_dir_s3_prefix = '{}/data/test'.format(nb_name)

# Location in S3 where the model checkpoint will be stored
model_checkpoint_s3_path = 's3://{}/{}/checkpoint/'.format(s3_bucket, nb_name)

# Location in S3 where the trained model will be stored
model_output_s3_path = 's3://{}/{}/output/'.format(s3_bucket, nb_name)

# Names of the model tar file and extracted file - these are dependent on the
# framework and algorithm you used to train the model. This notebook uses
# SageMaker's built-in XGBoost algorithm and that will have the names as follows:
model_tar_file_name = 'model.tar.gz'
extracted_model_file_name = 'xgboost-model'

# Container details
container_image_name = nb_name
container_registry_url_prefix = ''

# App Runner details
## Service details
## Note: The ARN of the App Runner service will be generated when the service is created
apprunner_service_arn = 'TBD'
apprunner_service_name = 'sm-xgboost-ca-housing-model'
apprunner_cpu = '1 vCPU'
apprunner_memory = '2 GB'
apprunner_port = '8080'
apprunner_access_role = ''
apprunner_instance_role = ''
## Healthcheck details
apprunner_healthcheck_protocol = 'HTTP'
apprunner_healthcheck_path = '/healthcheck'
apprunner_healthcheck_interval_in_seconds = 10
apprunner_healthcheck_timeout_in_seconds = 10
apprunner_healthcheck_healthy_threshold = 3
apprunner_healthcheck_unhealthy_threshold = 3

## 2. Prepare the data 

The [California Housing dataset](https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html) consists of 20,640 observations on housing prices with 9 economic covariates. These covariates are,

* MedianHouseValue
* MedianIncome
* HousingMedianAge
* TotalRooms
* TotalBedrooms
* Population
* Households
* Latitude
* Longitude

This dataset has been downloaded to the local `datasets` directory and modified as a CSV file with the feature names in the first row. This will be used in this notebook.

The following steps will help with preparing the datasets for training, validation and testing.

### A) Create the local directories 

Create the directories in the local system where the dataset will be copied to and processed.

In [None]:
# Create the local directories if they don't exist
os.makedirs(data_dir, exist_ok=True)
os.makedirs(train_dir, exist_ok=True)
os.makedirs(val_dir, exist_ok=True)
os.makedirs(test_dir, exist_ok=True)

### B) Load the dataset and view the details 

Check if the CSV file exists in the `datasets` directory and load it into a Pandas DataFrame. Finally, print the details of the dataset.

In [None]:
# Check if the dataset file exists and proceed
if os.path.exists(dataset_csv_file):
 print('Dataset CSV file \'{}\' exists.'.format(dataset_csv_file))
 # Load the data into a Pandas DataFrame
 pd_data_frame = pd.read_csv(dataset_csv_file)
 # Print the first 5 records
 #print(pd_data_frame.head(5))
 # Describe the dataset
 print(pd_data_frame.describe())
else:
 print('Dataset CSV file \'{}\' does not exist.'.format(dataset_csv_file))

### C) (Optional) Visualize the dataset 

Display the distributions in the dataset.

In [None]:
# Print the correlation matrix
plt.figure(figsize=(11, 7))
sns.heatmap(cbar=False, annot=True, data=(pd_data_frame.corr() * 100), cmap='coolwarm')
plt.title('% Correlation Matrix')
plt.show()

### D) Split the dataset into train, validate and test sets 

Split the dataset into train, validate and test sets after shuffling. Split further into x and y sets.

In [None]:
# Split into train and test datasets after shuffling
train, test = sklearn.model_selection.train_test_split(pd_data_frame, test_size=0.2,
 random_state=35, shuffle=True)
# Split the train dataset further into train and validation datasets after shuffling
train, val = sklearn.model_selection.train_test_split(train, test_size=0.1,
 random_state=25, shuffle=True)

# Define functions to get x and y columns
def get_x(df):
 return df[['median_income','housing_median_age','total_rooms','total_bedrooms',
 'population','households','latitude','longitude']]
def get_y(df):
 return df[['median_house_value']]

# Load the x and y columns for train, validation and test datasets
x_train = get_x(train)
y_train = get_y(train)
x_val = get_x(val)
y_val = get_y(val)
x_test = get_x(test)
y_test = get_y(test)

# Summarize the datasets
print("x_train shape:", x_train.shape)
print("y_train shape:", y_train.shape)
print("x_val shape:", x_val.shape)
print("y_val shape:", y_val.shape)
print("x_test shape:", x_test.shape)
print("y_test shape:", y_test.shape)

### E) Standardize the datasets 

* Standardize the x columns of the train dataset using the `fit_transform()` function of `StandardScaler`.
* Standardize the x columns of the validate and test datasets using the `transform()` function of `StandardScaler`.

In [None]:
# Standardize the dataset
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_val = scaler.transform(x_val)
x_test = scaler.transform(x_test)

### F) Save the prepared datasets locally 

Save the prepared train, validate and test datasets to local directories. Prior to saving, concatenate x and y columns as needed. Create the directories if they don't exist.

In [None]:
# Save the prepared dataset (in numpy format) to the local directories as csv files

np.savetxt(os.path.join(train_dir, 'train.csv'),
 np.concatenate((y_train.to_numpy(), x_train), axis=1), delimiter=',')
np.savetxt(os.path.join(train_dir, 'train_x.csv'), x_train)
np.savetxt(os.path.join(train_dir, 'train_y.csv'), y_train.to_numpy())

np.savetxt(os.path.join(val_dir, 'validate.csv'),
 np.concatenate((y_val.to_numpy(), x_val), axis=1), delimiter=',')
np.savetxt(os.path.join(val_dir, 'validate_x.csv'), x_val)
np.savetxt(os.path.join(val_dir, 'validate_y.csv'), y_val.to_numpy())

np.savetxt(os.path.join(test_dir, 'test.csv'),
 np.concatenate((y_test.to_numpy(), x_test), axis=1), delimiter=',')
np.savetxt(os.path.join(test_dir, 'test_x.csv'), x_test)
np.savetxt(os.path.join(test_dir, 'test_y.csv'), y_test.to_numpy())

### G) Upload the prepared datasets to S3 

Upload the datasets from the local directories to appropriate sub-directories in the specified S3 bucket.

In [None]:
# Upload the data to S3
train_dir_s3_path = sagemaker.Session().upload_data(path='./data/{}/train/'.format(nb_name),
 bucket=s3_bucket,
 key_prefix=train_dir_s3_prefix)
val_dir_s3_path = sagemaker.Session().upload_data(path='./data/{}/validate/'.format(nb_name),
 bucket=s3_bucket,
 key_prefix=val_dir_s3_prefix)
test_dir_s3_path = sagemaker.Session().upload_data(path='./data/{}/test/'.format(nb_name),
 bucket=s3_bucket,
 key_prefix=test_dir_s3_prefix)

# Capture the S3 locations of the uploaded datasets
train_s3_path = '{}/train.csv'.format(train_dir_s3_path)
train_x_s3_path = '{}/train_x.csv'.format(train_dir_s3_path)
train_y_s3_path = '{}/train_y.csv'.format(train_dir_s3_path)
val_s3_path = '{}/validate.csv'.format(val_dir_s3_path)
val_x_s3_path = '{}/validate_x.csv'.format(val_dir_s3_path)
val_y_s3_path = '{}/validate_y.csv'.format(val_dir_s3_path)
test_s3_path = '{}/test.csv'.format(test_dir_s3_path)
test_x_s3_path = '{}/test_x.csv'.format(test_dir_s3_path)
test_y_s3_path = '{}/test_y.csv'.format(test_dir_s3_path)

## 3. Perform training 

In this step, SageMaker's [built-in XGBoost algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) is used to train a regression model on the [California Housing dataset](https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html).

Note: This model has not been tuned as that is not the intent of this demo.

### A) Set the training parameters 

1. Inputs - S3 location of the training and validation data.
2. Hyperparameters.
3. Training instance details:

 1. Instance count
 
 2. Instance type
 
 3. The max run time of the training job
 
 4. (Optional) Use Spot instances. For more info, refer [here](https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html).
 
 5. (Optional) The max wait for Spot instances, if using Spot. This should be larger than the max run time.
 
4. Base job name
5. Appropriate local and S3 directories that will be used by the training job.

In [None]:
# Set the input data input along with their content types
train_input = TrainingInput(train_s3_path, content_type='text/csv')
val_input = TrainingInput(val_s3_path, content_type='text/csv')
inputs = {'train':train_input, 'validation':val_input}

# Set the hyperparameters
hyperparameters = {
 'objective':'reg:squarederror',
 'max_depth':'6',
 'eta':'0.3',
 'alpha':'3',
 'colsample_bytree':'0.7',
 'num_round':'100'}

# Set the instance count, instance type, volume size, options to use Spot instances and other parameters
train_instance_count = 1
train_instance_type = 'ml.m5.xlarge'
train_instance_volume_size_in_gb = 5
#use_spot_instances = True
#spot_max_wait_time_in_seconds = 5400
use_spot_instances = False
spot_max_wait_time_in_seconds = None
max_run_time_in_seconds = 3600
algorithm_name = 'xgboost'
algorithm_version = '1.2-2'
py_version = 'py37'
# Get the container image URI for the specified parameters
container_image_uri = sagemaker.image_uris.retrieve(framework=algorithm_name,
 region=region_name,
 version=algorithm_version,
 py_version=py_version,
 instance_type=train_instance_type,
 image_scope='training')

# Set the training container related parameters
container_log_level = logging.INFO

# Location where the model checkpoints will be stored locally in the container before being uploaded to S3
model_checkpoint_local_dir = '/opt/ml/checkpoints/'

# Location where the trained model will be stored locally in the container before being uploaded to S3
model_local_dir = '/opt/ml/model'

### B) (Optional) Delete previous checkpoints 

If model checkpoints from previous trainings are found in the S3 checkpoint location specified in the previous step, then training will resume from those checkpoints. In order to start a fresh training, run the following code cell to delete all checkpoint objects from S3.

In [None]:
# Delete the checkpoints if you want to train from the beginning; else ignore this code cell
for checkpoint_file in s3_bucket_resource.objects.filter(Prefix='{}/checkpoint/'.format(nb_name)):
 checkpoint_file_key = checkpoint_file.key
 print('Deleting {} ...'.format(checkpoint_file_key))
 s3_resource.Object(s3_bucket_resource.name, checkpoint_file_key).delete()

### C) Run the training job 

Prepare the `estimator` and call the `fit()` method. This will pull the container containing the specified version of the algorithm in the AWS region and run the training job in the specified type of EC2 instance(s). The training data will be pulled from the specified location in S3 and training results and checkpoints will be written to the specified locations in S3.

Note: SageMaker Debugger is disabled.

In [None]:
# Create the estimator
estimator = sagemaker.estimator.Estimator(
 image_uri=container_image_uri,
 checkpoint_local_path=model_checkpoint_local_dir,
 checkpoint_s3_uri=model_checkpoint_s3_path,
 model_dir=model_local_dir,
 output_path=model_output_s3_path,
 instance_type=train_instance_type,
 instance_count=train_instance_count,
 use_spot_instances=use_spot_instances,
 max_wait=spot_max_wait_time_in_seconds,
 max_run=max_run_time_in_seconds,
 hyperparameters=hyperparameters,
 role=sagemaker.get_execution_role(),
 base_job_name=train_job_name,
 framework_version=algorithm_version,
 py_version=py_version,
 container_log_level=container_log_level,
 script_mode=False,
 debugger_hook_config=False,
 disable_profiler=True)

# Perform the training
estimator.fit(inputs, wait=True)

## 4. Create and push the Docker container to an Amazon ECR repository 

In this step, we will create a Docker container containing the generated model along with its dependencies. If you bring a pre-trained model, you can upload it to S3 and use it to build the container. The following steps contains instructions for doing so.

### A) Retrieve the model pickle file 

* The model file generated using SageMaker's [built-in XGBoost algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) will be a Python pickle file zipped up in a tar file named `model.tar.gz`. The S3 URI for this file will be available in the `model_data` attribute of the `estimator` object created in the training step.

* If you bring your pre-trained model, you have to specify the S3 URI appropriately in the following cell.

* The zip file needs to be downloaded from S3 and extracted.

* The name of the extracted pickle file will depend on the framework and algorithm that was used to train the model. In this notebook example, we have used SageMaker's [built-in XGBoost algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) and so the pickle file will be named `xgboost-model`. You will see this when the model tar file is extracted.

In [None]:
# Create the container artifacts directory if it doesn't exist
os.makedirs(container_artifacts_dir, exist_ok=True)

# Set the file paths
model_tar_file_s3_path_suffix = '{}/output/{}/output/{}'.format(nb_name,
 estimator.latest_training_job.name,
 model_tar_file_name)
model_tar_file_local_path = '{}/{}'.format(container_artifacts_dir, model_tar_file_name)
extracted_model_file_local_path = '{}/{}'.format(container_artifacts_dir, extracted_model_file_name)

# Delete old model files if they exist
if os.path.exists(model_tar_file_local_path):
 os.remove(model_tar_file_local_path)
if os.path.exists(extracted_model_file_local_path):
 os.remove(extracted_model_file_local_path)

# Download the model tar file from S3
s3_bucket_resource.download_file(model_tar_file_s3_path_suffix, model_tar_file_local_path)

# Extract the model tar file and retrieve the model pickle file
with tarfile.open(model_tar_file_local_path, "r:gz") as tar:
 tar.extractall(path=container_artifacts_dir)

### B) (Optional) Test the model pickle file 

The code in the following cell entirely depends on the framework and algorithm that was used to train the model. The extracted Python3 pickle file will contain the appropriate object name. If you are bringing your own model file, you have to change this cell appropriately.

In [None]:
# Load the model pickle file as a pickle object
pickle_file_path = extracted_model_file_local_path
with open(pickle_file_path, 'rb') as pkl_file:
 model = pickle.load(pkl_file)

# Run a prediction against the model loaded as a pickle object
# by sending the first record of the test dataset
test_pred_x_df = pd.read_csv(StringIO(','.join(map(str, x_test[0]))), sep=',', header=None)
test_pred_x = xgb.DMatrix(test_pred_x_df.values)
print('Input for prediction = {}'.format(test_pred_x_df.values))
print('Predicted value = {}'.format(model.predict(test_pred_x)[0]))
print('Actual value = {}'.format(y_test.values[0][0]))
print('Note: There may be a huge difference between the actual and predicted values as the model has not been tuned in the training step.')

### C) View the inference script 

The inference script is a Python3 [Flask](https://flask.palletsprojects.com/en/1.1.x/) app script that contains the following logic:
* Initialize the Flask web app server.
* Load the ML model pickle object into memory.
* Run the Flask web app server.
* Parse the request sent to the web app server.
* Run the prediction.
* Format the response to match with the parameter specified in the request.
* Return the response.
* Implement the healthcheck logic to return a success on invocation. This has to be called by the service hosting this container to perform health checks.

The request should be in the following format:

`{
 "response_content_type": "",
 "pred_x_csv": ""
}`

This script will be packaged into the container that will be built in the upcoming steps.

You can view the script by running the following code cell.

In [None]:
# View the Python3 Flask script (containing the inference code)
!cat {container_script_file}

### D) Create the Dockerfile 

In this step, we will create a [Dockerfile](https://docs.docker.com/engine/reference/builder/) which is required to build our [Docker](https://www.docker.com/) container containing the model pickle file, an inference script and its dependencies.

In order to create the container, we will use the [Amazon Linux 2 container image](https://gallery.ecr.aws/amazonlinux/amazonlinux) available in the [Amazon ECR public registry](https://aws.amazon.com/ecr/) as the base image. As this is a public registry, you do not require any credentials or permissions to download it.

Note: At the time of writing this notebook, this image was based on [Amazon Linux 2](https://aws.amazon.com/amazon-linux-2/). Depending on the specific version you intend to use, you can suffix container image URL with the specific version after the `:` character in the following cell.

In [None]:
# Copy the inference script and requirements.txt to the container-artifacts directory
!cp -pr {container_script_file} {container_artifacts_dir}/server.py
!cp -pr {container_script_req_file} {container_artifacts_dir}/requirements.txt

# Create the Dockerfile content
dockerfile_content_lines = []
dockerfile_content_lines.append('# syntax=docker/dockerfile:1\n\n')
dockerfile_content_lines.append('# Use Amazon Linux 2 as the base image\n')
dockerfile_content_lines.append('FROM public.ecr.aws/amazonlinux/amazonlinux:latest\n\n')
dockerfile_content_lines.append('# Setup the working directory\n')
dockerfile_content_lines.append('WORKDIR /\n\n')
dockerfile_content_lines.append('# Install Python3\n')
dockerfile_content_lines.append('RUN yum -y install python3\n\n')
dockerfile_content_lines.append('# Upgrade pip\n')
dockerfile_content_lines.append('RUN pip3 install --upgrade pip\n\n')
dockerfile_content_lines.append('# Setup the Python virtual env to run the inference script\n')
dockerfile_content_lines.append('RUN python3 -m venv /opt/appenv\n\n')
dockerfile_content_lines.append('# Install the Python packages required for the inference script in the virtual env\n')
dockerfile_content_lines.append('COPY requirements.txt .\n')
dockerfile_content_lines.append('RUN /opt/appenv/bin/pip install -r requirements.txt\n\n')
dockerfile_content_lines.append('# Copy the extracted model file and the inference script\n')
dockerfile_content_lines.append('COPY ')
dockerfile_content_lines.append(extracted_model_file_name)
dockerfile_content_lines.append(' ./\n')
dockerfile_content_lines.append('COPY server.py ./\n\n')
dockerfile_content_lines.append('# Specify the ENV variables\n')
dockerfile_content_lines.append('ENV MODEL_PICKLE_FILE_PATH=')
dockerfile_content_lines.append(extracted_model_file_name)
dockerfile_content_lines.append('\n')
dockerfile_content_lines.append('ENV FLASK_SERVER_LOG_LEVEL=DEBUG\n')
dockerfile_content_lines.append('ENV FLASK_SERVER_HOSTNAME=0.0.0.0\n')
dockerfile_content_lines.append('ENV FLASK_SERVER_PORT=')
dockerfile_content_lines.append(apprunner_port)
dockerfile_content_lines.append('\n')
dockerfile_content_lines.append('ENV FLASK_SERVER_DEBUG=True\n\n')
dockerfile_content_lines.append('# Specify the command to run the inference script as a Flask app\n')
dockerfile_content_lines.append('ENTRYPOINT ["/opt/appenv/bin/python", "server.py"]')

# Create the Dockerfile
dockerfile_local_path = '{}/Dockerfile'.format(container_artifacts_dir)
with open(dockerfile_local_path, 'wt') as file:
 file.write(''.join(dockerfile_content_lines))
 
# Print the contents of the generated Dockerfile
!cat {dockerfile_local_path}

### E) Create the container 

Create the Docker container using the `docker build` command. Specify the container image name and point to the container-artifacts directory that contains all the files to build the container.

Note: You may see warning messages when the container is built with the Dockerfile that we created in the prior step. These warnings will be around installing the Python packages that are required by the inference script. You can choose to either ignore or fix them.

In [None]:
# Create the Docker container
!docker build -t {container_image_name} {container_artifacts_dir}

### F) Create the private repository in ECR 

In order to create an image-based service in AWS App Runner, the container image should exist in a container registry. In this notebook, we will create and use an [Amazon ECR](https://aws.amazon.com/ecr/) private repository for this purpose.

In this step, we will check if the private repository in Amazon ECR that we intend to create already exists or not. If it does not exist, we will create it with the repository name the same as the container image name.

Note: When creating the repository, setting the `scanOnPush` parameter to `True` will automatically initiate a vulnerability scan on the container image that is pushed to the repository. For more info on image scanning, refer [here](https://docs.aws.amazon.com/AmazonECR/latest/userguide/image-scanning.html).

In [None]:
# Check if the ECR repository exists already; if not, then create it
try:
 ecr_client.describe_repositories(repositoryNames=[container_image_name])
 print('ECR repository {} already exists.'.format(container_image_name))
except ecr_client.exceptions.RepositoryNotFoundException:
 print('ECR repository {} does not exist.'.format(container_image_name))
 print('Creating ECR repository {}...'.format(container_image_name))
 # Create the ECR repository - here we use the container image name for the repository name
 ecr_client.create_repository(repositoryName=container_image_name,
 imageScanningConfiguration={
 'scanOnPush': True
 })
 print('Completed creating ECR repository {}.'.format(container_image_name))

### G) Push the container to ECR 

In this step, we will push the container to a private registry that we created in Amazon ECR.

When using an Amazon ECR private registry, you must authenticate your Docker client to your private registry so that you can use the `docker push` and `docker pull` commands to push and pull images to and from the repositories in that registry. For more information about this, refer [here](https://docs.aws.amazon.com/AmazonECR/latest/userguide/registry_auth.html).

1. If this notebook instance is running on Amazon Linux v1, the authentication happens through an authorization token generated by an AWS CLI command in the following code cell. This token will be automatically deleted when the code cell completes execution.
2. If this notebook instance is running on Amazon Linux v2, the authentication happens through temporary credentials generated based on the IAM role attached to this notebook. For this, you have to complete the prerequisite mentioned in the first step of this notebook.

In [None]:
# Set the image names
source_image_name = '{}:latest'.format(container_image_name)
target_image_name = '{}/{}:latest'.format(container_registry_url_prefix, container_image_name)

if os_version == 'ALv1':
 # Get the private registry credentials using an authorization token
 !aws ecr get-login-password --region {region_name} | docker login --username AWS --password-stdin {container_registry_url_prefix}

# Tag the container
!docker tag {source_image_name} {target_image_name}

# Push the container to the specified registry in Amazon ECR
!docker push {target_image_name}

if os_version == 'ALv1':
 # Delete the Docker credentials file
 print('\nDeleting the generated Docker credentials file...')
 !rm /home/ec2-user/.docker/config.json
 print('Completed deleting the generated Docker credentials file.')
 # Verify the delete
 print('Verifying the delete of the generated Docker credentials file...')
 !cat /home/ec2-user/.docker/config.json
 print('Completed verifying the delete of the generated Docker credentials file.')

## 5. Deploy and test on AWS App Runner 

In this step, we will create an [image-based service](https://docs.aws.amazon.com/apprunner/latest/dg/service-source-image.html) on [AWS App Runner](https://docs.aws.amazon.com/apprunner/latest/dg/what-is-apprunner.html) using the Docker container that was created in the previous step and test it.

### A) Create the App Runner service 

In this step, we will check if the App Runner service that we intend to create already exists or not. If it does not exist, we will create it.

Note:

* At the time of writing this notebook, AWS App Runner supported only public endpoints.
* If you intend to deploy this to Production with access security, then you have to build the authentication and authorization layers in the container prior to letting the call go to the inference script. This is not covered in this notebook.
* We have not configured this App Runner service to use a custom domain name. If you require it, refer to the instructions [here](https://docs.aws.amazon.com/apprunner/latest/dg/manage-custom-domains.html).

In [None]:
# Check if the App Runner service exists already; if not, then create it
create_service_flag = False
try:
 describe_service_response = apprunner_client.describe_service(ServiceArn=apprunner_service_arn)
 service_status = describe_service_response['Service']['Status']
 print('App Runner service \'{}\' already exists and is in status \'{}\'.'.format(apprunner_service_name,
 service_status))
 if service_status == 'DELETED':
 create_service_flag = True
 else:
 apprunner_service_arn = describe_service_response['Service']['ServiceArn']
 apprunner_service_url = describe_service_response['Service']['ServiceUrl']
except (apprunner_client.exceptions.ResourceNotFoundException,
 apprunner_client.exceptions.InvalidRequestException):
 print('App Runner service {} does not exist.'.format(apprunner_service_name))
 create_service_flag = True

# Create the service based on the determined condition
if create_service_flag == True:
 print('Creating App Runner service {}...'.format(apprunner_service_name))
 create_service_response = apprunner_client.create_service(ServiceName=apprunner_service_name,
 SourceConfiguration={
 'ImageRepository':{
 'ImageIdentifier':target_image_name,
 'ImageConfiguration':{
 'Port':apprunner_port
 },
 'ImageRepositoryType':'ECR'
 },
 'AutoDeploymentsEnabled':False,
 'AuthenticationConfiguration':{
 'AccessRoleArn':apprunner_access_role
 }
 },
 InstanceConfiguration={
 'Cpu':apprunner_cpu,
 'Memory':apprunner_memory,
 'InstanceRoleArn':apprunner_instance_role
 },
 HealthCheckConfiguration={
 'Protocol':apprunner_healthcheck_protocol,
 'Path':apprunner_healthcheck_path,
 'Interval':apprunner_healthcheck_interval_in_seconds,
 'Timeout':apprunner_healthcheck_timeout_in_seconds,
 'HealthyThreshold':apprunner_healthcheck_healthy_threshold,
 'UnhealthyThreshold':apprunner_healthcheck_unhealthy_threshold
 })
 apprunner_service_arn = create_service_response['Service']['ServiceArn']
 apprunner_service_url = create_service_response['Service']['ServiceUrl']
 print('App Runner service status = {}'.format(create_service_response['Service']['Status']))
 
# Print the service details
print('App Runner service ARN = {}'.format(apprunner_service_arn))
print('App Runner service URL = {}'.format(apprunner_service_url))

In [None]:
# Sleep every 10 seconds and print the status of the AWS App Runner service
# until it goes to CREATE_FAILED, RUNNING, DELETED, DELETE_FAILED or PAUSED state
while True:
 describe_service_response = apprunner_client.describe_service(ServiceArn=apprunner_service_arn)
 describe_service_status = describe_service_response['Service']['Status']
 print('App Runner service status = {}'.format(describe_service_status))
 if describe_service_status in {'CREATE_FAILED', 'RUNNING', 'DELETED', 'DELETE_FAILED', 'PAUSED'}:
 break
 time.sleep(10)

### B) Test the App Runner service 

In this step, we will test the App Runner service that we created in the previous step by invoking it synchronously. For this, we will invoke the Python3 [Flask](https://flask.palletsprojects.com/en/1.1.x/) app script running in the container by using the Public URL of the App Runner service.

Invoke the endpoint by making a HTTP POST call with the first record of the test dataset as a CSV string. The request should be in the following format:

`{
 "response_content_type": "",
 "pred_x_csv": ""
}`

In [None]:
# Set the request payload
x_test_request_payload_csv = ','.join(map(str, x_test[0]))
x_test_request_payload = '{' + '"response_content_type": "application/json","pred_x_csv":"{}"'.format(x_test_request_payload_csv) + '}'
# Print the request
print('Request payload:\n')
print(x_test_request_payload)

# Invoke the App Runner service and print the response
apprunner_service_full_url = 'https://{}/'.format(apprunner_service_url)
print('\nResponse:\n')
!curl -X POST -H 'Content-Type: application/json' --data '{x_test_request_payload}' {apprunner_service_full_url}

## 6. Cleanup 

As a best practice, you should delete resources and S3 objects when no longer required. This will help you avoid incurring unncessary costs.

This step will cleanup the resources and S3 objects created by this notebook.

Note: Apart from these resources, there will be Docker containers and related images created in the notebook instance that is running this Jupyter notebook. As they are already part of the notebook instance, you do not need to delete them. If you decide to delete them, then go to the Terminal of the Jupyter notebook and and run appropriate `docker` commands.

### A) Cleanup App Runner resources 

In [None]:
# Delete the App Runner service
apprunner_client.delete_service(ServiceArn=apprunner_service_arn)

### B) Cleanup ECR repository 

In [None]:
# Delete the ECR private repository
try:
 ecr_client.delete_repository(repositoryName=container_image_name, force=True)
 print('ECR repository {} deleted.'.format(container_image_name))
except ecr_client.exceptions.RepositoryNotFoundException:
 print('ECR repository {} does not exist.'.format(container_image_name))

### C) Cleanup S3 objects 

In [None]:
# Delete data from S3 bucket
for file in s3_bucket_resource.objects.filter(Prefix='{}/'.format(nb_name)):
 file_key = file.key
 print('Deleting {} ...'.format(file_key))
 s3_resource.Object(s3_bucket_resource.name, file_key).delete()