# Train a XGBoost regression model on Amazon SageMaker, host inference on a Docker container running on Amazon ECS on AWS Fargate and optionally expose as an API with Amazon API Gateway

[Amazon SageMaker](https://aws.amazon.com/sagemaker/) is a fully managed end-to-end Machine Learning (ML) service. With SageMaker, you have the option of using the built-in algorithms or you can bring your own algorithms and frameworks to train your models.  After training, you can deploy the models in [one of two ways](https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model.html) for inference - persistent endpoint or batch transform.

With a persistent inference endpoint, you get a fully-managed real-time HTTPS endpoint hosted on either CPU or GPU based EC2 instances.  It supports features like auto scaling, data capture, model monitoring and also provides cost-effective GPU support using [Amazon Elastic Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/ei.html).  It also supports hosting multiple models using multi-model endpoints that provide A/B testing capability.  You can monitor the endpoint using [Amazon CloudWatch](https://aws.amazon.com/cloudwatch/).  In addition to all these, you can use [Amazon SageMaker Pipelines](https://aws.amazon.com/sagemaker/pipelines/) which provides a purpose-built, easy-to-use Continuous Integration and Continuous Delivery (CI/CD) service for Machine Learning.

There are use cases where you may want to host the ML model on a real-time inference endpoint that is cost-effective and do not require all the capabilities provided by the SageMaker persistent inference endpoint.  These may involve,
* simple models
* models whose sizes are lesser than 200 MB
* models that are invoked sparsely and do not need inference instances running all the time
* models that do not need to be re-trained and re-deployed frequently
* models that do not need GPUs for inference

In these cases, you can take the trained ML model and host it on a container on [Amazon ECS on AWS Fargate](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/AWS_Fargate.html) and optionally expose it as an API by front-ending it with a HTTP/REST API hosted on [Amazon API Gateway](https://aws.amazon.com/api-gateway/).  This will be cost-effective as compared to having real-time inference instances and still provide a fully-managed and scalable solution.

[Amazon Elastic Container Service (Amazon ECS)](https://aws.amazon.com/ecs) is a fully managed container orchestration service.  It is a highly scalable, fast container management service that makes it easy to run, stop and manage containers on a cluster. Your containers are defined in a task definition that you use to run individual tasks or tasks within a service.  In this context, a service is a configuration that enables you to run and maintain a specified number of tasks simultaneously in a cluster. You can run your tasks and services on a serverless infrastructure that is managed by [AWS Fargate](https://aws.amazon.com/fargate). Alternatively, for more control over your infrastructure, you can run your tasks and services on a cluster of Amazon EC2 instances that you manage.

This notebook demonstrates this solution by using SageMaker's [built-in XGBoost algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) to train a regression model on the [California Housing dataset](https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html).  It loads the trained model as a Python3 [pickle](https://docs.python.org/3/library/pickle.html) object in a Python3 [Flask](https://flask.palletsprojects.com/en/1.1.x/) app script in a container to be hosted on [Amazon ECS on AWS Fargate](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/AWS_Fargate.html).  Finally, it provides instructions for exposing it as an API by front-ending it with a HTTP/REST API hosted on [Amazon API Gateway](https://aws.amazon.com/api-gateway/).

**Warning:** The Python3 [pickle](https://docs.python.org/3/library/pickle.html) module is not secure.  Only unpickle data you trust.  Keep this in mind if you decide to get the trained ML model file from somewhere instead of building your own model.

**Note:**

* This notebook should only be run from within a SageMaker notebook instance as it references SageMaker native APIs.  The underlying OS of the notebook instance can either be Amazon Linux v1 or v2.
* At the time of writing this notebook, the most relevant latest version of the Jupyter notebook kernel for this notebook was `conda_python3` and this came built-in with SageMaker notebooks.
* This notebook uses CPU based instances for training.
* If you already have a trained model that can be loaded as a Python3 [pickle](https://docs.python.org/3/library/pickle.html) object, then you can skip the training step in this notebook and directly upload the model file to S3 and update the code in this notebook's cells accordingly.
* In this notebook, the ML model generated in the training step has not been tuned as that is not the intent of this demo.
* In this notebook, we will create only one ECS Task.  In order to scale to more tasks, you have to create an [Amazon ECS Service](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs_services.html) made up of multiple tasks.  You can then setup [Load Balancing](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-load-balancing.html) and [Auto Scaling](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-auto-scaling.html).
* This notebook will create resources in the same AWS account and in the same region where this notebook is running.
* Users of this notebook require `root` access to install/update required software.  This is set by default when you create the notebook.  For more info, refer [here](https://docs.aws.amazon.com/sagemaker/latest/dg/nbi-root-access.html).

**Table of Contents:**

1. [Complete prerequisites](#Complete%20prerequisites)

    1. [Check and configure access to the Internet](#Check%20and%20configure%20access%20to%20the%20Internet)

    2. [Check and upgrade required software versions](#Check%20and%20upgrade%20required%20software%20versions)
    
    3. [Check and configure security permissions](#Check%20and%20configure%20security%20permissions)

    4. [Organize imports](#Organize%20imports)
    
    5. [Create common objects](#Create%20common%20objects)

2. [Prepare the data](#Prepare%20the%20data)

    1. [Create the local directories](#Create%20the%20local%20directories)
    
    2. [Load the dataset and view the details](#Load%20the%20dataset%20and%20view%20the%20details)
    
    3. [(Optional) Visualize the dataset](#(Optional)%20Visualize%20the%20dataset)
    
    4. [Split the dataset into train, validate and test sets](#Split%20the%20dataset%20into%20train,%20validate%20and%20test%20sets)
    
    5. [Standardize the datasets](#Standardize%20the%20datasets)
    
    6. [Save the prepared datasets locally](#Save%20the%20prepared%20datasets%20locally)
    
    7. [Upload the prepared datasets to S3](#Upload%20the%20prepared%20datasets%20to%20S3)

3. [Perform training](#Perform%20training)

    1. [Set the training parameters](#Set%20the%20training%20parameters)
    
    2. [(Optional) Delete previous checkpoints](#(Optional)%20Delete%20previous%20checkpoints)
    
    3. [Run the training job](#Run%20the%20training%20job)

4. [Create and push the Docker container to an Amazon ECR repository](#Create%20and%20push%20the%20Docker%20container%20to%20an%20Amazon%20ECR%20repository)

    1. [Retrieve the model pickle file](#Retrieve%20the%20model%20pickle%20file)
    
    2. [(Optional) Test the model pickle file](#(Optional)%20Test%20the%20model%20pickle%20file)
    
    3. [View the inference script](#View%20the%20inference%20script)
    
    4. [Create the Dockerfile](#Create%20the%20Dockerfile)
    
    5. [Create the container](#Create%20the%20container)
    
    6. [Create the private repository in ECR](#Create%20the%20private%20repository%20in%20ECR)
    
    7. [Push the container to ECR](#Push%20the%20container%20to%20ECR)

5. [Deploy and test on Amazon ECS on AWS Fargate](#Deploy%20and%20test%20on%20Amazon%20ECS%20on%20AWS%20Fargate)
    
    1. [Create the ECS cluster](#Create%20the%20ECS%20cluster)
    
    2. [Create the ECS Task and deploy the container](#Create%20the%20ECS%20Task%20and%20deploy%20the%20container)
    
    3. [Prepare to test the ECS Task](#Prepare%20to%20test%20the%20ECS%20Task)
    
    4. [Test the ECS Task](#Test%20the%20ECS%20Task)
    
6. [(Optional) Front-end the container with Amazon API Gateway](#(Optional)%20Front-end%20the%20container%20with%20Amazon%20API%20Gateway)

7. [Cleanup](#Cleanup)

    1. [Cleanup ECS resources](#Cleanup%20ECS%20resources)
    
    2. [Cleanup ECR repository](#Cleanup%20ECR%20repository)
    
    3. [Cleanup S3 objects](#Cleanup%20S3%20objects)


##  1. Complete prerequisites <a id='Complete%20prerequisites'></a>

Check and complete the prerequisites.

###  A. Check and configure access to the Internet <a id='Check%20and%20configure%20access%20to%20the%20Internet'></a>

This notebook requires outbound access to the Internet to download the required software updates and to make calls to the container hosted as an Amazon ECS Task.  You can either provide direct Internet access (default) or provide Internet access through a VPC.  For more information on this, refer [here](https://docs.aws.amazon.com/sagemaker/latest/dg/appendix-notebook-and-internet-access.html).

### B. Check and upgrade required software versions  <a id='Check%20and%20upgrade%20required%20software%20versions'></a>

This notebook requires:
* [SageMaker Python SDK version 2.x](https://sagemaker.readthedocs.io/en/stable/v2.html)
* [Python 3.6.x](https://www.python.org/downloads/release/python-360/)
* [Boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html)
* [AWS Command Line Interface](https://aws.amazon.com/cli/)
* [Docker](https://www.docker.com/)
* [XGBoost Python module](https://xgboost.readthedocs.io/en/latest/python/python_intro.html)
* [cURL](https://curl.se/)

Capture the version of the OS on which this notebook is running.

In [None]:
import subprocess
from subprocess import Popen

p = Popen(['cat','/etc/system-release'], stdout=subprocess.PIPE, stderr=subprocess.PIPE, universal_newlines=True)
os_cmd_output, os_cmd_error = p.communicate()
if len(os_cmd_error) > 0:
    print('Notebook OS command returned error :: {}'.format(os_cmd_error))
    os_version = ''
else:
    if os_cmd_output.find('Amazon Linux release 2') >= 0:
        os_version = 'ALv2'
    elif os_cmd_output.find('Amazon Linux AMI release 2018.03') >= 0:
        os_version = 'ALv1'
    else:
        os_version = ''
print('Notebook OS version : {}'.format(os_version))

**Note:** When running the following cell, if you get 'module not found' errors, then uncomment the appropriate installation commands and install the modules.  Also, uncomment and run the kernel shutdown command.  When the kernel comes back, comment out the installation and kernel shutdown commands and run the following cell.  Now, you should not see any errors.

In [None]:
"""

Last tested versions:


On Amazon Linux v1 (ALv1) notebook:
-----------------------------------
SageMaker Python SDK version : 2.54.0
Python version : 3.6.13 | packaged by conda-forge | (default, Feb 19 2021, 05:36:01) 
[GCC 9.3.0]
Boto3 version : 1.18.27
XGBoost Python module version : 1.4.2
AWS CLI version : aws-cli/1.20.21 Python/3.6.13 Linux/4.14.238-125.422.amzn1.x86_64 botocore/1.21.27
Docker version : 19.03.13-ce, build 4484c46


On Amazon Linux v2 (ALv2) notebook:
-----------------------------------
SageMaker Python SDK version : 2.59.1
Python version : 3.6.13 | packaged by conda-forge | (default, Feb 19 2021, 05:36:01) 
[GCC 9.3.0]
Boto3 version : 1.18.36
XGBoost Python module version : 1.4.2
AWS CLI version : aws-cli/1.20.24 Python/3.6.13 Linux/4.14.243-185.433.amzn2.x86_64 botocore/1.21.36
Docker version : 20.10.7, build f0df350
Amazon ECR Docker Credential Helper : 0.6.3

"""

import boto3
import IPython
import os
import sagemaker
import sys
try:
    import xgboost as xgb
except ModuleNotFoundError:
    # Install XGBoost and restart kernel
    print('Installing XGBoost module...')
    !{sys.executable} -m pip install -U xgboost
    IPython.Application.instance().kernel.do_shutdown(True)

# Install/upgrade the Sagemaker SDK, Boto3 and XGBoost and restart kernel
#!{sys.executable} -m pip install -U sagemaker boto3 xgboost
#IPython.Application.instance().kernel.do_shutdown(True)

# Get the current installed version of Sagemaker SDK, Python, Boto3 and XGBoost
print('SageMaker Python SDK version : {}'.format(sagemaker.__version__))
print('Python version : {}'.format(sys.version))
print('Boto3 version : {}'.format(boto3.__version__))
print('XGBoost Python module version : {}'.format(xgb.__version__))

# Get the AWS CLI version
print('AWS CLI version : ')
!aws --version

**Docker:**

Docker should be pre-installed in the SageMaker notebook instance.  Verify it by running the `docker --version` command.  If Docker is not installed, you can install it by uncommenting the install command in the following cell.  You will require `sudo` rights to install.

In [None]:
# Verify if docker is installed
!docker --version

# Install docker
#!sudo yum --assumeyes install docker

**cURL:**

cURL should be pre-installed in the SageMaker notebook instance.  Verify it by running the `curl --version` command.  If cURL is not installed, you can install it by uncommenting the install command in the following cell.  You will require `sudo` rights to install.

In [None]:
"""
Last tested version:
curl 7.71.1 (x86_64-conda-linux-gnu) libcurl/7.71.1 OpenSSL/1.1.1j zlib/1.2.11 libssh2/1.9.0 nghttp2/1.43.0
Release-Date: 2020-07-01
Protocols: dict file ftp ftps gopher http https imap imaps pop3 pop3s rtsp scp sftp smb smbs smtp smtps telnet tftp 
Features: AsynchDNS GSS-API HTTP2 HTTPS-proxy IPv6 Kerberos Largefile libz NTLM NTLM_WB SPNEGO SSL TLS-SRP UnixSockets
"""

# Verify if curl is installed
!curl --version

# Install curl
#!sudo yum --assumeyes install curl

**Additional prerequisite (when notebook is running on Amazon Linux v2):**

Install and configure the [Amazon ECR credential helper](https://github.com/awslabs/amazon-ecr-credential-helper).  This makes it easier to store and use Docker credentials for use with Amazon ECR private registries.

In [None]:
if os_version == 'ALv2':
    # Install
    !sudo yum --assumeyes install amazon-ecr-credential-helper
    # Verify installation
    print('Amazon ECR Docker Credential Helper version : ')
    !docker-credential-ecr-login version
    # Create the .docker directory if it doesn't exist
    !mkdir -p ~/.docker
    # Configure
    !printf "{\\n\\t\"credsStore\": \"ecr-login\"\\n}" > ~/.docker/config.json
    # Verify configuration
    !cat ~/.docker/config.json

###  C. Check and configure security permissions <a id='Check%20and%20configure%20security%20permissions'></a>

Users of this notebook require `root` access to install/update required software.  This is set by default when you create the notebook.  For more info, refer [here](https://docs.aws.amazon.com/sagemaker/latest/dg/nbi-root-access.html).

This notebook uses the IAM role attached to the underlying notebook instance.  This role should have the following permissions,

1. Full access to the S3 bucket that will be used to store training and output data.
2. Full access to launch training instances.
3. Access to create CloudWatch Log Groups.
4. Access to write to CloudWatch Logs and CloudWatch Metrics.
5. Access to create, delete and write to Amazon ECR private registries.
6. Access to create and delete Amazon ECS clusters and a task definitions.
7. Access to run ECS tasks.

To view the name of this role, run the following cell.

In [None]:
print(sagemaker.get_execution_role())

This notebook creates a [task](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_definitions.html) on [Amazon ECS on AWS Fargate](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/AWS_Fargate.html).  This task requires an IAM role named 'Task Execution IAM role' that it assumes when it is invoked.  For more information on this, refer [here](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_execution_IAM_role.html).

For the task created in this notebook, at a minimum, this role should provide access to the following,

* Access to create CloudWatch Log Groups.
* Access to write to CloudWatch Logs and CloudWatch Metrics.
* Read access to Amazon ECR.

For information on the various IAM roles required for Amazon ECS on AWS Fargate refer [here](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-iam-roles.html).

###  D. Organize imports <a id='Organize%20imports'></a>

Organize all the library and module imports for later use.

In [None]:
from io import StringIO
import json
import logging
import matplotlib.pyplot as plt
import numpy as np
import pickle
import pandas as pd
from sagemaker.inputs import TrainingInput
import seaborn as sns
import sklearn.model_selection
from sklearn.preprocessing import StandardScaler
import tarfile
import time

###  E. Create common objects <a id='Create%20common%20objects'></a>

Create common objects to be used in future steps in this notebook.

In [None]:
# Specify the S3 bucket name
s3_bucket = '<Specify the S3 bucket name>'

# Create the S3 Boto3 resource
s3_resource = boto3.resource('s3')
s3_bucket_resource = s3_resource.Bucket(s3_bucket)

# Create the SageMaker Boto3 client
sm_client = boto3.client('sagemaker')

# Create the ECR client
ecr_client = boto3.client('ecr')

# Create the Amazon ECS client
ecs_client = boto3.client('ecs')

# Create the Amazon EC2 client
ec2_client = boto3.client('ec2')

# Get the AWS region name
region_name = sagemaker.Session().boto_region_name

# Base name to be used to create resources
nb_name = 'sm-xgboost-ca-housing-ecs-container-model-hosting'

# Names of various resources
train_job_name = 'train-{}'.format(nb_name)

# Names of local sub-directories in the notebook file system
data_dir = os.path.join(os.getcwd(), 'data/{}'.format(nb_name))
train_dir = os.path.join(os.getcwd(), 'data/{}/train'.format(nb_name))
val_dir = os.path.join(os.getcwd(), 'data/{}/validate'.format(nb_name))
test_dir = os.path.join(os.getcwd(), 'data/{}/test'.format(nb_name))

# Location of the datasets file in the notebook file system
dataset_csv_file = os.path.join(os.getcwd(), 'datasets/california_housing.csv')

# Container artifacts directory in the notebook file system
container_artifacts_dir = os.path.join(os.getcwd(), 'container-artifacts/{}'.format(nb_name))

# Location of the Python3 Flask script (containing the inference code) and it's corresponding
# requirements.txt in the notebook file system
container_script_file_name = 'container_sm_xgboost_ca_housing_inference.py'
container_script_req_file_name = 'container_sm_xgboost_ca_housing_inference_requirements.txt'
container_script_file = os.path.join(os.getcwd(), 'scripts/{}'.format(container_script_file_name))
container_script_req_file = os.path.join(os.getcwd(), 'scripts/{}'.format(container_script_req_file_name))

# Sub-folder names in S3
train_dir_s3_prefix = '{}/data/train'.format(nb_name)
val_dir_s3_prefix = '{}/data/validate'.format(nb_name)
test_dir_s3_prefix = '{}/data/test'.format(nb_name)

# Location in S3 where the model checkpoint will be stored
model_checkpoint_s3_path = 's3://{}/{}/checkpoint/'.format(s3_bucket, nb_name)

# Location in S3 where the trained model will be stored
model_output_s3_path = 's3://{}/{}/output/'.format(s3_bucket, nb_name)

# Names of the model tar file and extracted file - these are dependent on the
# framework and algorithm you used to train the model.  This notebook uses
# SageMaker's built-in XGBoost algorithm and that will have the names as follows:
model_tar_file_name = 'model.tar.gz'
extracted_model_file_name = 'xgboost-model'

# Container details
container_image_name = nb_name
container_registry_url_prefix = '<Specify the ECR URL prefix in this format {aws_account_id}.dkr.ecr.{region}.amazonaws.com>'

# ECS cluster details
ecs_cluster_name = 'cluster-{}'.format(nb_name)

# ECS Task details
ecs_fargate_task_name = 'fargate-task-{}'.format(nb_name)
ecs_fargate_task_role = '<Specify the ARN for the ECS Task IAM role>'
ecs_fargate_task_execution_role = '<Specify the ARN for the ECS Task execution IAM role>'
ecs_fargate_task_cpu = '0.25 vCPU'
ecs_fargate_task_memory = '0.5 GB'
ecs_fargate_task_count = 1

# ECS Task networking details
ecs_fargate_task_subnet_list = ['<Specify the ID of a public subnet in your preferred VPC>']
ecs_fargate_task_security_group_list = ['<Specify the ID of your Security Group in your preferred VPC>']

# ECS Task container details
ecs_container_name = 'container-{}'.format(nb_name)
ecs_container_port = 80
ecs_container_host_port = 80
ecs_container_healthcheck_command_list = ["CMD-SHELL", "curl -f http://localhost:80/healthcheck || exit 1"]
ecs_container_healthcheck_interval_in_seconds = 30
ecs_container_healthcheck_timeout_in_seconds = 30

## 2. Prepare the data <a id='Prepare%20the%20data'></a>

The [California Housing dataset](https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html) consists of 20,640 observations on housing prices with 9 economic covariates.  These covariates are,

* MedianHouseValue
* MedianIncome
* HousingMedianAge
* TotalRooms
* TotalBedrooms
* Population
* Households
* Latitude
* Longitude

This dataset has been downloaded to the local `datasets` directory and modified as a CSV file with the feature names in the first row.  This will be used in this notebook.

The following steps will help with preparing the datasets for training, validation and testing.

### A) Create the local directories <a id='Create%20the%20local%20directories'></a>

Create the directories in the local system where the dataset will be copied to and processed.

In [None]:
# Create the local directories if they don't exist
os.makedirs(data_dir, exist_ok=True)
os.makedirs(train_dir, exist_ok=True)
os.makedirs(val_dir, exist_ok=True)
os.makedirs(test_dir, exist_ok=True)

### B) Load the dataset and view the details <a id='Load%20the%20dataset%20and%20view%20the%20details'></a>

Check if the CSV file exists in the `datasets` directory and load it into a Pandas DataFrame.  Finally, print the details of the dataset.

In [None]:
# Check if the dataset file exists and proceed
if os.path.exists(dataset_csv_file):
    print('Dataset CSV file \'{}\' exists.'.format(dataset_csv_file))
    # Load the data into a Pandas DataFrame
    pd_data_frame = pd.read_csv(dataset_csv_file)
    # Print the first 5 records
    #print(pd_data_frame.head(5))
    # Describe the dataset
    print(pd_data_frame.describe())
else:
    print('Dataset CSV file \'{}\' does not exist.'.format(dataset_csv_file))

### C) (Optional) Visualize the dataset <a id='(Optional)%20Visualize%20the%20dataset'></a>

Display the distributions in the dataset.

In [None]:
# Print the correlation matrix
plt.figure(figsize=(11, 7))
sns.heatmap(cbar=False, annot=True, data=(pd_data_frame.corr() * 100), cmap='coolwarm')
plt.title('% Correlation Matrix')
plt.show()

### D) Split the dataset into train, validate and test sets <a id='Split%20the%20dataset%20into%20train,%20validate%20and%20test%20sets'></a>

Split the dataset into train, validate and test sets after shuffling.  Split further into x and y sets.

In [None]:
# Split into train and test datasets after shuffling
train, test = sklearn.model_selection.train_test_split(pd_data_frame, test_size=0.2,
                                                       random_state=35, shuffle=True)
# Split the train dataset further into train and validation datasets after shuffling
train, val = sklearn.model_selection.train_test_split(train, test_size=0.1,
                                                      random_state=25, shuffle=True)

# Define functions to get x and y columns
def get_x(df):
    return df[['median_income','housing_median_age','total_rooms','total_bedrooms',
                 'population','households','latitude','longitude']]
def get_y(df):
    return df[['median_house_value']]

# Load the x and y columns for train, validation and test datasets
x_train = get_x(train)
y_train = get_y(train)
x_val = get_x(val)
y_val = get_y(val)
x_test = get_x(test)
y_test = get_y(test)

# Summarize the datasets
print("x_train shape:", x_train.shape)
print("y_train shape:", y_train.shape)
print("x_val shape:", x_val.shape)
print("y_val shape:", y_val.shape)
print("x_test shape:", x_test.shape)
print("y_test shape:", y_test.shape)

### E) Standardize the datasets <a id='Standardize%20the%20datasets'></a>

* Standardize the x columns of the train dataset using the `fit_transform()` function of `StandardScaler`.
* Standardize the x columns of the validate and test datasets using the `transform()` function of `StandardScaler`.

In [None]:
# Standardize the dataset
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_val = scaler.transform(x_val)
x_test = scaler.transform(x_test)

### F) Save the prepared datasets locally <a id='Save%20the%20prepared%20datasets%20locally'></a>

Save the prepared train, validate and test datasets to local directories.  Prior to saving, concatenate x and y columns as needed.  Create the directories if they don't exist.

In [None]:
# Save the prepared dataset (in numpy format) to the local directories as csv files

np.savetxt(os.path.join(train_dir, 'train.csv'),
           np.concatenate((y_train.to_numpy(), x_train), axis=1), delimiter=',')
np.savetxt(os.path.join(train_dir, 'train_x.csv'), x_train)
np.savetxt(os.path.join(train_dir, 'train_y.csv'), y_train.to_numpy())

np.savetxt(os.path.join(val_dir, 'validate.csv'),
           np.concatenate((y_val.to_numpy(), x_val), axis=1), delimiter=',')
np.savetxt(os.path.join(val_dir, 'validate_x.csv'), x_val)
np.savetxt(os.path.join(val_dir, 'validate_y.csv'), y_val.to_numpy())

np.savetxt(os.path.join(test_dir, 'test.csv'),
           np.concatenate((y_test.to_numpy(), x_test), axis=1), delimiter=',')
np.savetxt(os.path.join(test_dir, 'test_x.csv'), x_test)
np.savetxt(os.path.join(test_dir, 'test_y.csv'), y_test.to_numpy())

### G) Upload the prepared datasets to S3 <a id='Upload%20the%20prepared%20datasets%20to%20S3'></a>

Upload the datasets from the local directories to appropriate sub-directories in the specified S3 bucket.

In [None]:
# Upload the data to S3
train_dir_s3_path = sagemaker.Session().upload_data(path='./data/{}/train/'.format(nb_name),
                                                          bucket=s3_bucket,
                                                          key_prefix=train_dir_s3_prefix)
val_dir_s3_path = sagemaker.Session().upload_data(path='./data/{}/validate/'.format(nb_name),
                                                        bucket=s3_bucket,
                                                        key_prefix=val_dir_s3_prefix)
test_dir_s3_path = sagemaker.Session().upload_data(path='./data/{}/test/'.format(nb_name),
                                                         bucket=s3_bucket,
                                                         key_prefix=test_dir_s3_prefix)

# Capture the S3 locations of the uploaded datasets
train_s3_path = '{}/train.csv'.format(train_dir_s3_path)
train_x_s3_path = '{}/train_x.csv'.format(train_dir_s3_path)
train_y_s3_path = '{}/train_y.csv'.format(train_dir_s3_path)
val_s3_path = '{}/validate.csv'.format(val_dir_s3_path)
val_x_s3_path = '{}/validate_x.csv'.format(val_dir_s3_path)
val_y_s3_path = '{}/validate_y.csv'.format(val_dir_s3_path)
test_s3_path = '{}/test.csv'.format(test_dir_s3_path)
test_x_s3_path = '{}/test_x.csv'.format(test_dir_s3_path)
test_y_s3_path = '{}/test_y.csv'.format(test_dir_s3_path)

##  3. Perform training <a id='Perform%20training'></a>

In this step, SageMaker's [built-in XGBoost algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) is used to train a regression model on the [California Housing dataset](https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html).

Note: This model has not been tuned as that is not the intent of this demo.

### A) Set the training parameters <a id='Set%20the%20training%20parameters'></a>

1. Inputs - S3 location of the training and validation data.
2. Hyperparameters.
3. Training instance details:

    1. Instance count
    
    2. Instance type
    
    3. The max run time of the training job
    
    4. (Optional) Use Spot instances.  For more info, refer [here](https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html).
    
    5. (Optional) The max wait for Spot instances, if using Spot.  This should be larger than the max run time.
    
4. Base job name
5. Appropriate local and S3 directories that will be used by the training job.

In [None]:
# Set the input data input along with their content types
train_input = TrainingInput(train_s3_path, content_type='text/csv')
val_input = TrainingInput(val_s3_path, content_type='text/csv')
inputs = {'train':train_input, 'validation':val_input}

# Set the hyperparameters
hyperparameters = {
        'objective':'reg:squarederror',
        'max_depth':'6',
        'eta':'0.3',
        'alpha':'3',
        'colsample_bytree':'0.7',
        'num_round':'100'}

# Set the instance count, instance type, volume size, options to use Spot instances and other parameters
train_instance_count = 1
train_instance_type = 'ml.m5.xlarge'
train_instance_volume_size_in_gb = 5
#use_spot_instances = True
#spot_max_wait_time_in_seconds = 5400
use_spot_instances = False
spot_max_wait_time_in_seconds = None
max_run_time_in_seconds = 3600
algorithm_name = 'xgboost'
algorithm_version = '1.2-1'
py_version = 'py37'
# Get the container image URI for the specified parameters
container_image_uri = sagemaker.image_uris.retrieve(framework=algorithm_name,
                                                    region=region_name,
                                                    version=algorithm_version,
                                                    py_version=py_version,
                                                    instance_type=train_instance_type,
                                                    image_scope='training')

# Set the training container related parameters
container_log_level = logging.INFO

# Location where the model checkpoints will be stored locally in the container before being uploaded to S3
model_checkpoint_local_dir = '/opt/ml/checkpoints/'

# Location where the trained model will be stored locally in the container before being uploaded to S3
model_local_dir = '/opt/ml/model'

### B) (Optional) Delete previous checkpoints <a id='(Optional)%20Delete%20previous%20checkpoints'></a>

If model checkpoints from previous trainings are found in the S3 checkpoint location specified in the previous step, then training will resume from those checkpoints.  In order to start a fresh training, run the following code cell to delete all checkpoint objects from S3.

In [None]:
# Delete the checkpoints if you want to train from the beginning; else ignore this code cell
for checkpoint_file in s3_bucket_resource.objects.filter(Prefix='{}/checkpoint/'.format(nb_name)):
    checkpoint_file_key = checkpoint_file.key
    print('Deleting {} ...'.format(checkpoint_file_key))
    s3_resource.Object(s3_bucket_resource.name, checkpoint_file_key).delete()

### C) Run the training job <a id='Run%20the%20training%20job'></a>

Prepare the `estimator` and call the `fit()` method.  This will pull the container containing the specified version of the algorithm in the AWS region and run the training job in the specified type of EC2 instance(s).  The training data will be pulled from the specified location in S3 and training results and checkpoints will be written to the specified locations in S3.

Note: SageMaker Debugger is disabled.

In [None]:
# Create the estimator
estimator = sagemaker.estimator.Estimator(
    image_uri=container_image_uri,
    checkpoint_local_path=model_checkpoint_local_dir,
    checkpoint_s3_uri=model_checkpoint_s3_path,
    model_dir=model_local_dir,
    output_path=model_output_s3_path,
    instance_type=train_instance_type,
    instance_count=train_instance_count,
    use_spot_instances=use_spot_instances,
    max_wait=spot_max_wait_time_in_seconds,
    max_run=max_run_time_in_seconds,
    hyperparameters=hyperparameters,
    role=sagemaker.get_execution_role(),
    base_job_name=train_job_name,
    framework_version=algorithm_version,
    py_version=py_version,
    container_log_level=container_log_level,
    script_mode=False,
    debugger_hook_config=False,
    disable_profiler=True)

# Perform the training
estimator.fit(inputs, wait=True)

##  4. Create and push the Docker container to an Amazon ECR repository <a id='Create%20and%20push%20the%20Docker%20container%20to%20an%20Amazon%20ECR%20repository'></a>

In this step, we will create a Docker container containing the generated model along with its dependencies.  If you bring a pre-trained model, you can upload it to S3 and use it to build the container.  The following steps contains instructions for doing so.

### A) Retrieve the model pickle file <a id='Retrieve%20the%20model%20pickle%20file'></a>

* The model file generated using SageMaker's [built-in XGBoost algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) will be a Python pickle file zipped up in a tar file named `model.tar.gz`.  The S3 URI for this file will be available in the `model_data` attribute of the `estimator` object created in the training step.

* If you bring your pre-trained model, you have to specify the S3 URI appropriately in the following cell.

* The zip file needs to be downloaded from S3 and extracted.

* The name of the extracted pickle file will depend on the framework and algorithm that was used to train the model.  In this notebook example, we have used SageMaker's [built-in XGBoost algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) and so the pickle file will be named `xgboost-model`.  You will see this when the model tar file is extracted.

In [None]:
# Create the container artifacts directory if it doesn't exist
os.makedirs(container_artifacts_dir, exist_ok=True)

# Set the file paths
model_tar_file_s3_path_suffix = '{}/output/{}/output/{}'.format(nb_name,
                                                                estimator.latest_training_job.name,
                                                                model_tar_file_name)
model_tar_file_local_path = '{}/{}'.format(container_artifacts_dir, model_tar_file_name)
extracted_model_file_local_path = '{}/{}'.format(container_artifacts_dir, extracted_model_file_name)

# Delete old model files if they exist
if os.path.exists(model_tar_file_local_path):
    os.remove(model_tar_file_local_path)
if os.path.exists(extracted_model_file_local_path):
    os.remove(extracted_model_file_local_path)

# Download the model tar file from S3
s3_bucket_resource.download_file(model_tar_file_s3_path_suffix, model_tar_file_local_path)

# Extract the model tar file and retrieve the model pickle file
with tarfile.open(model_tar_file_local_path, "r:gz") as tar:
    tar.extractall(path=container_artifacts_dir)

### B) (Optional) Test the model pickle file <a id='(Optional)%20Test%20the%20model%20pickle%20file'></a>

The code in the following cell entirely depends on the framework and algorithm that was used to train the model.  The extracted Python3 pickle file will contain the appropriate object name.  If you are bringing your own model file, you have to change this cell appropriately.

In [None]:
# Load the model pickle file as a pickle object
pickle_file_path = extracted_model_file_local_path
with open(pickle_file_path, 'rb') as pkl_file:
    model = pickle.load(pkl_file)

# Run a prediction against the model loaded as a pickle object
# by sending the first record of the test dataset
test_pred_x_df = pd.read_csv(StringIO(','.join(map(str, x_test[0]))), sep=',', header=None)
test_pred_x = xgb.DMatrix(test_pred_x_df.values)
print('Input for prediction = {}'.format(test_pred_x_df.values))
print('Predicted value = {}'.format(model.predict(test_pred_x)[0]))
print('Actual value = {}'.format(y_test.values[0][0]))
print('Note: There may be a huge difference between the actual and predicted values as the model has not been tuned in the training step.')

### C) View the inference script <a id='View%20the%20inference%20script'></a>

The inference script is a Python3 [Flask](https://flask.palletsprojects.com/en/1.1.x/) app script that contains the following logic:
* Initialize the Flask web app server.
* Load the ML model pickle object into memory.
* Run the Flask web app server.
* Parse the request sent to the web app server either from direct invocation or from a REST/HTTP API in Amazon API Gateway.
* Run the prediction.
* Format the response to match with the parameter specified in the request.
* Return the response.
* Implement the healthcheck logic to return a success on invocation.  This has to be called by the service hosting this container to perform health checks.

The request should be in the following format:

`{
  "response_content_type": "<Specify either text/plain or application/json>",
  "pred_x_csv": "<The comma-separated x column values to be used for prediction>"
}`

This script will be packaged into the container that will be built in the upcoming steps.

You can view the script by running the following code cell.

In [None]:
# View the Python3 Flask script (containing the inference code)
!cat {container_script_file}

### D) Create the Dockerfile <a id='Create%20the%20Dockerfile'></a>

In this step, we will create a [Dockerfile](https://docs.docker.com/engine/reference/builder/) which is required to build our [Docker](https://www.docker.com/) container containing the model pickle file, an inference script and its dependencies.

In order to create the container, we will use the [Amazon Linux 2 container image](https://gallery.ecr.aws/amazonlinux/amazonlinux) available in the [Amazon ECR public registry](https://aws.amazon.com/ecr/) as the base image.  As this is a public registry, you do not require any credentials or permissions to download it.

Note: At the time of writing this notebook, this image was based on [Amazon Linux 2](https://aws.amazon.com/amazon-linux-2/).  Depending on the specific version you intend to use, you can suffix container image URL with the specific version after the `:` character in the following cell.

In [None]:
# Copy the inference script and requirements.txt to the container-artifacts directory
!cp -pr {container_script_file} {container_artifacts_dir}/server.py
!cp -pr {container_script_req_file} {container_artifacts_dir}/requirements.txt

# Create the Dockerfile content
dockerfile_content_lines = []
dockerfile_content_lines.append('# syntax=docker/dockerfile:1\n\n')
dockerfile_content_lines.append('# Use Amazon Linux 2 as the base image\n')
dockerfile_content_lines.append('FROM public.ecr.aws/amazonlinux/amazonlinux:latest\n\n')
dockerfile_content_lines.append('# Setup the working directory\n')
dockerfile_content_lines.append('WORKDIR /\n\n')
dockerfile_content_lines.append('# Install Python3\n')
dockerfile_content_lines.append('RUN yum -y install python3\n\n')
dockerfile_content_lines.append('# Upgrade pip\n')
dockerfile_content_lines.append('RUN pip3 install --upgrade pip\n\n')
dockerfile_content_lines.append('# Setup the Python virtual env to run the inference script\n')
dockerfile_content_lines.append('RUN python3 -m venv /opt/appenv\n\n')
dockerfile_content_lines.append('# Install the Python packages required for the inference script in the virtual env\n')
dockerfile_content_lines.append('COPY requirements.txt .\n')
dockerfile_content_lines.append('RUN /opt/appenv/bin/pip install -r requirements.txt\n\n')
dockerfile_content_lines.append('# Copy the extracted model file and the inference script\n')
dockerfile_content_lines.append('COPY ')
dockerfile_content_lines.append(extracted_model_file_name)
dockerfile_content_lines.append(' ./\n')
dockerfile_content_lines.append('COPY server.py ./\n\n')
dockerfile_content_lines.append('# Specify the ENV variables\n')
dockerfile_content_lines.append('ENV MODEL_PICKLE_FILE_PATH=')
dockerfile_content_lines.append(extracted_model_file_name)
dockerfile_content_lines.append('\n')
dockerfile_content_lines.append('ENV FLASK_SERVER_LOG_LEVEL=DEBUG\n')
dockerfile_content_lines.append('ENV FLASK_SERVER_HOSTNAME=0.0.0.0\n')
dockerfile_content_lines.append('ENV FLASK_SERVER_PORT=')
dockerfile_content_lines.append(str(ecs_container_port))
dockerfile_content_lines.append('\n')
dockerfile_content_lines.append('ENV FLASK_SERVER_DEBUG=True\n\n')
dockerfile_content_lines.append('# Specify the command to run the inference script as a Flask app\n')
dockerfile_content_lines.append('ENTRYPOINT ["/opt/appenv/bin/python", "server.py"]')

# Create the Dockerfile
dockerfile_local_path = '{}/Dockerfile'.format(container_artifacts_dir)
with open(dockerfile_local_path, 'wt') as file:
    file.write(''.join(dockerfile_content_lines))
    
# Print the contents of the generated Dockerfile
!cat {dockerfile_local_path}

### E) Create the container <a id='Create%20the%20container'></a>

Create the Docker container using the `docker build` command.  Specify the container image name and point to the container-artifacts directory that contains all the files to build the container.

Note: You may see warning messages when the container is built with the Dockerfile that we created in the prior step.  These warnings will be around installing the Python packages that are required by the inference script.  You can choose to either ignore or fix them.

In [None]:
# Create the Docker container
!docker build -t {container_image_name} {container_artifacts_dir}

### F) Create the private repository in ECR <a id='Create%20the%20private%20repository%20in%20ECR'></a>

In order to configure Amazon ECS to run a container, the container image should exist in a container registry.  In this notebook, we will create and use an [Amazon ECR](https://aws.amazon.com/ecr/) private repository for this purpose.

In this step, we will check if the private repository in Amazon ECR that we intend to create already exists or not.  If it does not exist, we will create it with the repository name the same as the container image name.

Note: When creating the repository, setting the `scanOnPush` parameter to `True` will automatically initiate a vulnerability scan on the container image that is pushed to the repository.  For more info on image scanning, refer [here](https://docs.aws.amazon.com/AmazonECR/latest/userguide/image-scanning.html).

In [None]:
# Check if the ECR repository exists already; if not, then create it
try:
    ecr_client.describe_repositories(repositoryNames=[container_image_name])
    print('ECR repository {} already exists.'.format(container_image_name))
except ecr_client.exceptions.RepositoryNotFoundException:
    print('ECR repository {} does not exist.'.format(container_image_name))
    print('Creating ECR repository {}...'.format(container_image_name))
    # Create the ECR repository - here we use the container image name for the repository name
    ecr_client.create_repository(repositoryName=container_image_name,
                                 imageScanningConfiguration={
                                     'scanOnPush': True
                                 })
    print('Completed creating ECR repository {}.'.format(container_image_name))

### G) Push the container to ECR <a id='Push%20the%20container%20to%20ECR'></a>

In this step, we will push the container to a private registry that we created in Amazon ECR.

When using an Amazon ECR private registry, you must authenticate your Docker client to your private registry so that you can use the `docker push` and `docker pull` commands to push and pull images to and from the repositories in that registry.  For more information about this, refer [here](https://docs.aws.amazon.com/AmazonECR/latest/userguide/registry_auth.html).

1. If this notebook instance is running on Amazon Linux v1, the authentication happens through an authorization token generated by an AWS CLI command in the following code cell.  This token will be automatically deleted when the code cell completes execution.
2. If this notebook instance is running on Amazon Linux v2, the authentication happens through temporary credentials generated based on the IAM role attached to this notebook.  For this, you have to complete the prerequisite mentioned in the first step of this notebook.

In [None]:
# Set the image names
source_image_name = '{}:latest'.format(container_image_name)
target_image_name = '{}/{}:latest'.format(container_registry_url_prefix, container_image_name)

if os_version == 'ALv1':
    # Get the private registry credentials using an authorization token
    !aws ecr get-login-password --region {region_name} | docker login --username AWS --password-stdin {container_registry_url_prefix}

# Tag the container
!docker tag {source_image_name} {target_image_name}

# Push the container to the specified registry in Amazon ECR
!docker push {target_image_name}

if os_version == 'ALv1':
    # Delete the Docker credentials file
    print('\nDeleting the generated Docker credentials file...')
    !rm /home/ec2-user/.docker/config.json
    print('Completed deleting the generated Docker credentials file.')
    # Verify the delete
    print('Verifying the delete of the generated Docker credentials file...')
    !cat /home/ec2-user/.docker/config.json
    print('Completed verifying the delete of the generated Docker credentials file.')

##  5. Deploy and test on Amazon ECS on AWS Fargate <a id='Deploy%20and%20test%20on%20Amazon%20ECS%20on%20AWS%20Fargate'></a>

In this step, we will create an [Amazon ECS cluster](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/clusters.html), deploy the Docker container that was created in the previous step as a [task](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_definitions.html) on [Amazon ECS on AWS Fargate](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/AWS_Fargate.html) and test it.

### A) Create the ECS cluster <a id='Create%20the%20ECS%20cluster'></a>

In this step, we will check if the ECS cluster that we intend to create already exists or not.  If it does not exist, we will create it.

Note:

* We have not configured this cluster to use an [Amazon VPC](https://aws.amazon.com/vpc) for networking.  If you require it, refer to the instructions [here](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/create_cluster.html).
* Sometimes, after the `delete_cluster` API is invoked on an ECS cluster, the cluster can go into 'INACTIVE' state and may remain discoverable in your AWS account for a period of time.  You may not see this in the AWS console.  When this happens, you won't be able to use the cluster.  For more information on this, refer [here](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/ecs.html#ECS.Client.delete_cluster).

In [None]:
# Check if the Amazon ECS cluster exists already; if not, then create it
try:
    describe_ecs_cluster_response = ecs_client.describe_clusters(clusters=[ecs_cluster_name])
    if describe_ecs_cluster_response['failures'][0]['reason'] == 'MISSING':
        print('ECS cluster {} does not exist.'.format(ecs_cluster_name))
        print('Creating ECS cluster {}...'.format(ecs_cluster_name))
        create_ecs_cluster_response = ecs_client.create_cluster(clusterName=ecs_cluster_name)
        print('ECS cluster status = {}'.format(create_ecs_cluster_response['cluster']['status']))
except IndexError:
    ecs_cluster_status = describe_ecs_cluster_response['clusters'][0]['status']
    print('ECS cluster \'{}\' already exists and is in status \'{}\'.'.format(ecs_cluster_name, ecs_cluster_status))

In [None]:
# Sleep every 10 seconds and print the status of the ECS cluster until it goes to ACTIVE, INACTIVE or FAILED state
while True:
    describe_ecs_cluster_response = ecs_client.describe_clusters(clusters=[ecs_cluster_name])
    ecs_cluster_status = describe_ecs_cluster_response['clusters'][0]['status']
    print('ECS cluster status = {}'.format(ecs_cluster_status))
    if ecs_cluster_status in {'ACTIVE', 'INACTIVE', 'FAILED'}:
        break
    time.sleep(10)

### B) Create the ECS Task and deploy the container <a id='Create%20the%20ECS%20Task%20and%20deploy%20the%20container'></a>

In this step, we will create a [task](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_definitions.html) on [Amazon ECS on AWS Fargate](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/AWS_Fargate.html) and deploy the container.

The following configuration will be used:

* Fargate launch type.  For details on how to configure this, refer [here](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/AWS_Fargate.html).
* The container location will be the Amazon ECR private registry that we created in prior steps.
* The container port and host port will be what we configured in prior steps.
* Protocol will be TCP.
* `awslogs` driver will be used to send the logs to [Amazon CloudWatch](https://aws.amazon.com/cloudwatch/).
* Healthcheck will be configured to invoke the healthcheck URL path suffix configured in the inference script's Flask app running in the container.
* CPU and memory settings will be what we configured in prior steps.
* The cluster name, container count, VPC subnets and Security Groups will be what we configured in prior steps.
* Auto-assign Public IP address to the Task.

In [None]:
# Register the ECS Fargate task definition
ecs_register_task_definition_response = ecs_client.register_task_definition(family=ecs_fargate_task_name,
                                                                            taskRoleArn=ecs_fargate_task_role,
                                                                            executionRoleArn=ecs_fargate_task_execution_role,
                                                                            networkMode='awsvpc',
                                                                            containerDefinitions=[{
                                                                                'name':ecs_container_name,
                                                                                'image':target_image_name,
                                                                                'portMappings': [{
                                                                                    'containerPort':ecs_container_port,
                                                                                    'hostPort':ecs_container_host_port,
                                                                                    'protocol':'tcp',
                                                                                }],
                                                                                'logConfiguration': {
                                                                                    'logDriver':'awslogs',
                                                                                    'options': {
                                                                                        'awslogs-create-group':'true',
                                                                                        'awslogs-region':region_name,
                                                                                        'awslogs-group':'/ecs/{}'.format(ecs_fargate_task_name),
                                                                                        'awslogs-stream-prefix':'ecs'
                                                                                    }
                                                                                },
                                                                                'healthCheck': {
                                                                                    'command':ecs_container_healthcheck_command_list,
                                                                                    'interval':ecs_container_healthcheck_interval_in_seconds,
                                                                                    'timeout':ecs_container_healthcheck_timeout_in_seconds
                                                                                }
                                                                            }],
                                                                            requiresCompatibilities=[
                                                                                'FARGATE'
                                                                            ],
                                                                            cpu=ecs_fargate_task_cpu,
                                                                            memory=ecs_fargate_task_memory
                                                                           )


# Print the task definition ARN
ecs_fargate_task_definiton_arn = ecs_register_task_definition_response['taskDefinition']['taskDefinitionArn']
print('ECS Fargate Task definition ARN = {}'.format(ecs_fargate_task_definiton_arn))

The ECS cluster created in the previous step will take a few seconds to go into ACTIVE state.  Wait until then and proceed to the next step.

In [None]:
# Run the ECS Fargate task
ecs_run_task_response = ecs_client.run_task(cluster=ecs_cluster_name,
                                            count=ecs_fargate_task_count,
                                            launchType='FARGATE',
                                            networkConfiguration={
                                                'awsvpcConfiguration':{
                                                    'subnets':ecs_fargate_task_subnet_list,
                                                    'securityGroups':ecs_fargate_task_security_group_list,
                                                    'assignPublicIp': 'ENABLED'
                                                }
                                            },
                                            taskDefinition=ecs_fargate_task_name)

# Print the task ARN
ecs_fargate_task_id = ecs_run_task_response['tasks'][0]['taskArn']
print('ECS Fargate Task ARN = {}'.format(ecs_fargate_task_id))

### C) Prepare to test the ECS Task <a id='Prepare%20to%20test%20the%20ECS%20Task'></a>

Follow these steps to prepare to the test the ECS Task:

1. **Wait for the Fargate task state to go to `RUNNING`** - after the successful run of the previous step, it will take few minutes for the ECS Fargate Task to go into `RUNNING` state.  You can check the state either using the AWS CLI/API/SDK or by going into the [AWS console](https://console.aws.amazon.com/ecs/home).  In the ECS console page for the AWS region, navigate to the specific ECS cluster that was created by this notebook and go to the Tasks tab.  There, you can check the state of your ECS Fargate Task.

In [None]:
# Sleep every 5 seconds and print the status of the ECS Task until it goes to RUNNING or STOPPED state
while True:
    ecs_describe_tasks_response = ecs_client.describe_tasks(cluster=ecs_cluster_name,
                                                            tasks=[ecs_fargate_task_id])
    ecs_task_status = ecs_describe_tasks_response['tasks'][0]['lastStatus']
    print('ECS Fargate Task status = {}'.format(ecs_task_status))
    if ecs_task_status in {'RUNNING', 'STOPPED'}:
        break
    time.sleep(5)

2. **Retrieve the Public IP address of this notebook instance** - if you intend to test the ECS Task from this notebook, then you will require the Public IP address of this notebook instance to configure in the next step.  You can retrieve it by running the following code cell.

In [None]:
# Print the Public IP address of this notebook instance
!curl ifconfig.me

3. **Configure the Security Group on the ECS Task** - setup an Inbound Rule in the ECS Task's Security Group to allow access to the IP address of the system from where you are going to test the ECS Task.  This rule should be for the HTTP protocol on the configured port.  In this notebook, we have configured a Public IP address on the ECS Task on port 80.  We will test the ECS Task from this notebook.  So, make sure you configure the notebook instance's Public IP address retrieved in the previous step to configure this Security Group.

### D) Test the ECS Task <a id='Test%20the%20ECS%20Task'></a>

In this step, we will test the ECS Task that we created in the previous step by invoking it synchronously.  For this, we will invoke the Python3 [Flask](https://flask.palletsprojects.com/en/1.1.x/) app script running in the container by using the Public IP address of the ECS Task.

1. Retrieve the Public IP address of the ECS Task.
2. Invoke the endpoint by making a HTTP POST call with the first record of the test dataset as a CSV string.
    The request should be in the following format:
    `{
      "response_content_type": "<Specify either text/plain or application/json>",
      "pred_x_csv": "<The comma-separated x column values to be used for prediction>"
    }`

In [None]:
# Retrieve the ECS Task details
ecs_describe_tasks_response = ecs_client.describe_tasks(cluster=ecs_cluster_name,
                                                       tasks=[ecs_fargate_task_id])

# Retrieve the Public IP address of the ECS Task
ecs_task_attachments = ecs_describe_tasks_response['tasks'][0]['attachments']
for ecs_task_attachment in ecs_task_attachments:
    if ecs_task_attachment['type'] == 'ElasticNetworkInterface':
        ecs_task_attachment_details = ecs_task_attachment['details']
        for ecs_task_attachment_detail in ecs_task_attachment_details:
            if ecs_task_attachment_detail['name'] == 'networkInterfaceId':
                ecs_task_nid = ecs_task_attachment_detail['value']
                describe_network_interfaces_response = ec2_client.describe_network_interfaces(NetworkInterfaceIds=[
                    ecs_task_nid
                ])
                ecs_fargate_task_public_ip = describe_network_interfaces_response['NetworkInterfaces'][0]['Association']['PublicIp']
                
# Print the Public IP address of the ECS Task
print('ECS Task Public IP address = {}'.format(ecs_fargate_task_public_ip))

In [None]:
# Set the request payload
x_test_request_payload_csv = ','.join(map(str, x_test[0]))
x_test_request_payload = '{' + '"response_content_type": "application/json","pred_x_csv":"{}"'.format(x_test_request_payload_csv) + '}'
# Print the request
print('Request payload:\n')
print(x_test_request_payload)

# Invoke the ECS Task and print the response
ecs_fargate_task_public_url = 'http://{}:80/'.format(ecs_fargate_task_public_ip)
print('\nResponse:\n')
!curl -X POST -H 'Content-Type: application/json' --data '{x_test_request_payload}' {ecs_fargate_task_public_url}

##  6. (Optional) Front-end the container with Amazon API Gateway <a id='(Optional)%20Front-end%20the%20container%20with%20Amazon%20API%20Gateway'></a>

For some use cases, you may prefer to front-end the inference as a service on Amazon ECS on AWS Fargate with [Amazon API Gateway](https://docs.aws.amazon.com/apigateway/latest/developerguide/welcome.html).  With this setup, you can serve the model inference as an API with a HTTPS endpoint.  Prior to setting up the API, you have to create an [Amazon ECS Service](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs_services.html) made up of multiple tasks and then setup [Load Balancing](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-load-balancing.html) and [Auto Scaling](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-auto-scaling.html).

For the API, you have the following options to choose from:
* [HTTP API](https://docs.aws.amazon.com/apigateway/latest/developerguide/http-api.html)
* [REST API](https://docs.aws.amazon.com/apigateway/latest/developerguide/apigateway-rest-api.html)

For guidance on choosing the right API option, refere [here](https://docs.aws.amazon.com/apigateway/latest/developerguide/http-api-vs-rest.html).

For information on setting up the container as the backend for Amazon API Gateway, refer [here](https://docs.aws.amazon.com/apigateway/latest/developerguide/setup-http-integrations.html).

Note: The container that we created in prior steps has the logic to handle both REST and HTTP API requests from the Amazon API Gateway assuming the gateway passes through the request payload as-is to the backend container.

## 7. Cleanup <a id='Cleanup'></a>

As a best practice, you should delete resources and S3 objects when no longer required.  This will help you avoid incurring unncessary costs.

This step will cleanup the resources and S3 objects created by this notebook.

Note: Apart from these resources, there will be Docker containers and related images created in the notebook instance that is running this Jupyter notebook.  As they are already part of the notebook instance, you do not need to delete them.  If you decide to delete them, then go to the Terminal of the Jupyter notebook and and run appropriate `docker` commands.

### A) Cleanup ECS resources <a id='Cleanup%20ECS%20resources'></a>

Note: Sometimes, after the `delete_cluster` API is invoked on an ECS cluster, the cluster can go into 'INACTIVE' state and may remain discoverable in your AWS account for a period of time.  You may not see this in the AWS console.  When this happens, you won't be able to use the cluster.  For more information on this, refer [here](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/ecs.html#ECS.Client.delete_cluster).

In [None]:
# Stop the ECS Task
ecs_client.stop_task(cluster=ecs_cluster_name,
                     task=ecs_fargate_task_id,
                     reason='Cleanup from notebook {}'.format(nb_name))

In [None]:
# Deregister ECS Task definition
ecs_client.deregister_task_definition(taskDefinition=ecs_fargate_task_definiton_arn)

In [None]:
# Delete the ECS cluster
ecs_client.delete_cluster(cluster=ecs_cluster_name)

### B) Cleanup ECR repository <a id='Cleanup%20ECR%20repository'></a>

In [None]:
# Delete the ECR private repository
try:
    ecr_client.delete_repository(repositoryName=container_image_name, force=True)
    print('ECR repository {} deleted.'.format(container_image_name))
except ecr_client.exceptions.RepositoryNotFoundException:
    print('ECR repository {} does not exist.'.format(container_image_name))

### C) Cleanup S3 objects <a id='Cleanup%20S3%20objects'></a>

In [None]:
# Delete data from S3 bucket
for file in s3_bucket_resource.objects.filter(Prefix='{}/'.format(nb_name)):
    file_key = file.key
    print('Deleting {} ...'.format(file_key))
    s3_resource.Object(s3_bucket_resource.name, file_key).delete()