# Multi-model SageMaker Pipeline with Hyperparamater Tuning and Experiments

Before proceeding, please see context of this notebook in [README.md](README.md). This notebook has been tested in a SageMaker notebook that is using a kernel with Python 3.7 installed, e.g. conda_mxnet_latest_p37, conda_python3. Make sure you have created a SageMaker project outside of this notebook with the name `diabetes`. Recommendation is to create a SageMaker project using [SageMaker-provide MLOps template for model building, training, and deployment template](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-projects-templates-sm.html#sagemaker-projects-templates-code-commit). Note that this notebook will not create the SageMaker project for you. 


## Prepare the raw data

We create an S3 bucket and with encryption enabled for additional security. 

#### If you are running this Notebook in us-east-1 region, don't use 'CreateBucketConfiguration' parameter with create_bucket().  us-east-1 is the default location.

In [None]:
import boto3

AWS_ACCOUNT = boto3.client("sts").get_caller_identity()["Account"]
AWS_REGION = boto3.Session().region_name
BUCKET_NAME = "sagemaker-diabetes-{AWS_ACCOUNT}".format(AWS_ACCOUNT=AWS_ACCOUNT)

s3_client = boto3.client("s3")
location = {"LocationConstraint": AWS_REGION}

# default location is us-east-1, so CreateBucketConfiguration is not needed
s3_client.create_bucket(Bucket=BUCKET_NAME)

# use this create_bucket statement for any AWS region other than us-east-1
#s3_client.create_bucket(Bucket=BUCKET_NAME, CreateBucketConfiguration=location) 

s3_client.put_bucket_encryption(
    Bucket=BUCKET_NAME,
    ServerSideEncryptionConfiguration={
        "Rules": [
            {
                "ApplyServerSideEncryptionByDefault": {"SSEAlgorithm": "AES256"},
            },
        ]
    },
)

## Dataset collection

Download UCI dataset and copy to S3 bucket

In [None]:
%%sh

AWS_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
BUCKET_NAME="sagemaker-diabetes-${AWS_ACCOUNT}"

wget https://archive.ics.uci.edu/ml/machine-learning-databases/00296/dataset_diabetes.zip
unzip dataset_diabetes.zip
aws s3 cp dataset_diabetes/diabetic_data.csv s3://${BUCKET_NAME}/
    

Update diabetes.flow to use your AWS account ID. 

In [None]:
%%sh

AWS_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
sed -i "s/AWS_ACCOUNT/${AWS_ACCOUNT}/g" sagemaker-pipeline/diabetes.flow

Next, Create IAM Role for ML workflow steps

In [None]:
import json

iam_client = boto3.client("iam")

sagemaker_assume_role_policy_document = json.dumps(
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {"Service": "sagemaker.amazonaws.com"},
                "Action": "sts:AssumeRole",
            }
        ],
    }
)

response_role = iam_client.create_role(
    RoleName="AmazonSageMakerServiceCatalogProductsUseRole-diabetes",
    AssumeRolePolicyDocument=sagemaker_assume_role_policy_document,
)


iam_client.attach_role_policy(
    RoleName=response_role["Role"]["RoleName"],
    PolicyArn='arn:aws:iam::aws:policy/AmazonSageMakerFullAccess'
)


## Prepare the Decision Tree custom Docker image

We make a  Docker image containing a custom algorithm using [Scikit-learn Decision Tree Regressor](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor). Note that the Docker image has been modified to support hyperparameter tuning and validation data. 



In [None]:
! sudo yum install docker -y

In [None]:
%%sh

# The name of our algorithm
ALGORITHM_NAME="diabetes-decision-trees"

cd container

chmod +x decision_trees/train
chmod +x decision_trees/serve

AWS_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
AWS_REGION=$(aws configure get region)

IMAGE_FULLNAME="${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/${ALGORITHM_NAME}:latest"

# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "${ALGORITHM_NAME}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${ALGORITHM_NAME}" > /dev/null
fi

# Get the login command from ECR and execute it directly
aws ecr get-login-password --region ${AWS_REGION}|docker login --username AWS --password-stdin ${IMAGE_FULLNAME}

# Build the docker image locally with the image name and then push it to ECR with the full name.
# Ensure your notebook IAM role has required permission for pushing image to ECR

docker build  -t ${ALGORITHM_NAME} .
docker tag ${ALGORITHM_NAME} ${IMAGE_FULLNAME}
docker push ${IMAGE_FULLNAME}


Once Docker image is pushed to ECR repository, we make the image accessible from SageMaker. 

In [None]:
%%sh

# The name of our algorithm
SM_IMAGE_NAME=diabetes-dtree
AWS_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)

# This assumes the role name is AmazonSageMakerServiceCatalogProductsUseRole-diabetes
ROLE_ARN="arn:aws:iam::${AWS_ACCOUNT}:role/AmazonSageMakerServiceCatalogProductsUseRole-diabetes"

aws sagemaker create-image \
    --image-name ${SM_IMAGE_NAME} \
    --role-arn ${ROLE_ARN}

aws sagemaker create-app-image-config \
    --cli-input-json file://container/app-image-config-input.json


In [None]:
%%sh
AWS_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
ALGORITHM_NAME=diabetes-decision-trees
AWS_REGION=$(aws configure get region)
SM_IMAGE_NAME=diabetes-dtree
SM_BASE_IMAGE="${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/${ALGORITHM_NAME}:latest"

aws sagemaker create-image-version \
    --image-name ${SM_IMAGE_NAME} \
    --base-image ${SM_BASE_IMAGE}

## Trigger the SageMaker Pipelines pipeline

Here we perform following steps:

1) Clone SageMaker Projects model-build repo from code commit

2) Copy local project sagemaker-pipeline to the SageMaker Project repo

3) Commit these changes to code commit

Above 3 steps will trigger the SageMaker Projects model-build pipeline.

In [None]:
%%sh
SAGEMAKER_PROJECT_NAME=diabetes
AWS_REGION=$(aws configure get region)
SAGEMAKER_PROJECT_ID=$(aws sagemaker describe-project --project-name ${SAGEMAKER_PROJECT_NAME} --query 'ProjectId' | tr -d '"')
SAGEMAKER_PROJECT_REPO="sagemaker-${SAGEMAKER_PROJECT_NAME}-${SAGEMAKER_PROJECT_ID}-modelbuild"

rm -rf ~/${SAGEMAKER_PROJECT_REPO}/
git clone https://git-codecommit.${AWS_REGION}.amazonaws.com/v1/repos/${SAGEMAKER_PROJECT_REPO} ~/${SAGEMAKER_PROJECT_REPO}
rsync -a sagemaker-pipeline/ ~/${SAGEMAKER_PROJECT_REPO}/ && rm -rf ~/${SAGEMAKER_PROJECT_REPO}/pipelines/abalone/ ~/${SAGEMAKER_PROJECT_REPO}/build/
cd ~/${SAGEMAKER_PROJECT_REPO}/ && git config --global user.name "name" && git config --global user.email name@email.com && git config advice.addIgnoredFile false && git add --all && git commit -am "initial commit" && git push origin main 



The commit should trigger a pipeline run. Proceed to monitor your pipeline run until completion in SageMaker Studio. 

If you inspect the pipeline, you will see that the XGBoost model performs better than Decision Tree. Therefore, the XGBoost model is registered in the model registry.

## Approve top performing model in Model registry

After the SageMaker Pipeline is complete, new trained Model will be registered in Model Registry.

1) Make sure to update your desired `MODEL_VERSION`. We assume we approve the model version 1. 

2) As EventBridge monitors Model Registry status changes, Model status change will trigger SageMaker Projects model-deploy pipeline.

In [None]:
from sagemaker import get_execution_role, session
import boto3

role = get_execution_role()
sm_client = boto3.client("sagemaker")

MODEL_VERSION = "1"
SAGEMAKER_PROJECT_NAME = "diabetes"
SAGEMAKER_PROJECT_ID = sm_client.describe_project(ProjectName=SAGEMAKER_PROJECT_NAME)["ProjectId"]
AWS_REGION = boto3.Session().region_name
MODEL_PACKAGE_ARN = "arn:aws:sagemaker:{AWS_REGION}:{AWS_ACCOUNT}:model-package/{SAGEMAKER_PROJECT_NAME}-{SAGEMAKER_PROJECT_ID}/{MODEL_VERSION}".format(
    AWS_REGION=AWS_REGION,
    AWS_ACCOUNT=AWS_ACCOUNT,
    SAGEMAKER_PROJECT_NAME=SAGEMAKER_PROJECT_NAME,
    SAGEMAKER_PROJECT_ID=SAGEMAKER_PROJECT_ID,
    MODEL_VERSION=MODEL_VERSION,
)


model_package_update_response = sm_client.update_model_package(
    ModelPackageArn=MODEL_PACKAGE_ARN, ModelApprovalStatus="Approved"
)

## Run predictions on model

Wait until SageMaker Projects model-deploy pipeline has deployed the staging inference endpoint. Use the following data for inference:

Example 1
------------
`5.0,64.0,0.0,18.0,0.0,0.0,7.0,9.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
`

In summary, this is a diabetic patient that is Caucasian Female age 60-70, who has spent 5 days in the hospital under emergency care in the current encounter. Prior to this encounter, patient has spent 0 days in outpatient care, 0 days in emergency care, 7 days in inpatient care. 64 laboratory procedures have been performed on the patient. Patient is not using metformin, repaglinide, pioglitazone, rosiglitazone, and insulin prescription is steady.



In [None]:
import json
import boto3

sm_runtime = boto3.client("runtime.sagemaker")
endpoint_name ="diabetes-staging"
line = "5.0,64.0,0.0,18.0,0.0,0.0,7.0,9.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0"
response = sm_runtime.invoke_endpoint(EndpointName=endpoint_name, ContentType="text/csv", Body=line)
result = json.loads(response["Body"].read().decode())
print("Predicted class : {}".format(round(result)))

Now you try:

Example 2
------------

`3.0,19.0,3.0,19.0,0.0,0.0,0.0,8.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
`

In summary, this is a dibetic patient that is Caucasian Female age 70-80, who has spent 3 days in the hospital under elective care in the current encounter. Prior to this encounter, patient has spent 0 days in outpatient care, 0 days in emergency care, 0 days in inpatient care. 19 laboratory procedures have been performed on the patient. Patient is not using metformin, repaglinide, pioglitazone, rosiglitazone. Patient is not using insulin. 

## Cleanup

To avoid incurring future charges, clean up created resources such as the S3 bucket, ECR repository, and SageMaker Studio. Prior to deleting the SageMaker Studio, make sure to delete the SageMaker model and endpoint resources, delete the entire SageMaker project diabetes, as well as its peripheral resources CodePipeline pipelines and CodeCommit repositories.
Finally, delete the Jupyter instance containing the notebook. 