# Multi-model SageMaker Pipeline with Hyperparamater Tuning and Experiments

This notebook has been tested in a SageMaker notebook that is using a kernel with Python 3.7 installed, e.g. conda_mxnet_latest_p37, conda_python3.

## Prepare the dataset collection

We create an S3 bucket and with encryption enabled for additional security. 

#### If you are running this Notebook in us-east-1 region, don't use 'CreateBucketConfiguration' parameter with create_bucket(). us-east-1 is the default location.

In [None]:
import boto3

AWS_ACCOUNT = boto3.client("sts").get_caller_identity()["Account"]
AWS_REGION = boto3.Session().region_name

In [None]:
import boto3

AWS_ACCOUNT = boto3.client("sts").get_caller_identity()["Account"]
AWS_REGION = boto3.Session().region_name
PREFIX = "sagemaker-diabetes"
BUCKET_NAME = "{PREFIX}-{AWS_ACCOUNT}".format(PREFIX=PREFIX,AWS_ACCOUNT=AWS_ACCOUNT)

s3_client = boto3.client("s3")
location = {"LocationConstraint": AWS_REGION}

# default location is us-east-1, so CreateBucketConfiguration is not needed
s3_client.create_bucket(Bucket=BUCKET_NAME)

# use this create_bucket statement for any AWS region other than us-east-1
#s3_client.create_bucket(Bucket=BUCKET_NAME, CreateBucketConfiguration=location) 

s3_client.put_bucket_encryption(
 Bucket=BUCKET_NAME,
 ServerSideEncryptionConfiguration={
 "Rules": [
 {
 "ApplyServerSideEncryptionByDefault": {"SSEAlgorithm": "AES256"},
 },
 ]
 },
)

Download UCI dataset and copy to S3 bucket. 

In [None]:
%%sh

AWS_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
BUCKET_NAME="sagemaker-diabetes-${AWS_ACCOUNT}"

wget https://archive.ics.uci.edu/ml/machine-learning-databases/00296/dataset_diabetes.zip
unzip dataset_diabetes.zip
aws s3 cp dataset_diabetes/diabetic_data.csv s3://${BUCKET_NAME}/
 

Update diabetes.flow to use your AWS account ID. 

In [None]:
%%sh

AWS_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
sed -i "s/AWS_ACCOUNT/${AWS_ACCOUNT}/g" sagemaker-pipeline/diabetes.flow

Next, Create IAM Role for ML workflow steps

In [None]:
import json

iam_client = boto3.client("iam")

sagemaker_assume_role_policy_document = json.dumps(
 {
 "Version": "2012-10-17",
 "Statement": [
 {
 "Effect": "Allow",
 "Principal": {"Service": "sagemaker.amazonaws.com"},
 "Action": "sts:AssumeRole",
 }
 ],
 }
)

response_role = iam_client.create_role(
 RoleName="AmazonSageMakerServiceCatalogProductsUseRole-diabetes",
 AssumeRolePolicyDocument=sagemaker_assume_role_policy_document,
)


iam_client.attach_role_policy(
 RoleName=response_role["Role"]["RoleName"],
 PolicyArn='arn:aws:iam::aws:policy/AmazonSageMakerFullAccess'
)


## Prepare the Decision Tree custom Docker image

We make a Docker image containing a custom algorithm using [Scikit-learn Decision Tree Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html). Note that the Docker image has been modified to support hyperparameter tuning and validation data. 



In [None]:
! sudo yum install docker -y

In [None]:
%%sh

# The name of our algorithm
ALGORITHM_NAME="diabetes-decision-trees"

cd container

chmod +x decision_trees/train
chmod +x decision_trees/serve

AWS_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
AWS_REGION=$(aws configure get region)

IMAGE_FULLNAME="${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/${ALGORITHM_NAME}:latest"

# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "${ALGORITHM_NAME}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
 aws ecr create-repository --repository-name "${ALGORITHM_NAME}" > /dev/null
fi

# Get the login command from ECR and execute it directly
aws ecr get-login-password --region ${AWS_REGION}|docker login --username AWS --password-stdin ${IMAGE_FULLNAME}

# Build the docker image locally with the image name and then push it to ECR with the full name.
# Ensure your notebook IAM role has required permission for pushing image to ECR

docker build -t ${ALGORITHM_NAME} .
docker tag ${ALGORITHM_NAME} ${IMAGE_FULLNAME}
docker push ${IMAGE_FULLNAME}


Once Docker image is pushed to ECR repository, we make the image accessible from SageMaker. 

In [None]:
%%sh

# The name of our algorithm
SM_IMAGE_NAME=diabetes-dtree
AWS_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)

# This assumes the role name is AmazonSageMakerServiceCatalogProductsUseRole-diabetes
ROLE_ARN="arn:aws:iam::${AWS_ACCOUNT}:role/AmazonSageMakerServiceCatalogProductsUseRole-diabetes"

aws sagemaker create-image \
 --image-name ${SM_IMAGE_NAME} \
 --role-arn ${ROLE_ARN}

aws sagemaker create-app-image-config \
 --cli-input-json file://container/app-image-config-input.json


In [None]:
%%sh
AWS_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
ALGORITHM_NAME=diabetes-decision-trees
AWS_REGION=$(aws configure get region)
SM_IMAGE_NAME=diabetes-dtree
SM_BASE_IMAGE="${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/${ALGORITHM_NAME}:latest"

aws sagemaker create-image-version \
 --image-name ${SM_IMAGE_NAME} \
 --base-image ${SM_BASE_IMAGE}

## Define and start the SageMaker pipeline

Install the necessary Python library `awswrangler` for the SageMaker pipeline. 

In [None]:
! pip3 install awswrangler

Import the necessary Python modules for the SageMaker pipeline. 

In [None]:
import os
import time
import uuid
import json
import boto3
import sagemaker
import sagemaker.session
from sagemaker.estimator import Estimator
from sagemaker.inputs import TrainingInput
from sagemaker.model_metrics import MetricsSource, ModelMetrics
from sagemaker.processing import ProcessingInput, ProcessingOutput, ScriptProcessor
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.workflow.condition_step import ConditionStep, JsonGet
#from sagemaker.workflow.functions import JsonGet
from sagemaker.workflow.pipeline_context import PipelineSession
from sagemaker.workflow.conditions import ConditionGreaterThanOrEqualTo
from sagemaker.workflow.parameters import ParameterInteger, ParameterString
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.properties import PropertyFile
from sagemaker.workflow.step_collections import RegisterModel
from sagemaker.workflow.steps import ProcessingStep, TrainingStep, CacheConfig, TuningStep
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.processing import Processor
from sagemaker.network import NetworkConfig
from sagemaker.tuner import (
 ContinuousParameter,
 IntegerParameter,
 CategoricalParameter,
 HyperparameterTuner,
 WarmStartConfig,
 WarmStartTypes,
)

Create boto3 session and define pipeline step instance count and other configuration. 

In [None]:
region = AWS_REGION
default_bucket = BUCKET_NAME

pipeline_session = PipelineSession()
role = sagemaker.session.get_execution_role(pipeline_session)

datawrangler_instance_count = 2
datawrangler_instance_type = "ml.m5.4xlarge" 
processing_instance_count = 1
processing_instance_type = "ml.m5.2xlarge"
training_instance_count = 1
training_instance_type = "ml.m5.2xlarge"
model_approval_status = ParameterString(
 name="ModelApprovalStatus",
 default_value="PendingManualApproval", # ModelApprovalStatus can be set to a default of "Approved" if you don't want manual approval.
)
input_data = ParameterString(
 name="InputDataUrl",
 default_value=f"", # Change this to point to the s3 location of your raw input data.
)


Define and create the `DataWranglerProcess` step in the SageMaker pipeline. This step uses an existing Data Wrangler flow file `diabetes.flow` that has the following transformations:
* Move column readmitted to the beginning. This column is to be predicted in the classification problem. 
* Convert readmitted column value to 0 if it is NO and 1 if it is <30 or >30.
* Drop the columns that have minimal to zero prediction power based on Data Wrangler Data Quality and Insights Report, e.g. payer_code and encounter_id. 
* Group values into finite categories using Python custom transform in the following columns: diag_1, diag_2, diag_3, admission_type_id, admission_source_id, and discharge_disposition_id. 
* Fill missing values in columns diag_1, diag_2, diag_3 and replace strings in column race.
* Drop duplicates, balance data using SMOTE, and one-hot encode the following columns: race, gender, age, diag_1, diag_2, diag_3, max_glu_serum, A1Cresult, metformin, repaglinide, pioglitazone, rosiglitazone, insulin, change, diabetesMed, admission_type_id, discharge_disposition_id, admission_source_id. 

To use your own transformations, replace `output_name` and `flow_file_name`.

In [None]:
print(f"Data Wrangler export storage bucket: {default_bucket}")

# unique flow export ID
flow_export_id = f"{time.strftime('%d-%H-%M-%S', time.gmtime())}-{str(uuid.uuid4())[:8]}"
flow_export_name = f"flow-{flow_export_id}"

output_name = "d593101e-278b-4330-9779-b6e02fbeb99e.default"

s3_output_prefix = f"export-{flow_export_name}/output"
s3_output_path = f"s3://{default_bucket}/{s3_output_prefix}"
print(f"Flow S3 export result path: {s3_output_path}")

processing_job_output = ProcessingOutput(
 output_name=output_name,
 source="/opt/ml/processing/output",
 destination=s3_output_path,
 s3_upload_mode="EndOfJob",
)

# name of the flow file which should exist in the current notebook working directory
flow_file_name = "sagemaker-pipeline/diabetes.flow"

# Load .flow file from current notebook working directory
#!echo "Loading flow file from current notebook working directory: $PWD"

with open(flow_file_name) as f:
 flow = json.load(f)

# Upload flow to S3
s3_client = boto3.client("s3")
s3_client.upload_file(
 flow_file_name,
 default_bucket,
 f"data_wrangler_flows/{flow_export_name}.flow",
 ExtraArgs={"ServerSideEncryption": "aws:kms"},
)

flow_s3_uri = f"s3://{default_bucket}/data_wrangler_flows/{flow_export_name}.flow"

print(f"Data Wrangler flow {flow_file_name} uploaded to {flow_s3_uri}")

flow_input = ProcessingInput(
 source=flow_s3_uri,
 destination="/opt/ml/processing/flow",
 input_name="flow",
 s3_data_type="S3Prefix",
 s3_input_mode="File",
 s3_data_distribution_type="FullyReplicated",
)

# IAM role for executing the processing job.
iam_role = role

# Unique processing job name. Give a unique name every time you re-execute processing jobs
processing_job_name = f"data-wrangler-flow-processing-{flow_export_id}"

# Size in GB of the EBS volume to use for storing data during processing
volume_size_in_gb = 30

# Content type for each output. Data Wrangler supports CSV as default and Parquet.
output_content_type = "CSV"

# Network Isolation mode; default is off
enable_network_isolation = False

# List of tags to be passed to the processing job
user_tags = []

# Output configuration used as processing job container arguments
output_config = {output_name: {"content_type": output_content_type}}

# KMS key for per object encryption; default is None
kms_key = None

# Data Wrangler Container URL.
container_uri = sagemaker.image_uris.retrieve(
 framework="data-wrangler", 
 region=region, 
 version="1.x",
)

processor = Processor(
 role=iam_role,
 image_uri=container_uri,
 instance_count=datawrangler_instance_count,
 instance_type=datawrangler_instance_type,
 volume_size_in_gb=volume_size_in_gb,
 network_config=NetworkConfig(enable_network_isolation=enable_network_isolation),
 sagemaker_session=pipeline_session,
 output_kms_key=kms_key,
 tags=user_tags,
)

data_wrangler_step = ProcessingStep(
 name="DataWranglerProcess",
 processor=processor,
 inputs=[flow_input],
 outputs=[processing_job_output],
 job_arguments=[f"--output-config '{json.dumps(output_config)}'"],
)


Define and create the `Preprocess` step in the SageMaker pipeline. This step reads the transformed data from the DataWranglerProcess, randomizes, and splits the data into train (70%), validation (10%), and test data (20%). 

You can also put here other necessary transformations and pre-processing changes that are done outside of Data Wrangler. 

In [None]:
base_job_prefix = "sagemaker-diabetes"

sklearn_processor = SKLearnProcessor(
 framework_version="0.23-1",
 instance_type=processing_instance_type,
 instance_count=processing_instance_count,
 base_job_name=f"{base_job_prefix}/sklearn-diabetes-preprocess", # choose any name
 sagemaker_session=pipeline_session,
 role=role
)

step_process = ProcessingStep(
 name="Preprocess", # choose any name
 processor=sklearn_processor,
 inputs=[
 ProcessingInput(
 source=data_wrangler_step.properties.ProcessingOutputConfig.Outputs[
 output_name
 ].S3Output.S3Uri,
 destination="/opt/ml/processing/data/raw-data-dir",
 )
 ],
 outputs=[
 ProcessingOutput(output_name="train", source="/opt/ml/processing/train"),
 ProcessingOutput(output_name="validation", source="/opt/ml/processing/validation"),
 ProcessingOutput(output_name="test", source="/opt/ml/processing/test"),
 ],
 code="sagemaker-pipeline/pipelines/diabetes/preprocess.py",
 job_arguments=[
 "--input-data",
 data_wrangler_step.properties.ProcessingOutputConfig.Outputs[
 output_name
 ].S3Output.S3Uri,
 ],
)


Define and create the `XGBHPTune` step in the SageMaker pipeline. This is a hyperparameter tuning job using SageMaker XGBoost algorithm.

In [None]:
# Training step for generating model artifacts
model_path = f"s3://{pipeline_session.default_bucket()}/{base_job_prefix}/diabetesTrain"
model_bucket_key = f"{pipeline_session.default_bucket()}/{base_job_prefix}/diabetesTrain"
cache_config = CacheConfig(enable_caching=True, expire_after="30d")

xgb_image_uri = sagemaker.image_uris.retrieve(
 framework="xgboost", # we are using the Sagemaker built in xgboost algorithm
 region=region,
 version="1.0-1",
 py_version="py3",
 instance_type=training_instance_type,
 image_scope="training"
)
xgb_train = Estimator(
 image_uri=xgb_image_uri,
 instance_type=training_instance_type,
 instance_count=training_instance_count,
 output_path=model_path,
 base_job_name=f"{base_job_prefix}/diabetes-xgb-train",
 sagemaker_session=pipeline_session,
 role=role,
)
xgb_train.set_hyperparameters(
 num_round=50,
 objective="binary:logistic", # we are using binary:logistic as the objective function for classification 
)

xgb_train.set_hyperparameters(grow_policy="lossguide")

xgb_objective_metric_name = "validation:auc" # we are using AUC as a performance metric 
xgb_hyperparameter_ranges = {
 "max_depth": IntegerParameter(5, 10, scaling_type="Auto"),
 "min_child_weight": IntegerParameter(5, 10, scaling_type="Auto"),
 "eta": ContinuousParameter(0.1, 0.9, scaling_type="Auto"),
 "gamma": IntegerParameter(4, 9, scaling_type="Auto"),
 "subsample": ContinuousParameter(0.7, 0.9, scaling_type="Auto"),
}

xgb_tuner_log = HyperparameterTuner(
 xgb_train,
 xgb_objective_metric_name,
 xgb_hyperparameter_ranges,
 max_jobs=5,
 max_parallel_jobs=5,
 strategy="Random",
 objective_type="Maximize",
)

xgb_step_tuning = TuningStep(
 name="XGBHPTune",
 tuner=xgb_tuner_log,
 inputs={
 "train": TrainingInput(
 s3_data=step_process.properties.ProcessingOutputConfig.Outputs[
 "train"
 ].S3Output.S3Uri,
 content_type="text/csv",
 ),
 "validation": TrainingInput(
 s3_data=step_process.properties.ProcessingOutputConfig.Outputs[
 "validation"
 ].S3Output.S3Uri,
 content_type="text/csv",
 ),
 },
 cache_config=cache_config,
)


Define and create the `DTreeHPTune` step in the SageMaker pipeline. This is a hyperparameter tuning job using Scikit-learn Decision Tree algorithm. Note that this is in a custom Docker image pushed to the repository in section ` Prepare the Decision Tree custom Docker image`. 

In [None]:
dtree_image_uri = pipeline_session.sagemaker_client.describe_image_version(ImageName="diabetes-dtree")["ContainerImage"]

dtree_train = Estimator(
 image_uri=dtree_image_uri,
 role=role,
 instance_count=1,
 instance_type=training_instance_type,
 base_job_name=f"{base_job_prefix}/diabetes-dtree-train",
 output_path=model_path,
 sagemaker_session=pipeline_session,
)

dtree_objective_metric_name = "validation:auc"
dtree_metric_definitions = [{"Name": "validation:auc", "Regex": "auc:(\S+)"}]

dtree_hyperparameter_ranges = {
 "max_depth": IntegerParameter(5, 10, scaling_type="Linear"),
 "max_leaf_nodes": IntegerParameter(2, 10, scaling_type="Linear"),
}

dtree_tuner_log = HyperparameterTuner(
 dtree_train,
 dtree_objective_metric_name,
 dtree_hyperparameter_ranges,
 dtree_metric_definitions,
 max_jobs=5,
 max_parallel_jobs=5,
 strategy="Random",
 objective_type="Maximize",
)

dtree_step_tuning = TuningStep(
 name="DTreeHPTune",
 tuner=dtree_tuner_log,
 inputs={
 "training": TrainingInput(
 s3_data=step_process.properties.ProcessingOutputConfig.Outputs[
 "train"
 ].S3Output.S3Uri,
 content_type="text/csv",
 ),
 "validation": TrainingInput(
 s3_data=step_process.properties.ProcessingOutputConfig.Outputs[
 "validation"
 ].S3Output.S3Uri,
 content_type="text/csv",
 ),
 },
 cache_config=cache_config,
)


Define and create the `DtreeEval` step in the SageMaker pipeline. This uses `dtree_evaluate.py` to evaluate the performance of the generated model from `DTreeHPTune` step using test data. 

In [None]:
dtree_script_eval = ScriptProcessor(
 image_uri=dtree_image_uri,
 command=["python3"],
 instance_type=processing_instance_type,
 instance_count=1,
 base_job_name=f"{base_job_prefix}/script-dtree-eval",
 sagemaker_session=pipeline_session,
 role=role,
)

dtree_evaluation_report = PropertyFile(
 name="EvaluationReportDTree",
 output_name="dtree_evaluation",
 path="dtree_evaluation.json",
)

dtree_step_eval = ProcessingStep(
 name="DTreeEval",
 processor=dtree_script_eval,
 inputs=[
 ProcessingInput(
 source=dtree_step_tuning.get_top_model_s3_uri(top_k=0, s3_bucket=model_bucket_key),
 destination="/opt/ml/processing/model",
 ),
 ProcessingInput(
 source=step_process.properties.ProcessingOutputConfig.Outputs[
 "test"
 ].S3Output.S3Uri,
 destination="/opt/ml/processing/test",
 ),
 ],
 outputs=[
 ProcessingOutput(
 output_name="dtree_evaluation", source="/opt/ml/processing/evaluation"
 ),
 ],
 code="sagemaker-pipeline/pipelines/diabetes/dtree_evaluate.py",
 property_files=[dtree_evaluation_report],
)


Define and create the `XGBEval` step in the SageMaker pipeline. This uses `xgb_evaluate.py` to evaluate the performance of the generated model from `XGBHPTune` step using test data. 

In [None]:
xgb_script_eval = ScriptProcessor(
 image_uri=xgb_image_uri,
 command=["python3"],
 instance_type=processing_instance_type,
 instance_count=1,
 base_job_name=f"{base_job_prefix}/script-xgb-eval",
 sagemaker_session=pipeline_session,
 role=role,
)

xgb_evaluation_report = PropertyFile(
 name="EvaluationReportXGBoost",
 output_name="xgb_evaluation",
 path="xgb_evaluation.json",
)

xgb_step_eval = ProcessingStep(
 name="XGBEval",
 processor=xgb_script_eval,
 inputs=[
 ProcessingInput(
 source=xgb_step_tuning.get_top_model_s3_uri(top_k=0, s3_bucket=model_bucket_key),
 destination="/opt/ml/processing/model",
 ),
 ProcessingInput(
 source=step_process.properties.ProcessingOutputConfig.Outputs[
 "test"
 ].S3Output.S3Uri,
 destination="/opt/ml/processing/test",
 ),
 ],
 outputs=[
 ProcessingOutput(output_name="xgb_evaluation", source="/opt/ml/processing/evaluation"),
 ],
 code="sagemaker-pipeline/pipelines/diabetes/xgb_evaluate.py",
 property_files=[xgb_evaluation_report],
)


Retrieve the resulting AUC-ROC score from steps `DTreeEval` and `XGBEval` in the SageMaker pipeline.

In [None]:
xgb_model_metrics = ModelMetrics(
 model_statistics=MetricsSource(
 s3_uri="{}/xgb_evaluation.json".format(
 xgb_step_eval.arguments["ProcessingOutputConfig"]["Outputs"][0]["S3Output"]["S3Uri"]
 ),
 content_type="application/json",
 )
)

dtree_model_metrics = ModelMetrics(
 model_statistics=MetricsSource(
 s3_uri="{}/dtree_evaluation.json".format(
 dtree_step_eval.arguments["ProcessingOutputConfig"]["Outputs"][0]["S3Output"][
 "S3Uri"
 ]
 ),
 content_type="application/json",
 )
)

xgb_eval_metrics = JsonGet(
 #step_name=xgb_step_eval,
 step=xgb_step_eval,
 property_file=xgb_evaluation_report,
 json_path="classification_metrics.roc.value", # This should follow the structure of your report_dict defined in the evaluate.py file.
)

dtree_eval_metrics = JsonGet(
 #step_name=dtree_step_eval,
 step=dtree_step_eval,
 property_file=dtree_evaluation_report,
 json_path="classification_metrics.roc.value", # This should follow the structure of your report_dict defined in the evaluate.py file.
)


Define and create ` AccuracyCond`. `DTreeReg-RegisterModel` and `XGBReg-RegisterModel` steps in the SageMaker pipeline. IF AUC-ROC score of Scikit-learn Decision Tree is greater than SageMaker XGBoost, then the Decision Tree model is registered in the model registry. Else, XGBoost is registered in the model registry. 

In [None]:
model_package_group_name = "sagemaker-diabetes"
# Register model step that will be conditionally executed
dtree_step_register = RegisterModel(
 name="DTreeReg",
 estimator=dtree_train,
 model_data=dtree_step_tuning.get_top_model_s3_uri(top_k=0, s3_bucket=model_bucket_key),
 content_types=["text/csv"],
 response_types=["text/csv"],
 inference_instances=["ml.t2.medium", "ml.m5.large"],
 transform_instances=["ml.m5.large"],
 model_package_group_name=model_package_group_name,
 approval_status=model_approval_status,
 model_metrics=dtree_model_metrics,
)

# Register model step that will be conditionally executed
xgb_step_register = RegisterModel(
 name="XGBReg",
 estimator=xgb_train,
 model_data=xgb_step_tuning.get_top_model_s3_uri(top_k=0, s3_bucket=model_bucket_key),
 content_types=["text/csv"],
 response_types=["text/csv"],
 inference_instances=["ml.t2.medium", "ml.m5.large"],
 transform_instances=["ml.m5.large"],
 model_package_group_name=model_package_group_name,
 approval_status=model_approval_status,
 model_metrics=xgb_model_metrics,
)

# Condition step for evaluating model quality and branching execution
cond_lte = ConditionGreaterThanOrEqualTo( # You can change the condition here
 left=JsonGet(
 #step_name=dtree_step_eval,
 step=dtree_step_eval,
 property_file=dtree_evaluation_report,
 json_path="classification_metrics.roc.value", # This should follow the structure of your report_dict defined in the evaluate.py file.
 ),
 right=JsonGet(
 #step_name=xgb_step_eval,
 step=xgb_step_eval,
 property_file=xgb_evaluation_report,
 json_path="classification_metrics.roc.value"
 ), 
)

step_cond = ConditionStep(
 name="AccuracyCond",
 conditions=[cond_lte],
 if_steps=[dtree_step_register],
 else_steps=[xgb_step_register],
)



Define and start the SageMaker pipeline. You should be able to see the running SageMaker pipeline in SageMaker Studio. 

In [None]:
pipeline_name = "sagemaker-diabetes"

pipeline = Pipeline(
 name=pipeline_name,
 parameters=[
 datawrangler_instance_type,
 datawrangler_instance_count, 
 processing_instance_type,
 processing_instance_count,
 training_instance_type,
 training_instance_count,
 model_approval_status,
 input_data
 ],
 steps=[
 data_wrangler_step,
 step_process,
 dtree_step_tuning,
 xgb_step_tuning,
 dtree_step_eval,
 xgb_step_eval,
 step_cond,
 ],
 sagemaker_session=pipeline_session,
)


pipeline.upsert(role_arn=role)
execution = pipeline.start()

## Approve top performing model in SageMaker model registry

After the SageMaker Pipeline is complete, new trained Model will be registered in Model Registry.

1) Make sure to update your desired `MODEL_VERSION`. We assume we approve the model version 1. 

2) As EventBridge monitors Model Registry status changes, Model status change will trigger SageMaker Projects model-deploy pipeline.

In [None]:
from sagemaker import get_execution_role, session
import boto3

role = get_execution_role()
sm_client = boto3.client("sagemaker")

MODEL_VERSION = "2"
AWS_REGION = boto3.Session().region_name
MODEL_PACKAGE_ARN = "arn:aws:sagemaker:{AWS_REGION}:{AWS_ACCOUNT}:model-package/sagemaker-diabetes/{MODEL_VERSION}".format(
 AWS_REGION=AWS_REGION,
 AWS_ACCOUNT=AWS_ACCOUNT, 
 MODEL_VERSION=MODEL_VERSION
)


model_package_update_response = sm_client.update_model_package(
 ModelPackageArn=MODEL_PACKAGE_ARN, ModelApprovalStatus="Approved"
)

## Deploy the SageMaker inference endpoint

Import model into hosting. Register the model with hosting. This allows the flexibility of importing models trained elsewhere.

In [None]:
from time import gmtime, strftime

model_name = "diabetes-modelregistry-model-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print("Model name : {}".format(model_name))
container_list = [{"ModelPackageName": MODEL_PACKAGE_ARN}]

create_model_response = sm_client.create_model(
 ModelName=model_name, ExecutionRoleArn=role, Containers=container_list
)
print("Model arn : {}".format(create_model_response["ModelArn"]))

Create endpoint configuration. SageMaker supports configuring REST endpoints in hosting with multiple models, e.g. for A/B testing purposes. In order to support this, customers create an endpoint configuration, that describes the distribution of traffic across the models, whether split, shadowed, or sampled in some way. In addition, the endpoint configuration describes the instance type required for model deployment.

In [None]:
endpoint_config_name = "diabetes-modelregistry-EndpointConfig-" + strftime(
 "%Y-%m-%d-%H-%M-%S", gmtime()
)
print(endpoint_config_name)
create_endpoint_config_response = sm_client.create_endpoint_config(
 EndpointConfigName=endpoint_config_name,
 ProductionVariants=[
 {
 "InstanceType": "ml.m5.large",
 "InitialVariantWeight": 1,
 "InitialInstanceCount": 1,
 "ModelName": model_name,
 "VariantName": "AllTraffic",
 }
 ],
)

Create endpoint. Lastly, the customer creates the endpoint that serves up the model, through specifying the name and configuration defined above. The end result is an endpoint that can be validated and incorporated into your applications. This takes 9-11 minutes to complete.

In [None]:
endpoint_name = "diabetes-staging"
print("EndpointName={}".format(endpoint_name))

create_endpoint_response = sm_client.create_endpoint(
 EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)


while True:
 endpoint = sm_client.describe_endpoint(EndpointName=endpoint_name)
 if endpoint["EndpointStatus"] == "InService":
 break
 print("Waiting for the endpoint to be completed..")
 time.sleep(60)

print("Endpoint arn : {}".format(create_endpoint_response["EndpointArn"]))

## Run predictions on model

Wait until SageMaker Projects model-deploy pipeline has deployed the staging inference endpoint. Use the following data for inference:

Example 1
------------
`5.0,64.0,0.0,18.0,0.0,0.0,7.0,9.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
`

In summary, this is a diabetic patient that is Caucasian Female age 60-70, who has spent 5 days in the hospital under emergency care in the current encounter. Prior to this encounter, patient has spent 0 days in outpatient care, 0 days in emergency care, 7 days in inpatient care. 64 laboratory procedures have been performed on the patient. Patient is not using metformin, repaglinide, pioglitazone, rosiglitazone, and insulin prescription is steady.



In [None]:
import json
import boto3

sm_runtime = boto3.client("runtime.sagemaker")
endpoint_name ="diabetes-staging"
line = "5.0,64.0,0.0,18.0,0.0,0.0,7.0,9.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0"
response = sm_runtime.invoke_endpoint(EndpointName=endpoint_name, ContentType="text/csv", Body=line)
result = json.loads(response["Body"].read().decode())
print("Predicted class : {}".format(round(result)))

Now you try:

Example 2
------------

`3.0,19.0,3.0,19.0,0.0,0.0,0.0,8.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
`

In summary, this is a diabetic patient that is Caucasian Female age 70-80, who has spent 3 days in the hospital under elective care in the current encounter. Prior to this encounter, patient has spent 0 days in outpatient care, 0 days in emergency care, 0 days in inpatient care. 19 laboratory procedures have been performed on the patient. Patient is not using metformin, repaglinide, pioglitazone, rosiglitazone. Patient is not using insulin. 

## Cleanup

To avoid incurring future charges, clean up created resources such as the S3 bucket, ECR repository, and SageMaker Studio. Prior to deleting the SageMaker Studio, make sure to delete the SageMaker model and endpoint resources. 
Finally, delete the Jupyter instance containing the notebook. 