{ "cells": [ { "cell_type": "markdown", "id": "671a914c", "metadata": {}, "source": [ "# Multi-model SageMaker Pipeline with Hyperparamater Tuning and Experiments" ] }, { "cell_type": "markdown", "id": "d697da1a", "metadata": {}, "source": [ "This notebook has been tested in a SageMaker notebook that is using a kernel with Python 3.7 installed, e.g. conda_mxnet_latest_p37, conda_python3." ] }, { "cell_type": "markdown", "id": "bd2ffcca", "metadata": {}, "source": [ "## Prepare the dataset collection" ] }, { "cell_type": "markdown", "id": "c125794f", "metadata": {}, "source": [ "We create an S3 bucket and with encryption enabled for additional security. \n", "\n", "#### If you are running this Notebook in us-east-1 region, don't use 'CreateBucketConfiguration' parameter with create_bucket(). us-east-1 is the default location." ] }, { "cell_type": "code", "execution_count": null, "id": "6f7edb84", "metadata": {}, "outputs": [], "source": [ "import boto3\n", "\n", "AWS_ACCOUNT = boto3.client(\"sts\").get_caller_identity()[\"Account\"]\n", "AWS_REGION = boto3.Session().region_name" ] }, { "cell_type": "code", "execution_count": null, "id": "ef0780a3", "metadata": {}, "outputs": [], "source": [ "import boto3\n", "\n", "AWS_ACCOUNT = boto3.client(\"sts\").get_caller_identity()[\"Account\"]\n", "AWS_REGION = boto3.Session().region_name\n", "PREFIX = \"sagemaker-diabetes\"\n", "BUCKET_NAME = \"{PREFIX}-{AWS_ACCOUNT}\".format(PREFIX=PREFIX,AWS_ACCOUNT=AWS_ACCOUNT)\n", "\n", "s3_client = boto3.client(\"s3\")\n", "location = {\"LocationConstraint\": AWS_REGION}\n", "\n", "# default location is us-east-1, so CreateBucketConfiguration is not needed\n", "s3_client.create_bucket(Bucket=BUCKET_NAME)\n", "\n", "# use this create_bucket statement for any AWS region other than us-east-1\n", "#s3_client.create_bucket(Bucket=BUCKET_NAME, CreateBucketConfiguration=location) \n", "\n", "s3_client.put_bucket_encryption(\n", " Bucket=BUCKET_NAME,\n", " ServerSideEncryptionConfiguration={\n", " \"Rules\": [\n", " {\n", " \"ApplyServerSideEncryptionByDefault\": {\"SSEAlgorithm\": \"AES256\"},\n", " },\n", " ]\n", " },\n", ")" ] }, { "cell_type": "markdown", "id": "a28a1f0e", "metadata": {}, "source": [ "Download UCI dataset and copy to S3 bucket. " ] }, { "cell_type": "code", "execution_count": null, "id": "57356c93", "metadata": { "scrolled": true }, "outputs": [], "source": [ "%%sh\n", "\n", "AWS_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)\n", "BUCKET_NAME=\"sagemaker-diabetes-${AWS_ACCOUNT}\"\n", "\n", "wget https://archive.ics.uci.edu/ml/machine-learning-databases/00296/dataset_diabetes.zip\n", "unzip dataset_diabetes.zip\n", "aws s3 cp dataset_diabetes/diabetic_data.csv s3://${BUCKET_NAME}/\n", " " ] }, { "cell_type": "markdown", "id": "61768499", "metadata": {}, "source": [ "Update diabetes.flow to use your AWS account ID. " ] }, { "cell_type": "code", "execution_count": null, "id": "ee47dce2", "metadata": {}, "outputs": [], "source": [ "%%sh\n", "\n", "AWS_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)\n", "sed -i \"s/AWS_ACCOUNT/${AWS_ACCOUNT}/g\" sagemaker-pipeline/diabetes.flow" ] }, { "cell_type": "markdown", "id": "50020dd6", "metadata": {}, "source": [ "Next, Create IAM Role for ML workflow steps" ] }, { "cell_type": "code", "execution_count": null, "id": "c1ff2f77", "metadata": {}, "outputs": [], "source": [ "import json\n", "\n", "iam_client = boto3.client(\"iam\")\n", "\n", "sagemaker_assume_role_policy_document = json.dumps(\n", " {\n", " \"Version\": \"2012-10-17\",\n", " \"Statement\": [\n", " {\n", " \"Effect\": \"Allow\",\n", " \"Principal\": {\"Service\": \"sagemaker.amazonaws.com\"},\n", " \"Action\": \"sts:AssumeRole\",\n", " }\n", " ],\n", " }\n", ")\n", "\n", "response_role = iam_client.create_role(\n", " RoleName=\"AmazonSageMakerServiceCatalogProductsUseRole-diabetes\",\n", " AssumeRolePolicyDocument=sagemaker_assume_role_policy_document,\n", ")\n", "\n", "\n", "iam_client.attach_role_policy(\n", " RoleName=response_role[\"Role\"][\"RoleName\"],\n", " PolicyArn='arn:aws:iam::aws:policy/AmazonSageMakerFullAccess'\n", ")\n" ] }, { "cell_type": "markdown", "id": "6f41e95e", "metadata": {}, "source": [ "## Prepare the Decision Tree custom Docker image" ] }, { "cell_type": "markdown", "id": "6c642455", "metadata": {}, "source": [ "We make a Docker image containing a custom algorithm using [Scikit-learn Decision Tree Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html). Note that the Docker image has been modified to support hyperparameter tuning and validation data. \n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "70f0000a", "metadata": {}, "outputs": [], "source": [ "! sudo yum install docker -y" ] }, { "cell_type": "code", "execution_count": null, "id": "3aec9c87", "metadata": {}, "outputs": [], "source": [ "%%sh\n", "\n", "# The name of our algorithm\n", "ALGORITHM_NAME=\"diabetes-decision-trees\"\n", "\n", "cd container\n", "\n", "chmod +x decision_trees/train\n", "chmod +x decision_trees/serve\n", "\n", "AWS_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)\n", "AWS_REGION=$(aws configure get region)\n", "\n", "IMAGE_FULLNAME=\"${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/${ALGORITHM_NAME}:latest\"\n", "\n", "# If the repository doesn't exist in ECR, create it.\n", "aws ecr describe-repositories --repository-names \"${ALGORITHM_NAME}\" > /dev/null 2>&1\n", "\n", "if [ $? -ne 0 ]\n", "then\n", " aws ecr create-repository --repository-name \"${ALGORITHM_NAME}\" > /dev/null\n", "fi\n", "\n", "# Get the login command from ECR and execute it directly\n", "aws ecr get-login-password --region ${AWS_REGION}|docker login --username AWS --password-stdin ${IMAGE_FULLNAME}\n", "\n", "# Build the docker image locally with the image name and then push it to ECR with the full name.\n", "# Ensure your notebook IAM role has required permission for pushing image to ECR\n", "\n", "docker build -t ${ALGORITHM_NAME} .\n", "docker tag ${ALGORITHM_NAME} ${IMAGE_FULLNAME}\n", "docker push ${IMAGE_FULLNAME}\n" ] }, { "cell_type": "markdown", "id": "11b15c71", "metadata": {}, "source": [ "Once Docker image is pushed to ECR repository, we make the image accessible from SageMaker. " ] }, { "cell_type": "code", "execution_count": null, "id": "f3e03c17", "metadata": {}, "outputs": [], "source": [ "%%sh\n", "\n", "# The name of our algorithm\n", "SM_IMAGE_NAME=diabetes-dtree\n", "AWS_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)\n", "\n", "# This assumes the role name is AmazonSageMakerServiceCatalogProductsUseRole-diabetes\n", "ROLE_ARN=\"arn:aws:iam::${AWS_ACCOUNT}:role/AmazonSageMakerServiceCatalogProductsUseRole-diabetes\"\n", "\n", "aws sagemaker create-image \\\n", " --image-name ${SM_IMAGE_NAME} \\\n", " --role-arn ${ROLE_ARN}\n", "\n", "aws sagemaker create-app-image-config \\\n", " --cli-input-json file://container/app-image-config-input.json\n" ] }, { "cell_type": "code", "execution_count": null, "id": "ad3a940d", "metadata": {}, "outputs": [], "source": [ "%%sh\n", "AWS_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)\n", "ALGORITHM_NAME=diabetes-decision-trees\n", "AWS_REGION=$(aws configure get region)\n", "SM_IMAGE_NAME=diabetes-dtree\n", "SM_BASE_IMAGE=\"${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/${ALGORITHM_NAME}:latest\"\n", "\n", "aws sagemaker create-image-version \\\n", " --image-name ${SM_IMAGE_NAME} \\\n", " --base-image ${SM_BASE_IMAGE}" ] }, { "cell_type": "markdown", "id": "9eb7dd2b", "metadata": {}, "source": [ "## Define and start the SageMaker pipeline" ] }, { "cell_type": "markdown", "id": "4e0f7a38", "metadata": {}, "source": [ "Install the necessary Python library `awswrangler` for the SageMaker pipeline. " ] }, { "cell_type": "code", "execution_count": null, "id": "21ed4a32", "metadata": {}, "outputs": [], "source": [ "! pip3 install awswrangler" ] }, { "cell_type": "markdown", "id": "10abe34d", "metadata": {}, "source": [ "Import the necessary Python modules for the SageMaker pipeline. " ] }, { "cell_type": "code", "execution_count": null, "id": "faaac6eb", "metadata": {}, "outputs": [], "source": [ "import os\n", "import time\n", "import uuid\n", "import json\n", "import boto3\n", "import sagemaker\n", "import sagemaker.session\n", "from sagemaker.estimator import Estimator\n", "from sagemaker.inputs import TrainingInput\n", "from sagemaker.model_metrics import MetricsSource, ModelMetrics\n", "from sagemaker.processing import ProcessingInput, ProcessingOutput, ScriptProcessor\n", "from sagemaker.sklearn.processing import SKLearnProcessor\n", "from sagemaker.workflow.condition_step import ConditionStep, JsonGet\n", "#from sagemaker.workflow.functions import JsonGet\n", "from sagemaker.workflow.pipeline_context import PipelineSession\n", "from sagemaker.workflow.conditions import ConditionGreaterThanOrEqualTo\n", "from sagemaker.workflow.parameters import ParameterInteger, ParameterString\n", "from sagemaker.workflow.pipeline import Pipeline\n", "from sagemaker.workflow.properties import PropertyFile\n", "from sagemaker.workflow.step_collections import RegisterModel\n", "from sagemaker.workflow.steps import ProcessingStep, TrainingStep, CacheConfig, TuningStep\n", "from sagemaker.processing import ProcessingInput, ProcessingOutput\n", "from sagemaker.processing import Processor\n", "from sagemaker.network import NetworkConfig\n", "from sagemaker.tuner import (\n", " ContinuousParameter,\n", " IntegerParameter,\n", " CategoricalParameter,\n", " HyperparameterTuner,\n", " WarmStartConfig,\n", " WarmStartTypes,\n", ")" ] }, { "cell_type": "markdown", "id": "a0667f2e", "metadata": {}, "source": [ "Create boto3 session and define pipeline step instance count and other configuration. " ] }, { "cell_type": "code", "execution_count": null, "id": "6a425a1e", "metadata": {}, "outputs": [], "source": [ "region = AWS_REGION\n", "default_bucket = BUCKET_NAME\n", "\n", "pipeline_session = PipelineSession()\n", "role = sagemaker.session.get_execution_role(pipeline_session)\n", "\n", "datawrangler_instance_count = 2\n", "datawrangler_instance_type = \"ml.m5.4xlarge\" \n", "processing_instance_count = 1\n", "processing_instance_type = \"ml.m5.2xlarge\"\n", "training_instance_count = 1\n", "training_instance_type = \"ml.m5.2xlarge\"\n", "model_approval_status = ParameterString(\n", " name=\"ModelApprovalStatus\",\n", " default_value=\"PendingManualApproval\", # ModelApprovalStatus can be set to a default of \"Approved\" if you don't want manual approval.\n", ")\n", "input_data = ParameterString(\n", " name=\"InputDataUrl\",\n", " default_value=f\"\", # Change this to point to the s3 location of your raw input data.\n", ")\n" ] }, { "cell_type": "markdown", "id": "db038a2b", "metadata": {}, "source": [ "Define and create the `DataWranglerProcess` step in the SageMaker pipeline. This step uses an existing Data Wrangler flow file `diabetes.flow` that has the following transformations:\n", "* Move column readmitted to the beginning. This column is to be predicted in the classification problem. \n", "* Convert readmitted column value to 0 if it is NO and 1 if it is <30 or >30.\n", "* Drop the columns that have minimal to zero prediction power based on Data Wrangler Data Quality and Insights Report, e.g. payer_code and encounter_id. \n", "* Group values into finite categories using Python custom transform in the following columns: diag_1, diag_2, diag_3, admission_type_id, admission_source_id, and discharge_disposition_id. \n", "* Fill missing values in columns diag_1, diag_2, diag_3 and replace strings in column race.\n", "* Drop duplicates, balance data using SMOTE, and one-hot encode the following columns: race, gender, age, diag_1, diag_2, diag_3, max_glu_serum, A1Cresult, metformin, repaglinide, pioglitazone, rosiglitazone, insulin, change, diabetesMed, admission_type_id, discharge_disposition_id, admission_source_id. \n", "\n", "To use your own transformations, replace `output_name` and `flow_file_name`." ] }, { "cell_type": "code", "execution_count": null, "id": "fb83bcb5", "metadata": {}, "outputs": [], "source": [ "print(f\"Data Wrangler export storage bucket: {default_bucket}\")\n", "\n", "# unique flow export ID\n", "flow_export_id = f\"{time.strftime('%d-%H-%M-%S', time.gmtime())}-{str(uuid.uuid4())[:8]}\"\n", "flow_export_name = f\"flow-{flow_export_id}\"\n", "\n", "output_name = \"d593101e-278b-4330-9779-b6e02fbeb99e.default\"\n", "\n", "s3_output_prefix = f\"export-{flow_export_name}/output\"\n", "s3_output_path = f\"s3://{default_bucket}/{s3_output_prefix}\"\n", "print(f\"Flow S3 export result path: {s3_output_path}\")\n", "\n", "processing_job_output = ProcessingOutput(\n", " output_name=output_name,\n", " source=\"/opt/ml/processing/output\",\n", " destination=s3_output_path,\n", " s3_upload_mode=\"EndOfJob\",\n", ")\n", "\n", "# name of the flow file which should exist in the current notebook working directory\n", "flow_file_name = \"sagemaker-pipeline/diabetes.flow\"\n", "\n", "# Load .flow file from current notebook working directory\n", "#!echo \"Loading flow file from current notebook working directory: $PWD\"\n", "\n", "with open(flow_file_name) as f:\n", " flow = json.load(f)\n", "\n", "# Upload flow to S3\n", "s3_client = boto3.client(\"s3\")\n", "s3_client.upload_file(\n", " flow_file_name,\n", " default_bucket,\n", " f\"data_wrangler_flows/{flow_export_name}.flow\",\n", " ExtraArgs={\"ServerSideEncryption\": \"aws:kms\"},\n", ")\n", "\n", "flow_s3_uri = f\"s3://{default_bucket}/data_wrangler_flows/{flow_export_name}.flow\"\n", "\n", "print(f\"Data Wrangler flow {flow_file_name} uploaded to {flow_s3_uri}\")\n", "\n", "flow_input = ProcessingInput(\n", " source=flow_s3_uri,\n", " destination=\"/opt/ml/processing/flow\",\n", " input_name=\"flow\",\n", " s3_data_type=\"S3Prefix\",\n", " s3_input_mode=\"File\",\n", " s3_data_distribution_type=\"FullyReplicated\",\n", ")\n", "\n", "# IAM role for executing the processing job.\n", "iam_role = role\n", "\n", "# Unique processing job name. Give a unique name every time you re-execute processing jobs\n", "processing_job_name = f\"data-wrangler-flow-processing-{flow_export_id}\"\n", "\n", "# Size in GB of the EBS volume to use for storing data during processing\n", "volume_size_in_gb = 30\n", "\n", "# Content type for each output. Data Wrangler supports CSV as default and Parquet.\n", "output_content_type = \"CSV\"\n", "\n", "# Network Isolation mode; default is off\n", "enable_network_isolation = False\n", "\n", "# List of tags to be passed to the processing job\n", "user_tags = []\n", "\n", "# Output configuration used as processing job container arguments\n", "output_config = {output_name: {\"content_type\": output_content_type}}\n", "\n", "# KMS key for per object encryption; default is None\n", "kms_key = None\n", "\n", "# Data Wrangler Container URL.\n", "container_uri = sagemaker.image_uris.retrieve(\n", " framework=\"data-wrangler\", \n", " region=region, \n", " version=\"1.x\",\n", ")\n", "\n", "processor = Processor(\n", " role=iam_role,\n", " image_uri=container_uri,\n", " instance_count=datawrangler_instance_count,\n", " instance_type=datawrangler_instance_type,\n", " volume_size_in_gb=volume_size_in_gb,\n", " network_config=NetworkConfig(enable_network_isolation=enable_network_isolation),\n", " sagemaker_session=pipeline_session,\n", " output_kms_key=kms_key,\n", " tags=user_tags,\n", ")\n", "\n", "data_wrangler_step = ProcessingStep(\n", " name=\"DataWranglerProcess\",\n", " processor=processor,\n", " inputs=[flow_input],\n", " outputs=[processing_job_output],\n", " job_arguments=[f\"--output-config '{json.dumps(output_config)}'\"],\n", ")\n" ] }, { "cell_type": "markdown", "id": "db4599ea", "metadata": {}, "source": [ "Define and create the `Preprocess` step in the SageMaker pipeline. This step reads the transformed data from the DataWranglerProcess, randomizes, and splits the data into train (70%), validation (10%), and test data (20%). \n", "\n", "You can also put here other necessary transformations and pre-processing changes that are done outside of Data Wrangler. " ] }, { "cell_type": "code", "execution_count": null, "id": "bfb50e53", "metadata": {}, "outputs": [], "source": [ "base_job_prefix = \"sagemaker-diabetes\"\n", "\n", "sklearn_processor = SKLearnProcessor(\n", " framework_version=\"0.23-1\",\n", " instance_type=processing_instance_type,\n", " instance_count=processing_instance_count,\n", " base_job_name=f\"{base_job_prefix}/sklearn-diabetes-preprocess\", # choose any name\n", " sagemaker_session=pipeline_session,\n", " role=role\n", ")\n", "\n", "step_process = ProcessingStep(\n", " name=\"Preprocess\", # choose any name\n", " processor=sklearn_processor,\n", " inputs=[\n", " ProcessingInput(\n", " source=data_wrangler_step.properties.ProcessingOutputConfig.Outputs[\n", " output_name\n", " ].S3Output.S3Uri,\n", " destination=\"/opt/ml/processing/data/raw-data-dir\",\n", " )\n", " ],\n", " outputs=[\n", " ProcessingOutput(output_name=\"train\", source=\"/opt/ml/processing/train\"),\n", " ProcessingOutput(output_name=\"validation\", source=\"/opt/ml/processing/validation\"),\n", " ProcessingOutput(output_name=\"test\", source=\"/opt/ml/processing/test\"),\n", " ],\n", " code=\"sagemaker-pipeline/pipelines/diabetes/preprocess.py\",\n", " job_arguments=[\n", " \"--input-data\",\n", " data_wrangler_step.properties.ProcessingOutputConfig.Outputs[\n", " output_name\n", " ].S3Output.S3Uri,\n", " ],\n", ")\n" ] }, { "cell_type": "markdown", "id": "ade846a1", "metadata": {}, "source": [ "Define and create the `XGBHPTune` step in the SageMaker pipeline. This is a hyperparameter tuning job using SageMaker XGBoost algorithm." ] }, { "cell_type": "code", "execution_count": null, "id": "f1815db4", "metadata": {}, "outputs": [], "source": [ "# Training step for generating model artifacts\n", "model_path = f\"s3://{pipeline_session.default_bucket()}/{base_job_prefix}/diabetesTrain\"\n", "model_bucket_key = f\"{pipeline_session.default_bucket()}/{base_job_prefix}/diabetesTrain\"\n", "cache_config = CacheConfig(enable_caching=True, expire_after=\"30d\")\n", "\n", "xgb_image_uri = sagemaker.image_uris.retrieve(\n", " framework=\"xgboost\", # we are using the Sagemaker built in xgboost algorithm\n", " region=region,\n", " version=\"1.0-1\",\n", " py_version=\"py3\",\n", " instance_type=training_instance_type,\n", " image_scope=\"training\"\n", ")\n", "xgb_train = Estimator(\n", " image_uri=xgb_image_uri,\n", " instance_type=training_instance_type,\n", " instance_count=training_instance_count,\n", " output_path=model_path,\n", " base_job_name=f\"{base_job_prefix}/diabetes-xgb-train\",\n", " sagemaker_session=pipeline_session,\n", " role=role,\n", ")\n", "xgb_train.set_hyperparameters(\n", " num_round=50,\n", " objective=\"binary:logistic\", # we are using binary:logistic as the objective function for classification \n", ")\n", "\n", "xgb_train.set_hyperparameters(grow_policy=\"lossguide\")\n", "\n", "xgb_objective_metric_name = \"validation:auc\" # we are using AUC as a performance metric \n", "xgb_hyperparameter_ranges = {\n", " \"max_depth\": IntegerParameter(5, 10, scaling_type=\"Auto\"),\n", " \"min_child_weight\": IntegerParameter(5, 10, scaling_type=\"Auto\"),\n", " \"eta\": ContinuousParameter(0.1, 0.9, scaling_type=\"Auto\"),\n", " \"gamma\": IntegerParameter(4, 9, scaling_type=\"Auto\"),\n", " \"subsample\": ContinuousParameter(0.7, 0.9, scaling_type=\"Auto\"),\n", "}\n", "\n", "xgb_tuner_log = HyperparameterTuner(\n", " xgb_train,\n", " xgb_objective_metric_name,\n", " xgb_hyperparameter_ranges,\n", " max_jobs=5,\n", " max_parallel_jobs=5,\n", " strategy=\"Random\",\n", " objective_type=\"Maximize\",\n", ")\n", "\n", "xgb_step_tuning = TuningStep(\n", " name=\"XGBHPTune\",\n", " tuner=xgb_tuner_log,\n", " inputs={\n", " \"train\": TrainingInput(\n", " s3_data=step_process.properties.ProcessingOutputConfig.Outputs[\n", " \"train\"\n", " ].S3Output.S3Uri,\n", " content_type=\"text/csv\",\n", " ),\n", " \"validation\": TrainingInput(\n", " s3_data=step_process.properties.ProcessingOutputConfig.Outputs[\n", " \"validation\"\n", " ].S3Output.S3Uri,\n", " content_type=\"text/csv\",\n", " ),\n", " },\n", " cache_config=cache_config,\n", ")\n" ] }, { "cell_type": "markdown", "id": "d480e28f", "metadata": {}, "source": [ "Define and create the `DTreeHPTune` step in the SageMaker pipeline. This is a hyperparameter tuning job using Scikit-learn Decision Tree algorithm. Note that this is in a custom Docker image pushed to the repository in section ` Prepare the Decision Tree custom Docker image`. " ] }, { "cell_type": "code", "execution_count": null, "id": "a0abbb71", "metadata": {}, "outputs": [], "source": [ "dtree_image_uri = pipeline_session.sagemaker_client.describe_image_version(ImageName=\"diabetes-dtree\")[\"ContainerImage\"]\n", "\n", "dtree_train = Estimator(\n", " image_uri=dtree_image_uri,\n", " role=role,\n", " instance_count=1,\n", " instance_type=training_instance_type,\n", " base_job_name=f\"{base_job_prefix}/diabetes-dtree-train\",\n", " output_path=model_path,\n", " sagemaker_session=pipeline_session,\n", ")\n", "\n", "dtree_objective_metric_name = \"validation:auc\"\n", "dtree_metric_definitions = [{\"Name\": \"validation:auc\", \"Regex\": \"auc:(\\S+)\"}]\n", "\n", "dtree_hyperparameter_ranges = {\n", " \"max_depth\": IntegerParameter(5, 10, scaling_type=\"Linear\"),\n", " \"max_leaf_nodes\": IntegerParameter(2, 10, scaling_type=\"Linear\"),\n", "}\n", "\n", "dtree_tuner_log = HyperparameterTuner(\n", " dtree_train,\n", " dtree_objective_metric_name,\n", " dtree_hyperparameter_ranges,\n", " dtree_metric_definitions,\n", " max_jobs=5,\n", " max_parallel_jobs=5,\n", " strategy=\"Random\",\n", " objective_type=\"Maximize\",\n", ")\n", "\n", "dtree_step_tuning = TuningStep(\n", " name=\"DTreeHPTune\",\n", " tuner=dtree_tuner_log,\n", " inputs={\n", " \"training\": TrainingInput(\n", " s3_data=step_process.properties.ProcessingOutputConfig.Outputs[\n", " \"train\"\n", " ].S3Output.S3Uri,\n", " content_type=\"text/csv\",\n", " ),\n", " \"validation\": TrainingInput(\n", " s3_data=step_process.properties.ProcessingOutputConfig.Outputs[\n", " \"validation\"\n", " ].S3Output.S3Uri,\n", " content_type=\"text/csv\",\n", " ),\n", " },\n", " cache_config=cache_config,\n", ")\n" ] }, { "cell_type": "markdown", "id": "581bd6de", "metadata": {}, "source": [ "Define and create the `DtreeEval` step in the SageMaker pipeline. This uses `dtree_evaluate.py` to evaluate the performance of the generated model from `DTreeHPTune` step using test data. " ] }, { "cell_type": "code", "execution_count": null, "id": "3d08243a", "metadata": {}, "outputs": [], "source": [ "dtree_script_eval = ScriptProcessor(\n", " image_uri=dtree_image_uri,\n", " command=[\"python3\"],\n", " instance_type=processing_instance_type,\n", " instance_count=1,\n", " base_job_name=f\"{base_job_prefix}/script-dtree-eval\",\n", " sagemaker_session=pipeline_session,\n", " role=role,\n", ")\n", "\n", "dtree_evaluation_report = PropertyFile(\n", " name=\"EvaluationReportDTree\",\n", " output_name=\"dtree_evaluation\",\n", " path=\"dtree_evaluation.json\",\n", ")\n", "\n", "dtree_step_eval = ProcessingStep(\n", " name=\"DTreeEval\",\n", " processor=dtree_script_eval,\n", " inputs=[\n", " ProcessingInput(\n", " source=dtree_step_tuning.get_top_model_s3_uri(top_k=0, s3_bucket=model_bucket_key),\n", " destination=\"/opt/ml/processing/model\",\n", " ),\n", " ProcessingInput(\n", " source=step_process.properties.ProcessingOutputConfig.Outputs[\n", " \"test\"\n", " ].S3Output.S3Uri,\n", " destination=\"/opt/ml/processing/test\",\n", " ),\n", " ],\n", " outputs=[\n", " ProcessingOutput(\n", " output_name=\"dtree_evaluation\", source=\"/opt/ml/processing/evaluation\"\n", " ),\n", " ],\n", " code=\"sagemaker-pipeline/pipelines/diabetes/dtree_evaluate.py\",\n", " property_files=[dtree_evaluation_report],\n", ")\n" ] }, { "cell_type": "markdown", "id": "85b7e5e0", "metadata": {}, "source": [ "Define and create the `XGBEval` step in the SageMaker pipeline. This uses `xgb_evaluate.py` to evaluate the performance of the generated model from `XGBHPTune` step using test data. " ] }, { "cell_type": "code", "execution_count": null, "id": "6271fd78", "metadata": {}, "outputs": [], "source": [ "xgb_script_eval = ScriptProcessor(\n", " image_uri=xgb_image_uri,\n", " command=[\"python3\"],\n", " instance_type=processing_instance_type,\n", " instance_count=1,\n", " base_job_name=f\"{base_job_prefix}/script-xgb-eval\",\n", " sagemaker_session=pipeline_session,\n", " role=role,\n", ")\n", "\n", "xgb_evaluation_report = PropertyFile(\n", " name=\"EvaluationReportXGBoost\",\n", " output_name=\"xgb_evaluation\",\n", " path=\"xgb_evaluation.json\",\n", ")\n", "\n", "xgb_step_eval = ProcessingStep(\n", " name=\"XGBEval\",\n", " processor=xgb_script_eval,\n", " inputs=[\n", " ProcessingInput(\n", " source=xgb_step_tuning.get_top_model_s3_uri(top_k=0, s3_bucket=model_bucket_key),\n", " destination=\"/opt/ml/processing/model\",\n", " ),\n", " ProcessingInput(\n", " source=step_process.properties.ProcessingOutputConfig.Outputs[\n", " \"test\"\n", " ].S3Output.S3Uri,\n", " destination=\"/opt/ml/processing/test\",\n", " ),\n", " ],\n", " outputs=[\n", " ProcessingOutput(output_name=\"xgb_evaluation\", source=\"/opt/ml/processing/evaluation\"),\n", " ],\n", " code=\"sagemaker-pipeline/pipelines/diabetes/xgb_evaluate.py\",\n", " property_files=[xgb_evaluation_report],\n", ")\n" ] }, { "cell_type": "markdown", "id": "694e6b79", "metadata": {}, "source": [ "Retrieve the resulting AUC-ROC score from steps `DTreeEval` and `XGBEval` in the SageMaker pipeline." ] }, { "cell_type": "code", "execution_count": null, "id": "6f7decd9", "metadata": {}, "outputs": [], "source": [ "xgb_model_metrics = ModelMetrics(\n", " model_statistics=MetricsSource(\n", " s3_uri=\"{}/xgb_evaluation.json\".format(\n", " xgb_step_eval.arguments[\"ProcessingOutputConfig\"][\"Outputs\"][0][\"S3Output\"][\"S3Uri\"]\n", " ),\n", " content_type=\"application/json\",\n", " )\n", ")\n", "\n", "dtree_model_metrics = ModelMetrics(\n", " model_statistics=MetricsSource(\n", " s3_uri=\"{}/dtree_evaluation.json\".format(\n", " dtree_step_eval.arguments[\"ProcessingOutputConfig\"][\"Outputs\"][0][\"S3Output\"][\n", " \"S3Uri\"\n", " ]\n", " ),\n", " content_type=\"application/json\",\n", " )\n", ")\n", "\n", "xgb_eval_metrics = JsonGet(\n", " #step_name=xgb_step_eval,\n", " step=xgb_step_eval,\n", " property_file=xgb_evaluation_report,\n", " json_path=\"classification_metrics.roc.value\", # This should follow the structure of your report_dict defined in the evaluate.py file.\n", ")\n", "\n", "dtree_eval_metrics = JsonGet(\n", " #step_name=dtree_step_eval,\n", " step=dtree_step_eval,\n", " property_file=dtree_evaluation_report,\n", " json_path=\"classification_metrics.roc.value\", # This should follow the structure of your report_dict defined in the evaluate.py file.\n", ")\n" ] }, { "cell_type": "markdown", "id": "d187a447", "metadata": {}, "source": [ "Define and create ` AccuracyCond`. `DTreeReg-RegisterModel` and `XGBReg-RegisterModel` steps in the SageMaker pipeline. IF AUC-ROC score of Scikit-learn Decision Tree is greater than SageMaker XGBoost, then the Decision Tree model is registered in the model registry. Else, XGBoost is registered in the model registry. " ] }, { "cell_type": "code", "execution_count": null, "id": "0e4cd8c1", "metadata": {}, "outputs": [], "source": [ "model_package_group_name = \"sagemaker-diabetes\"\n", "# Register model step that will be conditionally executed\n", "dtree_step_register = RegisterModel(\n", " name=\"DTreeReg\",\n", " estimator=dtree_train,\n", " model_data=dtree_step_tuning.get_top_model_s3_uri(top_k=0, s3_bucket=model_bucket_key),\n", " content_types=[\"text/csv\"],\n", " response_types=[\"text/csv\"],\n", " inference_instances=[\"ml.t2.medium\", \"ml.m5.large\"],\n", " transform_instances=[\"ml.m5.large\"],\n", " model_package_group_name=model_package_group_name,\n", " approval_status=model_approval_status,\n", " model_metrics=dtree_model_metrics,\n", ")\n", "\n", "# Register model step that will be conditionally executed\n", "xgb_step_register = RegisterModel(\n", " name=\"XGBReg\",\n", " estimator=xgb_train,\n", " model_data=xgb_step_tuning.get_top_model_s3_uri(top_k=0, s3_bucket=model_bucket_key),\n", " content_types=[\"text/csv\"],\n", " response_types=[\"text/csv\"],\n", " inference_instances=[\"ml.t2.medium\", \"ml.m5.large\"],\n", " transform_instances=[\"ml.m5.large\"],\n", " model_package_group_name=model_package_group_name,\n", " approval_status=model_approval_status,\n", " model_metrics=xgb_model_metrics,\n", ")\n", "\n", "# Condition step for evaluating model quality and branching execution\n", "cond_lte = ConditionGreaterThanOrEqualTo( # You can change the condition here\n", " left=JsonGet(\n", " #step_name=dtree_step_eval,\n", " step=dtree_step_eval,\n", " property_file=dtree_evaluation_report,\n", " json_path=\"classification_metrics.roc.value\", # This should follow the structure of your report_dict defined in the evaluate.py file.\n", " ),\n", " right=JsonGet(\n", " #step_name=xgb_step_eval,\n", " step=xgb_step_eval,\n", " property_file=xgb_evaluation_report,\n", " json_path=\"classification_metrics.roc.value\"\n", " ), \n", ")\n", "\n", "step_cond = ConditionStep(\n", " name=\"AccuracyCond\",\n", " conditions=[cond_lte],\n", " if_steps=[dtree_step_register],\n", " else_steps=[xgb_step_register],\n", ")\n", "\n" ] }, { "cell_type": "markdown", "id": "b9d1ace6", "metadata": {}, "source": [ "Define and start the SageMaker pipeline. You should be able to see the running SageMaker pipeline in SageMaker Studio. " ] }, { "cell_type": "code", "execution_count": null, "id": "5cbfb086", "metadata": {}, "outputs": [], "source": [ "pipeline_name = \"sagemaker-diabetes\"\n", "\n", "pipeline = Pipeline(\n", " name=pipeline_name,\n", " parameters=[\n", " datawrangler_instance_type,\n", " datawrangler_instance_count, \n", " processing_instance_type,\n", " processing_instance_count,\n", " training_instance_type,\n", " training_instance_count,\n", " model_approval_status,\n", " input_data\n", " ],\n", " steps=[\n", " data_wrangler_step,\n", " step_process,\n", " dtree_step_tuning,\n", " xgb_step_tuning,\n", " dtree_step_eval,\n", " xgb_step_eval,\n", " step_cond,\n", " ],\n", " sagemaker_session=pipeline_session,\n", ")\n", "\n", "\n", "pipeline.upsert(role_arn=role)\n", "execution = pipeline.start()" ] }, { "cell_type": "markdown", "id": "858e43e8", "metadata": {}, "source": [ "## Approve top performing model in SageMaker model registry" ] }, { "cell_type": "markdown", "id": "1ff84ce4", "metadata": {}, "source": [ "After the SageMaker Pipeline is complete, new trained Model will be registered in Model Registry.\n", "\n", "1) Make sure to update your desired `MODEL_VERSION`. We assume we approve the model version 1. \n", "\n", "2) As EventBridge monitors Model Registry status changes, Model status change will trigger SageMaker Projects model-deploy pipeline." ] }, { "cell_type": "code", "execution_count": null, "id": "698e3477", "metadata": {}, "outputs": [], "source": [ "from sagemaker import get_execution_role, session\n", "import boto3\n", "\n", "role = get_execution_role()\n", "sm_client = boto3.client(\"sagemaker\")\n", "\n", "MODEL_VERSION = \"2\"\n", "AWS_REGION = boto3.Session().region_name\n", "MODEL_PACKAGE_ARN = \"arn:aws:sagemaker:{AWS_REGION}:{AWS_ACCOUNT}:model-package/sagemaker-diabetes/{MODEL_VERSION}\".format(\n", " AWS_REGION=AWS_REGION,\n", " AWS_ACCOUNT=AWS_ACCOUNT, \n", " MODEL_VERSION=MODEL_VERSION\n", ")\n", "\n", "\n", "model_package_update_response = sm_client.update_model_package(\n", " ModelPackageArn=MODEL_PACKAGE_ARN, ModelApprovalStatus=\"Approved\"\n", ")" ] }, { "cell_type": "markdown", "id": "6ce11a41", "metadata": {}, "source": [ "## Deploy the SageMaker inference endpoint" ] }, { "cell_type": "markdown", "id": "e4c895fc", "metadata": {}, "source": [ "Import model into hosting. Register the model with hosting. This allows the flexibility of importing models trained elsewhere." ] }, { "cell_type": "code", "execution_count": null, "id": "73d09101", "metadata": {}, "outputs": [], "source": [ "from time import gmtime, strftime\n", "\n", "model_name = \"diabetes-modelregistry-model-\" + strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())\n", "print(\"Model name : {}\".format(model_name))\n", "container_list = [{\"ModelPackageName\": MODEL_PACKAGE_ARN}]\n", "\n", "create_model_response = sm_client.create_model(\n", " ModelName=model_name, ExecutionRoleArn=role, Containers=container_list\n", ")\n", "print(\"Model arn : {}\".format(create_model_response[\"ModelArn\"]))" ] }, { "cell_type": "markdown", "id": "4a7623ed", "metadata": {}, "source": [ "Create endpoint configuration. SageMaker supports configuring REST endpoints in hosting with multiple models, e.g. for A/B testing purposes. In order to support this, customers create an endpoint configuration, that describes the distribution of traffic across the models, whether split, shadowed, or sampled in some way. In addition, the endpoint configuration describes the instance type required for model deployment." ] }, { "cell_type": "code", "execution_count": null, "id": "40838588", "metadata": {}, "outputs": [], "source": [ "endpoint_config_name = \"diabetes-modelregistry-EndpointConfig-\" + strftime(\n", " \"%Y-%m-%d-%H-%M-%S\", gmtime()\n", ")\n", "print(endpoint_config_name)\n", "create_endpoint_config_response = sm_client.create_endpoint_config(\n", " EndpointConfigName=endpoint_config_name,\n", " ProductionVariants=[\n", " {\n", " \"InstanceType\": \"ml.m5.large\",\n", " \"InitialVariantWeight\": 1,\n", " \"InitialInstanceCount\": 1,\n", " \"ModelName\": model_name,\n", " \"VariantName\": \"AllTraffic\",\n", " }\n", " ],\n", ")" ] }, { "cell_type": "markdown", "id": "c0e6759f", "metadata": {}, "source": [ "Create endpoint. Lastly, the customer creates the endpoint that serves up the model, through specifying the name and configuration defined above. The end result is an endpoint that can be validated and incorporated into your applications. This takes 9-11 minutes to complete." ] }, { "cell_type": "code", "execution_count": null, "id": "ef5578c8", "metadata": {}, "outputs": [], "source": [ "endpoint_name = \"diabetes-staging\"\n", "print(\"EndpointName={}\".format(endpoint_name))\n", "\n", "create_endpoint_response = sm_client.create_endpoint(\n", " EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name\n", ")\n", "\n", "\n", "while True:\n", " endpoint = sm_client.describe_endpoint(EndpointName=endpoint_name)\n", " if endpoint[\"EndpointStatus\"] == \"InService\":\n", " break\n", " print(\"Waiting for the endpoint to be completed..\")\n", " time.sleep(60)\n", "\n", "print(\"Endpoint arn : {}\".format(create_endpoint_response[\"EndpointArn\"]))" ] }, { "cell_type": "markdown", "id": "1cd1356a", "metadata": {}, "source": [ "## Run predictions on model" ] }, { "cell_type": "markdown", "id": "087b25ad", "metadata": {}, "source": [ "Wait until SageMaker Projects model-deploy pipeline has deployed the staging inference endpoint. Use the following data for inference:\n", "\n", "Example 1\n", "------------\n", "`5.0,64.0,0.0,18.0,0.0,0.0,7.0,9.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0\n", "`\n", "\n", "In summary, this is a diabetic patient that is Caucasian Female age 60-70, who has spent 5 days in the hospital under emergency care in the current encounter. Prior to this encounter, patient has spent 0 days in outpatient care, 0 days in emergency care, 7 days in inpatient care. 64 laboratory procedures have been performed on the patient. Patient is not using metformin, repaglinide, pioglitazone, rosiglitazone, and insulin prescription is steady.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "fa19de14", "metadata": {}, "outputs": [], "source": [ "import json\n", "import boto3\n", "\n", "sm_runtime = boto3.client(\"runtime.sagemaker\")\n", "endpoint_name =\"diabetes-staging\"\n", "line = \"5.0,64.0,0.0,18.0,0.0,0.0,7.0,9.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0\"\n", "response = sm_runtime.invoke_endpoint(EndpointName=endpoint_name, ContentType=\"text/csv\", Body=line)\n", "result = json.loads(response[\"Body\"].read().decode())\n", "print(\"Predicted class : {}\".format(round(result)))" ] }, { "cell_type": "markdown", "id": "3b16d0f3", "metadata": {}, "source": [ "Now you try:\n", "\n", "Example 2\n", "------------\n", "\n", "`3.0,19.0,3.0,19.0,0.0,0.0,0.0,8.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0\n", "`\n", "\n", "In summary, this is a diabetic patient that is Caucasian Female age 70-80, who has spent 3 days in the hospital under elective care in the current encounter. Prior to this encounter, patient has spent 0 days in outpatient care, 0 days in emergency care, 0 days in inpatient care. 19 laboratory procedures have been performed on the patient. Patient is not using metformin, repaglinide, pioglitazone, rosiglitazone. Patient is not using insulin. " ] }, { "cell_type": "markdown", "id": "8fcb475b", "metadata": {}, "source": [ "## Cleanup" ] }, { "cell_type": "markdown", "id": "9787688e", "metadata": {}, "source": [ "To avoid incurring future charges, clean up created resources such as the S3 bucket, ECR repository, and SageMaker Studio. Prior to deleting the SageMaker Studio, make sure to delete the SageMaker model and endpoint resources. \n", "Finally, delete the Jupyter instance containing the notebook. " ] } ], "metadata": { "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" } }, "nbformat": 4, "nbformat_minor": 5 }