{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Parameterize SageMaker Pipelines\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", "\n", "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/sagemaker-pipeline-parameterization|parameterized-pipeline.ipynb)\n", "\n", "---" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Customers can use SageMaker Pipelines to build scalable machine learning pipelines that preprocess data and train machine learning models. With SageMaker Pipelines, customers have a toolkit for every part of the machine learning lifecycle that provides deep customizations and tuning options to fit every organization. Customers have the freedom to customize SageMaker Pipelines to specific use cases, but also to create generic machine learning pipelines that can be reused across different use cases.\n", "\n", "From a birds-eye view a machine learning pipeline usually consists of 3 general steps: a preprocess step where the data is transformed, a training step where a machine learning model is trained, and an evaluation step which tests the performance of the trained model. If the model is performing according to the objective metric you’re optimizing for, then that becomes a candidate model for deployment to one or more environments. These candidate models should be registered into SageMaker Model Registry to catalog and store key metadata for that model version.\n", "\n", "\n", "![SageMaker Pipelines](images/sm-pipelines.png \"SageMaker Pipelines\")\n", "\n", "These steps have a lot of commonalities, even across different machine learning use cases. Customers that want to create training pipelines that can be re-used in an organization can use SageMaker Pipelines to create parameterized, generic training pipelines. Parameters allow customers to identify specific parameters that can be passed into the pipeline during pipeline execution without having to directly change the pipeline code itself. \n", "\n", "**This notebook** demonstrates how SageMaker Pipelines can be used to create a generic binary classification machine learning pipeline using XGBoost that's reusable across teams, machine learning use cases and even customers in a SaaS system. \n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### SageMaker Pipelines\n", "Amazon SageMaker Pipelines is a purpose-built, easy-to-use CI/CD service for machine learning. With SageMaker Pipelines, customers can create machine learning workflows with an easy-to-use Python SDK, and then visualize and manage workflows using Amazon SageMaker Studio.\n", "\n", "\n", "#### SageMaker Pipeline steps and parameters\n", "SageMaker pipelines works on the concept of steps. The order steps are executed in is inferred from the dependencies each step has. If a step has a dependency on the output from a previous step, it's not executed until after that step has completed successfully.\n", "\n", "SageMaker Pipeline Parameters are input parameters specified when triggering a pipeline execution. They need to be explicitly defined when creating the pipeline and contain default values.\n", "\n", "To know more about the type of steps and parameters supported, check out the [SageMaker Pipelines Overview](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-sdk.html).\n", "\n", "#### SageMaker Pipeline DAG\n", "\n", "When creating a SageMaker Pipeline, SageMaker creates a Direct Acyclic Graph, DAG, that customers can visualize in Amazon SageMaker Studio. The DAG can be used to track pipeline executions, outputs and metrics. In this notebook, a SageMaker Pipeline with the following DAG is created:\n", "\n", "![SageMaker Pipeline Direct Acyclic Graph](images/sm-pipeline-dag.png \"SageMaker Pipeline Direct Acyclic Graph\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Predict customer churn and credit risk with XGBoost\n", "\n", "### Data\n", "\n", "This notebook uses 2 datasets to demonstrate pipeline portability:\n", "1. A synthetic customer churn dataset.\n", "2. The [Statlog German credit data](https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)) from UCI's ML Repository.\n", "\n", "\n", "### Overview \n", "**Disclaimer** This notebook was created using [Amazon SageMaker Studio](https://aws.amazon.com/sagemaker/studio/) and the `Python3(DataScience) kernel`. SageMaker Studio is required for the visualizations of the DAG and model metrics to work.\n", "\n", "The purpose of this notebook is to demonstrate how SageMaker Pipelines can be used to create a generic XGBoost training pipeline that preprocesses, trains, tunes, evaluates and registers new machine learning models with the SageMaker model registry, that is reusable across teams, customers and use cases. All scripts to preprocess the data and evaluate the trained model have been prepared in advance and are available here: \n", "- [credit/preprocess.py](credit/preprocess.py) \n", "- [customer_churn/preprocess.py](customer_churn/preprocess.py) \n", "- [evaluate.py](evaluate.py)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install -U sagemaker==2.72.1 --quiet # Ensure correct version of SageMaker is installed" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import boto3\n", "import sagemaker\n", "import sagemaker.session" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "session = sagemaker.session.Session()\n", "region = session.boto_region_name\n", "role = sagemaker.get_execution_role()\n", "bucket = session.default_bucket()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "prefix = \"paramaterized\" # Prefix to S3 artifacts\n", "pipeline_name = \"DEMO-parameterized-pipeline\" # SageMaker Pipeline name\n", "credit_model_group = \"DEMO-credit-registry\"\n", "churn_model_group = \"DEMO-churn-registry\"" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Download data\n", "Start with downloading all data sets" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3 = boto3.client(\"s3\")\n", "s3.download_file(\n", " f\"sagemaker-example-files-prod-{region}\",\n", " \"datasets/tabular/uci_statlog_german_credit_data/german_credit_data.csv\",\n", " \"credit_risk/german_credit_data.csv\",\n", ")\n", "s3.download_file(\n", " f\"sagemaker-example-files-prod-{region}\",\n", " \"datasets/tabular/synthetic/churn.csv\",\n", " \"customer_churn/churn-dataset.csv\",\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Upload data\n", "Upload all data sets and scripts to S3." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Upload the raw datasets and scripts to S3\n", "customer_churn_data_uri = session.upload_data(\n", " path=\"customer_churn/churn-dataset.csv\", key_prefix=prefix + \"/data\"\n", ")\n", "\n", "credit_data_uri = session.upload_data(\n", " path=\"credit_risk/german_credit_data.csv\", key_prefix=prefix + \"/data\"\n", ")\n", "\n", "churn_preprocess_uri = session.upload_data(\n", " path=\"customer_churn/preprocess.py\", key_prefix=prefix + \"/preprocess/churn\"\n", ")\n", "\n", "credit_preprocess_uri = session.upload_data(\n", " path=\"credit_risk/preprocess.py\", key_prefix=prefix + \"/preprocess/credit\"\n", ")\n", "\n", "evaluate_script_uri = session.upload_data(path=\"evaluate.py\", key_prefix=prefix + \"/evaluate\")\n", "\n", "\n", "print(\"Customer churn data set uploaded to \", customer_churn_data_uri)\n", "print(\"Credit data set uploaded to \", credit_data_uri)\n", "\n", "print(\"Customer churn preprocessing script uploaded to \", churn_preprocess_uri)\n", "print(\"Credit preprocessing script uploaded to \", credit_preprocess_uri)\n", "\n", "print(\"Evaluation script uploaded to \", evaluate_script_uri)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "### Pipeline input parameters\n", "\n", "Pipeline Parameters are input parameter when triggering a pipeline execution. They need to be explicitly defined when creating the pipeline and contain default values.\n", "\n", "Create parameters for the inputs to the pipeline. In this case, parameters will be used for:\n", "- `ModelGroup` - Which registry to register the trained model with.\n", "- `InputData` - S3 URI to pipeline input data.\n", "- `PreprocessScript` - S3 URI to python script to preprocess the data.\n", "- `EvaluateScript` - S3 URI to python script to evaluate the trained model.\n", "- `MaxiumTrainingJobs` - How many training jobs to allow when hyperparameter tuning the model\n", "- `MaxiumParallelTrainingJobs` - How many training jobs to allow in parallel when hyperparameter tuning the model.\n", "- `AccuracyConditionThreshold` - Only register models with the model registry if the have at least this classification accuracy.\n", "- `ProcessingInstanceType` - What EC2 instance type to use for processing.\n", "- `TrainingInstanceType` - What EC2 instance type to use for training." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.workflow.parameters import (\n", " ParameterInteger,\n", " ParameterString,\n", " ParameterFloat,\n", ")\n", "\n", "# To what Registry to register the model and its versions.\n", "model_registry_package = ParameterString(name=\"ModelGroup\", default_value=\"default-registry\")\n", "\n", "# S3 URI to input data\n", "input_data = ParameterString(name=\"InputData\", default_value=\"s3://{}/uri/data.csv\".format(bucket))\n", "\n", "# S3 URI to preprocessing script\n", "preprocess_script = ParameterString(\n", " name=\"PreprocessScript\", default_value=\"s3://{}/uri/preprocess.py\".format(bucket)\n", ")\n", "\n", "# S3 URI to evaluation script\n", "evaluate_script = ParameterString(\n", " name=\"EvaluateScript\", default_value=\"s3://{}/uri/evaluate.py\".format(bucket)\n", ")\n", "\n", "# Maximum amount of training jobs to allow in the HP tuning\n", "max_training_jobs = ParameterInteger(name=\"MaxiumTrainingJobs\", default_value=1)\n", "\n", "# Maximum amount of trainingjobs to allow in the HP tuning\n", "max_parallel_training_jobs = ParameterInteger(name=\"MaxiumParallelTrainingJobs\", default_value=1)\n", "\n", "# Accuracy threshold to decide whether or not to register the model with Model Registry\n", "accuracy_condition_threshold = ParameterFloat(name=\"AccuracyConditionThreshold\", default_value=0.7)\n", "\n", "# What instance type to use for processing.\n", "processing_instance_type = ParameterString(\n", " name=\"ProcessingInstanceType\", default_value=\"ml.m5.large\"\n", ")\n", "\n", "# What instance type to use for training.\n", "training_instance_type = ParameterString(name=\"TrainingInstanceType\", default_value=\"ml.m5.xlarge\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "## Preprocess data step\n", "In the first step an sklearn processor is created, used in the ProcessingStep." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.sklearn.processing import SKLearnProcessor\n", "from sagemaker.processing import ProcessingInput, ProcessingOutput\n", "from sagemaker.workflow.steps import ProcessingStep\n", "from sagemaker.workflow.functions import Join\n", "from sagemaker.workflow.execution_variables import ExecutionVariables\n", "\n", "# Create SKlearn processor object,\n", "# The object contains information about what instance type to use, the IAM role to use etc.\n", "# A managed processor comes with a preconfigured container, so only specifying version is required.\n", "sklearn_processor = SKLearnProcessor(\n", " framework_version=\"0.23-1\", role=role, instance_type=processing_instance_type, instance_count=1\n", ")\n", "\n", "# Use the sklearn_processor in a SageMaker Pipelines ProcessingStep\n", "step_preprocess_data = ProcessingStep(\n", " name=\"Preprocess-Data\",\n", " processor=sklearn_processor,\n", " inputs=[\n", " ProcessingInput(source=input_data, destination=\"/opt/ml/processing/input\"),\n", " ],\n", " outputs=[\n", " ProcessingOutput(\n", " output_name=\"train\",\n", " source=\"/opt/ml/processing/train\",\n", " destination=Join(\n", " on=\"/\",\n", " values=[\n", " \"s3://{}\".format(bucket),\n", " prefix,\n", " ExecutionVariables.PIPELINE_EXECUTION_ID,\n", " \"train\",\n", " ],\n", " ),\n", " ),\n", " ProcessingOutput(\n", " output_name=\"validation\",\n", " source=\"/opt/ml/processing/validation\",\n", " destination=Join(\n", " on=\"/\",\n", " values=[\n", " \"s3://{}\".format(bucket),\n", " prefix,\n", " ExecutionVariables.PIPELINE_EXECUTION_ID,\n", " \"validation\",\n", " ],\n", " ),\n", " ),\n", " ProcessingOutput(\n", " output_name=\"test\",\n", " source=\"/opt/ml/processing/test\",\n", " destination=Join(\n", " on=\"/\",\n", " values=[\n", " \"s3://{}\".format(bucket),\n", " prefix,\n", " ExecutionVariables.PIPELINE_EXECUTION_ID,\n", " \"test\",\n", " ],\n", " ),\n", " ),\n", " ],\n", " code=preprocess_script,\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "## Train model step\n", "In the second step, the train and validation output from the previous processing step are used to train a model. The XGBoost container is retrieved and then an XGBoost estimator is created, on which hyperparameters are specified before the training step is created. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.inputs import TrainingInput\n", "from sagemaker.estimator import Estimator\n", "from sagemaker.tuner import HyperparameterTuner, ContinuousParameter, IntegerParameter\n", "from sagemaker.workflow.steps import TuningStep\n", "\n", "# Fetch container to use for training\n", "image_uri = sagemaker.image_uris.retrieve(\n", " framework=\"xgboost\",\n", " region=region,\n", " version=\"1.2-2\",\n", " py_version=\"py3\",\n", " instance_type=\"ml.m5.xlarge\",\n", ")\n", "\n", "# Create XGBoost estimator object\n", "# The object contains information about what container to use, what instance type etc.\n", "xgb_estimator = Estimator(\n", " image_uri=image_uri,\n", " instance_type=training_instance_type,\n", " instance_count=1,\n", " role=role,\n", " disable_profiler=True,\n", ")\n", "\n", "# Create Hyperparameter tuner object. Ranges from https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost-tuning.html\n", "xgb_tuner = HyperparameterTuner(\n", " estimator=xgb_estimator,\n", " objective_metric_name=\"validation:auc\",\n", " hyperparameter_ranges={\n", " \"eta\": ContinuousParameter(0, 0.5),\n", " \"alpha\": ContinuousParameter(0, 1000),\n", " \"min_child_weight\": ContinuousParameter(1, 120),\n", " \"max_depth\": IntegerParameter(1, 10),\n", " \"num_round\": IntegerParameter(1, 2000),\n", " \"subsample\": ContinuousParameter(0.5, 1),\n", " },\n", " max_jobs=max_training_jobs,\n", " max_parallel_jobs=max_parallel_training_jobs,\n", ")\n", "\n", "# use the tuner in a SageMaker pipielines tuning step.\n", "step_tuning = TuningStep(\n", " name=\"Train-And-Tune-Model\",\n", " tuner=xgb_tuner,\n", " inputs={\n", " \"train\": TrainingInput(\n", " s3_data=step_preprocess_data.properties.ProcessingOutputConfig.Outputs[\n", " \"train\"\n", " ].S3Output.S3Uri,\n", " content_type=\"text/csv\",\n", " ),\n", " \"validation\": TrainingInput(\n", " s3_data=step_preprocess_data.properties.ProcessingOutputConfig.Outputs[\n", " \"validation\"\n", " ].S3Output.S3Uri,\n", " content_type=\"text/csv\",\n", " ),\n", " },\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "## Evaluate model step\n", "When a model is trained, it's common to evaluate the model on unseen data before registering it with the model registry. This ensures the model registry isn't cluttered with poorly performing model versions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.processing import ScriptProcessor\n", "from sagemaker.workflow.properties import PropertyFile\n", "\n", "# Create ScriptProcessor object.\n", "# The object contains information about what container to use, what instance type etc.\n", "evaluate_model_processor = ScriptProcessor(\n", " image_uri=image_uri,\n", " command=[\"python3\"],\n", " instance_type=processing_instance_type,\n", " instance_count=1,\n", " role=role,\n", ")\n", "\n", "# Create a PropertyFile\n", "# A PropertyFile is used to be able to reference outputs from a processing step, for instance to use in a condition step.\n", "# For more information, visit https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-propertyfile.html\n", "evaluation_report = PropertyFile(\n", " name=\"EvaluationReport\", output_name=\"evaluation\", path=\"evaluation.json\"\n", ")\n", "\n", "# Use the evaluate_model_processor in a SageMaker Pipelines ProcessingStep.\n", "# Extract the best model for evaluation.\n", "step_evaluate_model = ProcessingStep(\n", " name=\"Evaluate-Model\",\n", " processor=evaluate_model_processor,\n", " inputs=[\n", " ProcessingInput(\n", " source=step_tuning.get_top_model_s3_uri(top_k=0, s3_bucket=bucket),\n", " destination=\"/opt/ml/processing/model\",\n", " ),\n", " ProcessingInput(\n", " source=step_preprocess_data.properties.ProcessingOutputConfig.Outputs[\n", " \"test\"\n", " ].S3Output.S3Uri,\n", " destination=\"/opt/ml/processing/test\",\n", " ),\n", " ],\n", " outputs=[\n", " ProcessingOutput(\n", " output_name=\"evaluation\",\n", " source=\"/opt/ml/processing/evaluation\",\n", " destination=Join(\n", " on=\"/\",\n", " values=[\n", " \"s3://{}\".format(bucket),\n", " prefix,\n", " ExecutionVariables.PIPELINE_EXECUTION_ID,\n", " \"evaluation-report\",\n", " ],\n", " ),\n", " ),\n", " ],\n", " code=evaluate_script,\n", " property_files=[evaluation_report],\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "## Register model step\n", "If the trained model meets the model performance requirements, a new model version is registered with the model registry for further analysis. To attach model metrics to the model version, create a [ModelMetrics](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-model-quality-metrics.html) object using the evaluation report created in the evaluation step. Then, create the RegisterModel step.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.model_metrics import MetricsSource, ModelMetrics\n", "from sagemaker.workflow.step_collections import RegisterModel\n", "\n", "# Create ModelMetrics object using the evaluation report from the evaluation step\n", "# A ModelMetrics object contains metrics captured from a model.\n", "model_metrics = ModelMetrics(\n", " model_statistics=MetricsSource(\n", " s3_uri=Join(\n", " on=\"/\",\n", " values=[\n", " step_evaluate_model.arguments[\"ProcessingOutputConfig\"][\"Outputs\"][0][\"S3Output\"][\n", " \"S3Uri\"\n", " ],\n", " \"evaluation.json\",\n", " ],\n", " ),\n", " content_type=\"application/json\",\n", " )\n", ")\n", "\n", "# Crete a RegisterModel step, which registers the model with SageMaker Model Registry.\n", "step_register_model = RegisterModel(\n", " name=\"Register-Model\",\n", " estimator=xgb_estimator,\n", " model_data=step_tuning.get_top_model_s3_uri(top_k=0, s3_bucket=bucket),\n", " content_types=[\"text/csv\"],\n", " response_types=[\"text/csv\"],\n", " inference_instances=[\"ml.t2.medium\", \"ml.m5.xlarge\", \"ml.m5.large\"],\n", " transform_instances=[\"ml.m5.xlarge\"],\n", " model_package_group_name=model_registry_package,\n", " model_metrics=model_metrics,\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "## Accuracy condition step\n", "Adding conditions to the pipeline is done with a ConditionStep.\n", "In this case, we only want to register the new model version with the model registry if the new model meets an accuracy condition." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.workflow.conditions import ConditionGreaterThanOrEqualTo\n", "from sagemaker.workflow.condition_step import ConditionStep\n", "from sagemaker.workflow.functions import JsonGet\n", "\n", "# Create accuracy condition to ensure the model meets performance requirements.\n", "# Models with a test accuracy lower than the condition will not be registered with the model registry.\n", "cond_gte = ConditionGreaterThanOrEqualTo(\n", " left=JsonGet(\n", " step_name=step_evaluate_model.name,\n", " property_file=evaluation_report,\n", " json_path=\"binary_classification_metrics.accuracy.value\",\n", " ),\n", " right=accuracy_condition_threshold,\n", ")\n", "\n", "# Create a SageMaker Pipelines ConditionStep, using the condition above.\n", "# Enter the steps to perform if the condition returns True / False.\n", "step_cond = ConditionStep(\n", " name=\"Accuracy-Condition\",\n", " conditions=[cond_gte],\n", " if_steps=[step_register_model],\n", " else_steps=[],\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "## Pipeline Creation: Orchestrate all steps\n", "\n", "Now that all pipeline steps are created, a pipeline is created." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.workflow.pipeline import Pipeline\n", "\n", "# Create a SageMaker Pipeline.\n", "# Each parameter for the pipeline must be set as a parameter explicitly when the pipeline is created.\n", "# Also pass in each of the steps created above.\n", "# Note that the order of execution is determined from each step's dependencies on other steps,\n", "# not on the order they are passed in below.\n", "pipeline = Pipeline(\n", " name=pipeline_name,\n", " parameters=[\n", " processing_instance_type,\n", " training_instance_type,\n", " input_data,\n", " preprocess_script,\n", " evaluate_script,\n", " accuracy_condition_threshold,\n", " model_registry_package,\n", " max_parallel_training_jobs,\n", " max_training_jobs,\n", " ],\n", " steps=[step_preprocess_data, step_tuning, step_evaluate_model, step_cond],\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Submit pipeline\n", "pipeline.upsert(role_arn=role)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Start pipeline with different parameters.\n", "Now that the pipeline is created, it can be started with custom parameters making the pipeline agnostic to who is triggering it, but also to the scripts and data used. The pipeline can be started using the CLI, the SageMaker Studio UI or the SDK and below there is a screenshot of what it looks like in the SageMaker Studio UI.\n", "\n", "![Trigger pipeline from Studio](images/manual-execution.png \"SageMaker Studio UI execution\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### Starting the pipeline with the SDK\n", "In the examples below, the pipeline is triggered for two machine learning problems, each with different preprocessing scripts and model registry. Each machine learning problem is run with two different sets of parameters." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Start pipeline with credit data and preprocessing script\n", "pipeline.start(\n", " execution_display_name=\"Credit\",\n", " parameters=dict(\n", " InputData=credit_data_uri,\n", " PreprocessScript=credit_preprocess_uri,\n", " EvaluateScript=evaluate_script_uri,\n", " AccuracyConditionThreshold=0.2,\n", " MaxiumParallelTrainingJobs=2,\n", " MaxiumTrainingJobs=5,\n", " ModelGroup=credit_model_group,\n", " ),\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Start pipeline with credit data and preprocessing script\n", "pipeline.start(\n", " execution_display_name=\"Credit\",\n", " parameters=dict(\n", " InputData=credit_data_uri,\n", " PreprocessScript=credit_preprocess_uri,\n", " EvaluateScript=evaluate_script_uri,\n", " AccuracyConditionThreshold=0.7,\n", " MaxiumParallelTrainingJobs=3,\n", " MaxiumTrainingJobs=42,\n", " ModelGroup=credit_model_group,\n", " ),\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Start pipeline with customer churn data and preprocessing script\n", "pipeline.start(\n", " execution_display_name=\"Churn\",\n", " parameters=dict(\n", " InputData=customer_churn_data_uri,\n", " PreprocessScript=churn_preprocess_uri,\n", " EvaluateScript=evaluate_script_uri,\n", " AccuracyConditionThreshold=0.4,\n", " MaxiumParallelTrainingJobs=1,\n", " MaxiumTrainingJobs=2,\n", " ModelGroup=churn_model_group,\n", " ),\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Start pipeline with customer churn data and preprocessing script\n", "pipeline.start(\n", " execution_display_name=\"Churn\",\n", " parameters=dict(\n", " InputData=customer_churn_data_uri,\n", " PreprocessScript=churn_preprocess_uri,\n", " EvaluateScript=evaluate_script_uri,\n", " AccuracyConditionThreshold=0.8,\n", " MaxiumParallelTrainingJobs=4,\n", " MaxiumTrainingJobs=40,\n", " ModelGroup=churn_model_group,\n", " ),\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Visualize model performance metrics\n", "Once the pipelines have completed successfully, metrics attached to the model version can be visualized. In SageMaker Studio, choose `SageMaker Components and registries` in the left pane and under `Model registry`, select one of the model package that was created. Select both versions and right-click. Choose `Compare model versions`. \n", "\n", "The screenshot below shows what comparing the customer churn model versions looks like. Note that the standard deviation shows as NaN since it is not relevant to this model's calculated metrics.\n", "\n", "![Compare churn model versions](images/compare-versions-churn.png \"Compare versions\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The screenshot below shows what comparing the credit risk model versions looks like.\n", "\n", "\n", "![Compare credit risk versions](images/compare-versions-credit-risk.png \"Compare versions\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Clean up (optional)\n", "Delete the model registries and the pipeline to keep the Studio environment tidy." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def delete_model_package_group(sm_client, package_group_name):\n", " try:\n", " model_versions = sm_client.list_model_packages(ModelPackageGroupName=package_group_name)\n", "\n", " except Exception as e:\n", " print(\"{} \\n\".format(e))\n", " return\n", "\n", " for model_version in model_versions[\"ModelPackageSummaryList\"]:\n", " try:\n", " sm_client.delete_model_package(ModelPackageName=model_version[\"ModelPackageArn\"])\n", " except Exception as e:\n", " print(\"{} \\n\".format(e))\n", " time.sleep(0.5) # Ensure requests aren't throttled\n", "\n", " try:\n", " sm_client.delete_model_package_group(ModelPackageGroupName=package_group_name)\n", " print(\"{} model package group deleted\".format(package_group_name))\n", " except Exception as e:\n", " print(\"{} \\n\".format(e))\n", " return\n", "\n", "\n", "def delete_sagemaker_pipeline(sm_client, pipeline_name):\n", " try:\n", " sm_client.delete_pipeline(\n", " PipelineName=pipeline_name,\n", " )\n", " print(\"{} pipeline deleted\".format(pipeline_name))\n", " except Exception as e:\n", " print(\"{} \\n\".format(e))\n", " return" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import boto3\n", "import time\n", "\n", "client = boto3.client(\"sagemaker\")\n", "\n", "registries = [credit_model_group, churn_model_group]\n", "\n", "for registry in registries:\n", " delete_model_package_group(client, registry)\n", "\n", "delete_sagemaker_pipeline(client, pipeline_name)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Notebook CI Test Results\n", "\n", "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", "\n", "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/sagemaker-pipeline-parameterization|parameterized-pipeline.ipynb)\n", "\n", "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/sagemaker-pipeline-parameterization|parameterized-pipeline.ipynb)\n", "\n", "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/sagemaker-pipeline-parameterization|parameterized-pipeline.ipynb)\n", "\n", "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/sagemaker-pipeline-parameterization|parameterized-pipeline.ipynb)\n", "\n", "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/sagemaker-pipeline-parameterization|parameterized-pipeline.ipynb)\n", "\n", "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/sagemaker-pipeline-parameterization|parameterized-pipeline.ipynb)\n", "\n", "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/sagemaker-pipeline-parameterization|parameterized-pipeline.ipynb)\n", "\n", "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/sagemaker-pipeline-parameterization|parameterized-pipeline.ipynb)\n", "\n", "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/sagemaker-pipeline-parameterization|parameterized-pipeline.ipynb)\n", "\n", "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/sagemaker-pipeline-parameterization|parameterized-pipeline.ipynb)\n", "\n", "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/sagemaker-pipeline-parameterization|parameterized-pipeline.ipynb)\n", "\n", "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/sagemaker-pipeline-parameterization|parameterized-pipeline.ipynb)\n", "\n", "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/sagemaker-pipeline-parameterization|parameterized-pipeline.ipynb)\n", "\n", "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/sagemaker-pipeline-parameterization|parameterized-pipeline.ipynb)\n", "\n", "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/sagemaker-pipeline-parameterization|parameterized-pipeline.ipynb)\n" ] } ], "metadata": { "instance_type": "ml.t3.medium", "interpreter": { "hash": "aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49" }, "kernelspec": { "display_name": "Python 3 (Data Science 3.0)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/sagemaker-data-science-310-v1" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" } }, "nbformat": 4, "nbformat_minor": 4 }