{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "isConfigCell": true,
    "tags": [
     "parameters"
    ]
   },
   "source": [
    "# Launch Amazon SageMaker Autopilot experiments directly from within Amazon SageMaker Pipelines to easily automate MLOps workflows"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n",
    "\n",
    "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/sagemaker-pipelines|tabular|automl-step|sagemaker_autopilot_pipelines_native_auto_ml_step.ipynb)\n",
    "\n",
    "---"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Amazon SageMaker Autopilot](https://aws.amazon.com/sagemaker/autopilot), a low-code machine learning (ML) service that automatically builds, trains, and tunes the best ML models based on tabular data, is now integrated with [Amazon SageMaker Pipelines](https://aws.amazon.com/sagemaker/pipelines/), the first purpose-built continuous integration and continuous delivery (CI/CD) service for ML. This enables the automation of an end-to-end flow of building ML models using Autopilot and integrating models into subsequent CI/CD steps.\n",
    "\n",
    "So far, to launch an Autopilot experiment within Pipelines, you have to build a model-building workflow by writing custom integration code with Pipelines [Lambda](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-steps.html#step-type-lambda) or [Processing](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-steps.html#step-type-processing) steps. For more information, see [Move Amazon SageMaker Autopilot ML models from experimentation to production using Amazon SageMaker Pipelines](https://aws.amazon.com/blogs/machine-learning/move-amazon-sagemaker-autopilot-ml-models-from-experimentation-to-production-using-amazon-sagemaker-pipelines/).\n",
    "\n",
    "With the support for Autopilot as a native step within Pipelines, you can now add an automated training step ([AutoMLStep](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-steps.html#step-type-automl)) in Pipelines and invoke an Autopilot experiment with [Ensembling training mode](https://docs.amazonaws.cn/en_us/sagemaker/latest/dg/autopilot-model-support-validation.html). For example, if you’re building a training and evaluation ML workflow for a fraud detection use case with Pipelines, you can now launch an Autopilot experiment using the AutoML step, which automatically runs multiple trials to find the best model on a given input dataset. After the best model is created using the [Model step](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-steps.html#step-type-model), its performance can be evaluated on test data using the [Transform step](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-steps.html#step-type-transform) and a [Processing step](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-steps.html#step-type-processing) for a custom evaluation script within Pipelines. Eventually, the model can be registered into the SageMaker model registry using the [Model step](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-steps.html#step-type-model) in combination with a [Condition step](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-steps.html#step-type-condition). \n",
    "\n",
    "In this notebook, we show how to create an end-to-end ML workflow to train and evaluate a SageMaker generated ML model using the newly launched AutoML step in Pipelines and register it with the SageMaker model registry. The ML model with the best performance can be deployed to a SageMaker endpoint.\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Architecture Diagram](architecture_diagram.png)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use Pipelines to orchestrate different [pipeline steps](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-steps.html) required to train an Autopilot model. We create and run an [Autopilot experiment](https://docs.aws.amazon.com/sagemaker/latest/dg/autopilot-automate-model-development-create-experiment.html) as part of an AutoML step as described in this tutorial.\n",
    "\n",
    "The following steps are required for this end-to-end Autopilot training process:\n",
    "\n",
    "* Create and monitor an Autopilot training job using the `AutoMLStep`.\n",
    "* Create a SageMaker model using `ModelStep`. This step fetches the best model’s metadata and artifacts rendered by Autopilot in the previous step.\n",
    "* Evaluate the trained Autopilot model on a test dataset using `TransformStep`. \n",
    "* Compare the output from the previously run `TransformStep` with the actual target labels using `ProcessingStep`.\n",
    "* Register the ML model to the [SageMaker model registry](https://docs.aws.amazon.com/sagemaker/latest/dg/model-registry.html) using `ModelStep`, if the previously obtained evaluation metric exceeds a predefined threshold in `ConditionStep`.\n",
    "* Deploy the ML model as a SageMaker endpoint for testing purposes."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For this example notebook, having the [AmazonSageMakerFullAccess](https://docs.aws.amazon.com/sagemaker/latest/dg/security-iam-awsmanpol.html#security-iam-awsmanpol-AmazonSageMakerFullAccess) managed IAM policy attached to the [execution role](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) is sufficient to successfully run all cells. However, it is highly recommended to further scope down IAM permissions for improved security."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imports"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Some of the features used in this notebook might not be available in older versions of the boto3, botocore and sagemaker python packages. Thus, we are upgrading them, if necessary:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -U pip\n",
    "!pip install -U sagemaker\n",
    "!pip install \"boto3>=1.24.*\"\n",
    "!pip install \"botocore>=1.27.*\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import boto3\n",
    "import json\n",
    "import pandas as pd\n",
    "import time\n",
    "from sagemaker import (\n",
    "    AutoML,\n",
    "    AutoMLInput,\n",
    "    get_execution_role,\n",
    "    MetricsSource,\n",
    "    ModelMetrics,\n",
    "    ModelPackage,\n",
    ")\n",
    "from sagemaker.predictor import Predictor\n",
    "from sagemaker.processing import ProcessingOutput, ProcessingInput\n",
    "from sagemaker.s3 import s3_path_join, S3Downloader, S3Uploader\n",
    "from sagemaker.serializers import CSVSerializer\n",
    "from sagemaker.sklearn.processing import SKLearnProcessor\n",
    "from sagemaker.transformer import Transformer\n",
    "from sagemaker.workflow.automl_step import AutoMLStep\n",
    "from sagemaker.workflow.conditions import ConditionGreaterThanOrEqualTo\n",
    "from sagemaker.workflow.condition_step import ConditionStep\n",
    "from sagemaker.workflow.functions import Join, JsonGet\n",
    "from sagemaker.workflow.model_step import ModelStep\n",
    "from sagemaker.workflow.parameters import ParameterFloat, ParameterInteger, ParameterString\n",
    "from sagemaker.workflow.pipeline import Pipeline\n",
    "from sagemaker.workflow.pipeline_context import PipelineSession\n",
    "from sagemaker.workflow.properties import PropertyFile\n",
    "from sagemaker.workflow.steps import ProcessingStep, TransformStep, TransformInput\n",
    "from sklearn.model_selection import train_test_split"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Initializations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "execution_role = get_execution_role()\n",
    "pipeline_session = PipelineSession()\n",
    "sagemaker_client = boto3.client(\"sagemaker\")\n",
    "output_prefix = \"auto-ml-training\""
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## SageMaker Pipelines Parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "instance_count = ParameterInteger(name=\"InstanceCount\", default_value=1)\n",
    "instance_type = ParameterString(name=\"InstanceType\", default_value=\"ml.m5.xlarge\")\n",
    "max_automl_runtime = ParameterInteger(\n",
    "    name=\"MaxAutoMLRuntime\", default_value=3600\n",
    ")  # max. AutoML training runtime: 1 hour\n",
    "model_approval_status = ParameterString(name=\"ModelApprovalStatus\", default_value=\"Approved\")\n",
    "model_package_group_name = ParameterString(\n",
    "    name=\"ModelPackageName\", default_value=\"AutoMLModelPackageGroup\"\n",
    ")\n",
    "model_registration_metric_threshold = ParameterFloat(\n",
    "    name=\"ModelRegistrationMetricThreshold\", default_value=0.5\n",
    ")\n",
    "s3_bucket = ParameterString(name=\"S3Bucket\", default_value=pipeline_session.default_bucket())\n",
    "target_attribute_name = ParameterString(name=\"TargetAttributeName\", default_value=\"class\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Preprocessing"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use the publicly available [UCI Adult 1994 Census Income dataset](https://archive.ics.uci.edu/ml/datasets/adult) to predict if a person has an annual income of greater than $50,000 per year. This is a binary classification problem; the options for the income target variable are either <=50K or >50K. The dataset contains demographic information about individuals and `class` as the target column indicating the income class.\n",
    "\n",
    "This data preprocessing is performed in this notebook, i.e. outside SageMaker Pipelines. Alternatively, you may directly use your own custom dataset on S3 and skip directly to the next section."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_names = [\n",
    "    \"age\",\n",
    "    \"workclass\",\n",
    "    \"fnlwgt\",\n",
    "    \"education\",\n",
    "    \"education-num\",\n",
    "    \"marital-status\",\n",
    "    \"occupation\",\n",
    "    \"relationship\",\n",
    "    \"race\",\n",
    "    \"sex\",\n",
    "    \"capital-gain\",\n",
    "    \"capital-loss\",\n",
    "    \"hours-per-week\",\n",
    "    \"native-country\",\n",
    "]\n",
    "column_names = feature_names + [target_attribute_name.default_value]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Preparing the training dataset (Autopilot will automatically hold out a portion for validation):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset_file_name = \"adult.data\"\n",
    "S3Downloader.download(\n",
    "    f\"s3://sagemaker-example-files-prod-{boto3.session.Session().region_name}/datasets/tabular/uci_adult/{dataset_file_name}\",\n",
    "    \".\",\n",
    "    sagemaker_session=pipeline_session,\n",
    ")\n",
    "df = pd.read_csv(dataset_file_name, header=None, names=column_names)\n",
    "df.to_csv(\"train_val.csv\", index=False)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Preparing the test dataset with separate feature CSV files for features `x_test` and the target `y_test`. Feature columns are used for ML inference and resulting predictions are than compared with the target column values for final evaluation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset_file_name = \"adult.test\"\n",
    "S3Downloader.download(\n",
    "    f\"s3://sagemaker-example-files-prod-{boto3.session.Session().region_name}/datasets/tabular/uci_adult/{dataset_file_name}\",\n",
    "    \".\",\n",
    "    sagemaker_session=pipeline_session,\n",
    ")\n",
    "df = pd.read_csv(dataset_file_name, header=None, names=column_names, skiprows=1)\n",
    "df[target_attribute_name.default_value] = df[target_attribute_name.default_value].map(\n",
    "    {\" <=50K.\": \" <=50K\", \" >50K.\": \" >50K\"}\n",
    ")\n",
    "df.to_csv(\n",
    "    \"x_test.csv\",\n",
    "    header=False,\n",
    "    index=False,\n",
    "    columns=[\n",
    "        x for x in column_names if x != target_attribute_name.default_value\n",
    "    ],  # all columns except target\n",
    ")\n",
    "df.to_csv(\"y_test.csv\", header=False, index=False, columns=[target_attribute_name.default_value])"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Uploading the datasets to S3:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s3_prefix = s3_path_join(\"s3://\", s3_bucket.default_value, \"data\")\n",
    "S3Uploader.upload(\"train_val.csv\", s3_prefix, sagemaker_session=pipeline_session)\n",
    "S3Uploader.upload(\"x_test.csv\", s3_prefix, sagemaker_session=pipeline_session)\n",
    "S3Uploader.upload(\"y_test.csv\", s3_prefix, sagemaker_session=pipeline_session)\n",
    "s3_train_val = s3_path_join(s3_prefix, \"train_val.csv\")\n",
    "s3_x_test = s3_path_join(s3_prefix, \"x_test.csv\")\n",
    "s3_y_test = s3_path_join(s3_prefix, \"y_test.csv\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## AutoML Training Step"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An [AutoML object](https://sagemaker.readthedocs.io/en/stable/api/training/automl.html#sagemaker.automl.automl.AutoML) is used to define the Autopilot training job run and can be added to the SageMaker pipeline by using the AutoMLStep class, as shown in the following code. The [Ensembling training mode](https://docs.aws.amazon.com/sagemaker/latest/dg/autopilot-model-support-validation.html#autopilot-algorithm-suppprt) needs to be specified, but other parameters can be adjusted as needed. For example, instead of letting the AutoML job automatically infer the ML [problem type](https://docs.aws.amazon.com/sagemaker/latest/dg/autopilot-datasets-problem-types.html) and [objective metric](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_AutoMLJobObjective.html), these could be hardcoded by specifying the `problem_type` and `job_objective` parameters passed to the AutoML object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "automl = AutoML(\n",
    "    role=execution_role,\n",
    "    target_attribute_name=target_attribute_name,\n",
    "    sagemaker_session=pipeline_session,\n",
    "    total_job_runtime_in_seconds=max_automl_runtime,\n",
    "    mode=\"ENSEMBLING\",  # only ensembling mode is supported for native AutoML step integration in SageMaker Pipelines\n",
    ")\n",
    "train_args = automl.fit(\n",
    "    inputs=[\n",
    "        AutoMLInput(\n",
    "            inputs=s3_train_val,\n",
    "            target_attribute_name=target_attribute_name,\n",
    "            channel_type=\"training\",\n",
    "        )\n",
    "    ]\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Defining a SageMaker Pipelines AutoML step:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "step_auto_ml_training = AutoMLStep(\n",
    "    name=\"AutoMLTrainingStep\",\n",
    "    step_args=train_args,\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Model Creation Step"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The AutoML step takes care of generating various ML model candidates, combining them, and obtaining the best ML model. Model artifacts and metadata are automatically stored and can be obtained by calling the `get_best_auto_ml_model()` method on the AutoML training step. These can then be used to create a SageMaker model as part of the Model step:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "best_auto_ml_model = step_auto_ml_training.get_best_auto_ml_model(\n",
    "    execution_role, sagemaker_session=pipeline_session\n",
    ")\n",
    "step_args_create_model = best_auto_ml_model.create(instance_type=instance_type)\n",
    "step_create_model = ModelStep(name=\"ModelCreationStep\", step_args=step_args_create_model)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Batch Transform Step"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use the [Transformer object](https://sagemaker.readthedocs.io/en/stable/api/inference/transformer.html#sagemaker.transformer.Transformer) for [batch inference](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html) on the test dataset, which can then be used for evaluation purposes in the next pipeline step."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "transformer = Transformer(\n",
    "    model_name=step_create_model.properties.ModelName,\n",
    "    instance_count=instance_count,\n",
    "    instance_type=instance_type,\n",
    "    output_path=Join(on=\"/\", values=[\"s3:/\", s3_bucket, output_prefix, \"transform\"]),\n",
    "    sagemaker_session=pipeline_session,\n",
    ")\n",
    "step_batch_transform = TransformStep(\n",
    "    name=\"BatchTransformStep\",\n",
    "    step_args=transformer.transform(data=s3_x_test, content_type=\"text/csv\"),\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluation Step"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Defining the evaluation script used to compare the batch transform output `x_test.csv.out` to the actual (ground truth) target label `y_test.csv` using a Scikit-learn metrics function. We evaluate our results based on the [F1 score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html). The performance metrics are saved to a JSON file, which is referenced when registering the model in the subsequent step."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile evaluation.py\n",
    "import json\n",
    "import os\n",
    "import pathlib\n",
    "import pandas as pd\n",
    "from sklearn.metrics import f1_score\n",
    "\n",
    "\n",
    "if __name__ == \"__main__\":\n",
    "    y_pred_path = \"/opt/ml/processing/input/predictions/x_test.csv.out\"\n",
    "    y_pred = pd.read_csv(y_pred_path, header=None)\n",
    "    y_true_path = \"/opt/ml/processing/input/true_labels/y_test.csv\"\n",
    "    y_true = pd.read_csv(y_true_path, header=None)\n",
    "    report_dict = {\n",
    "        \"classification_metrics\": {\n",
    "            \"weighted_f1\": {\n",
    "                \"value\": f1_score(y_true, y_pred, average=\"weighted\"),\n",
    "                \"standard_deviation\": \"NaN\",\n",
    "            },\n",
    "        },\n",
    "    }\n",
    "    output_dir = \"/opt/ml/processing/evaluation\"\n",
    "    pathlib.Path(output_dir).mkdir(parents=True, exist_ok=True)\n",
    "    evaluation_path = os.path.join(output_dir, \"evaluation_metrics.json\")\n",
    "    with open(evaluation_path, \"w\") as f:\n",
    "        f.write(json.dumps(report_dict))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The ML model performance is captured in the form of an evaluation report in JSON format that is uploaded to S3 by the Evaluation Step and made available to other pipeline steps in the form of a property file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "evaluation_report = PropertyFile(\n",
    "    name=\"evaluation\", output_name=\"evaluation_metrics\", path=\"evaluation_metrics.json\"\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The evaluation script runs within a [SKLearnProcessor](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/sagemaker.sklearn.html#sagemaker.sklearn.processing.SKLearnProcessor) ([SageMaker Processing](https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_processing.html)) task:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sklearn_processor = SKLearnProcessor(\n",
    "    role=execution_role,\n",
    "    framework_version=\"1.0-1\",\n",
    "    instance_count=instance_count,\n",
    "    instance_type=instance_type.default_value,\n",
    "    sagemaker_session=pipeline_session,\n",
    ")\n",
    "step_args_sklearn_processor = sklearn_processor.run(\n",
    "    inputs=[\n",
    "        ProcessingInput(\n",
    "            source=step_batch_transform.properties.TransformOutput.S3OutputPath,\n",
    "            destination=\"/opt/ml/processing/input/predictions\",\n",
    "        ),\n",
    "        ProcessingInput(source=s3_y_test, destination=\"/opt/ml/processing/input/true_labels\"),\n",
    "    ],\n",
    "    outputs=[\n",
    "        ProcessingOutput(\n",
    "            output_name=\"evaluation_metrics\",\n",
    "            source=\"/opt/ml/processing/evaluation\",\n",
    "            destination=Join(on=\"/\", values=[\"s3:/\", s3_bucket, output_prefix, \"evaluation\"]),\n",
    "        ),\n",
    "    ],\n",
    "    code=\"evaluation.py\",\n",
    ")\n",
    "step_evaluation = ProcessingStep(\n",
    "    name=\"ModelEvaluationStep\",\n",
    "    step_args=step_args_sklearn_processor,\n",
    "    property_files=[evaluation_report],\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conditional Registration Step"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If the previously obtained evaluation metric is greater than or equal to a pre-defined model registration metric threshold, the ML model is being registered with the SageMaker model registry:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_metrics = ModelMetrics(\n",
    "    model_statistics=MetricsSource(\n",
    "        s3_uri=step_auto_ml_training.properties.BestCandidateProperties.ModelInsightsJsonReportPath,\n",
    "        content_type=\"application/json\",\n",
    "    ),\n",
    "    explainability=MetricsSource(\n",
    "        s3_uri=step_auto_ml_training.properties.BestCandidateProperties.ExplainabilityJsonReportPath,\n",
    "        content_type=\"application/json\",\n",
    "    ),\n",
    ")\n",
    "step_args_register_model = best_auto_ml_model.register(\n",
    "    content_types=[\"text/csv\"],\n",
    "    response_types=[\"text/csv\"],\n",
    "    inference_instances=[instance_type],\n",
    "    transform_instances=[instance_type],\n",
    "    model_package_group_name=model_package_group_name,\n",
    "    approval_status=model_approval_status,\n",
    "    model_metrics=model_metrics,\n",
    ")\n",
    "step_register_model = ModelStep(name=\"ModelRegistrationStep\", step_args=step_args_register_model)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "step_conditional_registration = ConditionStep(\n",
    "    name=\"ConditionalRegistrationStep\",\n",
    "    conditions=[\n",
    "        ConditionGreaterThanOrEqualTo(\n",
    "            left=JsonGet(\n",
    "                step_name=step_evaluation.name,\n",
    "                property_file=evaluation_report,\n",
    "                json_path=\"classification_metrics.weighted_f1.value\",\n",
    "            ),\n",
    "            right=model_registration_metric_threshold,\n",
    "        )\n",
    "    ],\n",
    "    if_steps=[step_register_model],\n",
    "    else_steps=[],  # pipeline end\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pipeline Execution"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After we define the steps, we combine them into a *Pipeline*. The steps are run in sequential order. The pipeline runs all the steps for an AutoML job using Autopilot and Pipelines for training, model evaluation, and model registration."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pipeline = Pipeline(\n",
    "    name=\"AutoMLTrainingPipeline\",\n",
    "    parameters=[\n",
    "        instance_count,\n",
    "        instance_type,\n",
    "        max_automl_runtime,\n",
    "        model_approval_status,\n",
    "        model_package_group_name,\n",
    "        model_registration_metric_threshold,\n",
    "        s3_bucket,\n",
    "        target_attribute_name,\n",
    "    ],\n",
    "    steps=[\n",
    "        step_auto_ml_training,\n",
    "        step_create_model,\n",
    "        step_batch_transform,\n",
    "        step_evaluation,\n",
    "        step_conditional_registration,\n",
    "    ],\n",
    "    sagemaker_session=pipeline_session,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "json.loads(pipeline.definition())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pipeline.upsert(role_arn=execution_role)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pipeline_execution = pipeline.start()\n",
    "pipeline_execution.describe()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This example pipeline execution will take around 35-40 minutes to complete:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pipeline_execution.wait(delay=30, max_attempts=180)  # max. wait: 1.5 hours\n",
    "pipeline_execution.list_steps()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Model Deployment"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After we have manually reviewed the ML model’s performance, we can deploy our newly created model to a SageMaker endpoint. For this, we can run the cells below that create the model endpoint using the model configuration of the latest approved ML model saved in the SageMaker model registry's model package group.\n",
    " \n",
    "Note that this script is shared for demonstration purposes, but it’s recommended to follow a more robust CI/CD pipeline for production deployment for ML inference. For more information, refer to Building, automating, managing, and scaling ML workflows using [Amazon SageMaker Pipelines](https://aws.amazon.com/blogs/machine-learning/building-automating-managing-and-scaling-ml-workflows-using-amazon-sagemaker-pipelines/):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_package = sagemaker_client.list_model_packages(\n",
    "    MaxResults=1,\n",
    "    ModelApprovalStatus=\"Approved\",\n",
    "    ModelPackageGroupName=model_package_group_name.default_value,\n",
    "    SortBy=\"CreationTime\",\n",
    "    SortOrder=\"Descending\",\n",
    ")\n",
    "model_package_arn = model_package[\"ModelPackageSummaryList\"][0][\"ModelPackageArn\"]\n",
    "sagemaker_client.describe_model_package(ModelPackageName=model_package_arn)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "while (\n",
    "    sagemaker_client.describe_model_package(ModelPackageName=model_package_arn)[\n",
    "        \"ModelPackageStatus\"\n",
    "    ]\n",
    "    != \"Completed\"\n",
    "):\n",
    "    time.sleep(10)\n",
    "model = ModelPackage(role=execution_role, model_package_arn=model_package_arn)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.deploy(\n",
    "    initial_instance_count=instance_count.default_value,\n",
    "    instance_type=instance_type.default_value,\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Perform ML inference on the deployed endpoint using a sample from the test dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "predictor = Predictor(\n",
    "    endpoint_name=model.endpoint_name,\n",
    "    sagemaker_session=pipeline_session,\n",
    "    serializer=CSVSerializer(),\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "predictor.predict(\n",
    "    \"25, Private,226802, 11th,7, Never-married, Machine-op-inspct, Own-child, Black, Male,0,0,40, United-States\"\n",
    ").decode(\"utf-8\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cleanup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sagemaker_client.delete_endpoint(EndpointName=model.endpoint_name)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "This notebook describes an easy-to-use ML pipeline approach to automatically train tabular ML models (AutoML) using Autopilot, Pipelines, and Studio. AutoML improves ML practitioners’ efficiency, accelerating the path from ML experimentation to production without the need for extensive ML expertise. We outline the respective pipeline steps needed for ML model creation, evaluation, and registration.\n",
    "\n",
    "For more information on Autopilot and Pipelines, refer to Automate model development with [Amazon SageMaker Autopilot](https://docs.aws.amazon.com/sagemaker/latest/dg/autopilot-automate-model-development.html) and [Amazon SageMaker Pipelines](https://aws.amazon.com/sagemaker/pipelines/)."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Notebook CI Test Results\n",
    "\n",
    "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n",
    "\n",
    "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/sagemaker-pipelines|tabular|automl-step|sagemaker_autopilot_pipelines_native_auto_ml_step.ipynb)\n",
    "\n",
    "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/sagemaker-pipelines|tabular|automl-step|sagemaker_autopilot_pipelines_native_auto_ml_step.ipynb)\n",
    "\n",
    "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/sagemaker-pipelines|tabular|automl-step|sagemaker_autopilot_pipelines_native_auto_ml_step.ipynb)\n",
    "\n",
    "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/sagemaker-pipelines|tabular|automl-step|sagemaker_autopilot_pipelines_native_auto_ml_step.ipynb)\n",
    "\n",
    "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/sagemaker-pipelines|tabular|automl-step|sagemaker_autopilot_pipelines_native_auto_ml_step.ipynb)\n",
    "\n",
    "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/sagemaker-pipelines|tabular|automl-step|sagemaker_autopilot_pipelines_native_auto_ml_step.ipynb)\n",
    "\n",
    "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/sagemaker-pipelines|tabular|automl-step|sagemaker_autopilot_pipelines_native_auto_ml_step.ipynb)\n",
    "\n",
    "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/sagemaker-pipelines|tabular|automl-step|sagemaker_autopilot_pipelines_native_auto_ml_step.ipynb)\n",
    "\n",
    "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/sagemaker-pipelines|tabular|automl-step|sagemaker_autopilot_pipelines_native_auto_ml_step.ipynb)\n",
    "\n",
    "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/sagemaker-pipelines|tabular|automl-step|sagemaker_autopilot_pipelines_native_auto_ml_step.ipynb)\n",
    "\n",
    "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/sagemaker-pipelines|tabular|automl-step|sagemaker_autopilot_pipelines_native_auto_ml_step.ipynb)\n",
    "\n",
    "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/sagemaker-pipelines|tabular|automl-step|sagemaker_autopilot_pipelines_native_auto_ml_step.ipynb)\n",
    "\n",
    "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/sagemaker-pipelines|tabular|automl-step|sagemaker_autopilot_pipelines_native_auto_ml_step.ipynb)\n",
    "\n",
    "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/sagemaker-pipelines|tabular|automl-step|sagemaker_autopilot_pipelines_native_auto_ml_step.ipynb)\n",
    "\n",
    "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/sagemaker-pipelines|tabular|automl-step|sagemaker_autopilot_pipelines_native_auto_ml_step.ipynb)\n"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "instance_type": "ml.t3.medium",
  "kernelspec": {
   "display_name": "Python 3 (Data Science 3.0)",
   "language": "python",
   "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/sagemaker-data-science-310-v1"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.6"
  },
  "notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved.  Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
 },
 "nbformat": 4,
 "nbformat_minor": 4
}