{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# SageMaker Pipelines workshop\n",
    "\n",
    "This notebook works well with the `Python 3 (Data Science)` kernel on SageMaker Studio.\n",
    "\n",
    "---\n",
    "\n",
    "---\n",
    "\n",
    "This workshop is based on the [amazon sagemaker drift detection project](https://github.com/aws-samples/amazon-sagemaker-drift-detection) on github, available in the aws-samples repository. \n",
    "\n",
    "Amazon SageMaker Model Building Pipelines offers machine learning (ML) application developers and operations engineers the ability to orchestrate SageMaker jobs and author reproducible ML pipelines. It also enables them to deploy custom-build models for inference in real-time with low latency, run offline inferences with Batch Transform, and track lineage of artifacts. They can institute sound operational practices in deploying and monitoring production workflows, deploying model artifacts, and tracking artifact lineage through a simple interface, adhering to safety and best practice paradigms for ML application development.\n",
    "\n",
    "The SageMaker Pipelines service supports a SageMaker Pipeline domain specific language (DSL), which is a declarative JSON specification. This DSL defines a directed acyclic graph (DAG) of pipeline parameters and SageMaker job steps. The SageMaker Python Software Developer Kit (SDK) streamlines the generation of the pipeline DSL using constructs that engineers and scientists are already familiar with."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## SageMaker Pipelines\n",
    "\n",
    "SageMaker Pipelines supports the following activities, which are demonstrated in this notebook:\n",
    "\n",
    "* Pipelines - A DAG of steps and conditions to orchestrate SageMaker jobs and resource creation.\n",
    "* Processing job steps - A simplified, managed experience on SageMaker to run data processing workloads, such as feature engineering, data validation, model evaluation, and model interpretation.\n",
    "* Training job steps - An iterative process that teaches a model to make predictions by presenting examples from a training dataset.\n",
    "* Conditional execution steps - A step that provides conditional execution of branches in a pipeline.\n",
    "* Register model steps - A step that creates a model package resource in the Model Registry that can be used to create deployable models in Amazon SageMaker.\n",
    "* Create model steps - A step that creates a model for use in transform steps or later publication as an endpoint.\n",
    "* Transform job steps - A batch transform to preprocess datasets to remove noise or bias that interferes with training or inference from a dataset, get inferences from large datasets, and run inference when a persistent endpoint is not needed.\n",
    "* Parametrized Pipeline executions - Enables variation in pipeline executions according to specified parameters."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Notebook Overview\n",
    "\n",
    "This notebook shows how to:\n",
    "\n",
    "* Define a set of Pipeline parameters that can be used to parametrize a SageMaker Pipeline.\n",
    "* Define a Processing step that performs cleaning, feature engineering, and splitting the input data into train and test data sets.\n",
    "* Define a Training step that trains a model on the preprocessed train data set.\n",
    "* Define a Processing step that evaluates the trained model's performance on the test dataset.\n",
    "* Define a Create Model step that creates a model from the model artifacts used in training.\n",
    "* Define a Transform step that performs batch transformation based on the model that was created.\n",
    "* Define a Register Model step that creates a model package from the estimator and model artifacts used to train the model.\n",
    "* Define a Conditional step that measures a condition based on output from prior steps and conditionally executes other steps.\n",
    "* Define and create a Pipeline definition in a DAG, with the defined parameters and steps.\n",
    "* Start a Pipeline execution and wait for execution to complete.\n",
    "* Download the model evaluation report from the S3 bucket for examination.\n",
    "* Start a second Pipeline execution."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will consider 3 teams in this scenario:\n",
    "* Engineering team - ML or data engineering\n",
    "* Data science team - developing the models\n",
    "* Dev ops team - automating and integrating the pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## A SageMaker Pipeline\n",
    "\n",
    "The pipeline that you create follows a typical machine learning (ML) application pattern of preprocessing, training, evaluation, model creation, batch transformation, and model registration:\n",
    "\n",
    "![A typical ML Application pipeline](https://raw.githubusercontent.com/aws/amazon-sagemaker-examples/a78c7255cbe7892408d8ea2b15a7a2117703befc/sagemaker-pipelines/tabular/abalone_build_train_deploy/img/pipeline-full.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import numpy as np                                \n",
    "import pandas as pd                               \n",
    "import json\n",
    "import os\n",
    "\n",
    "import boto3\n",
    "import sagemaker\n",
    "import sagemaker.session\n",
    "from sagemaker import get_execution_role\n",
    "\n",
    "from sagemaker.estimator import Estimator\n",
    "from sagemaker.debugger import Rule, rule_configs\n",
    "from sagemaker.inputs import TrainingInput\n",
    "from sagemaker.model_monitor.dataset_format import DatasetFormat\n",
    "from sagemaker.model_metrics import (\n",
    "    MetricsSource,\n",
    "    ModelMetrics,\n",
    ")\n",
    "from sagemaker.processing import (\n",
    "    ProcessingInput,\n",
    "    ProcessingOutput,\n",
    "    Processor,\n",
    "    ScriptProcessor,\n",
    ")\n",
    "from sagemaker.sklearn.processing import SKLearnProcessor\n",
    "from sagemaker.s3 import S3Uploader\n",
    "from sagemaker.workflow.conditions import ConditionGreaterThanOrEqualTo\n",
    "from sagemaker.workflow.condition_step import ConditionStep\n",
    "from sagemaker.workflow.functions import JsonGet\n",
    "\n",
    "from sagemaker.workflow.parameters import (\n",
    "    ParameterInteger,\n",
    "    ParameterString,\n",
    ")\n",
    "from sagemaker.workflow.pipeline import Pipeline\n",
    "from sagemaker.workflow.properties import PropertyFile\n",
    "from sagemaker.workflow.steps import (\n",
    "    ProcessingStep,\n",
    "    TrainingStep,\n",
    "    CacheConfig,\n",
    ")\n",
    "from sagemaker.workflow.step_collections import RegisterModel\n",
    "from sagemaker.utils import name_from_base"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "role = get_execution_role()\n",
    "\n",
    "region = boto3.Session().region_name\n",
    "\n",
    "sagemaker_session = sagemaker.Session()\n",
    "pipeline_session = sagemaker.workflow.pipeline_context.PipelineSession()\n",
    "\n",
    "base_job_prefix = 'Demo-random-forest-dm-pipeline'\n",
    "\n",
    "bucket=sagemaker.Session().default_bucket()\n",
    "lab_pipelines_prefix = f\"{base_job_prefix}/demo-pipelines\"\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dataset\n",
    "We shall reuse the data  [direct marketing dataset](https://sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com/autopilot/direct_marketing/bank-additional.zip) downloaded from notebook [F1-xgboost_direct_marketing_sagemaker.ipynb](./F1-xgboost_direct_marketing_sagemaker.ipynb).\n",
    "\n",
    "> If you haven't run [F1-xgboost_direct_marketing_sagemaker.ipynb](./F1-xgboost_direct_marketing_sagemaker.ipynb), please open it and execute the code cells for data downloading."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now lets read this into a Pandas data frame and take a look."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# cell \n",
    "data_file = \"./bank-additional/bank-additional-full.csv\"\n",
    "data = pd.read_csv(data_file)\n",
    "pd.set_option('display.max_columns', 500)     # Make sure we can see all of the columns\n",
    "pd.set_option('display.max_rows', 20)         # Keep the output on one page\n",
    "data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preparation\n",
    "\n",
    "\n",
    "Let's start by specifying:\n",
    "\n",
    "- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting.\n",
    "- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# processing input / output path\n",
    "processing_input_path = f's3://{bucket}/{lab_pipelines_prefix}/processing/input'\n",
    "processing_output_path = f's3://{bucket}/{lab_pipelines_prefix}/processing/output'\n",
    "\n",
    "# upload data files \n",
    "sagemaker.s3.S3Uploader.upload(data_file, processing_input_path, sagemaker_session = sagemaker_session)\n",
    "input_data = f'{processing_input_path}/bank-additional-full.csv'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Define Parameters to Parametrize Pipeline Execution\n",
    "\n",
    "Define Pipeline parameters that you can use to parametrize the pipeline. Parameters enable custom pipeline executions and schedules without having to modify the Pipeline definition.\n",
    "\n",
    "The supported parameter types include:\n",
    "\n",
    "* `ParameterString` - represents a `str` Python type\n",
    "* `ParameterInteger` - represents an `int` Python type\n",
    "* `ParameterFloat` - represents a `float` Python type\n",
    "\n",
    "These parameters support providing a default value, which can be overridden on pipeline execution. The default value specified should be an instance of the type of the parameter.\n",
    "\n",
    "The parameters defined in this workflow include:\n",
    "\n",
    "* `processing_instance_type` - The `ml.*` instance type of the processing job.\n",
    "* `processing_instance_count` - The instance count of the processing job.\n",
    "* `training_instance_type` - The `ml.*` instance type of the training job.\n",
    "* `model_approval_status` - What approval status to register the trained model with for CI/CD purposes ( \"PendingManualApproval\" is the default).\n",
    "* `input_data` - The S3 bucket URI location of the input bank customer data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Define Parameters](https://raw.githubusercontent.com/aws/amazon-sagemaker-examples/a78c7255cbe7892408d8ea2b15a7a2117703befc/sagemaker-pipelines/tabular/abalone_build_train_deploy/img/pipeline-1.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# parameters for pipeline execution\n",
    "     \n",
    "input_data = ParameterString(\n",
    "    name=\"InputDataUrl\",\n",
    "    default_value=input_data,\n",
    ")\n",
    "\n",
    "processing_instance_count = ParameterInteger(\n",
    "    name=\"ProcessingInstanceCount\", default_value=2\n",
    ")\n",
    "processing_instance_type = ParameterString(\n",
    "    name=\"ProcessingInstanceType\", default_value=\"ml.m5.xlarge\"\n",
    ")\n",
    "training_instance_type = ParameterString(\n",
    "    name=\"TrainingInstanceType\", default_value=\"ml.m5.xlarge\"\n",
    ")\n",
    "model_approval_status = ParameterString(\n",
    "    name=\"ModelApprovalStatus\", default_value=\"PendingManualApproval\"\n",
    ")\n",
    "model_output = ParameterString(\n",
    "    name=\"ModelOutputUrl\",\n",
    "    default_value=f\"s3://{bucket}/{lab_pipelines_prefix}/model\",\n",
    ")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Create cache configuration (Unable to pass parameter for expire_after value)\n",
    "cache_config = CacheConfig(enable_caching=True, expire_after=\"PT1H\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Define a Processing Step for Feature Engineering\n",
    "\n",
    "First, develop a preprocessing script that is specified in the Processing step.\n",
    "\n",
    "The Processing step executes the script on the input data. The Training step uses the preprocessed training features and labels to train a model. The Evaluation step uses the trained model and preprocessed test features and labels to evaluate the model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Define a Processing Step for Feature Engineering](https://raw.githubusercontent.com/aws/amazon-sagemaker-examples/a78c7255cbe7892408d8ea2b15a7a2117703befc/sagemaker-pipelines/tabular/abalone_build_train_deploy/img/pipeline-2.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, create an instance of an `SKLearnProcessor` processor and use that in our `ProcessingStep`.\n",
    "\n",
    "You also specify the `framework_version` to use throughout this notebook.\n",
    "\n",
    "Note the `processing_instance_type` and `processing_instance_count` parameters used by the processor instance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# processing step for feature engineering\n",
    "sklearn_processor = SKLearnProcessor(\n",
    "    framework_version=\"1.0-1\",\n",
    "    instance_type=processing_instance_type,\n",
    "    instance_count=processing_instance_count,\n",
    "    base_job_name=f\"{base_job_prefix}-sklearn-preprocess\",\n",
    "    sagemaker_session=pipeline_session,\n",
    "    role=role,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, use the processor instance to construct a `ProcessingStep`, along with the input and output channels, and the code that will be executed when the pipeline invokes pipeline execution. This is similar to a processor instance's `run` method in the Python SDK.\n",
    "\n",
    "Note the `input_data` parameters passed into `ProcessingStep` is the input data used in the step. This input data is used by the processor instance when it is run.\n",
    "\n",
    "Also, note the `\"train\"`, `\"validation\"` and `\"test\"` named channels specified in the output configuration for the processing job. Step `Properties` can be used in subsequent steps and resolve to their runtime values at execution. Specifically, this usage is called out when you define the training step."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%%writefile ./preprocessing.py\n",
    "\n",
    "import argparse\n",
    "import os\n",
    "\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "PROCESSING_INPUT_DIR = \"/opt/ml/processing/input\"\n",
    "PROCESSING_OUTPUT_DIR = \"/opt/ml/processing/output\"\n",
    "\n",
    "def preprocess(args): \n",
    "    \n",
    "    input_data_path = os.path.join(args.processing_input_dir, \"bank-additional-full.csv\")\n",
    "    \n",
    "    print(f\"Reading input data from {input_data_path}\")\n",
    "    data = pd.read_csv(input_data_path)\n",
    "    \n",
    "    # Indicator variable to capture when pdays takes a value of 999\n",
    "    data['no_previous_contact'] = np.where(data['pdays'] == 999, 1, 0)                                 \n",
    "\n",
    "    # Indicator for individuals not actively employed\n",
    "    data['not_working'] = np.where(np.in1d(data['job'], ['student', 'retired', 'unemployed']), 1, 0)   \n",
    "\n",
    "    # remove unnecessary data\n",
    "    data = data.drop(\n",
    "        ['duration', \n",
    "         'emp.var.rate', \n",
    "         'cons.price.idx', \n",
    "         'cons.conf.idx', \n",
    "         'euribor3m', \n",
    "         'nr.employed'\n",
    "        ], \n",
    "        axis=1)\n",
    "\n",
    "    # Convert categorical variables to sets of indicators\n",
    "    model_data = pd.get_dummies(data)                    \n",
    "\n",
    "    # Replace \"y_no\" and \"y_yes\" with a single label column, and bring it to the front:\n",
    "    model_data = pd.concat([model_data['y_yes'], model_data.drop(['y_no', 'y_yes'], axis=1)], axis=1)\n",
    "    \n",
    "    # Randomly sort the data then split \n",
    "    validation_test_split_ratio = args.train_test_split_ratio + (1 - args.train_test_split_ratio) * 0.8\n",
    "    train_data, validation_data, test_data = np.split(\n",
    "        model_data.sample(frac=1, random_state=1729), \n",
    "        [int(args.train_test_split_ratio * len(model_data)), \n",
    "         int(validation_test_split_ratio * len(model_data))]) \n",
    "    print(f\"total dataset length:{len(model_data)}\")\n",
    "    print(f\"train data length:{len(train_data)}\")\n",
    "    print(f\"validation data length:{len(validation_data)}\")\n",
    "    print(f\"test data length:{len(test_data)}\")\n",
    "    \n",
    "    # output to local folder\n",
    "    train_data_output_path = os.path.join(args.processing_output_dir, \"train/train.csv\")\n",
    "    train_data.to_csv(train_data_output_path, header=True, index=False)\n",
    "\n",
    "    validation_data_output_path = os.path.join(args.processing_output_dir, \"validation/validation.csv\")\n",
    "    validation_data.to_csv(validation_data_output_path, header=True, index=False)\n",
    "    \n",
    "    test_data_output_path = os.path.join(args.processing_output_dir, \"test/test.csv\")\n",
    "    test_data.to_csv(test_data_output_path, header=True, index=False)\n",
    "\n",
    "    # For batch transform\n",
    "    bt_data_output_path = os.path.join(args.processing_output_dir, \"batchtransform/data.csv\")\n",
    "    test_data.to_csv(bt_data_output_path, header=False, index=False)\n",
    "\n",
    "    \n",
    "if __name__ == \"__main__\":\n",
    "    parser = argparse.ArgumentParser()\n",
    "    parser.add_argument(\"--train-test-split-ratio\", type=float, default=0.7)\n",
    "    parser.add_argument(\"--processing-input-dir\", type=str, default=PROCESSING_INPUT_DIR)\n",
    "    parser.add_argument(\"--processing-output-dir\", type=str, default=PROCESSING_OUTPUT_DIR)\n",
    "    args, _ = parser.parse_known_args()\n",
    "    \n",
    "    print(f\"Received arguments: {args}\")\n",
    "\n",
    "    preprocess(args)\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "process_script = './preprocessing.py'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "step_process = ProcessingStep(\n",
    "    name=\"PreprocessData\",\n",
    "    processor=sklearn_processor,\n",
    "    inputs=[\n",
    "        ProcessingInput(\n",
    "            source=input_data,\n",
    "            destination=\"/opt/ml/processing/input\",\n",
    "            s3_data_distribution_type=\"FullyReplicated\",\n",
    "        )\n",
    "    ],\n",
    "    outputs=[\n",
    "        ProcessingOutput(output_name=\"train\", source=\"/opt/ml/processing/output/train\"),\n",
    "        ProcessingOutput(output_name=\"validation\", source=\"/opt/ml/processing/output/validation\"),\n",
    "        ProcessingOutput(output_name=\"test\", source=\"/opt/ml/processing/output/test\"),\n",
    "        ProcessingOutput(output_name=\"batchtransform\", source=\"/opt/ml/processing/output/batchtransform\"),\n",
    "    ],\n",
    "    code=process_script,\n",
    "    cache_config=cache_config,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Define a Training Step to Train a Model\n",
    "\n",
    "In this section, use Scikit-Learn Random Forest to train on this dataset. Configure an Estimator for the Random Forest algorithm and the input dataset. A typical training script loads data from the input channels, configures training with hyperparameters, trains a model, and saves a model to `model_output` so that it can be hosted later. \n",
    "\n",
    "The model path where the models from training will be saved is also specified.\n",
    "\n",
    "Note the `training_instance_type` parameter may be used in multiple places in the pipeline. In this case, the `training_instance_type` is passed into the estimator."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Define a Training Step to Train a Model](https://raw.githubusercontent.com/aws/amazon-sagemaker-examples/a78c7255cbe7892408d8ea2b15a7a2117703befc/sagemaker-pipelines/tabular/abalone_build_train_deploy/img/pipeline-3.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%%writefile ./train.py\n",
    "# Python Built-Ins:\n",
    "import argparse\n",
    "import os\n",
    "\n",
    "# External Dependencies:\n",
    "#Joblib is a set of tools to provide lightweight pipelining in Python.\n",
    "# NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, \n",
    "# along with a large collection of high-level mathematical functions to operate on these arrays.\n",
    "\n",
    "#  pandas is a software library written for the Python programming language for data manipulation and analysis. \n",
    "#In particular, it offers data structures and operations for manipulating numerical tables and time series.\n",
    "\n",
    "import joblib\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from sklearn.model_selection import GridSearchCV\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.metrics import accuracy_score, precision_score,recall_score, f1_score\n",
    "\n",
    "def train(args):\n",
    "\n",
    "    print(\"loading training data\")\n",
    "    train_data_path = os.path.join(args.train_dir, \"train.csv\")\n",
    "    train_data = pd.read_csv(train_data_path, index_col=None)\n",
    "    X_train, y_train = train_data.iloc[:, 1:], train_data.iloc[:, 0]\n",
    "    \n",
    "    validation_data_path = os.path.join(args.validation_dir, \"validation.csv\")\n",
    "    validation_data = pd.read_csv(validation_data_path, index_col=None)\n",
    "    X_val, y_val = validation_data.iloc[:, 1:], validation_data.iloc[:, 0]\n",
    "\n",
    "    print(\"model training\")\n",
    "    \n",
    "    hyperparameters = {\n",
    "        \"bootstrap\": [True],\n",
    "        \"max_depth\": [12, 13],\n",
    "        \"max_features\": [13, 14],\n",
    "        \"n_estimators\": [100, 150]\n",
    "    }\n",
    "\n",
    "    model = RandomForestClassifier()\n",
    "\n",
    "    grid_search = GridSearchCV(model, hyperparameters, cv=5, n_jobs=-1)\n",
    "    grid_search.fit(X_train, y_train)\n",
    "    best_model = grid_search.best_estimator_\n",
    "    print(best_model)\n",
    "    \n",
    "    \n",
    "    # Evaluation on Validation data.\n",
    "    y_pred = grid_search.predict(X_val)\n",
    "    accuracy = accuracy_score(y_val, y_pred)\n",
    "    print(f\"accuracy: {accuracy}\")\n",
    "    precision = precision_score(y_val, y_pred)\n",
    "    print(f\"precision: {precision}\")\n",
    "    recall = recall_score(y_val, y_pred)\n",
    "    print(f\"recall: {recall}\")\n",
    "    f1 = f1_score(y_val, y_pred)\n",
    "    print(f\"f1: {f1}\")\n",
    "    \n",
    "    \n",
    "    # Saving model\n",
    "    path = os.path.join(args.model_dir, \"model.joblib\")\n",
    "    print(\"Saving model to {}\".format(path))\n",
    "    joblib.dump(best_model, path)\n",
    "    \n",
    "if __name__ == '__main__':\n",
    "\n",
    "    #------------------------------- parsing input parameters (from command line)\n",
    "    print('extracting arguments')\n",
    "    parser = argparse.ArgumentParser()\n",
    "\n",
    "    # RandomForest hyperparameters\n",
    "    parser.add_argument('--n-estimators', type=int, nargs=\"+\", default=[100, 150])\n",
    "    parser.add_argument('--max-depth', type=int, nargs=\"+\", default=[12, 13])\n",
    "    parser.add_argument('--max-features', type=int, nargs=\"+\", default=[13, 14])\n",
    "    parser.add_argument('--bootstrap', type=bool, nargs=\"+\", default=[True])\n",
    "\n",
    "    # Data, model, and output directories\n",
    "    parser.add_argument('--model-dir', type=str, default=os.environ.get('SM_MODEL_DIR'))\n",
    "    parser.add_argument('--train-dir', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))\n",
    "    parser.add_argument('--validation-dir', type=str, default=os.environ.get('SM_CHANNEL_VALIDATION'))\n",
    "\n",
    "    args, _ = parser.parse_known_args()\n",
    "    \n",
    "    print(f\"Received arguments: {args}\")    \n",
    "    \n",
    "    train(args)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# training step for generating model artifacts\n",
    "\n",
    "from sagemaker.sklearn.estimator import SKLearn\n",
    "\n",
    "sklearn_estimator = SKLearn(\n",
    "    entry_point='./train.py',\n",
    "    role=get_execution_role(),\n",
    "    instance_count=1,\n",
    "    instance_type='ml.m5.xlarge',\n",
    "    framework_version=\"1.0-1\",\n",
    "    base_job_name='randomforest-scikit',\n",
    "    metric_definitions=[\n",
    "        { 'Name': 'accuracy', 'Regex': 'accuracy: ([0-9\\\\.]+)' },\n",
    "        { 'Name': 'precision', 'Regex': 'precision: ([0-9\\\\.]+)' },\n",
    "        { 'Name': 'recall', 'Regex': 'recall: ([0-9\\\\.]+)' },\n",
    "        { 'Name': 'f1', 'Regex': 'f1: ([0-9\\\\.]+)' },\n",
    "    ],\n",
    "    hyperparameters={\n",
    "        'n_estimators': [100, 200]\n",
    "    },\n",
    "    max_run=20*60,  # Maximum allowed active runtime (in seconds)\n",
    "    # spot instances\n",
    "    # use_spot_instances=True,  # Use spot instances to reduce cost\n",
    "    # max_wait=30*60,  # Maximum clock time (including spot delays)\n",
    "    disable_profiler=False,\n",
    "    sagemaker_session=pipeline_session\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, use the estimator instance to construct a `TrainingStep` as well as the `properties` of the prior `ProcessingStep` used as input in the `TrainingStep` inputs and the code that's executed when the pipeline invokes the pipeline execution. This is similar to an estimator's `fit` method in the Python SDK.\n",
    "\n",
    "Pass in the `S3Uri` of the `\"train\"` output channel to the `TrainingStep`. Also, use the other `\"validation\"` output channel for model evaluation in the pipeline. The `properties` attribute of a Pipeline step matches the object model of the corresponding response of a describe call. These properties can be referenced as placeholder values and are resolved at runtime. For example, the `ProcessingStep` `properties` attribute matches the object model of the [DescribeProcessingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeProcessingJob.html) response object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "step_train  = TrainingStep(\n",
    "    name=\"TrainModel\",\n",
    "    estimator=sklearn_estimator,\n",
    "    inputs={\n",
    "        \"train\": TrainingInput(\n",
    "            s3_data=step_process.properties.ProcessingOutputConfig.Outputs[\"train\"].S3Output.S3Uri,\n",
    "            content_type=\"text/csv\",\n",
    "        ),\n",
    "        \"validation\": TrainingInput(\n",
    "            s3_data=step_process.properties.ProcessingOutputConfig.Outputs[\"validation\"].S3Output.S3Uri,\n",
    "            content_type=\"text/csv\",\n",
    "        ),\n",
    "    },\n",
    "    cache_config=cache_config,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Define a Model Evaluation Step to Evaluate the Trained Model\n",
    "\n",
    "First, develop an evaluation script that is specified in a Processing step that performs the model evaluation.\n",
    "\n",
    "After pipeline execution, you can examine the resulting `evaluation.json` for analysis.\n",
    "\n",
    "The evaluation script uses `SKLearn Container` to do the following:\n",
    "\n",
    "* Load the model.\n",
    "* Read the test data.\n",
    "* Issue predictions against the test data.\n",
    "* Save the evaluation report to the evaluation directory."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Define a Model Evaluation Step to Evaluate the Trained Model](https://raw.githubusercontent.com/aws/amazon-sagemaker-examples/a78c7255cbe7892408d8ea2b15a7a2117703befc/sagemaker-pipelines/tabular/abalone_build_train_deploy/img/pipeline-4.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, create an instance of a `SKLearnProcessor` processor and use it in the `ProcessingStep`.\n",
    "\n",
    "Note the `processing_instance_type` parameter passed into the processor."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# processing step for evaluation\n",
    "sklearn_postprocessor = SKLearnProcessor(\n",
    "    framework_version=\"1.0-1\", \n",
    "    role=role, \n",
    "    instance_type=\"ml.m5.xlarge\", \n",
    "    instance_count=1,\n",
    "    base_job_name=f\"{base_job_prefix}/postprocess-eval\",\n",
    "    sagemaker_session=pipeline_session,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use the processor instance to construct a `ProcessingStep`, along with the input and output channels and the code that will be executed when the pipeline invokes pipeline execution. This is similar to a processor instance's `run` method in the Python SDK.\n",
    "\n",
    "Specifically, the `S3ModelArtifacts` from the `step_train` `properties` and the `S3Uri` of the `\"test\"` output channel of the `step_process` `properties` are passed into the inputs. The `TrainingStep` and `ProcessingStep` `properties` attribute matches the object model of the [DescribeTrainingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeTrainingJob.html) and  [DescribeProcessingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeProcessingJob.html) response objects, respectively."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%%writefile ./evaluation.py\n",
    "\n",
    "import os\n",
    "import json\n",
    "import logging\n",
    "from pathlib import Path\n",
    "import tarfile\n",
    "import argparse\n",
    "import joblib\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "from sklearn.metrics import accuracy_score, precision_score,recall_score, f1_score, classification_report\n",
    "\n",
    "\n",
    "logger = logging.getLogger()\n",
    "logger.setLevel(logging.INFO)\n",
    "logger.addHandler(logging.StreamHandler())\n",
    "\n",
    "PROCESSING_INPUT_DIR = '/opt/ml/processing/test'\n",
    "PROCESSING_MODEL_DIR = '/opt/ml/processing/model'\n",
    "PROCESSING_EVALUATION_DIR = '/opt/ml/processing/evaluation'\n",
    "\n",
    "def postprocess(args):\n",
    "    model_path = Path(args.processing_model_dir) / 'model.tar.gz'\n",
    "    if model_path.exists():\n",
    "        logger.info('Extracting model from path: {}'.format(model_path))\n",
    "        with tarfile.open(model_path) as tar:\n",
    "            tar.extractall(path=args.processing_model_dir)\n",
    "\n",
    "        for file in os.listdir(args.processing_model_dir):\n",
    "            logger.info(file)\n",
    "        \n",
    "    logger.debug(\"Loading the model.\")\n",
    "    model = joblib.load(os.path.join(args.processing_model_dir, \"model.joblib\"))\n",
    "\n",
    "    logger.debug(\"Reading test data.\")\n",
    "    test_data_path = os.path.join(args.processing_input_dir, 'test.csv')\n",
    "    \n",
    "    # index_col is from the old version of pandas read_csv() api.\n",
    "    test_data = pd.read_csv(test_data_path, index_col=None)\n",
    "    print(test_data.head())\n",
    "    \n",
    "    logger.debug(\"Reading test data.\")\n",
    "    y_test = test_data.iloc[:, 0].values\n",
    "    test_data.drop(test_data.columns[0], axis=1, inplace=True)\n",
    "    X_test = test_data.values\n",
    "    \n",
    "    logger.info(\"Performing predictions against test data.\")\n",
    "    y_pred = model.predict(X_test) \n",
    "    \n",
    "    logger.info(f\"predictions: {y_pred}\")\n",
    "\n",
    "    print('Creating classification evaluation report')\n",
    "    accuracy = accuracy_score(y_test, y_pred)\n",
    "    print(f\"accuracy: {accuracy}\")\n",
    "    precision = precision_score(y_test, y_pred)\n",
    "    print(f\"precision: {precision}\")\n",
    "    recall = recall_score(y_test, y_pred)\n",
    "    print(f\"recall: {recall}\")\n",
    "    f1 = f1_score(y_test, y_pred)\n",
    "    print(f\"f1: {f1}\")\n",
    "    \n",
    "    report_dict = {\n",
    "        'metrics': {\n",
    "            'accuracy': accuracy,\n",
    "            'precision': precision,\n",
    "            'recall': recall,\n",
    "            'f1': f1\n",
    "        },\n",
    "    }\n",
    "    \n",
    "    evaluation_output_path = os.path.join(args.processing_evaluation_dir, 'evaluation.json')\n",
    "    print('Saving classification report to {}'.format(evaluation_output_path))\n",
    "\n",
    "    with open(evaluation_output_path, 'w') as f:\n",
    "        f.write(json.dumps(report_dict))\n",
    "        \n",
    "if __name__ == \"__main__\":\n",
    "    parser = argparse.ArgumentParser()\n",
    "    parser.add_argument(\"--processing-input-dir\", type=str, default=PROCESSING_INPUT_DIR)\n",
    "    parser.add_argument(\"--processing-model-dir\", type=str, default=PROCESSING_MODEL_DIR)\n",
    "    parser.add_argument(\"--processing-evaluation-dir\", type=str, default=PROCESSING_EVALUATION_DIR)\n",
    "    args, _ = parser.parse_known_args()\n",
    "    \n",
    "    print(f\"Received arguments: {args}\")\n",
    "\n",
    "    postprocess(args)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "postprocess_script = './evaluation.py'\n",
    "\n",
    "evaluation_report = PropertyFile(\n",
    "    name=\"EvaluationReport\",\n",
    "    output_name=\"evaluation\",\n",
    "    path=\"evaluation.json\",\n",
    ")\n",
    "step_eval = ProcessingStep(\n",
    "    name=\"EvaluateModel\",\n",
    "    processor=sklearn_postprocessor,\n",
    "    inputs=[\n",
    "        ProcessingInput(\n",
    "            source=step_train.properties.ModelArtifacts.S3ModelArtifacts,\n",
    "            destination=\"/opt/ml/processing/model\",\n",
    "        ),\n",
    "        ProcessingInput(\n",
    "            source=step_process.properties.ProcessingOutputConfig.Outputs[\n",
    "                \"test\"\n",
    "            ].S3Output.S3Uri,\n",
    "            destination=\"/opt/ml/processing/test\",\n",
    "        ),\n",
    "    ],\n",
    "    outputs=[\n",
    "        ProcessingOutput(\n",
    "            output_name=\"evaluation\", source=\"/opt/ml/processing/evaluation\"\n",
    "        ),\n",
    "    ],\n",
    "    code=postprocess_script,\n",
    "    property_files=[evaluation_report],\n",
    "    cache_config=cache_config,\n",
    "\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Define a Create Model Step to Create a Model\n",
    "\n",
    "In order to perform batch transformation using the example model, create a SageMaker model. \n",
    "\n",
    "Specifically, pass in the `S3ModelArtifacts` from the `TrainingStep`, `step_train` properties. The `TrainingStep` `properties` attribute matches the object model of the [DescribeTrainingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeTrainingJob.html) response object."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Define a Create Model Step and Batch Transform to Process Data in Batch at Scale](https://raw.githubusercontent.com/aws/amazon-sagemaker-examples/a78c7255cbe7892408d8ea2b15a7a2117703befc/sagemaker-pipelines/tabular/abalone_build_train_deploy/img/pipeline-5.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%%writefile ./inference.py\n",
    "import os\n",
    "import joblib\n",
    "\n",
    "# inference functions ---------------\n",
    "def model_fn(model_dir):\n",
    "    clf = joblib.load(os.path.join(model_dir, \"model.joblib\"))\n",
    "    return clf\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from sagemaker.inputs import CreateModelInput\n",
    "from sagemaker.workflow.steps import CreateModelStep\n",
    "from sagemaker.workflow.model_step import ModelStep\n",
    "from sagemaker.workflow.retry import (\n",
    "    StepRetryPolicy,\n",
    "    StepExceptionTypeEnum,\n",
    "    SageMakerJobExceptionTypeEnum,\n",
    "    SageMakerJobStepRetryPolicy,\n",
    ")\n",
    "\n",
    "from sagemaker.model import Model\n",
    "\n",
    "from sagemaker.sklearn.model import SKLearnModel\n",
    "\n",
    "model = SKLearnModel(\n",
    "    model_data=step_train.properties.ModelArtifacts.S3ModelArtifacts,\n",
    "    framework_version=\"1.0-1\",\n",
    "    py_version='py3',\n",
    "    role=get_execution_role(),\n",
    "    entry_point='./inference.py',\n",
    "    sagemaker_session=pipeline_session,\n",
    ")\n",
    "\n",
    "step_create_model = ModelStep(\n",
    "    name=\"DirectMarketingCreateModel\",\n",
    "    step_args=model.create(instance_type=\"ml.m5.xlarge\"),\n",
    "    retry_policies=[\n",
    "            StepRetryPolicy(exception_types=[StepExceptionTypeEnum.THROTTLING], max_attempts=3)\n",
    "        ]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Define a Transform Step to Perform Batch Transformation\n",
    "\n",
    "Now that a model instance is defined, create a `Transformer` instance with the appropriate model type, compute instance type, and desired output S3 URI.\n",
    "\n",
    "Specifically, pass in the `ModelName` from the `CreateModelStep`, `step_create_model` properties. The `CreateModelStep` `properties` attribute matches the object model of the [DescribeModel](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeModel.html) response object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from sagemaker.transformer import Transformer\n",
    "\n",
    "batch_output = f\"s3://{bucket}/{lab_pipelines_prefix}/batchtransform\"\n",
    "transformer = Transformer(\n",
    "    model_name=step_create_model.properties.ModelName,\n",
    "    instance_type=\"ml.m5.xlarge\",\n",
    "    instance_count=1,\n",
    "    output_path=batch_output,\n",
    "    accept=\"text/csv\",\n",
    "    assemble_with=\"Line\", \n",
    "    sagemaker_session=pipeline_session\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pass in the transformer instance and the `TransformInput` with the `batch_data` pipeline parameter defined earlier. More details of the `TransformInput` can be found [here](https://github.com/aws/sagemaker-python-sdk/blob/2594ffb3eaefaf55936b71ea0c38442135223602/src/sagemaker/inputs.py#L140). We use `input_filter` to filter out the first column, which is the label column, in the test data. We also set the batch job to join the input source with the batch job output. Please check the document about [associating inference with input record](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform-data-processing.html) for more details"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from sagemaker.inputs import TransformInput\n",
    "from sagemaker.workflow.steps import TransformStep\n",
    "\n",
    "\n",
    "step_transform = TransformStep(\n",
    "    name=\"DirectMarketingTransform\",\n",
    "    transformer=transformer,\n",
    "    inputs=TransformInput(\n",
    "        data=step_process.properties.ProcessingOutputConfig.Outputs[\n",
    "                \"batchtransform\"\n",
    "            ].S3Output.S3Uri,\n",
    "        content_type=\"text/csv\", \n",
    "        split_type=\"Line\",\n",
    "        input_filter=\"$[1:]\",\n",
    "        join_source=\"Input\",\n",
    "        ),\n",
    "    cache_config=cache_config,\n",
    ")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Define a Register Model Step to Create a Model Package\n",
    "\n",
    "Use the estimator instance specified in the training step to construct an instance of `RegisterModel`. The result of executing `RegisterModel` in a pipeline is a model package. A model package is a reusable model artifacts abstraction that packages all ingredients required for inference. Primarily, it consists of an inference specification that defines the inference image to use along with an optional model weights location.\n",
    "\n",
    "A model package group is a collection of model packages. A model package group can be created for a specific ML business problem, and new versions of the model packages can be added to it. Typically, customers are expected to create a ModelPackageGroup for a SageMaker pipeline so that model package versions can be added to the group for every SageMaker Pipeline run.\n",
    "\n",
    "The construction of `RegisterModel` is similar to an estimator instance's `register` method in the Python SDK.\n",
    "\n",
    "Specifically, pass in the `S3ModelArtifacts` from the `TrainingStep`, `step_train` properties. The `TrainingStep` `properties` attribute matches the object model of the [DescribeTrainingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeTrainingJob.html) response object.\n",
    "\n",
    "Note that the specific model package group name provided in this notebook can be used in the model registry and CI/CD work with SageMaker Projects."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# register model step that will be conditionally executed\n",
    "model_metrics = ModelMetrics(\n",
    "    model_statistics=MetricsSource(\n",
    "        s3_uri=\"{}/evaluation.json\".format(\n",
    "            step_eval.arguments[\"ProcessingOutputConfig\"][\"Outputs\"][0][\"S3Output\"][\n",
    "                \"S3Uri\"\n",
    "            ]\n",
    "        ),\n",
    "        content_type=\"application/json\",\n",
    "    )\n",
    ")\n",
    "model_package_group_name = \"DirectMarketingRandomForestModelGroup\"\n",
    "\n",
    "register_arg = model.register(\n",
    "    content_types=[\"text/csv\"],\n",
    "    response_types=[\"text/csv\"],\n",
    "    inference_instances=[\"ml.t2.medium\", \"ml.m5.large\"],\n",
    "    transform_instances=[\"ml.m5.xlarge\"],\n",
    "    model_package_group_name=model_package_group_name,\n",
    "    approval_status=model_approval_status,\n",
    "    model_metrics=model_metrics,\n",
    ")\n",
    "step_register = ModelStep(\n",
    "    name=\"RegisterModel\",\n",
    "    step_args=register_arg,\n",
    ")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Define a Condition Step to Check Accuracy and Conditionally Create a Model and Run a Batch Transformation and Register a Model in the Model Registry\n",
    "\n",
    "In this step, the model is registered only if the accuracy of the model, as determined by the evaluation step `step_eval`, exceeded a specified value. A `ConditionStep` enables pipelines to support conditional execution in the pipeline DAG based on the conditions of the step properties. \n",
    "\n",
    "In the following section, you:\n",
    "\n",
    "* Define a `ConditionLessThanOrEqualTo` on the accuracy value found in the output of the evaluation step, `step_eval`.\n",
    "* Use the condition in the list of conditions in a `ConditionStep`.\n",
    "* Pass the `CreateModelStep` and `TransformStep` steps, and the `RegisterModel` step collection into the `if_steps` of the `ConditionStep`, which are only executed, if the condition evaluates to `True`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Define a Condition Step to Check Accuracy and Conditionally Execute Steps](https://raw.githubusercontent.com/aws/amazon-sagemaker-examples/a78c7255cbe7892408d8ea2b15a7a2117703befc/sagemaker-pipelines/tabular/abalone_build_train_deploy/img/pipeline-6.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# condition step for evaluating model quality and branching execution\n",
    "cond_lte = ConditionGreaterThanOrEqualTo(\n",
    "    left=JsonGet(\n",
    "        step_name=step_eval.name,\n",
    "        property_file=evaluation_report,\n",
    "        json_path=\"metrics.accuracy\",\n",
    "    ),\n",
    "    right=0.55,\n",
    ")\n",
    "\n",
    "# step_transform is ignored as the testing data is not ready for batch transform\n",
    "step_cond = ConditionStep(\n",
    "    name=\"CheckEvaluation\",\n",
    "    conditions=[cond_lte],\n",
    "    if_steps=[step_register, step_create_model, step_transform],\n",
    "    else_steps=[],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Define a Pipeline of Parameters, Steps, and Conditions\n",
    "\n",
    "In this section, combine the steps into a Pipeline so it can be executed.\n",
    "\n",
    "A pipeline requires a `name`, `parameters`, and `steps`. Names must be unique within an `(account, region)` pair.\n",
    "\n",
    "Note:\n",
    "\n",
    "* All of the parameters used in the definitions must be present.\n",
    "* Steps passed into the pipeline do not have to be listed in the order of execution. The SageMaker Pipeline service resolves the _data dependency_ DAG as steps for the execution to complete.\n",
    "* Steps must be unique to across the pipeline step list and all condition step if/else lists."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Define a Pipeline of Parameters, Steps, and Conditions](https://raw.githubusercontent.com/aws/amazon-sagemaker-examples/a78c7255cbe7892408d8ea2b15a7a2117703befc/sagemaker-pipelines/tabular/abalone_build_train_deploy/img/pipeline-7.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# pipeline instance\n",
    "pipeline = Pipeline(\n",
    "    name=\"DirectMarketing-ModelBuildPipeline\",\n",
    "    parameters=[\n",
    "        input_data,\n",
    "        processing_instance_type,\n",
    "        processing_instance_count,\n",
    "        training_instance_type,\n",
    "        model_approval_status,\n",
    "        model_output,\n",
    "    ],\n",
    "    steps=[step_process, step_train, step_eval, step_cond],\n",
    "    sagemaker_session=pipeline_session,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### (Optional) Examining the pipeline definition\n",
    "\n",
    "The JSON of the pipeline definition can be examined to confirm the pipeline is well-defined and the parameters and step properties resolve correctly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import json\n",
    "\n",
    "definition = json.loads(pipeline.definition())\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Submit the pipeline to SageMaker and start execution\n",
    "\n",
    "Submit the pipeline definition to the Pipeline service. The role passed in will be used by the Pipeline service to create all the jobs defined in the steps."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "pipeline.upsert(role_arn=role)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Pipeline Operations: Examining and Waiting for Pipeline Execution\n",
    "\n",
    "Describe the pipeline execution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "execution = pipeline.start()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "execution.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Wait for the execution to complete.\n",
    "execution.wait()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Examining the Evalution\n",
    "\n",
    "Examine the resulting model evaluation after the pipeline completes. Download the resulting `evaluation.json` file from S3 and print the report."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from pprint import pprint\n",
    "\n",
    "evaluation_json = sagemaker.s3.S3Downloader.read_file(\"{}/evaluation.json\".format(\n",
    "    step_eval.arguments[\"ProcessingOutputConfig\"][\"Outputs\"][0][\"S3Output\"][\"S3Uri\"]\n",
    "))\n",
    "pprint(json.loads(evaluation_json))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Lineage\n",
    "\n",
    "Review the lineage of the artifacts generated by the pipeline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import time\n",
    "from sagemaker.lineage.visualizer import LineageTableVisualizer\n",
    "\n",
    "\n",
    "viz = LineageTableVisualizer(sagemaker.session.Session())\n",
    "for execution_step in reversed(execution.list_steps()):\n",
    "    print(execution_step)\n",
    "    display(viz.show(pipeline_execution_step=execution_step))\n",
    "    time.sleep(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Parametrized Executions\n",
    "\n",
    "You can run additional executions of the pipeline and specify different pipeline parameters. The parameters argument is a dictionary containing parameter names, and where the values are used to override the defaults values.\n",
    "\n",
    "Based on the performance of the model, you might want to kick off another pipeline execution on a compute-optimized instance type and set the model approval status to \"Approved\" automatically. This means that the model package version generated by the `RegisterModel` step is automatically ready for deployment through CI/CD pipelines, such as with SageMaker Projects."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "execution = pipeline.start(\n",
    "    parameters=dict(\n",
    "        ProcessingInstanceType=\"ml.c5.xlarge\",\n",
    "        ModelApprovalStatus=\"Approved\",\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "execution.wait()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "execution.list_steps()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "availableInstances": [
   {
    "_defaultOrder": 0,
    "_isFastLaunch": true,
    "category": "General purpose",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 4,
    "name": "ml.t3.medium",
    "vcpuNum": 2
   },
   {
    "_defaultOrder": 1,
    "_isFastLaunch": false,
    "category": "General purpose",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 8,
    "name": "ml.t3.large",
    "vcpuNum": 2
   },
   {
    "_defaultOrder": 2,
    "_isFastLaunch": false,
    "category": "General purpose",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 16,
    "name": "ml.t3.xlarge",
    "vcpuNum": 4
   },
   {
    "_defaultOrder": 3,
    "_isFastLaunch": false,
    "category": "General purpose",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 32,
    "name": "ml.t3.2xlarge",
    "vcpuNum": 8
   },
   {
    "_defaultOrder": 4,
    "_isFastLaunch": true,
    "category": "General purpose",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 8,
    "name": "ml.m5.large",
    "vcpuNum": 2
   },
   {
    "_defaultOrder": 5,
    "_isFastLaunch": false,
    "category": "General purpose",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 16,
    "name": "ml.m5.xlarge",
    "vcpuNum": 4
   },
   {
    "_defaultOrder": 6,
    "_isFastLaunch": false,
    "category": "General purpose",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 32,
    "name": "ml.m5.2xlarge",
    "vcpuNum": 8
   },
   {
    "_defaultOrder": 7,
    "_isFastLaunch": false,
    "category": "General purpose",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 64,
    "name": "ml.m5.4xlarge",
    "vcpuNum": 16
   },
   {
    "_defaultOrder": 8,
    "_isFastLaunch": false,
    "category": "General purpose",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 128,
    "name": "ml.m5.8xlarge",
    "vcpuNum": 32
   },
   {
    "_defaultOrder": 9,
    "_isFastLaunch": false,
    "category": "General purpose",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 192,
    "name": "ml.m5.12xlarge",
    "vcpuNum": 48
   },
   {
    "_defaultOrder": 10,
    "_isFastLaunch": false,
    "category": "General purpose",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 256,
    "name": "ml.m5.16xlarge",
    "vcpuNum": 64
   },
   {
    "_defaultOrder": 11,
    "_isFastLaunch": false,
    "category": "General purpose",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 384,
    "name": "ml.m5.24xlarge",
    "vcpuNum": 96
   },
   {
    "_defaultOrder": 12,
    "_isFastLaunch": false,
    "category": "General purpose",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 8,
    "name": "ml.m5d.large",
    "vcpuNum": 2
   },
   {
    "_defaultOrder": 13,
    "_isFastLaunch": false,
    "category": "General purpose",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 16,
    "name": "ml.m5d.xlarge",
    "vcpuNum": 4
   },
   {
    "_defaultOrder": 14,
    "_isFastLaunch": false,
    "category": "General purpose",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 32,
    "name": "ml.m5d.2xlarge",
    "vcpuNum": 8
   },
   {
    "_defaultOrder": 15,
    "_isFastLaunch": false,
    "category": "General purpose",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 64,
    "name": "ml.m5d.4xlarge",
    "vcpuNum": 16
   },
   {
    "_defaultOrder": 16,
    "_isFastLaunch": false,
    "category": "General purpose",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 128,
    "name": "ml.m5d.8xlarge",
    "vcpuNum": 32
   },
   {
    "_defaultOrder": 17,
    "_isFastLaunch": false,
    "category": "General purpose",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 192,
    "name": "ml.m5d.12xlarge",
    "vcpuNum": 48
   },
   {
    "_defaultOrder": 18,
    "_isFastLaunch": false,
    "category": "General purpose",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 256,
    "name": "ml.m5d.16xlarge",
    "vcpuNum": 64
   },
   {
    "_defaultOrder": 19,
    "_isFastLaunch": false,
    "category": "General purpose",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 384,
    "name": "ml.m5d.24xlarge",
    "vcpuNum": 96
   },
   {
    "_defaultOrder": 20,
    "_isFastLaunch": false,
    "category": "General purpose",
    "gpuNum": 0,
    "hideHardwareSpecs": true,
    "memoryGiB": 0,
    "name": "ml.geospatial.interactive",
    "supportedImageNames": [
     "sagemaker-geospatial-v1-0"
    ],
    "vcpuNum": 0
   },
   {
    "_defaultOrder": 21,
    "_isFastLaunch": true,
    "category": "Compute optimized",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 4,
    "name": "ml.c5.large",
    "vcpuNum": 2
   },
   {
    "_defaultOrder": 22,
    "_isFastLaunch": false,
    "category": "Compute optimized",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 8,
    "name": "ml.c5.xlarge",
    "vcpuNum": 4
   },
   {
    "_defaultOrder": 23,
    "_isFastLaunch": false,
    "category": "Compute optimized",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 16,
    "name": "ml.c5.2xlarge",
    "vcpuNum": 8
   },
   {
    "_defaultOrder": 24,
    "_isFastLaunch": false,
    "category": "Compute optimized",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 32,
    "name": "ml.c5.4xlarge",
    "vcpuNum": 16
   },
   {
    "_defaultOrder": 25,
    "_isFastLaunch": false,
    "category": "Compute optimized",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 72,
    "name": "ml.c5.9xlarge",
    "vcpuNum": 36
   },
   {
    "_defaultOrder": 26,
    "_isFastLaunch": false,
    "category": "Compute optimized",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 96,
    "name": "ml.c5.12xlarge",
    "vcpuNum": 48
   },
   {
    "_defaultOrder": 27,
    "_isFastLaunch": false,
    "category": "Compute optimized",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 144,
    "name": "ml.c5.18xlarge",
    "vcpuNum": 72
   },
   {
    "_defaultOrder": 28,
    "_isFastLaunch": false,
    "category": "Compute optimized",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 192,
    "name": "ml.c5.24xlarge",
    "vcpuNum": 96
   },
   {
    "_defaultOrder": 29,
    "_isFastLaunch": true,
    "category": "Accelerated computing",
    "gpuNum": 1,
    "hideHardwareSpecs": false,
    "memoryGiB": 16,
    "name": "ml.g4dn.xlarge",
    "vcpuNum": 4
   },
   {
    "_defaultOrder": 30,
    "_isFastLaunch": false,
    "category": "Accelerated computing",
    "gpuNum": 1,
    "hideHardwareSpecs": false,
    "memoryGiB": 32,
    "name": "ml.g4dn.2xlarge",
    "vcpuNum": 8
   },
   {
    "_defaultOrder": 31,
    "_isFastLaunch": false,
    "category": "Accelerated computing",
    "gpuNum": 1,
    "hideHardwareSpecs": false,
    "memoryGiB": 64,
    "name": "ml.g4dn.4xlarge",
    "vcpuNum": 16
   },
   {
    "_defaultOrder": 32,
    "_isFastLaunch": false,
    "category": "Accelerated computing",
    "gpuNum": 1,
    "hideHardwareSpecs": false,
    "memoryGiB": 128,
    "name": "ml.g4dn.8xlarge",
    "vcpuNum": 32
   },
   {
    "_defaultOrder": 33,
    "_isFastLaunch": false,
    "category": "Accelerated computing",
    "gpuNum": 4,
    "hideHardwareSpecs": false,
    "memoryGiB": 192,
    "name": "ml.g4dn.12xlarge",
    "vcpuNum": 48
   },
   {
    "_defaultOrder": 34,
    "_isFastLaunch": false,
    "category": "Accelerated computing",
    "gpuNum": 1,
    "hideHardwareSpecs": false,
    "memoryGiB": 256,
    "name": "ml.g4dn.16xlarge",
    "vcpuNum": 64
   },
   {
    "_defaultOrder": 35,
    "_isFastLaunch": false,
    "category": "Accelerated computing",
    "gpuNum": 1,
    "hideHardwareSpecs": false,
    "memoryGiB": 61,
    "name": "ml.p3.2xlarge",
    "vcpuNum": 8
   },
   {
    "_defaultOrder": 36,
    "_isFastLaunch": false,
    "category": "Accelerated computing",
    "gpuNum": 4,
    "hideHardwareSpecs": false,
    "memoryGiB": 244,
    "name": "ml.p3.8xlarge",
    "vcpuNum": 32
   },
   {
    "_defaultOrder": 37,
    "_isFastLaunch": false,
    "category": "Accelerated computing",
    "gpuNum": 8,
    "hideHardwareSpecs": false,
    "memoryGiB": 488,
    "name": "ml.p3.16xlarge",
    "vcpuNum": 64
   },
   {
    "_defaultOrder": 38,
    "_isFastLaunch": false,
    "category": "Accelerated computing",
    "gpuNum": 8,
    "hideHardwareSpecs": false,
    "memoryGiB": 768,
    "name": "ml.p3dn.24xlarge",
    "vcpuNum": 96
   },
   {
    "_defaultOrder": 39,
    "_isFastLaunch": false,
    "category": "Memory Optimized",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 16,
    "name": "ml.r5.large",
    "vcpuNum": 2
   },
   {
    "_defaultOrder": 40,
    "_isFastLaunch": false,
    "category": "Memory Optimized",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 32,
    "name": "ml.r5.xlarge",
    "vcpuNum": 4
   },
   {
    "_defaultOrder": 41,
    "_isFastLaunch": false,
    "category": "Memory Optimized",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 64,
    "name": "ml.r5.2xlarge",
    "vcpuNum": 8
   },
   {
    "_defaultOrder": 42,
    "_isFastLaunch": false,
    "category": "Memory Optimized",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 128,
    "name": "ml.r5.4xlarge",
    "vcpuNum": 16
   },
   {
    "_defaultOrder": 43,
    "_isFastLaunch": false,
    "category": "Memory Optimized",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 256,
    "name": "ml.r5.8xlarge",
    "vcpuNum": 32
   },
   {
    "_defaultOrder": 44,
    "_isFastLaunch": false,
    "category": "Memory Optimized",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 384,
    "name": "ml.r5.12xlarge",
    "vcpuNum": 48
   },
   {
    "_defaultOrder": 45,
    "_isFastLaunch": false,
    "category": "Memory Optimized",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 512,
    "name": "ml.r5.16xlarge",
    "vcpuNum": 64
   },
   {
    "_defaultOrder": 46,
    "_isFastLaunch": false,
    "category": "Memory Optimized",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 768,
    "name": "ml.r5.24xlarge",
    "vcpuNum": 96
   },
   {
    "_defaultOrder": 47,
    "_isFastLaunch": false,
    "category": "Accelerated computing",
    "gpuNum": 1,
    "hideHardwareSpecs": false,
    "memoryGiB": 16,
    "name": "ml.g5.xlarge",
    "vcpuNum": 4
   },
   {
    "_defaultOrder": 48,
    "_isFastLaunch": false,
    "category": "Accelerated computing",
    "gpuNum": 1,
    "hideHardwareSpecs": false,
    "memoryGiB": 32,
    "name": "ml.g5.2xlarge",
    "vcpuNum": 8
   },
   {
    "_defaultOrder": 49,
    "_isFastLaunch": false,
    "category": "Accelerated computing",
    "gpuNum": 1,
    "hideHardwareSpecs": false,
    "memoryGiB": 64,
    "name": "ml.g5.4xlarge",
    "vcpuNum": 16
   },
   {
    "_defaultOrder": 50,
    "_isFastLaunch": false,
    "category": "Accelerated computing",
    "gpuNum": 1,
    "hideHardwareSpecs": false,
    "memoryGiB": 128,
    "name": "ml.g5.8xlarge",
    "vcpuNum": 32
   },
   {
    "_defaultOrder": 51,
    "_isFastLaunch": false,
    "category": "Accelerated computing",
    "gpuNum": 1,
    "hideHardwareSpecs": false,
    "memoryGiB": 256,
    "name": "ml.g5.16xlarge",
    "vcpuNum": 64
   },
   {
    "_defaultOrder": 52,
    "_isFastLaunch": false,
    "category": "Accelerated computing",
    "gpuNum": 4,
    "hideHardwareSpecs": false,
    "memoryGiB": 192,
    "name": "ml.g5.12xlarge",
    "vcpuNum": 48
   },
   {
    "_defaultOrder": 53,
    "_isFastLaunch": false,
    "category": "Accelerated computing",
    "gpuNum": 4,
    "hideHardwareSpecs": false,
    "memoryGiB": 384,
    "name": "ml.g5.24xlarge",
    "vcpuNum": 96
   },
   {
    "_defaultOrder": 54,
    "_isFastLaunch": false,
    "category": "Accelerated computing",
    "gpuNum": 8,
    "hideHardwareSpecs": false,
    "memoryGiB": 768,
    "name": "ml.g5.48xlarge",
    "vcpuNum": 192
   }
  ],
  "instance_type": "ml.t3.medium",
  "kernelspec": {
   "display_name": "Python 3 (Data Science)",
   "language": "python",
   "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:ap-southeast-2:452832661640:image/datascience-1.0"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}