{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Use SKlearn and Amazon SageMaker Clarify in AWS StepFunctions\n",
    "_**Run Amazon SageMaker Clarify processing in a AWS StepFunction**_\n",
    "\n",
    "---\n",
    "\n",
    "## Contents\n",
    "1. [Introduction](#Introduction)\n",
    "2. [Setup](#Setup)\n",
    "  1. [Source the libraries](#Source-the-libraries)\n",
    "  2. [Set S3 bucket, data prefix and permission](#Set-S3-bucket,-data-prefix-and-permission)\n",
    "3. [Load and upload the data](#Load-and-upload-the-data)\n",
    "4. [Create the pipeline](#Create-the-pipeline)\n",
    "    1. [Set training session parameters](#Set-training-session-parameters)\n",
    "    2. [Set SKLearn container](#Set-SKLearn-container)\n",
    "    3. [Define Amazon SageMaker Clarify processing container](#Define-Amazon-SageMaker-Clarify-processing-container)\n",
    "5. [Test the endpoint and make predictions](#Test-the-endpoint-and-make-predictions)\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction\n",
    "\n",
    "This notebook demonstrates the use of Amazon SageMaker SKLearn to train a regression model. \n",
    "\n",
    "We use the [Abalone data](https://datahub.io/machine-learning/abalone), originally from the [UCI data repository](https://archive.ics.uci.edu/ml/datasets/abalone). More details about the original dataset can be found [here](https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.names).\n",
    "\n",
    "---\n",
    "## Setup\n",
    "\n",
    "This notebook was tested in Amazon SageMaker notebook on a ml.t3.medium instance with Python 3 (conda_python3) kernel.\n",
    "\n",
    "Let's start by specifying:\n",
    "1. Install AWS StepFunctions Data Science Python SDK\n",
    "2. Sourcing libraries\n",
    "3. The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting. Also, the IAM role arn used to give training and hosting access to your data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Source the libraries\n",
    "\n",
    "First we install the AWS StepFunctions Data Science Python SDK. And then we source in all libraries required to run this notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Collecting stepfunctions\n",
      "  Downloading stepfunctions-2.0.0.tar.gz (60 kB)\n",
      "\u001b[K     |████████████████████████████████| 60 kB 6.0 MB/s  eta 0:00:01\n",
      "\u001b[?25hRequirement already satisfied: sagemaker>=2.1.0 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from stepfunctions) (2.38.0)\n",
      "Requirement already satisfied: boto3>=1.14.38 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from stepfunctions) (1.17.55)\n",
      "Requirement already satisfied: pyyaml in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from stepfunctions) (5.4.1)\n",
      "Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from boto3>=1.14.38->stepfunctions) (0.10.0)\n",
      "Requirement already satisfied: s3transfer<0.5.0,>=0.4.0 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from boto3>=1.14.38->stepfunctions) (0.4.1)\n",
      "Requirement already satisfied: botocore<1.21.0,>=1.20.55 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from boto3>=1.14.38->stepfunctions) (1.20.55)\n",
      "Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from botocore<1.21.0,>=1.20.55->boto3>=1.14.38->stepfunctions) (1.26.4)\n",
      "Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from botocore<1.21.0,>=1.20.55->boto3>=1.14.38->stepfunctions) (2.8.1)\n",
      "Requirement already satisfied: six>=1.5 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.21.0,>=1.20.55->boto3>=1.14.38->stepfunctions) (1.15.0)\n",
      "Requirement already satisfied: protobuf>=3.1 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from sagemaker>=2.1.0->stepfunctions) (3.15.2)\n",
      "Requirement already satisfied: protobuf3-to-dict>=0.1.5 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from sagemaker>=2.1.0->stepfunctions) (0.1.5)\n",
      "Requirement already satisfied: packaging>=20.0 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from sagemaker>=2.1.0->stepfunctions) (20.9)\n",
      "Requirement already satisfied: attrs in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from sagemaker>=2.1.0->stepfunctions) (20.3.0)\n",
      "Requirement already satisfied: smdebug-rulesconfig==1.0.1 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from sagemaker>=2.1.0->stepfunctions) (1.0.1)\n",
      "Requirement already satisfied: google-pasta in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from sagemaker>=2.1.0->stepfunctions) (0.2.0)\n",
      "Requirement already satisfied: pandas in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from sagemaker>=2.1.0->stepfunctions) (1.1.5)\n",
      "Requirement already satisfied: pathos in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from sagemaker>=2.1.0->stepfunctions) (0.2.7)\n",
      "Requirement already satisfied: importlib-metadata>=1.4.0 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from sagemaker>=2.1.0->stepfunctions) (3.7.0)\n",
      "Requirement already satisfied: numpy>=1.9.0 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from sagemaker>=2.1.0->stepfunctions) (1.19.5)\n",
      "Requirement already satisfied: typing-extensions>=3.6.4 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from importlib-metadata>=1.4.0->sagemaker>=2.1.0->stepfunctions) (3.7.4.3)\n",
      "Requirement already satisfied: zipp>=0.5 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from importlib-metadata>=1.4.0->sagemaker>=2.1.0->stepfunctions) (3.4.0)\n",
      "Requirement already satisfied: pyparsing>=2.0.2 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from packaging>=20.0->sagemaker>=2.1.0->stepfunctions) (2.4.7)\n",
      "Requirement already satisfied: pytz>=2017.2 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from pandas->sagemaker>=2.1.0->stepfunctions) (2021.1)\n",
      "Requirement already satisfied: dill>=0.3.3 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from pathos->sagemaker>=2.1.0->stepfunctions) (0.3.3)\n",
      "Requirement already satisfied: pox>=0.2.9 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from pathos->sagemaker>=2.1.0->stepfunctions) (0.2.9)\n",
      "Requirement already satisfied: ppft>=1.6.6.3 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from pathos->sagemaker>=2.1.0->stepfunctions) (1.6.6.3)\n",
      "Requirement already satisfied: multiprocess>=0.70.11 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from pathos->sagemaker>=2.1.0->stepfunctions) (0.70.11.1)\n",
      "Building wheels for collected packages: stepfunctions\n",
      "  Building wheel for stepfunctions (setup.py) ... \u001b[?25ldone\n",
      "\u001b[?25h  Created wheel for stepfunctions: filename=stepfunctions-2.0.0-py2.py3-none-any.whl size=71523 sha256=9e1c07f59f1e4d8897e556343007271722342a5de0ab662ba2f5dcc761734308\n",
      "  Stored in directory: /home/ec2-user/.cache/pip/wheels/29/b2/a5/d9e16e3f4bfa7af770ce2df2e20dec0963b92f59ac62784e81\n",
      "Successfully built stepfunctions\n",
      "Installing collected packages: stepfunctions\n",
      "Successfully installed stepfunctions-2.0.0\n"
     ]
    }
   ],
   "source": [
    "import sys\n",
    "!{sys.executable} -m pip install stepfunctions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sagemaker\n",
    "from sagemaker.image_uris import retrieve\n",
    "from sagemaker.sklearn.estimator import SKLearn\n",
    "from sagemaker.processing import Processor, ProcessingInput, ProcessingOutput\n",
    "from sagemaker.dataset_definition.inputs import S3Input\n",
    "from sagemaker.inputs import TrainingInput\n",
    "from sagemaker.s3 import S3Uploader\n",
    "import stepfunctions\n",
    "from stepfunctions import steps\n",
    "from stepfunctions.inputs import ExecutionInput\n",
    "from stepfunctions.workflow import Workflow\n",
    "\n",
    "import logging\n",
    "import time\n",
    "import boto3\n",
    "import json\n",
    "import pandas as pd\n",
    "from sklearn.model_selection import train_test_split"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Set S3 bucket, data prefix and permission\n",
    "\n",
    "Here the user should set all the required variables that will be used throughout the notebook:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# the bucket with the project/use case data\n",
    "S3_BUCKET = 'sagemaker-clarify-demo' # YOUR_S3_BUCKET\n",
    "# the project/use case prefix\n",
    "PREFIX = \"abalone-clarify-statemachine\" # \"YOUR_PREFIX\"\n",
    "# the perfix for the training, validation, baseline data for the ML model and clarify\n",
    "DATA_PREFIX = f'{PREFIX}/prepared_data'\n",
    "# the step functions execution role\n",
    "account = boto3.client('sts').get_caller_identity().get('Account')\n",
    "WORKFLOW_EXEC_ROLE = \"arn:aws:iam::{}:role/sagemaker-clarify-demo-StepFunctionsRole\".format(account)\n",
    "STEPFUNCTION_NAME = \"abalone-training\" # \"YOUR_STEPFUNCTION_NAME\"\n",
    "# the target/label column of the data \n",
    "TARGET_COL = \"Rings\" # \"YOUR_TARGET_COL\"\n",
    "# the prefix to contain the clarify config file \n",
    "CLARIFY_CONFIG_PREFIX = \"{}/clarify-explainability\".format(PREFIX) # \"CLARIFY_CONFIG_PREFIX\"\n",
    "# flag to either generate the training/validation/baseline/config data for the state machine\n",
    "generate_data = True"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load and upload the data\n",
    "\n",
    "Generate the training and validation data and data config the regression model and generate the baseline data for the clarify explainability job."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "predictor.py\n"
     ]
    }
   ],
   "source": [
    "# This cell will compile your entry code for the Amazon SageMaker Training job\n",
    "# and then store it back in your model/ folder\n",
    "!cp model/predictor.py ./\n",
    "!tar -czvf model.tar.gz predictor.py\n",
    "!mv predictor.py model/predictor.py\n",
    "!mv model.tar.gz model/model.tar.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "if generate_data:\n",
    "    target_col = 'Rings'\n",
    "    df = pd.read_csv('https://datahub.io/machine-learning/abalone/r/abalone.csv')\n",
    "    cols = [x if \"rings\" not in x else \"Rings\" for x in df.columns]\n",
    "    df.columns = cols\n",
    "    # X: input features - this is what your algorithm takes for learning\n",
    "    # y: target - this is what your algorithm will predict\n",
    "    X = df.drop(TARGET_COL, axis=1)\n",
    "    y = df[TARGET_COL]\n",
    "    # Split the data into 70% training and 30% validation data\n",
    "    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42)\n",
    "    train = pd.concat([y_train, X_train], axis=1)\n",
    "    val = pd.concat([y_val, X_val], axis=1)\n",
    "    # save the data locally and then upload to S3\n",
    "    train.to_csv('train_data.csv', index=False) # training data\n",
    "    val.to_csv('val_data.csv', index=False) # validation data\n",
    "    baseline = train.agg({'Sex': 'mode', \n",
    "                      'Length': 'mean', \n",
    "                      'Diameter': 'mean', \n",
    "                      'Height': 'mean', \n",
    "                      'Whole_weight': 'mean', \n",
    "                      'Shucked_weight': 'mean', \n",
    "                      'Viscera_weight': 'mean', \n",
    "                      'Shell_weight': 'mean'}) # used in SageMaker Clarify: only store your features\n",
    "    baseline.to_csv('baseline.csv', index=False, header=None)\n",
    "    # write the columns to be one hot encoded and the column names, in the same order as in the training data into a config file\n",
    "    # this config file will be read during training and prediction to one hot encode the columns \n",
    "    config_data = {\n",
    "        'one_hot_encoding': ['Sex'], \n",
    "        'numeric': ['Length', 'Diameter', 'Height', 'Whole_weight', 'Shucked_weight', 'Viscera_weight', 'Shell_weight'],\n",
    "        'header': train.columns.tolist()\n",
    "    }\n",
    "    with open('config_data.json', 'w') as outfile:\n",
    "        json.dump(config_data, outfile)\n",
    "    # upload the data to S3\n",
    "    train_uri = S3Uploader.upload('train_data.csv', f's3://{S3_BUCKET}/{DATA_PREFIX}')\n",
    "    val_uri = S3Uploader.upload('val_data.csv', f's3://{S3_BUCKET}/{DATA_PREFIX}')\n",
    "    baseline_uri = S3Uploader.upload('baseline.csv', f's3://{S3_BUCKET}/{DATA_PREFIX}')\n",
    "    config_uri = S3Uploader.upload('config_data.json', f's3://{S3_BUCKET}/{DATA_PREFIX}')\n",
    "    model_uri = S3Uploader.upload('model/model.tar.gz', f's3://{S3_BUCKET}/{DATA_PREFIX}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create the pipeline\n",
    "\n",
    "### Set training session parameters\n",
    "\n",
    "In this section you will set the data source for the model to be run, as well as the Amazon SageMaker SDK session variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get a SageMaker-compatible role used by this function and the session.\n",
    "sagemaker_session = sagemaker.Session()\n",
    "region = sagemaker_session.boto_region_name\n",
    "role = sagemaker.get_execution_role()\n",
    "# get the clarify image_uri for a processor job to run clarify\n",
    "clarify_image_uri = retrieve('clarify', region)\n",
    "stepfunctions.set_stream_logger(level=logging.INFO)\n",
    "\n",
    "# Train a SKLearn estimator \n",
    "# define the required inputs for the SKLearn Base estimator\n",
    "framework_version = '0.23-1'\n",
    "instance_type = 'ml.m5.xlarge'\n",
    "instance_count = 1\n",
    "# Set your code folder\n",
    "source_dir = model_uri\n",
    "#source_dir = \"s3://{}/{}/model/model.tar.gz\".format(S3_BUCKET, PREFIX)\n",
    "entry_point = 'predictor.py'\n",
    "\n",
    "# training input path\n",
    "train_uri = \"s3://{}/{}/train_data.csv\".format(S3_BUCKET, DATA_PREFIX)\n",
    "# validation input path\n",
    "val_uri = \"s3://{}/{}/val_data.csv\".format(S3_BUCKET, DATA_PREFIX)\n",
    "# baseline (required for Clarify) input path\n",
    "baseline_uri = \"s3://{}/{}/baseline.csv\".format(S3_BUCKET, DATA_PREFIX)\n",
    "# data config input path\n",
    "config_data_uri = \"s3://{}/{}/config_data.json\".format(S3_BUCKET, DATA_PREFIX)\n",
    "# clarify config input path\n",
    "clarify_config_uri = \"s3://{}/{}/analysis_config.json\".format(S3_BUCKET, CLARIFY_CONFIG_PREFIX)\n",
    "# explainability output path\n",
    "explainability_output_uri = \"s3://{}/{}/\".format(S3_BUCKET, CLARIFY_CONFIG_PREFIX)\n",
    "\n",
    "# Set training input for the SKlearn estimator (not necessary but recommended)\n",
    "train_data = TrainingInput(train_uri, content_type='csv')\n",
    "validation_data = TrainingInput(val_uri, content_type='csv')\n",
    "config_data = TrainingInput(config_data_uri, content_type='json')\n",
    "\n",
    "s3_client = boto3.client('s3')\n",
    "result = s3_client.get_object(Bucket=config_data_uri.split('/')[2], Key='/'.join(config_data_uri.split('/')[3:])) \n",
    "config_dict = json.loads(result['Body'].read().decode('utf-8'))\n",
    "\n",
    "# Training job hyperparameters\n",
    "hyperparameters = {\n",
    "    'n_estimators': 100,\n",
    "    'max_depth': 10,\n",
    "    'max_features': 'sqrt',\n",
    "    'random_state': 42\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Set SKLearn container"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = SKLearn(\n",
    "    entry_point=entry_point,\n",
    "    source_dir=source_dir,\n",
    "    hyperparameters=hyperparameters,\n",
    "    role=role,\n",
    "    instance_count=instance_count,\n",
    "    instance_type=instance_type,\n",
    "    framework_version=framework_version,\n",
    "    sagemaker_session=sagemaker_session,\n",
    "    code_location=f's3://{S3_BUCKET}/{PREFIX}/model/',\n",
    "    output_path=f's3://{S3_BUCKET}/{PREFIX}/model/',\n",
    "    enable_sagemaker_metrics=True,\n",
    "    metric_definitions=[\n",
    "        {\n",
    "            'Name': 'train:mae',\n",
    "            'Regex': 'Train_mae=(.*?);'\n",
    "        },\n",
    "        {\n",
    "            'Name': 'validation:mae',\n",
    "            'Regex': 'Validation_mae=(.*?);'\n",
    "        }\n",
    "])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Define Amazon SageMaker Clarify processing container\n",
    "\n",
    "In this section we will define the processing container. Amazon SageMaker Clarify runs as a Amazon SageMaker processing job. For this reason we will be integrating a processing job into the AWS StepFunction."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# define the clarify processor inputs and outputs\n",
    "inputs = [\n",
    "    ProcessingInput(\n",
    "        input_name=\"dataset\",\n",
    "        app_managed=False,\n",
    "        s3_input=S3Input(\n",
    "            s3_uri=train_uri,\n",
    "            local_path=\"/opt/ml/processing/input/data\",\n",
    "            s3_data_type=\"S3Prefix\",\n",
    "            s3_input_mode=\"File\",\n",
    "            s3_data_distribution_type=\"FullyReplicated\",\n",
    "            s3_compression_type=\"None\"\n",
    "        )\n",
    "    ),\n",
    "    ProcessingInput(\n",
    "        input_name=\"analysis_config\",\n",
    "        app_managed=False,\n",
    "        s3_input=S3Input(\n",
    "            s3_uri=clarify_config_uri,\n",
    "            local_path=\"/opt/ml/processing/input/config\",\n",
    "            s3_data_type=\"S3Prefix\",\n",
    "            s3_input_mode=\"File\",\n",
    "            s3_data_distribution_type=\"FullyReplicated\",\n",
    "            s3_compression_type=\"None\"\n",
    "        )\n",
    "    )\n",
    "]\n",
    "outputs = [\n",
    "    ProcessingOutput(\n",
    "        source=\"/opt/ml/processing/output\",\n",
    "        destination=explainability_output_uri,\n",
    "        output_name=\"analysis_result\",\n",
    "        s3_upload_mode=\"EndOfJob\",\n",
    "    )\n",
    "]\n",
    "# define the clarify processor\n",
    "processor = Processor(\n",
    "    image_uri=clarify_image_uri,\n",
    "    role=role,\n",
    "    instance_type=instance_type,\n",
    "    instance_count=1,\n",
    "    sagemaker_session=sagemaker_session)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Define input schema"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# define the execution input schema\n",
    "schema = {\n",
    "    \"TrainingJobName\": str,\n",
    "    \"ModelName\": str,\n",
    "    \"EndpointName\": str,\n",
    "    \"ClarifyWriteConfigLambda\": str,\n",
    "    \"ClarifyJobName\": str,\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "# define the execution input, which needs to be passed in this format to the state machine\n",
    "execution_input = ExecutionInput(schema=schema)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Define AWS StepFunction steps"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set the model training step\n",
    "train_step = steps.TrainingStep(\n",
    "    'Model Training',\n",
    "    estimator=model,\n",
    "    data={\n",
    "        'train': train_data, \n",
    "        'validation': validation_data,\n",
    "        'config': config_data\n",
    "    },\n",
    "    job_name=execution_input['TrainingJobName'],\n",
    "    wait_for_completion=True\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Save the model to Sagemaker\n",
    "model_step = steps.ModelStep(\n",
    "    'Save SageMaker Model',\n",
    "    model=train_step.get_expected_model(),\n",
    "    model_name=execution_input['ModelName']\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "# SageMaker Clarify config\n",
    "lambda_step = steps.compute.LambdaStep(\n",
    "    'Write Clarify config file',\n",
    "    parameters={  \n",
    "        \"FunctionName\": execution_input['ClarifyWriteConfigLambda'],\n",
    "        'Payload': {\n",
    "            \"bucket\": S3_BUCKET,\n",
    "            \"data_prefix\": DATA_PREFIX,\n",
    "            \"model_name\": execution_input['ModelName'],\n",
    "            \"label\": TARGET_COL,\n",
    "            \"header\": config_dict['header'],\n",
    "            \"clarify_config_prefix\": CLARIFY_CONFIG_PREFIX\n",
    "        }\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "# SageMaker Clarify\n",
    "clarify_processing = steps.ProcessingStep(\n",
    "    'SageMaker Clarify Processing',\n",
    "    processor=processor,\n",
    "    job_name=execution_input['ClarifyJobName'],\n",
    "    inputs=inputs,\n",
    "    outputs=outputs,\n",
    "    wait_for_completion=True\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create endpoint config for model\n",
    "endpoint_config = steps.EndpointConfigStep(\n",
    "    \"Create SageMaker Endpoint Config\",\n",
    "    endpoint_config_name=execution_input['ModelName'],\n",
    "    model_name=execution_input['ModelName'],\n",
    "    initial_instance_count=instance_count,\n",
    "    instance_type=instance_type\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create/Update model endpoint\n",
    "endpoint = steps.EndpointStep(\n",
    "    'Create/Update SageMaker Endpoint',\n",
    "    endpoint_name=execution_input['EndpointName'],\n",
    "    endpoint_config_name=execution_input['ModelName'],\n",
    "    update=False\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Chain together the steps for the state machine\n",
    "workflow_definition = steps.Chain([\n",
    "    train_step,\n",
    "    model_step,\n",
    "    lambda_step,\n",
    "    clarify_processing,\n",
    "    endpoint_config,\n",
    "    endpoint\n",
    "])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define the Workflow\n",
    "workflow = Workflow(\n",
    "    name=STEPFUNCTION_NAME,\n",
    "    definition=workflow_definition,\n",
    "    role=WORKFLOW_EXEC_ROLE,\n",
    "    execution_input=execution_input\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[31m[ERROR] A workflow with the same name already exists on AWS Step Functions. To update a workflow, use Workflow.update().\u001b[0m\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'arn:aws:states:us-east-1:148244586595:stateMachine:abalone-training'"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Create the workflow\n",
    "workflow.create()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[32m[INFO] Workflow updated successfully on AWS Step Functions. All execute() calls will use the updated definition and role within a few seconds. \u001b[0m\n",
      "\u001b[32m[INFO] Workflow execution started successfully on AWS Step Functions.\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "workflow.update(definition=workflow_definition, role=WORKFLOW_EXEC_ROLE)\n",
    "time.sleep(10)\n",
    "\n",
    "gid = time.strftime(\"%Y-%m-%d-%H-%M-%S\", time.gmtime())\n",
    "inputs = {\n",
    "    \"TrainingJobName\": \"sagemaker-sklearn-job-{}\".format(gid),\n",
    "    \"ModelName\": \"sagemaker-sklearn-job-{}\".format(gid),\n",
    "    \"EndpointName\": \"sagemaker-sklearn-job-{}\".format(gid),\n",
    "    \"ClarifyWriteConfigLambda\": \"sagemaker-clarify-write-config\",\n",
    "    \"ClarifyJobName\": \"sagemaker-sklearn-job-{}\".format(gid),\n",
    "}\n",
    "\n",
    "execution = workflow.execute(inputs=inputs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "# workflow.render_graph(portrait=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cleanup the SageMaker endpoint"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sm_client = boto3.client('sagemaker')\n",
    "response = sm_client.delete_endpoint_config(\n",
    "    EndpointConfigName=inputs['EndpointName']\n",
    ")\n",
    "response = sm_client.delete_endpoint(\n",
    "    EndpointName=inputs['EndpointName']\n",
    ")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}