{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prerequisites\n",
    "\n",
    "To run  this notebook in an AWS Sagemaker Notebook, use the `conda_tensorflow2_p36` Python kernel. \n",
    "This kernel is set as default\n",
    "and may take a minute to start. Also a note of caution on running all cells at once \n",
    "through the `Cell->Run All` menu option: At the end of this notebook in the Clean Up section, the resources\n",
    "created are deleted through API calls, there are some parameters that need to be set like `my_email_address`, \n",
    "and some steps are optional. We recommend running through the notebook step by step.  \n",
    "\n",
    "We need a few installations to allow our notebook to run Sagemaker containers in local mode, and step functions. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Enabling local mode training\n",
    "!wget -q https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/master/local_mode_setup.sh\n",
    "!wget -q https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/master/daemon.json    \n",
    "!/bin/bash ./local_mode_setup.sh"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "\n",
    "# Upgrade to sagemaker version 2 to use Sagemaker Debugger built-in actions\n",
    "!{sys.executable} -m pip install sagemaker -U\n",
    "\n",
    "# Install stepfunctions python sdk -- we need the pre-release 2.0.0rc1 version since the current stable version \n",
    "# of stepfunctions SDK does not support Sagemaker 2.0 and above.\n",
    "!{sys.executable} -m pip install stepfunctions==2.0.0rc1 -U"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "is_executing": false,
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import os\n",
    "import sagemaker\n",
    "from sagemaker.debugger import Rule, rule_configs, CollectionConfig, DebuggerHookConfig\n",
    "from sagemaker.tensorflow import TensorFlow\n",
    "import boto3\n",
    "\n",
    "from stepfunctions.inputs import ExecutionInput\n",
    "import stepfunctions as sf\n",
    "\n",
    "import tensorflow as tf\n",
    "from os.path import join as pjoin\n",
    "import glob\n",
    "import zipfile"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define runtime information  \n",
    "sess = sagemaker.Session()\n",
    "\n",
    "region = sess.boto_region_name\n",
    "sm_client = sess.boto_session.client('sagemaker')\n",
    "cft_client = boto3.client('cloudformation')\n",
    "s3 = sess.boto_session.client('s3')\n",
    "account = boto3.client('sts').get_caller_identity().get('Account')\n",
    "bucket = f'{account}-sagemaker-debugger-model-automation'  \n",
    "\n",
    "# >>> provide email address for SNS topic subscription\n",
    "my_email_address = '<email address to receive notifications>'\n",
    "\n",
    "# sagemaker job params\n",
    "train_instance_type = 'ml.m5.xlarge'\n",
    "job_name_prefix = 'complex-resnet-model'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "# create our bucket \n",
    "if region == 'us-east-1':\n",
    "    #  'us-east-1' is the default and we should not specify the region\n",
    "    s3.create_bucket(Bucket=bucket)\n",
    "else:\n",
    "    s3.create_bucket(Bucket=bucket, CreateBucketConfiguration={'LocationConstraint': region})\n",
    "\n",
    "print(bucket)\n",
    "print(region)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### IAM roles\n",
    "\n",
    "We need a role for our Lambda functions, and another for our step functions workflow. Note that the IAM role attached \n",
    "to this notebook needs to have `iam:PassRole` permission for the two roles below to be able to attach them to \n",
    "resources created later. The `lambda_role` role below needs permissions to create and describe Sagemaker jobs, \n",
    "and publish to SNS topics. The `sf_role` below needs permission to invoke Lambda functions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lambda_role = f'arn:aws:iam::{account}:role/lambda-sagemaker-train'\n",
    "sf_role = f'arn:aws:iam::{account}:role/step-function-basic-role'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Container images\n",
    "\n",
    "We will use pre-built container images for Tensorflow and Sagemaker Debugger. The following retrieves the docker\n",
    "URIs to use when creating our training job instances.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sm_tensorflow_image = sagemaker.image_uris.retrieve(framework=\"tensorflow\", \n",
    "                                                    image_scope='training',\n",
    "                                                    version='2.2', \n",
    "                                                    region=region,\n",
    "                                                    instance_type=train_instance_type,\n",
    "                                                    py_version='py37'\n",
    "                                                    )\n",
    "sm_debugger_image = sagemaker.image_uris.retrieve(framework=\"debugger\", region=region)\n",
    "print(sm_tensorflow_image)\n",
    "print(sm_debugger_image)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Setting up notifications using AWS Simple Notification Service (SNS)\n",
    "Using boto3 API, we create an SNS topic and add a subscription using any of the supported protocols, like email or \n",
    "SMS. Here we will use an email address for the topic subscription.  This step\n",
    "can also be done through [AWS SNS console](https://console.aws.amazon.com/sns/home).   Note that the topic name \n",
    "`SMDebugRules` is what Sagemaker Debugger built-in \n",
    "[notification action](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-actions.html) looks for to \n",
    "publish messages to."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "sns = boto3.client('sns')\n",
    "topicresponse = sns.create_topic(Name='SMDebugRules')\n",
    "topic_arn=topicresponse['TopicArn']\n",
    "sns.subscribe(TopicArn=topic_arn, Protocol='email', Endpoint=my_email_address)\n",
    "print(topic_arn)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "**Before proceeding, make sure to confirm your subscription by following the instructions SNS sends to your email address.\n",
    "If you do not confirm your \n",
    "subscription, the workflow will not complain or produce an error, but you will not receive the notifications it \n",
    "publishes.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Model\n",
    "\n",
    "The model's python code can be found in the same repository as this notebook [here](code/model/model.py). The model can be configured to have `ELU` or `ReLU` activation, add a batch normalization layer before additions, or change warmup learning rate and the initial learning rate post warm-up period.\n",
    "\n",
    "We pack our model and upload to S3 below. We will use the default bucket allocated to this Sagemaker notebook, which can be found under the above bucket name."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "!tar -cvzf sourcedir.tar.gz -C code/model/ train.py model.py\n",
    "s3.upload_file('sourcedir.tar.gz', bucket, 'source/sourcedir.tar.gz')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Dataset\n",
    "\n",
    "We will use the standard [CIFAR10 dataset](https://www.cs.toronto.edu/~kriz/cifar.html). We download the dataset, then upload it to S3."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "is_executing": false,
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "def load_data():\n",
    "    (x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()\n",
    "    x_train = x_train.astype('float32') / 256\n",
    "    x_test = x_test.astype('float32') / 256\n",
    "\n",
    "    # Convert class vectors to binary class matrices.\n",
    "    y_train = tf.keras.utils.to_categorical(y_train, num_classes=10)\n",
    "    y_test = tf.keras.utils.to_categorical(y_test, num_classes=10)\n",
    "    return ((x_train, y_train), (x_test, y_test))\n",
    "\n",
    "(x_train, y_train), (x_test, y_test) = load_data()\n",
    "\n",
    "data_dir = os.path.join(os.getcwd(), 'data')\n",
    "train_dir = os.path.join(data_dir, 'train')\n",
    "test_dir = os.path.join(data_dir, 'test')\n",
    "os.makedirs(train_dir, exist_ok=True)\n",
    "os.makedirs(test_dir, exist_ok=True)\n",
    "\n",
    "np.save(pjoin(train_dir, 'x.npy'), x_train)\n",
    "np.save(pjoin(train_dir, 'y.npy'), y_train)\n",
    "np.save(pjoin(test_dir, 'x.npy'), x_test)\n",
    "np.save(pjoin(test_dir, 'y.npy'), y_test)\n",
    "print('Data saved locally:')\n",
    "print(train_dir) \n",
    "print(test_dir)\n",
    "\n",
    "\n",
    "# Uploads to default session bucket\n",
    "train_s3 = sess.upload_data(train_dir, bucket=bucket, key_prefix='data/train')\n",
    "test_s3 = sess.upload_data(test_dir, bucket=bucket, key_prefix='data/test')\n",
    "print('Data upload to s3:')\n",
    "print(train_s3)\n",
    "print(test_s3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Local model training (optional)\n",
    "\n",
    "Local mode training allows to run the training container in the same Jupyter notebook environment, which saves \n",
    "the startup time and reduces testing overheads. The following step is not strictly necessary to build the workflow. \n",
    "However, it is useful to see how we can test and develop our model template locally before plugging in the workflow. \n",
    "This saves us development time by allowing isolated testing of the model.\n",
    "\n",
    "We have set the `num_epochs` parameter to only 2 epochs to finish the job quickly. The step should take \n",
    "less than 5 minutes to complete. You can also skip to the next section.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "hyperparameters = {'warmup_learning_rate': 0.001, \n",
    "                   'learning_rate': 0.1, \n",
    "                   'num_epochs': 2, \n",
    "                   'add_batch_norm': 0       # set to 0 to disable, 1 to enable batch_norm before add in res blocks\n",
    "                   }\n",
    "\n",
    "# Build the local estimator\n",
    "local_estimator = TensorFlow(source_dir='code/model',\n",
    "                             entry_point='train.py',\n",
    "                             instance_type='local',\n",
    "                             instance_count=1,\n",
    "                             hyperparameters=hyperparameters,\n",
    "                             role=lambda_role,\n",
    "                             base_job_name=job_name_prefix,\n",
    "                             framework_version='2.2',\n",
    "                             py_version='py37',\n",
    "                             metric_definitions = [{'Name': 'loss',\n",
    "                                                    'Regex': 'loss: (.*?) '}],                             \n",
    "                             script_mode=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# file:// allows to feed local files to the local container for training.\n",
    "inputs = {'train': f'file://{train_dir}',\n",
    "          'test': f'file://{test_dir}'}\n",
    "\n",
    "local_estimator.fit(inputs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "### Training on Sagemaker cluster with Sagemaker Debugger (optional)\n",
    "\n",
    "Once we are satisfied with the model, we configure Sagemaker debugger, and perform a test launch on Sagemaker clusters. \n",
    "Note that we cannot run Sagemaker Debugger in local mode, but the model code has a checkpoint to identify if the\n",
    " model is being run in local mode, in which case it will not call Sagemaker Debugger callbacks. \n",
    "\n",
    "Here, we are concerned about exploding gradients. \n",
    "See [Sagemaker Debugger examples and documents](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/sagemaker.md) \n",
    "for other rules Debugger can check.\n",
    "\n",
    "### Sagemaker Debugger actions\n",
    "[ Debugger Actions](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-action-on-rules.html) is a new feature \n",
    "that allows to perform an action in response to a debugger rule firing. Here, we will use the built-in _stop training_ \n",
    "action to stop the training job if the training experiences exploding gradients. The built-in _email_ action is also \n",
    "useful to notify us should the debugger rule fire. \n",
    "\n",
    "Built-in actions are useful mechanisms to perform a single action, for example sending a notification, upon a rule firing.\n",
    "For more complex reactions to training problems, the next section will incorporate debugger rules within a step function \n",
    "workflow that can iteratively tweak the model while tracking the hisotry and sending detailed notification messages that\n",
    "include the history and current status of the model training.\n",
    "\n",
    "Although we have set `num_epochs` to 30 epochs, Debugger rule is expected to fire and trigger the built-in `StopTraining`\n",
    "action to stop the training job as soon as Debugger detects convergence problems.  A notification will also be sent.\n",
    "This step should take about 10-15 minutes to complete. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# Setting built-in actions to stop bad training job and notify if debugger rule is triggered.\n",
    "debugger_actions = rule_configs.ActionList( \n",
    "    rule_configs.StopTraining(),\n",
    "    rule_configs.Email(my_email_address)\n",
    ")\n",
    "\n",
    "# Adding exploding tensor rule to look for exploding gradient issues. \n",
    "dbg_rules = [\n",
    "    Rule.sagemaker(\n",
    "        base_config=rule_configs.exploding_tensor(),\n",
    "        rule_parameters={\n",
    "                \"tensor_regex\": \".*gradient\",\n",
    "                \"only_nan\": \"False\"\n",
    "        },\n",
    "        actions=debugger_actions\n",
    "    )\n",
    "]\n",
    "\n",
    "hook_config = DebuggerHookConfig(\n",
    "    hook_parameters={\n",
    "        \"save_interval\": \"100\"\n",
    "    },\n",
    "    collection_configs=[        \n",
    "        CollectionConfig(\n",
    "            name=\"gradients\",\n",
    "            parameters={\n",
    "                \"save_interval\": \"100\",\n",
    "            }\n",
    "        )\n",
    "    ]\n",
    ")\n",
    "\n",
    "hyperparameters = {'warmup_learning_rate': 0.1, \n",
    "                   'learning_rate': 0.1, \n",
    "                   'num_epochs': 30, \n",
    "                   'add_batch_norm': 0       # set to 0 to disable, 1 to enable batch_norm before add in res blocks\n",
    "                   }\n",
    "\n",
    "estimator = TensorFlow(\n",
    "    script_mode=True,\n",
    "    source_dir='code/model',\n",
    "    entry_point='train.py',\n",
    "    instance_type=train_instance_type,\n",
    "    instance_count=1,\n",
    "    hyperparameters=hyperparameters,\n",
    "    role=sagemaker.get_execution_role(),\n",
    "    base_job_name=job_name_prefix,\n",
    "    framework_version='2.2',\n",
    "    py_version='py37',                             \n",
    "    rules = dbg_rules,\n",
    "    debugger_hook_config = hook_config,\n",
    "    metric_definitions = [{'Name': 'loss',\n",
    "                            'Regex': 'loss: (.+?) '}],  \n",
    "    output_path=f's3://{bucket}'\n",
    ")\n",
    "\n",
    "inputs = {'train': train_s3,\n",
    "          'test': test_s3}\n",
    "\n",
    "# wait=True, logs=True will show us all the job logs, which may be convenient for testing.\n",
    "# We can also view the same logs in CloudWatch by following the link to logs on Sagemaker\n",
    "# training jobs page of our job.\n",
    "estimator.fit(inputs, wait=True, logs=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "##  Monitoring job status\n",
    "\n",
    "We can monitor the status of our training job through [Sagemaker console](https://console.aws.amazon.com/sagemaker/home), \n",
    "or by using the Sagemaker API without leaving our notebook as below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "estimator.latest_training_job.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## AWS Lambda functions\n",
    "\n",
    "We use two AWS Lambda functions for the workflow. One is used to create a training job. The other monitors the status of the\n",
    "job and Debugger rule checks, publishes notifications to SNS, and stops bad training jobs and plans next training configuration."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### First time creating\n",
    "The following piece of code zips our functions and creates Lambda functions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "lambda_client = boto3.client('lambda')\n",
    "# We will pack all python files under the lambda folder\n",
    "files = glob.glob('code/lambda/*.py')\n",
    "print(files)\n",
    "# Collect the file names, which we will use later when cleaning up.\n",
    "lambda_funcs = []\n",
    "for f in files:\n",
    "    zipname = f.split('.')[0]+'.zip'\n",
    "    func_name = os.path.basename(f.split('.')[0])\n",
    "    lambda_funcs.append(func_name)\n",
    "    zf = zipfile.ZipFile(zipname, mode='w')\n",
    "    zf.write(f, arcname=func_name + '.py')\n",
    "    zf.close()\n",
    "    s3.upload_file(zipname, Bucket=bucket, Key=f'source/{zipname}')\n",
    "    \n",
    "    response = lambda_client.create_function(\n",
    "        FunctionName=func_name,\n",
    "        Runtime='python3.7',\n",
    "        Role=lambda_role,\n",
    "        Handler=f'{func_name}.lambda_handler',\n",
    "        Code={\n",
    "            'S3Bucket': bucket,\n",
    "            'S3Key': f'source/{zipname}'\n",
    "        },\n",
    "        Description='Queries a SageMaker training job and return the results.',\n",
    "        Timeout=15,\n",
    "        MemorySize=128\n",
    "    )\n",
    "    print(f\"Creating {func_name}, API response status: {response['ResponseMetadata']['HTTPStatusCode']}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "###  Updating functions code if needed (Optional)\n",
    "As we work with developing our workflow, often need to update the lambda functions. This snippet allows us to update\n",
    "the function codes without leaving our notebook enviroment or deleting and recreating our Lambda functions. We can also \n",
    "edit the code in [AWS Lambda editor](https://console.aws.amazon.com/lambda/home)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "lambda_client = boto3.client('lambda')\n",
    "files = glob.glob('code/lambda/*.py')\n",
    "fileszip = []\n",
    "for f in files:\n",
    "    zipname = f.split('.')[0]+'.zip'\n",
    "    print(zipname)\n",
    "    fileszip.append(zipname)\n",
    "    func_name = os.path.basename(f.split('.')[0])\n",
    "    zf = zipfile.ZipFile(zipname, mode='w')\n",
    "    zf.write(f, arcname = func_name + '.py')\n",
    "    zf.close()\n",
    "    s3.upload_file(zipname, Bucket=bucket, Key=f'source/{zipname}')\n",
    "    response = lambda_client.update_function_code(\n",
    "        FunctionName=func_name,\n",
    "        S3Bucket = bucket,\n",
    "        S3Key = f'source/{zipname}',\n",
    "        Publish = False\n",
    "    )\n",
    "    print(response['ResponseMetadata']['HTTPStatusCode'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## AWS Step Function workflow\n",
    "\n",
    "We define a number of input configuration parameters for our workflow.\n",
    "The workflow provides a few checkpoints to avoid runaway situations. For example, the `max_monitor_transitions` ensures \n",
    "that the state machine does not loop indefinitely getting stuck in the monitor state. Set `max_monitor_transitions` \n",
    "to a high, but reasonable, value in accordance to the expected run time of the state machine and also \n",
    "the `wait_time` parameters which controls how much the state machine \n",
    "waits before querying the status of the job again. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "execution_input = ExecutionInput(schema={\n",
    "            'init_warmup_learning_rate': float,\n",
    "            'learning_rate': float,\n",
    "            'add_batch_norm': int,\n",
    "            'bucket': str,\n",
    "            'base_job_name': str,\n",
    "            'instance_type': str,\n",
    "            'region': str,\n",
    "            'num_epochs': int,\n",
    "            'debugger_save_interval': int,\n",
    "            'max_num_retraining': int,\n",
    "            'max_monitor_transitions': int\n",
    "        }\n",
    ")\n",
    "# wait_time sepcifies how many seconds the monitor step waits before querying about the status of the training job. \n",
    "# We are setting to wait for 5 minutes in between checking the status\n",
    "wait_time = 5*60"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "### Defining the state machine"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# Create the initial state that steps pass to each other. \n",
    "init_state = sf.steps.states.Pass('init', \n",
    "                                   parameters={\n",
    "                                        'state': {\n",
    "                                                'history': {\n",
    "                                                    'num_warmup_adjustments': 0,\n",
    "                                                    'num_batch_layer_adjustments': 0,\n",
    "                                                    'num_retraining': 0,\n",
    "                                                    'latest_job_name': '',\n",
    "                                                    'num_learning_rate_adjustments': 0,\n",
    "                                                    'num_monitor_transitions': 0\n",
    "                                                },\n",
    "                                                'next_action': 'launch_new',      # planning a new launch next\n",
    "                                                'job_status': '',\n",
    "                                                'run_spec': {\n",
    "                                                    'warmup_learning_rate': execution_input['init_warmup_learning_rate'],\n",
    "                                                    'learning_rate': execution_input['learning_rate'],\n",
    "                                                    'add_batch_norm': execution_input['add_batch_norm'],\n",
    "                                                    'bucket': execution_input['bucket'],\n",
    "                                                    'base_job_name': execution_input['base_job_name'],\n",
    "                                                    'instance_type': execution_input['instance_type'],\n",
    "                                                    'region': execution_input['region'],\n",
    "                                                    'sm_role': lambda_role,\n",
    "                                                    'num_epochs': execution_input['num_epochs'],\n",
    "                                                    'debugger_save_interval': execution_input['debugger_save_interval']\n",
    "                                                }\n",
    "                                        }\n",
    "                                   },\n",
    "                                 )\n",
    "                                       \n",
    "\n",
    "training_step = sf.steps.compute.LambdaStep('train',\n",
    "                                           parameters={\n",
    "                                               'FunctionName': 'launch_training_job',\n",
    "                                               'Payload': {\n",
    "                                                   'state.$': '$.state',\n",
    "                                                   'sm_tensorflow_image': sm_tensorflow_image,\n",
    "                                                   'sm_debugger_image': sm_debugger_image\n",
    "                                               }\n",
    "                                           }, \n",
    "                                           output_path='$.Payload.body',   \n",
    "                                           wait_for_callback=False\n",
    "                                           )\n",
    "\n",
    "wait_step = sf.steps.states.Wait('wait',                                  \n",
    "                                 seconds=wait_time, \n",
    "                                 comment='Wait for training job to make some progress.')\n",
    "\n",
    "monitor_step = sf.steps.compute.LambdaStep('monitor',\n",
    "                                            parameters={\n",
    "                                                 'FunctionName': 'monitor',\n",
    "                                                 'Payload': {\n",
    "                                                     'topic_arn': topic_arn,\n",
    "                                                     'max_num_retraining': execution_input['max_num_retraining'],\n",
    "                                                     'max_monitor_transitions': execution_input['max_monitor_transitions'],\n",
    "                                                     'state.$': '$.state'\n",
    "                                                 }\n",
    "                                             },\n",
    "                                           output_path='$.Payload.body'\n",
    "                                          )\n",
    "choice_step = sf.steps.states.Choice('choice')\n",
    "# possible values for next_action: 'launch_new', 'end', 'monitor'\n",
    "\n",
    "succeed_step = sf.steps.states.Succeed('succeed')\n",
    "\n",
    "choice_step.add_choice(\n",
    "    rule=sf.steps.choice_rule.ChoiceRule.StringEquals(variable='$.state.next_action', value='launch_new'),\n",
    "    next_step=training_step\n",
    ")\n",
    "choice_step.add_choice(\n",
    "    rule=sf.steps.choice_rule.ChoiceRule.StringEquals(variable='$.state.next_action', value='monitor'),\n",
    "    next_step=wait_step\n",
    ")\n",
    "choice_step.add_choice(\n",
    "    rule=sf.steps.choice_rule.ChoiceRule.StringEquals(variable='$.state.next_action', value='end'),\n",
    "    next_step=succeed_step\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "workflow_definition = sf.steps.Chain([init_state, training_step, wait_step, monitor_step, choice_step])\n",
    "workflow = sf.workflow.Workflow('sagemaker-model-dev-workflow-with-debugger', definition=workflow_definition, role=sf_role)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(workflow.definition.to_json(pretty=True))\n",
    "workflow.render_graph(portrait=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "workflow.create()\n",
    "# Calling update allows to update workflow definition of an existing workflow. \n",
    "# Note that if workflow already exists, workflow.create() does not complain\n",
    "workflow.update(workflow_definition)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "We have set `\"num_epochs\": 5` to finish faster (likely around 30 minutes or some more). If you are interested to see \n",
    "the full workflow optimizing the model architecture, try `\"num_epochs\": 30`, which should finish in about 3-4 hours."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "workflow.execute(inputs={'init_warmup_learning_rate': 0.1,\n",
    "                         'num_epochs': 5,\n",
    "                         'learning_rate': 0.1,\n",
    "                         'add_batch_norm': 0,\n",
    "                         'bucket': bucket,\n",
    "                         'base_job_name': job_name_prefix,\n",
    "                         'instance_type': 'ml.m5.xlarge',\n",
    "                         'region': region,\n",
    "                         'debugger_save_interval': 100,\n",
    "                         'max_num_retraining': 30,\n",
    "                         'max_monitor_transitions': 200\n",
    "                 })"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Clean up\n",
    "Finally, we can delete the resources we have created above. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# delete SNS topic\n",
    "sns.delete_topic(TopicArn=topic_arn)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# delete Step Functions workflow\n",
    "workflow.delete()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for lambda_f in lambda_funcs:\n",
    "    lambda_client.delete_function(FunctionName=lambda_f)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# Empty the bucket\n",
    "s3service = boto3.resource('s3')\n",
    "bucket_obj = s3service.Bucket(bucket)\n",
    "# This step may take a few minutes.\n",
    "bucket_obj.objects.all().delete()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# WARNING: This will delete the working bucket. To rerun the notebook, you would have to provide a new bucket\n",
    "bucket_obj.delete()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# Delete the CloudFormation stack.\n",
    "# WARNING: THIS WILL DELETE THIS NOTEBOOK AND ANY CODE CHANGES.\n",
    "cft_client.delete_stack(StackName='debugger-cft-stack')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_tensorflow2_p36",
   "language": "python",
   "name": "conda_tensorflow2_p36"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.10"
  },
  "pycharm": {
   "stem_cell": {
    "cell_type": "raw",
    "source": [],
    "metadata": {
     "collapsed": false
    }
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}