{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "# SageMaker Processing Job Image URI Restriction\n",
    "\n",
    "This example notebook can be used for testing the solution provided.\n",
    "\n",
    "In order to use this notebook, you can crate the CloudFormation Stack by providing in the `ProcessingContainers` parameter the following value:\n",
    "\n",
    "**ProcessingContainers**: sagemaker-scikit-learn:0.23-1-cpu-py3\n",
    "\n",
    "In this way, we are avoiding the usage of the SageMaker Container for SKLearn v0.23-1."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Dataset\n",
    "\n",
    "We are using a subset of ~20000 records of synthetic transactions, each of which is labeled as fraudulent or not fraudulent.\n",
    "We'd like to train a model based on the features of these transactions so that we can predict risky or fraudulent transactions in the future.\n",
    "\n",
    "This is a binary classification problem:\n",
    "\n",
    "* 1 - Fraud\n",
    "* 0 - No Fraud"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "! rm -rf ./data && mkdir -p ./data\n",
    "\n",
    "! aws s3 cp s3://sagemaker-sample-files/datasets/tabular/synthetic_credit_card_transactions/user0_credit_card_transactions.csv ./data/data.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "***"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Prerequisites\n",
    "\n",
    "Install the latest version of the SageMaker Python SDK"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "! pip install 'sagemaker' --upgrade"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "***"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Part 1/3 - Setup\n",
    "\n",
    "Here we'll import some libraries and define some variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "import boto3\n",
    "import json\n",
    "import logging\n",
    "import sagemaker\n",
    "from sagemaker.processing import ProcessingInput, ProcessingOutput\n",
    "from sagemaker.sklearn.processing import SKLearnProcessor"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "logging.basicConfig(level=logging.INFO)\n",
    "LOGGER = logging.getLogger(__name__)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "s3_client = boto3.client(\"s3\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "sagemaker_session = sagemaker.Session()\n",
    "region = boto3.session.Session().region_name\n",
    "role = sagemaker.get_execution_role()\n",
    "\n",
    "bucket_name = sagemaker_session.default_bucket()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "### Upload Dataset in the Default Amazon S3 Bucket\n",
    "\n",
    "In order to make the data available, we are uploading the downloaded dataset into the default S3 bucket"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "s3_client.delete_object(Bucket=bucket_name, Key=\"sg-container-restriction/data/input\")\n",
    "\n",
    "input_data = sagemaker_session.upload_data(\n",
    "    \"./data/data.csv\", key_prefix=\"sg-container-restriction/data/input\"\n",
    ")\n",
    "\n",
    "input_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "***"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create Processing Script\n",
    "\n",
    "We are creating the file `processing.py` for using it in the SageMaker Processing Job"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile processing.py\n",
    "\n",
    "import argparse\n",
    "import csv\n",
    "import logging\n",
    "import numpy as np\n",
    "import os\n",
    "from os import listdir\n",
    "from os.path import isfile, join\n",
    "import pandas as pd\n",
    "from sklearn.model_selection import train_test_split\n",
    "import traceback\n",
    "\n",
    "logging.basicConfig(level=logging.INFO)\n",
    "LOGGER = logging.getLogger(__name__)\n",
    "\n",
    "BASE_PATH = os.path.join(\"/\", \"opt\", \"ml\")\n",
    "PROCESSING_PATH = os.path.join(BASE_PATH, \"processing\")\n",
    "PROCESSING_PATH_INPUT = os.path.join(PROCESSING_PATH, \"input\")\n",
    "PROCESSING_PATH_OUTPUT = os.path.join(PROCESSING_PATH, \"output\")\n",
    "\n",
    "def extract_data(file_path, percentage=100):\n",
    "    try:\n",
    "        files = [f for f in listdir(file_path) if isfile(join(file_path, f)) and f.endswith(\".csv\")]\n",
    "        LOGGER.info(\"{}\".format(files))\n",
    "\n",
    "        frames = []\n",
    "\n",
    "        for file in files:\n",
    "            df = pd.read_csv(\n",
    "                os.path.join(file_path, file),\n",
    "                sep=\",\",\n",
    "                quotechar='\"',\n",
    "                quoting=csv.QUOTE_ALL,\n",
    "                escapechar='\\\\',\n",
    "                encoding='utf-8',\n",
    "                error_bad_lines=False\n",
    "            )\n",
    "\n",
    "            df = df.head(int(len(df) * (percentage / 100)))\n",
    "\n",
    "            frames.append(df)\n",
    "\n",
    "        df = pd.concat(frames)\n",
    "\n",
    "        return df\n",
    "    except Exception as e:\n",
    "        stacktrace = traceback.format_exc()\n",
    "        LOGGER.error(\"{}\".format(stacktrace))\n",
    "\n",
    "        raise e\n",
    "\n",
    "def load_data(df, file_path, file_name):\n",
    "    try:\n",
    "        if not os.path.exists(file_path):\n",
    "            os.makedirs(file_path)\n",
    "\n",
    "        path = os.path.join(file_path, file_name + \".csv\")\n",
    "\n",
    "        LOGGER.info(\"Saving file in {}\".format(path))\n",
    "\n",
    "        df.to_csv(\n",
    "            path,\n",
    "            index=False,\n",
    "            header=True,\n",
    "            quoting=csv.QUOTE_ALL,\n",
    "            encoding=\"utf-8\",\n",
    "            escapechar=\"\\\\\",\n",
    "            sep=\",\"\n",
    "        )\n",
    "    except Exception as e:\n",
    "        stacktrace = traceback.format_exc()\n",
    "        LOGGER.error(\"{}\".format(stacktrace))\n",
    "\n",
    "        raise e\n",
    "\n",
    "def transform_data(df):\n",
    "    try:\n",
    "        df = df[df['Is Fraud?'].notna()]\n",
    "\n",
    "        df.insert(0, 'ID', range(1, len(df) + 1))\n",
    "\n",
    "        df[\"Errors?\"].fillna('', inplace=True)\n",
    "        df['Errors?'] = df['Errors?'].map(lambda x: x.strip())\n",
    "        df[\"Errors?\"] = df[\"Errors?\"].map({\n",
    "            \"Insufficient Balance\": 0,\n",
    "            \"Technical Glitch\": 1,\n",
    "            \"Bad PIN\": 2,\n",
    "            \"Bad Expiration\": 3,\n",
    "            \"Bad Card Number\": 4,\n",
    "            \"Bad CVV\": 5,\n",
    "            \"Bad PIN,Insufficient Balance\": 6,\n",
    "            \"Bad PIN,Technical Glitch\": 7,\n",
    "            \"\": 8\n",
    "        })\n",
    "\n",
    "        df[\"Use Chip\"].fillna('', inplace=True)\n",
    "        df['Use Chip'] = df['Use Chip'].map(lambda x: x.strip())\n",
    "        df[\"Use Chip\"] = df[\"Use Chip\"].map({\n",
    "            \"Swipe Transaction\": 0,\n",
    "            \"Chip Transaction\": 1,\n",
    "            \"Online Transaction\": 2\n",
    "        })\n",
    "\n",
    "        df['Is Fraud?'] = df['Is Fraud?'].map(lambda x: x.replace(\"'\", \"\"))\n",
    "        df['Is Fraud?'] = df['Is Fraud?'].map(lambda x: x.strip())\n",
    "        df['Is Fraud?'] = df['Is Fraud?'].replace('', np.nan)\n",
    "        df['Is Fraud?'] = df['Is Fraud?'].replace(' ', np.nan)\n",
    "\n",
    "        df[\"Is Fraud?\"] = df[\"Is Fraud?\"].map({\"No\": 0, \"Yes\": 1})\n",
    "\n",
    "        df = df.rename(\n",
    "            columns={'Card': 'card', 'MCC': 'mcc', \"Errors?\": \"errors\", \"Use Chip\": \"use_chip\", \"Is Fraud?\": \"labels\"})\n",
    "\n",
    "        df = df[[\"card\", \"mcc\", \"errors\", \"use_chip\", \"labels\"]]\n",
    "\n",
    "        return df\n",
    "\n",
    "    except Exception as e:\n",
    "        stacktrace = traceback.format_exc()\n",
    "        LOGGER.error(\"{}\".format(stacktrace))\n",
    "\n",
    "        raise e\n",
    "\n",
    "if __name__ == '__main__':\n",
    "    parser = argparse.ArgumentParser()\n",
    "    parser.add_argument(\"--dataset-percentage\", type=int, required=False, default=100)\n",
    "    args = parser.parse_args()\n",
    "\n",
    "    LOGGER.info(\"Arguments: {}\".format(args))\n",
    "\n",
    "    df = extract_data(PROCESSING_PATH_INPUT, args.dataset_percentage)\n",
    "\n",
    "    df = transform_data(df)\n",
    "\n",
    "    data_train, data_test = train_test_split(df, test_size=0.2, shuffle=True)\n",
    "\n",
    "    load_data(data_train, os.path.join(PROCESSING_PATH_OUTPUT, \"train\"), \"train\")\n",
    "    load_data(data_test, os.path.join(PROCESSING_PATH_OUTPUT, \"test\"), \"test\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "### Global Parameters\n",
    "\n",
    "In this section, we are defining the parameters for the SageMaker Estimator. As framework-version, we use the PyTorch v1.12 and check that the SageMaker Training Job can be executed"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "processing_input_files_path = \"sg-container-restriction/data/input\"\n",
    "processing_output_files_path = \"sg-container-restriction/data/output\"\n",
    "processing_framework_version = \"1.0-1\"\n",
    "\n",
    "processing_instance_count = 1\n",
    "processing_instance_type = \"ml.t3.large\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "### SageMaker Processing Job"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "processor = SKLearnProcessor(\n",
    "    framework_version=processing_framework_version,\n",
    "    role=role,\n",
    "    instance_count=processing_instance_count,\n",
    "    instance_type=processing_instance_type\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "processor.run(\n",
    "    code=\"./processing.py\",\n",
    "    inputs=[\n",
    "        ProcessingInput(\n",
    "            source=\"s3://{}/{}\".format(bucket_name, processing_input_files_path),\n",
    "            destination=\"/opt/ml/processing/input\",\n",
    "        )\n",
    "    ],\n",
    "    outputs=[\n",
    "        ProcessingOutput(\n",
    "            output_name=\"output\",\n",
    "            source=\"/opt/ml/processing/output\",\n",
    "            destination=\"s3://{}/{}\".format(bucket_name, processing_output_files_path),\n",
    "        )\n",
    "    ]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Change Framework Version\n",
    "\n",
    "If we change the SKLearn version to 0.23-1, the expected result is that the provided solution will automatically stop the SageMaker Job, since it is using a version in the provided black list"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "processing_framework_version = \"0.23-1\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "processor = SKLearnProcessor(\n",
    "    framework_version=processing_framework_version,\n",
    "    role=role,\n",
    "    instance_count=processing_instance_count,\n",
    "    instance_type=processing_instance_type\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "processor.run(\n",
    "    code=\"./processing.py\",\n",
    "    inputs=[\n",
    "        ProcessingInput(\n",
    "            source=\"s3://{}/{}\".format(bucket_name, processing_input_files_path),\n",
    "            destination=\"/opt/ml/processing/input\",\n",
    "        )\n",
    "    ],\n",
    "    outputs=[\n",
    "        ProcessingOutput(\n",
    "            output_name=\"output\",\n",
    "            source=\"/opt/ml/processing/output\",\n",
    "            destination=\"s3://{}/{}\".format(bucket_name, processing_output_files_path),\n",
    "        )\n",
    "    ]\n",
    ")"
   ]
  }
 ],
 "metadata": {
  "instance_type": "ml.t3.medium",
  "kernelspec": {
   "display_name": "Python 3 (Data Science)",
   "language": "python",
   "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:eu-west-1:470317259841:image/datascience-1.0"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  },
  "vscode": {
   "interpreter": {
    "hash": "b0fa6594d8f4cbf19f97940f81e996739fb7646882a419484c72d19e05852a7e"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}