{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# [모듈 3.2] 전처리 스텝 개발 (SageMaker Model Building Pipeline 전처리 스텝)\n",
    "\n",
    "이 노트북은 \"모델 전처리\" 스텝을 정의하고, 모델 빌딩 파이프라인을 생성하여 실행하는 노트북 입니다.\n",
    "아래의 목차와 같이 노트북 실행이 될 예정이고\n",
    "전체를 모두 실행시에 완료 시간은 약 5분-10분 소요 됩니다.\n",
    "\n",
    "- 1. 전처리 개요 (SageMaker Processing 이용)\n",
    "- 2. 기본 라이브러리 로딩\n",
    "- 3. 원본 데이터 파일 확인 및 전처리 코드 로직 확인\n",
    "- 4. 모델 빌딩 파이프라인 의 스텝(Step) 생성\n",
    "- 5. 파리마터, 단계, 조건을 조합하여 최종 파이프라인 정의 및 실행\n",
    "- 6. 세이지 메이커 스튜디오에서 확인하기\n",
    "- 7. 전처리 파일 경로 추출\n",
    "\n",
    "---\n",
    "## SageMaker 파이프라인 소개\n",
    "\n",
    "![mdp_how_it_works.png](img/mdp_how_it_works.png)\n",
    "\n",
    "\n",
    "\n",
    "SageMaker 파이프라인은 다음 기능을 지원하며 본 lab_03_pipelinie 에서 일부를 다루게 됩니다. \n",
    "\n",
    "* Processing job steps - 데이터처러 워크로드를 실행하기 위한 SageMaker의 관리형 기능. Feature engineering, 데이터 검증, 모델 평가, 모델 해석 등에 주로 사용됨 \n",
    "* Training job steps - 학습작업. 모델에게 학습데이터셋을 이용하여 모델에게 예측을 하도록 학습시키는 작업 \n",
    "* Conditional execution steps - 조건별 실행분기. 파이프라인을 분기시키는 역할.\n",
    "* Register model steps - 학습이 완료된 모델패키지 리소스를 이후 배포를 위한 모델 레지스트리에 등록하기 \n",
    "* Create model steps - 추론 엔드포인트 또는 배치 추론을 위한 모델의 생성 \n",
    "* Transform job steps - 배치추론 작업. 배치작업을 이용하여 노이즈, bias의 제거 등 데이터셋을 전처리하고 대량데이터에 대해 추론을 실행하는 단계\n",
    "* Pipelines - Workflow DAG. SageMaker 작업과 리소스 생성을 조율하는 단계와 조건을 가짐\n",
    "* Parametrized Pipeline executions - 특정 파라미터에 따라 파이프라인 실행방식을 변화시키기 \n",
    "\n",
    "\n",
    "- 상세한 개발자 가이드는 아래 참조 하세요.\n",
    "    - [세이지 메이커 모델 빌딩 파이프라인의 개발자 가이드](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines.html)\n",
    "\n",
    "---\n",
    "### 노트북 커널\n",
    "- 이 워크샵은 노트북 커널이 `conda_python3` 를 사용합니다. 다른 커널일 경우 변경 해주세요.\n",
    "---\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. 전처리 개요 (SageMaker Processing 이용)\n",
    "\n",
    "이 노트북은 세이지 메이커의 Processing Job을 통해서 데이터 전처리를 합니다. <br>\n",
    "상세한 사항은 개발자 가이드를 참조 하세요. -->  [SageMaker Processing](https://docs.aws.amazon.com/ko_kr/sagemaker/latest/dg/processing-job.html)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Processing-1.png](img/Processing-1.png)\n",
    "- 일반적으로 크게 아래 4가지의 스텝으로 진행이 됩니다.\n",
    "\n",
    "    - (1) S3에 입력 파일 준비\n",
    "    - (2) 전처리를 수행하는 코드 준비\n",
    "    - (3) Projcessing Job을 생성시에 아래와 같은 항목을 제공합니다.\n",
    "        - Projcessing Job을 실행할 EC2(예: ml.m4.2xlarge) 기술\n",
    "        - EC2에서 로딩할 다커 이미지의 이름 기술\n",
    "        - S3 입력 파일 경로\n",
    "        - 전처리 코드 경로\n",
    "        - S3 출력 파일 경로\n",
    "    - (4) EC2에서 전치리 실행 하여 S3 출력 위치에 저장\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.2 프로세싱 스텝 \n",
    "- 프로세싱 단계의 개발자 가이드 \n",
    "    - [프로세싱 스텝](https://docs.aws.amazon.com/ko_kr/sagemaker/latest/dg/build-and-manage-steps.html#step-type-processing)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2. 기본 라이브러리 로딩\n",
    "\n",
    "세이지 메이커 관련 라이브러리를 로딩 합니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import boto3\n",
    "import sagemaker\n",
    "import pandas as pd\n",
    "from IPython.display import display as dp\n",
    "\n",
    "region = boto3.Session().region_name\n",
    "sagemaker_session = sagemaker.session.Session()\n",
    "role = sagemaker.get_execution_role()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.1 노트북 변수 로딩\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "저장된 변수를 확인 합니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Stored variables and their in-db values:\n",
      "bucket                         -> 'sagemaker-us-east-1-585843180719'\n",
      "claims_data_uri                -> 's3://sagemaker-us-east-1-585843180719/sagemaker-w\n",
      "customers_data_uri             -> 's3://sagemaker-us-east-1-585843180719/sagemaker-w\n",
      "input_data_uri                 -> 's3://sagemaker-us-east-1-585843180719/sagemaker-w\n",
      "preprocessing_code             -> 'src/preprocessing.py'\n",
      "project_prefix                 -> 'sagemaker-webinar-pipeline-base'\n"
     ]
    }
   ],
   "source": [
    "%store"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%store -r"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3. 원본 데이터 파일 확인 및 전처리 코드 로직 확인\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.1. 전처리에서 사용할 원본 데이터를 확인\n",
    "- 고객 데어터\n",
    "- 보험 청구 데이터"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "data_dir = '../data/raw'\n",
    "local_claim_data_path = f\"{data_dir}/claims.csv\"\n",
    "local_customers_data_path = f\"{data_dir}/customers.csv\"\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>policy_id</th>\n",
       "      <th>customer_age</th>\n",
       "      <th>months_as_customer</th>\n",
       "      <th>num_claims_past_year</th>\n",
       "      <th>num_insurers_past_5_years</th>\n",
       "      <th>policy_state</th>\n",
       "      <th>policy_deductable</th>\n",
       "      <th>policy_annual_premium</th>\n",
       "      <th>policy_liability</th>\n",
       "      <th>customer_zip</th>\n",
       "      <th>customer_gender</th>\n",
       "      <th>customer_education</th>\n",
       "      <th>auto_year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>54</td>\n",
       "      <td>94</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>WA</td>\n",
       "      <td>750</td>\n",
       "      <td>3000</td>\n",
       "      <td>25/50</td>\n",
       "      <td>99207</td>\n",
       "      <td>Unkown</td>\n",
       "      <td>Associate</td>\n",
       "      <td>2006</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>41</td>\n",
       "      <td>165</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>CA</td>\n",
       "      <td>750</td>\n",
       "      <td>2950</td>\n",
       "      <td>15/30</td>\n",
       "      <td>95632</td>\n",
       "      <td>Male</td>\n",
       "      <td>Bachelor</td>\n",
       "      <td>2012</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>57</td>\n",
       "      <td>155</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>CA</td>\n",
       "      <td>750</td>\n",
       "      <td>3000</td>\n",
       "      <td>15/30</td>\n",
       "      <td>93203</td>\n",
       "      <td>Female</td>\n",
       "      <td>Bachelor</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>39</td>\n",
       "      <td>80</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>AZ</td>\n",
       "      <td>750</td>\n",
       "      <td>3000</td>\n",
       "      <td>30/60</td>\n",
       "      <td>85208</td>\n",
       "      <td>Female</td>\n",
       "      <td>Advanced Degree</td>\n",
       "      <td>2020</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>39</td>\n",
       "      <td>60</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>CA</td>\n",
       "      <td>750</td>\n",
       "      <td>3000</td>\n",
       "      <td>15/30</td>\n",
       "      <td>91792</td>\n",
       "      <td>Female</td>\n",
       "      <td>High School</td>\n",
       "      <td>2018</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   policy_id  customer_age  months_as_customer  num_claims_past_year  \\\n",
       "0          1            54                  94                     0   \n",
       "1          2            41                 165                     0   \n",
       "2          3            57                 155                     0   \n",
       "3          4            39                  80                     0   \n",
       "4          5            39                  60                     0   \n",
       "\n",
       "   num_insurers_past_5_years policy_state  policy_deductable  \\\n",
       "0                          1           WA                750   \n",
       "1                          1           CA                750   \n",
       "2                          1           CA                750   \n",
       "3                          1           AZ                750   \n",
       "4                          1           CA                750   \n",
       "\n",
       "   policy_annual_premium policy_liability  customer_zip customer_gender  \\\n",
       "0                   3000            25/50         99207          Unkown   \n",
       "1                   2950            15/30         95632            Male   \n",
       "2                   3000            15/30         93203          Female   \n",
       "3                   3000            30/60         85208          Female   \n",
       "4                   3000            15/30         91792          Female   \n",
       "\n",
       "  customer_education  auto_year  \n",
       "0          Associate       2006  \n",
       "1           Bachelor       2012  \n",
       "2           Bachelor       2017  \n",
       "3    Advanced Degree       2020  \n",
       "4        High School       2018  "
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "customers_data_df = pd.read_csv(local_customers_data_path)\n",
    "customers_data_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>policy_id</th>\n",
       "      <th>driver_relationship</th>\n",
       "      <th>incident_type</th>\n",
       "      <th>collision_type</th>\n",
       "      <th>incident_severity</th>\n",
       "      <th>authorities_contacted</th>\n",
       "      <th>num_vehicles_involved</th>\n",
       "      <th>num_injuries</th>\n",
       "      <th>num_witnesses</th>\n",
       "      <th>police_report_available</th>\n",
       "      <th>injury_claim</th>\n",
       "      <th>vehicle_claim</th>\n",
       "      <th>total_claim_amount</th>\n",
       "      <th>incident_month</th>\n",
       "      <th>incident_day</th>\n",
       "      <th>incident_dow</th>\n",
       "      <th>incident_hour</th>\n",
       "      <th>fraud</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>Spouse</td>\n",
       "      <td>Collision</td>\n",
       "      <td>Front</td>\n",
       "      <td>Minor</td>\n",
       "      <td>None</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>No</td>\n",
       "      <td>71600</td>\n",
       "      <td>8913.668763</td>\n",
       "      <td>80513.668763</td>\n",
       "      <td>3</td>\n",
       "      <td>17</td>\n",
       "      <td>6</td>\n",
       "      <td>8</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>Self</td>\n",
       "      <td>Collision</td>\n",
       "      <td>Rear</td>\n",
       "      <td>Totaled</td>\n",
       "      <td>Police</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>Yes</td>\n",
       "      <td>6400</td>\n",
       "      <td>19746.724395</td>\n",
       "      <td>26146.724395</td>\n",
       "      <td>12</td>\n",
       "      <td>11</td>\n",
       "      <td>2</td>\n",
       "      <td>11</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>Self</td>\n",
       "      <td>Collision</td>\n",
       "      <td>Front</td>\n",
       "      <td>Minor</td>\n",
       "      <td>Police</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>Yes</td>\n",
       "      <td>10400</td>\n",
       "      <td>11652.969918</td>\n",
       "      <td>22052.969918</td>\n",
       "      <td>12</td>\n",
       "      <td>24</td>\n",
       "      <td>1</td>\n",
       "      <td>14</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>Child</td>\n",
       "      <td>Collision</td>\n",
       "      <td>Side</td>\n",
       "      <td>Minor</td>\n",
       "      <td>None</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>No</td>\n",
       "      <td>104700</td>\n",
       "      <td>11260.930936</td>\n",
       "      <td>115960.930936</td>\n",
       "      <td>12</td>\n",
       "      <td>23</td>\n",
       "      <td>0</td>\n",
       "      <td>19</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>Self</td>\n",
       "      <td>Collision</td>\n",
       "      <td>Side</td>\n",
       "      <td>Major</td>\n",
       "      <td>Police</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>No</td>\n",
       "      <td>3400</td>\n",
       "      <td>27987.704652</td>\n",
       "      <td>31387.704652</td>\n",
       "      <td>5</td>\n",
       "      <td>8</td>\n",
       "      <td>2</td>\n",
       "      <td>8</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   policy_id driver_relationship incident_type collision_type  \\\n",
       "0          1              Spouse     Collision          Front   \n",
       "1          2                Self     Collision           Rear   \n",
       "2          3                Self     Collision          Front   \n",
       "3          4               Child     Collision           Side   \n",
       "4          5                Self     Collision           Side   \n",
       "\n",
       "  incident_severity authorities_contacted  num_vehicles_involved  \\\n",
       "0             Minor                  None                      2   \n",
       "1           Totaled                Police                      3   \n",
       "2             Minor                Police                      2   \n",
       "3             Minor                  None                      2   \n",
       "4             Major                Police                      2   \n",
       "\n",
       "   num_injuries  num_witnesses police_report_available  injury_claim  \\\n",
       "0             0              0                      No         71600   \n",
       "1             4              0                     Yes          6400   \n",
       "2             0              1                     Yes         10400   \n",
       "3             0              0                      No        104700   \n",
       "4             1              0                      No          3400   \n",
       "\n",
       "   vehicle_claim  total_claim_amount  incident_month  incident_day  \\\n",
       "0    8913.668763        80513.668763               3            17   \n",
       "1   19746.724395        26146.724395              12            11   \n",
       "2   11652.969918        22052.969918              12            24   \n",
       "3   11260.930936       115960.930936              12            23   \n",
       "4   27987.704652        31387.704652               5             8   \n",
       "\n",
       "   incident_dow  incident_hour  fraud  \n",
       "0             6              8      0  \n",
       "1             2             11      0  \n",
       "2             1             14      0  \n",
       "3             0             19      0  \n",
       "4             2              8      0  "
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "claims_data_df = pd.read_csv(local_claim_data_path)\n",
    "claims_data_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.2 전처리 로직 로컬에서 실행"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Stored 'preprocessing_code' (str)\n"
     ]
    }
   ],
   "source": [
    "preprocessing_code = 'src/preprocessing.py'\n",
    "%store preprocessing_code"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 로컬 환경 셋업 \n",
    "\n",
    "- 로컬에서 테스트 하기 위해 세이지메이커의 다커 컨테이너와 같은 환경을 생성합니다.\n",
    "- split_rate = 0.2 로 해서 훈련 및 테스트 데이터 세트의 비율을 8:2로 정합니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "# 도커 컨테이너의 출력 폴더와 비슷한 환경 기술\n",
    "# 아래 경로 : opt/ml/processing/output\n",
    "# 도커 경로 : /opt/ml/processing/output\n",
    "base_output_dir = 'opt/ml/processing/output' \n",
    "\n",
    "# 도커 컨테이너의 입력 폴더와 비슷한 환경 기술\n",
    "base_preproc_input_dir = 'opt/ml/processing/input'\n",
    "os.makedirs(base_preproc_input_dir, exist_ok=True)\n",
    "\n",
    "# 출력 훈련 폴더를 기술 합니다.\n",
    "base_preproc_output_train_dir = 'opt/ml/processing/output/train/'\n",
    "os.makedirs(base_preproc_output_train_dir, exist_ok=True)\n",
    "\n",
    "# 출력 테스트 폴더를 기술 합니다.\n",
    "base_preproc_output_test_dir = 'opt/ml/processing/output/test/'\n",
    "os.makedirs(base_preproc_output_test_dir, exist_ok=True)\n",
    "\n",
    "\n",
    "split_rate = 0.2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### claims.csv, customers.csv 를 다커환경과 비슷한 경로로 복사"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "! cp {local_claim_data_path} {base_preproc_input_dir}\n",
    "! cp {local_customers_data_path} {base_preproc_input_dir}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 로컬에서 스크립트 실행\n",
    "전처리 코드에서 제공하는 로그를 통해서, 전처리 수행 내역을 확인 합니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "######### Argument Info ####################################\n",
      "args.base_output_dir: opt/ml/processing/output\n",
      "args.base_preproc_input_dir: opt/ml/processing/input\n",
      "args.label_column: fraud\n",
      "args.split_rate: 0.2\n",
      "\n",
      "### Loading Claim Dataset\n",
      "input_files: \n",
      " ['opt/ml/processing/input/claims.csv']\n",
      "dataframe shape \n",
      " (5000, 17)\n",
      "dataset sample \n",
      "           driver_relationship incident_type  ... incident_hour fraud\n",
      "policy_id                                    ...                    \n",
      "1                      Spouse     Collision  ...             8     0\n",
      "2                        Self     Collision  ...            11     0\n",
      "\n",
      "[2 rows x 17 columns]\n",
      "\n",
      "### Loading Customer Dataset\n",
      "input_files: \n",
      " ['opt/ml/processing/input/customers.csv']\n",
      "dataframe shape \n",
      " (5000, 12)\n",
      "dataset sample \n",
      "            customer_age  months_as_customer  ...  customer_education  auto_year\n",
      "policy_id                                    ...                               \n",
      "1                    54                  94  ...           Associate       2006\n",
      "2                    41                 165  ...            Bachelor       2012\n",
      "\n",
      "[2 rows x 12 columns]\n",
      "### dataframe merged with customer and claim: (5000, 29)\n",
      "\n",
      " ### Encoding: Category Features\n",
      "categorical_features: ['policy_state', 'policy_liability', 'customer_gender', 'customer_education', 'driver_relationship', 'incident_type', 'collision_type', 'incident_severity', 'authorities_contacted', 'police_report_available']\n",
      "\n",
      " ### Encoding: Numeric Features\n",
      "int_cols: \n",
      "['customer_age' 'months_as_customer' 'num_claims_past_year'\n",
      " 'num_insurers_past_5_years' 'policy_deductable' 'policy_annual_premium'\n",
      " 'customer_zip' 'auto_year' 'num_vehicles_involved' 'num_injuries'\n",
      " 'num_witnesses' 'injury_claim' 'incident_month' 'incident_day'\n",
      " 'incident_dow' 'incident_hour' 'fraud']\n",
      "float_cols: \n",
      "['vehicle_claim' 'total_claim_amount']\n",
      "\n",
      " ### Handle preprocess results\n",
      "preprocessed train shape \n",
      " (4000, 59)\n",
      "preprocessed test shape \n",
      " (1000, 59)\n",
      "\n",
      " ### Final result for train dataset \n",
      "preprocessed train sample \n",
      "    fraud  ...  police_report_available_Yes\n",
      "0      0  ...                            0\n",
      "1      0  ...                            1\n",
      "\n",
      "[2 rows x 59 columns]\n"
     ]
    }
   ],
   "source": [
    "! python {preprocessing_code} --base_preproc_input_dir {base_preproc_input_dir} \\\n",
    "                              --base_output_dir {base_output_dir} \\\n",
    "                              --split_rate {split_rate}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 전처리된 데이터 확인\n",
    "실제로 전처리 된 파일의 내역을 확인 합니다.\n",
    "훈련 및 테스트 세트의 fraud 의 분포를 확인 합니다. (0: non-fruad, 1: fraud)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>fraud</th>\n",
       "      <th>vehicle_claim</th>\n",
       "      <th>total_claim_amount</th>\n",
       "      <th>customer_age</th>\n",
       "      <th>months_as_customer</th>\n",
       "      <th>num_claims_past_year</th>\n",
       "      <th>num_insurers_past_5_years</th>\n",
       "      <th>policy_deductable</th>\n",
       "      <th>policy_annual_premium</th>\n",
       "      <th>customer_zip</th>\n",
       "      <th>...</th>\n",
       "      <th>collision_type_missing</th>\n",
       "      <th>incident_severity_Major</th>\n",
       "      <th>incident_severity_Minor</th>\n",
       "      <th>incident_severity_Totaled</th>\n",
       "      <th>authorities_contacted_Ambulance</th>\n",
       "      <th>authorities_contacted_Fire</th>\n",
       "      <th>authorities_contacted_None</th>\n",
       "      <th>authorities_contacted_Police</th>\n",
       "      <th>police_report_available_No</th>\n",
       "      <th>police_report_available_Yes</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>8913.668763</td>\n",
       "      <td>80513.668763</td>\n",
       "      <td>54</td>\n",
       "      <td>94</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>750</td>\n",
       "      <td>3000</td>\n",
       "      <td>99207</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>19746.724395</td>\n",
       "      <td>26146.724395</td>\n",
       "      <td>41</td>\n",
       "      <td>165</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>750</td>\n",
       "      <td>2950</td>\n",
       "      <td>95632</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>11652.969918</td>\n",
       "      <td>22052.969918</td>\n",
       "      <td>57</td>\n",
       "      <td>155</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>750</td>\n",
       "      <td>3000</td>\n",
       "      <td>93203</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>11260.930936</td>\n",
       "      <td>115960.930936</td>\n",
       "      <td>39</td>\n",
       "      <td>80</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>750</td>\n",
       "      <td>3000</td>\n",
       "      <td>85208</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>27987.704652</td>\n",
       "      <td>31387.704652</td>\n",
       "      <td>39</td>\n",
       "      <td>60</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>750</td>\n",
       "      <td>3000</td>\n",
       "      <td>91792</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 59 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   fraud  vehicle_claim  total_claim_amount  customer_age  months_as_customer  \\\n",
       "0      0    8913.668763        80513.668763            54                  94   \n",
       "1      0   19746.724395        26146.724395            41                 165   \n",
       "2      0   11652.969918        22052.969918            57                 155   \n",
       "3      0   11260.930936       115960.930936            39                  80   \n",
       "4      0   27987.704652        31387.704652            39                  60   \n",
       "\n",
       "   num_claims_past_year  num_insurers_past_5_years  policy_deductable  \\\n",
       "0                     0                          1                750   \n",
       "1                     0                          1                750   \n",
       "2                     0                          1                750   \n",
       "3                     0                          1                750   \n",
       "4                     0                          1                750   \n",
       "\n",
       "   policy_annual_premium  customer_zip  ...  collision_type_missing  \\\n",
       "0                   3000         99207  ...                       0   \n",
       "1                   2950         95632  ...                       0   \n",
       "2                   3000         93203  ...                       0   \n",
       "3                   3000         85208  ...                       0   \n",
       "4                   3000         91792  ...                       0   \n",
       "\n",
       "   incident_severity_Major  incident_severity_Minor  \\\n",
       "0                        0                        1   \n",
       "1                        0                        0   \n",
       "2                        0                        1   \n",
       "3                        0                        1   \n",
       "4                        1                        0   \n",
       "\n",
       "   incident_severity_Totaled  authorities_contacted_Ambulance  \\\n",
       "0                          0                                0   \n",
       "1                          1                                0   \n",
       "2                          0                                0   \n",
       "3                          0                                0   \n",
       "4                          0                                0   \n",
       "\n",
       "   authorities_contacted_Fire  authorities_contacted_None  \\\n",
       "0                           0                           1   \n",
       "1                           0                           0   \n",
       "2                           0                           0   \n",
       "3                           0                           1   \n",
       "4                           0                           0   \n",
       "\n",
       "   authorities_contacted_Police  police_report_available_No  \\\n",
       "0                             0                           1   \n",
       "1                             1                           0   \n",
       "2                             1                           0   \n",
       "3                             0                           1   \n",
       "4                             1                           1   \n",
       "\n",
       "   police_report_available_Yes  \n",
       "0                            0  \n",
       "1                            1  \n",
       "2                            1  \n",
       "3                            0  \n",
       "4                            0  \n",
       "\n",
       "[5 rows x 59 columns]"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "preprocessed_train_path = os.path.join(base_output_dir + '/train/train.csv')\n",
    "preprocessed_test_path = os.path.join(base_output_dir + '/test/test.csv')\n",
    "\n",
    "preprocessed_train_df = pd.read_csv(preprocessed_train_path)\n",
    "preprocessed_test_df = pd.read_csv(preprocessed_test_path)\n",
    "\n",
    "preprocessed_train_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "첫번때 데이터 세트는 훈련 데이터 세트, 두번째는 테스트 데이터 세트 입니다. 각각의 fraud의 비율을 확인하세요."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train Data Set: \n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "fraud\n",
       "0        3869\n",
       "1         131\n",
       "dtype: int64"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Test Data Set:\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "fraud\n",
       "0        967\n",
       "1         33\n",
       "dtype: int64"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "print(\"Train Data Set: \")\n",
    "dp(preprocessed_train_df[['fraud']].value_counts())\n",
    "\n",
    "print(\"\\nTest Data Set:\")\n",
    "dp(preprocessed_test_df[['fraud']].value_counts())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 4. 모델 빌딩 파이프라인 의 스텝(Step) 생성\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.1 모델 빌딩 파이프라인 변수 생성\n",
    "\n",
    "파이프라인에서 사용할 파이프라인 파라미터를 정의합니다. 파이프라인을 스케줄하고 실행할 때 파라미터를 이용하여 실행조건을 커스마이징할 수 있습니다. 파라미터를 이용하면 파이프라인 실행시마다 매번 파이프라인 정의를 수정하지 않아도 됩니다.\n",
    "\n",
    "지원되는 파라미터 타입은 다음과 같습니다:\n",
    "\n",
    "* `ParameterString` - 파이썬 타입에서 `str` \n",
    "* `ParameterInteger` - 파이썬 타입에서 `int` \n",
    "* `ParameterFloat` - 파이썬 타입에서 `float` \n",
    "\n",
    "이들 파라미터를 정의할 때 디폴트 값을 지정할 수 있으며 파이프라인 실행시 재지정할 수도 있습니다. 지정하는 디폴트 값은 파라미터 타입과 일치하여야 합니다.\n",
    "\n",
    "\n",
    "파이프라인의 각 스텝에서 사용할 변수를 파라미터 변수로서 정의 합니다.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from sagemaker.workflow.parameters import (\n",
    "    ParameterInteger,\n",
    "    ParameterString,\n",
    ")\n",
    "\n",
    "processing_instance_count = ParameterInteger(\n",
    "    name=\"ProcessingInstanceCount\",\n",
    "    default_value=1\n",
    ")\n",
    "processing_instance_type = ParameterString(\n",
    "    name=\"ProcessingInstanceType\",\n",
    "    default_value=\"ml.m5.xlarge\"\n",
    ")\n",
    "\n",
    "input_data = ParameterString(\n",
    "    name=\"InputData\",\n",
    "    default_value=input_data_uri,\n",
    ")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.2. 전처리 스텝 프로세서 정의\n",
    "- 전처리의 내장 SKLearnProcessor 를 통해서 sklearn_processor 오브젝트를 생성 합니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "The input argument instance_type of function (sagemaker.image_uris.retrieve) is a pipeline variable (<class 'sagemaker.workflow.parameters.ParameterString'>), which is not allowed. The default_value of this Parameter object will be used to override it. Please make sure the default_value is valid.\n"
     ]
    }
   ],
   "source": [
    "from sagemaker.sklearn.processing import SKLearnProcessor\n",
    "\n",
    "framework_version = \"1.0-1\"\n",
    "\n",
    "sklearn_processor = SKLearnProcessor(\n",
    "    framework_version=framework_version,\n",
    "    instance_type=processing_instance_type,\n",
    "    instance_count=processing_instance_count,\n",
    "    base_job_name=\"sklearn-fraud-process\",\n",
    "    role=role,\n",
    ")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.3. 전처리 스텝 단계 정의\n",
    "- 처리 단계에서는 아래와 같은 주요 인자가 있습니다.\n",
    "    - 단계 이름\n",
    "    - processor 기술: 위에서 생성한 processor 오브젝트를 제공\n",
    "    - inputs: S3의 경로를 기술하고, 다커안에서의 다운로드 폴더(destination)을 기술 합니다.\n",
    "    - outputs: 처리 결과가 저장될 다커안에서의 폴더 경로를 기술합니다.\n",
    "        - 다커안의 결과 파일이 저장 후에 자동으로 S3로 업로딩을 합니다.\n",
    "    - job_arguments: 사용자 정의의 인자를 기술 합니다.\n",
    "    - code: 전처리 코드의 경로를 기술 합니다.\n",
    "- 처리 단계의 상세한 사항은 여기를 보세요. --> [처리 단계, Processing Step](https://docs.aws.amazon.com/ko_kr/sagemaker/latest/dg/build-and-manage-steps.html#step-type-processing)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'src/preprocessing.py'"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "preprocessing_code"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from sagemaker.processing import ProcessingInput, ProcessingOutput\n",
    "from sagemaker.workflow.steps import ProcessingStep\n",
    "    \n",
    "\n",
    "step_process = ProcessingStep(\n",
    "    name=\"Fraud-Basic-Process\",\n",
    "    processor=sklearn_processor,\n",
    "    inputs=[\n",
    "        ProcessingInput(source=input_data_uri,destination='/opt/ml/processing/input'),\n",
    "         ],\n",
    "    outputs=[ProcessingOutput(output_name=\"train\",\n",
    "                              source='/opt/ml/processing/output/train'),\n",
    "             ProcessingOutput(output_name=\"test\",\n",
    "                              source='/opt/ml/processing/output/test')],\n",
    "    job_arguments=[\"--split_rate\", f\"{split_rate}\"],    \n",
    "    code= preprocessing_code\n",
    ")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5. 파리마터, 단계, 조건을 조합하여 최종 파이프라인 정의 및 실행\n",
    "\n",
    "\n",
    "이제 지금까지 생성한 단계들을 하나의 파이프라인으로 조합하고 실행하도록 하겠습니다.\n",
    "\n",
    "파이프라인은 name, parameters, steps 속성이 필수적으로 필요합니다. \n",
    "여기서 파이프라인의 이름은 (account, region) 조합에 대하여 유일(unique))해야 합니다.\n",
    "\n",
    "\n",
    "주의:\n",
    "\n",
    "- 정의에 사용한 모든 파라미터가 존재해야 합니다.\n",
    "- 파이프라인으로 전달된 단계(step)들은 실행순서와는 무관합니다. SageMaker Pipeline은 단계가 실행되고 완료될 수 있도록 의존관계를를 해석합니다.\n",
    "- [알림] 정의한 stpes 이 복수개이면 복수개를 기술합니다. 만약에 step 간에 의존성이 있으면, 명시적으로 기술하지 않아도 같이 실행 됩니다.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.1 파이프라인 정의\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from sagemaker.workflow.pipeline import Pipeline\n",
    "\n",
    "pipeline_name = project_prefix\n",
    "pipeline = Pipeline(\n",
    "    name=pipeline_name,\n",
    "    parameters=[\n",
    "        processing_instance_type, \n",
    "        processing_instance_count,\n",
    "        input_data,\n",
    "    ],\n",
    "    steps=[step_process],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.2 파이프라인 정의 확인\n",
    "위에서 정의한 파이프라인 정의는 Json 형식으로 정의 되어 있습니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'Version': '2020-12-01',\n",
       " 'Metadata': {},\n",
       " 'Parameters': [{'Name': 'ProcessingInstanceType',\n",
       "   'Type': 'String',\n",
       "   'DefaultValue': 'ml.m5.xlarge'},\n",
       "  {'Name': 'ProcessingInstanceCount', 'Type': 'Integer', 'DefaultValue': 1},\n",
       "  {'Name': 'InputData',\n",
       "   'Type': 'String',\n",
       "   'DefaultValue': 's3://sagemaker-us-east-1-585843180719/sagemaker-webinar-pipeline-base/input'}],\n",
       " 'PipelineExperimentConfig': {'ExperimentName': {'Get': 'Execution.PipelineName'},\n",
       "  'TrialName': {'Get': 'Execution.PipelineExecutionId'}},\n",
       " 'Steps': [{'Name': 'Fraud-Basic-Process',\n",
       "   'Type': 'Processing',\n",
       "   'Arguments': {'ProcessingResources': {'ClusterConfig': {'InstanceType': {'Get': 'Parameters.ProcessingInstanceType'},\n",
       "      'InstanceCount': {'Get': 'Parameters.ProcessingInstanceCount'},\n",
       "      'VolumeSizeInGB': 30}},\n",
       "    'AppSpecification': {'ImageUri': '683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-scikit-learn:1.0-1-cpu-py3',\n",
       "     'ContainerArguments': ['--split_rate', '0.2'],\n",
       "     'ContainerEntrypoint': ['python3',\n",
       "      '/opt/ml/processing/input/code/preprocessing.py']},\n",
       "    'RoleArn': 'arn:aws:iam::585843180719:role/TeamRole',\n",
       "    'ProcessingInputs': [{'InputName': 'input-1',\n",
       "      'AppManaged': False,\n",
       "      'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-585843180719/sagemaker-webinar-pipeline-base/input',\n",
       "       'LocalPath': '/opt/ml/processing/input',\n",
       "       'S3DataType': 'S3Prefix',\n",
       "       'S3InputMode': 'File',\n",
       "       'S3DataDistributionType': 'FullyReplicated',\n",
       "       'S3CompressionType': 'None'}},\n",
       "     {'InputName': 'code',\n",
       "      'AppManaged': False,\n",
       "      'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-585843180719/Fraud-Basic-Process-572c86dd1149a85da325ac4f7701e85d/input/code/preprocessing.py',\n",
       "       'LocalPath': '/opt/ml/processing/input/code',\n",
       "       'S3DataType': 'S3Prefix',\n",
       "       'S3InputMode': 'File',\n",
       "       'S3DataDistributionType': 'FullyReplicated',\n",
       "       'S3CompressionType': 'None'}}],\n",
       "    'ProcessingOutputConfig': {'Outputs': [{'OutputName': 'train',\n",
       "       'AppManaged': False,\n",
       "       'S3Output': {'S3Uri': {'Std:Join': {'On': '/',\n",
       "          'Values': ['s3:/',\n",
       "           'sagemaker-us-east-1-585843180719',\n",
       "           'sagemaker-webinar-pipeline-base',\n",
       "           {'Get': 'Execution.PipelineExecutionId'},\n",
       "           'Fraud-Basic-Process',\n",
       "           'output',\n",
       "           'train']}},\n",
       "        'LocalPath': '/opt/ml/processing/output/train',\n",
       "        'S3UploadMode': 'EndOfJob'}},\n",
       "      {'OutputName': 'test',\n",
       "       'AppManaged': False,\n",
       "       'S3Output': {'S3Uri': {'Std:Join': {'On': '/',\n",
       "          'Values': ['s3:/',\n",
       "           'sagemaker-us-east-1-585843180719',\n",
       "           'sagemaker-webinar-pipeline-base',\n",
       "           {'Get': 'Execution.PipelineExecutionId'},\n",
       "           'Fraud-Basic-Process',\n",
       "           'output',\n",
       "           'test']}},\n",
       "        'LocalPath': '/opt/ml/processing/output/test',\n",
       "        'S3UploadMode': 'EndOfJob'}}]}}}]}"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import json\n",
    "\n",
    "definition = json.loads(pipeline.definition())\n",
    "definition"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.3 파이프라인 정의를 제출하고 실행하기 \n",
    "\n",
    "파이프라인 정의를 파이프라인 서비스에 제출합니다. 함께 전달되는 역할(role)을 이용하여 AWS에서 파이프라인을 생성하고 작업의 각 단계를 실행할 것입니다.   "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "pipeline.upsert(role_arn=role)\n",
    "execution = pipeline.start()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "워크플로우의 실행상황을 살펴봅니다. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'PipelineArn': 'arn:aws:sagemaker:us-east-1:585843180719:pipeline/sagemaker-webinar-pipeline-base',\n",
       " 'PipelineExecutionArn': 'arn:aws:sagemaker:us-east-1:585843180719:pipeline/sagemaker-webinar-pipeline-base/execution/ejfhjx0f5its',\n",
       " 'PipelineExecutionDisplayName': 'execution-1678692546980',\n",
       " 'PipelineExecutionStatus': 'Executing',\n",
       " 'CreationTime': datetime.datetime(2023, 3, 13, 7, 29, 6, 892000, tzinfo=tzlocal()),\n",
       " 'LastModifiedTime': datetime.datetime(2023, 3, 13, 7, 29, 6, 892000, tzinfo=tzlocal()),\n",
       " 'CreatedBy': {},\n",
       " 'LastModifiedBy': {},\n",
       " 'ResponseMetadata': {'RequestId': 'aa7e0a4f-1d59-4732-a737-47f9b26e8472',\n",
       "  'HTTPStatusCode': 200,\n",
       "  'HTTPHeaders': {'x-amzn-requestid': 'aa7e0a4f-1d59-4732-a737-47f9b26e8472',\n",
       "   'content-type': 'application/x-amz-json-1.1',\n",
       "   'content-length': '427',\n",
       "   'date': 'Mon, 13 Mar 2023 07:29:06 GMT'},\n",
       "  'RetryAttempts': 0}}"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "execution.describe()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.4 파이프라인 실행 기다리기"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "execution.wait()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "실행이 완료될 때까지 기다립니다."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "실행된 단계들을 리스트업합니다. 파이프라인의 단계실행 서비스에 의해 시작되거나 완료된 단계를 보여줍니다."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.5 파이프라인 실행 단계 기록 보기"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'StepName': 'Fraud-Basic-Process',\n",
       "  'StartTime': datetime.datetime(2023, 3, 13, 7, 29, 7, 821000, tzinfo=tzlocal()),\n",
       "  'EndTime': datetime.datetime(2023, 3, 13, 7, 35, 38, 103000, tzinfo=tzlocal()),\n",
       "  'StepStatus': 'Succeeded',\n",
       "  'AttemptCount': 0,\n",
       "  'Metadata': {'ProcessingJob': {'Arn': 'arn:aws:sagemaker:us-east-1:585843180719:processing-job/pipelines-ejfhjx0f5its-fraud-basic-process-ggf58zvugz'}}}]"
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "execution.list_steps()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 6. 세이지 메이커 스튜디오에서 확인하기\n",
    "- 아래의 그림 처럼 SageMaker Studio에 로긴후에 따라하시면, SageMaker Studio 에서도 실행 내역을 확인할 수 있습니다.\n",
    "    - 그림 처럼 (1), (2), (3) 을 순서대로 클릭사시면 (4) 의 전처리 스텝의 실형 결과를 확인 할 수 있습니다.\n",
    "- SageMaker Studio 개발자 가이드 --> [SageMaker Studio](https://docs.aws.amazon.com/ko_kr/sagemaker/latest/dg/studio.html)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "![process_basic.png](img/process_basic.png)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 7. 전처리 파일 경로 추출\n",
    "- 다음 노트북에서 사용할 훈련 및 테스트의 전처리 S3 경로를 저장 합니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "train_preproc_dir_artifact: \n",
      " s3://sagemaker-us-east-1-585843180719/sagemaker-webinar-pipeline-base/ejfhjx0f5its/Fraud-Basic-Process/output/train\n",
      "test_preproc__dir_artifact: \n",
      " s3://sagemaker-us-east-1-585843180719/sagemaker-webinar-pipeline-base/ejfhjx0f5its/Fraud-Basic-Process/output/test\n"
     ]
    }
   ],
   "source": [
    "from src.p_utils import get_proc_artifact\n",
    "\n",
    "import boto3\n",
    "client = boto3.client(\"sagemaker\")\n",
    "\n",
    "train_preproc_dir_artifact = get_proc_artifact(execution, client, kind=0 )\n",
    "test_preproc_dir_artifact = get_proc_artifact(execution, client, kind=1 )\n",
    "\n",
    "print(\"train_preproc_dir_artifact: \\n\", train_preproc_dir_artifact)\n",
    "print(\"test_preproc__dir_artifact: \\n\", test_preproc_dir_artifact)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "다음 노트북에서 아래 변수를 사용하기 위해서 저장 합니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Stored 'train_preproc_dir_artifact' (str)\n",
      "Stored 'test_preproc_dir_artifact' (str)\n"
     ]
    }
   ],
   "source": [
    "%store train_preproc_dir_artifact\n",
    "%store test_preproc_dir_artifact"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "instance_type": "ml.t3.medium",
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}