{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "2fceb094",
   "metadata": {},
   "source": [
    "# 1.3 SageMaker Training with Experiments and Processing"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b1c7b1e2",
   "metadata": {},
   "source": [
    "## 학습 작업의 실행 노트북 개요\n",
    "\n",
    "- SageMaker Training에 SageMaker 실험을 추가하여 여러 실험의 결과를 비교할 수 있습니다.\n",
    "    - [작업 실행 시 필요 라이브러리 import](#작업-실행-시-필요-라이브러리-import)\n",
    "    - [SageMaker 세션과 Role, 사용 버킷 정의](#SageMaker-세션과-Role,-사용-버킷-정의)\n",
    "    - [하이퍼파라미터 정의](#하이퍼파라미터-정의)\n",
    "    - [학습 실행 작업 정의](#학습-실행-작업-정의)\n",
    "        - 학습 코드 명\n",
    "        - 학습 코드 폴더 명\n",
    "        - 학습 코드가 사용한 Framework 종류, 버전 등\n",
    "        - 학습 인스턴스 타입과 개수\n",
    "        - SageMaker 세션\n",
    "        - 학습 작업 하이퍼파라미터 정의\n",
    "        - 학습 작업 산출물 관련 S3 버킷 설정 등\n",
    "    - [학습 데이터셋 지정](#학습-데이터셋-지정)\n",
    "        - 학습에 사용하는 데이터셋의 S3 URI 지정\n",
    "    - [SageMaker 실험 설정](#SageMaker-실험-설정)\n",
    "    - [학습 실행](#학습-실행)\n",
    "    - [데이터 세트 설명](#데이터-세트-설명)\n",
    "    - [실험 결과 보기](#실험-결과-보기)\n",
    "    - [Evaluation 하기](#Evaluation-하기)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4dc90460",
   "metadata": {},
   "source": [
    "### 작업 실행 시 필요 라이브러리 import"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "80a85024",
   "metadata": {},
   "outputs": [],
   "source": [
    "import boto3\n",
    "import sagemaker"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba9d596a",
   "metadata": {},
   "source": [
    "### SageMaker 세션과 Role, 사용 버킷 정의"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bda98a0a",
   "metadata": {},
   "outputs": [],
   "source": [
    "sagemaker_session = sagemaker.session.Session()\n",
    "role = sagemaker.get_execution_role()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8664b776",
   "metadata": {},
   "outputs": [],
   "source": [
    "bucket = sagemaker_session.default_bucket()\n",
    "code_location = f's3://{bucket}/xgboost/code'\n",
    "output_path = f's3://{bucket}/xgboost/output'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "21bd23dc",
   "metadata": {},
   "source": [
    "### 하이퍼파라미터 정의"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fbfbb81f",
   "metadata": {},
   "outputs": [],
   "source": [
    "hyperparameters = {\n",
    "       \"scale_pos_weight\" : \"19\",    \n",
    "        \"max_depth\": \"2\",\n",
    "        \"eta\": \"0.3\",\n",
    "        \"objective\": \"binary:logistic\",\n",
    "        \"num_round\": \"100\",\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3d765652",
   "metadata": {},
   "source": [
    "### 학습 실행 작업 정의"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "17aae4fc",
   "metadata": {},
   "outputs": [],
   "source": [
    "instance_count = 1\n",
    "instance_type = \"ml.m5.large\"\n",
    "# instance_type = \"local\"\n",
    "max_run = 1*60*60\n",
    "\n",
    "use_spot_instances = False\n",
    "if use_spot_instances:\n",
    "    max_wait = 1*60*60\n",
    "else:\n",
    "    max_wait = None"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cf0b8f56-1659-464a-8caf-2230b2270402",
   "metadata": {},
   "outputs": [],
   "source": [
    "if instance_type in ['local', 'local_gpu']:\n",
    "    from sagemaker.local import LocalSession\n",
    "    sagemaker_session = LocalSession()\n",
    "    sagemaker_session.config = {'local': {'local_code': True}}\n",
    "else:\n",
    "    sagemaker_session = sagemaker.session.Session()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3e489556",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.xgboost.estimator import XGBoost\n",
    "\n",
    "estimator = XGBoost(\n",
    "    entry_point=\"xgboost_starter_script.py\",\n",
    "    source_dir=\"src\",\n",
    "    output_path=output_path,\n",
    "    code_location=code_location,\n",
    "    hyperparameters=hyperparameters,\n",
    "    role=role,\n",
    "    sagemaker_session=sagemaker_session,\n",
    "    instance_count=instance_count,\n",
    "    instance_type=instance_type,\n",
    "    framework_version=\"1.3-1\",\n",
    "    max_run=max_run,\n",
    "    use_spot_instances=use_spot_instances,  # spot instance 활용\n",
    "    max_wait=max_wait,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9eeb2bf2",
   "metadata": {},
   "source": [
    "### 학습 데이터셋 지정"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "475c82bb",
   "metadata": {},
   "outputs": [],
   "source": [
    "data_path=f's3://{bucket}/xgboost/dataset'\n",
    "!aws s3 sync ../data/dataset/ $data_path"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "47669f10-5ac8-4412-bcd1-daedd21b03c2",
   "metadata": {},
   "outputs": [],
   "source": [
    "if instance_type in ['local', 'local_gpu']:\n",
    "    from pathlib import Path\n",
    "    file_path = f'file://{Path.cwd()}'\n",
    "    inputs = file_path.split('lab_1_training')[0] + 'data/dataset/'\n",
    "    \n",
    "else:\n",
    "    inputs = data_path\n",
    "inputs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c686dcc8",
   "metadata": {
    "tags": []
   },
   "source": [
    "### SageMaker 실험 설정"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e68616ec",
   "metadata": {},
   "outputs": [],
   "source": [
    "# !pip install -U sagemaker-experiments"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7f26f8ec",
   "metadata": {},
   "outputs": [],
   "source": [
    "experiment_name='xgboost-poc-1'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b9750d63",
   "metadata": {},
   "outputs": [],
   "source": [
    "from smexperiments.experiment import Experiment\n",
    "from smexperiments.trial import Trial\n",
    "from time import strftime"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "99736980",
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_experiment(experiment_name):\n",
    "    try:\n",
    "        sm_experiment = Experiment.load(experiment_name)\n",
    "    except:\n",
    "        sm_experiment = Experiment.create(experiment_name=experiment_name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "38ae2ace",
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_trial(experiment_name):\n",
    "    create_date = strftime(\"%m%d-%H%M%s\")       \n",
    "    sm_trial = Trial.create(trial_name=f'{experiment_name}-{create_date}',\n",
    "                            experiment_name=experiment_name)\n",
    "\n",
    "    job_name = f'{sm_trial.trial_name}'\n",
    "    return job_name"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "084598f2",
   "metadata": {},
   "source": [
    "### 학습 실행"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e8313889",
   "metadata": {},
   "outputs": [],
   "source": [
    "create_experiment(experiment_name)\n",
    "job_name = create_trial(experiment_name)\n",
    "\n",
    "estimator.fit(inputs = {'inputdata': inputs},\n",
    "                  job_name = job_name,\n",
    "                  experiment_config={\n",
    "                      'TrialName': job_name,\n",
    "                      'TrialComponentDisplayName': job_name,\n",
    "                  },\n",
    "                  wait=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e76e4ec5",
   "metadata": {},
   "outputs": [],
   "source": [
    "estimator.logs()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "620b2585",
   "metadata": {},
   "source": [
    "###  실험 결과 보기\n",
    "위의 실험한 결과를 확인 합니다.\n",
    "- 각각의 훈련잡의 시도에 대한 훈련 사용 데이터, 모델 입력 하이퍼 파라미터, 모델 평가 지표, 모델 아티펙트 결과 위치 등의 확인이 가능합니다.\n",
    "- **아래의 모든 내용은 SageMaker Studio 를 통해서 직관적으로 확인이 가능합니다.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b121c5ca",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.analytics import ExperimentAnalytics\n",
    "import pandas as pd\n",
    "pd.options.display.max_columns = 50\n",
    "pd.options.display.max_rows = 10\n",
    "pd.options.display.max_colwidth = 100"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "80080207",
   "metadata": {},
   "outputs": [],
   "source": [
    "trial_component_training_analytics = ExperimentAnalytics(\n",
    "    sagemaker_session= sagemaker_session,\n",
    "    experiment_name= experiment_name,\n",
    "    sort_by=\"metrics.validation:auc.max\",        \n",
    "    sort_order=\"Descending\",\n",
    "    metric_names=[\"validation:auc\"]\n",
    ")\n",
    "\n",
    "trial_component_training_analytics.dataframe()[['Experiments', 'Trials', 'validation:auc - Min', 'validation:auc - Max',\n",
    "                                                'validation:auc - Avg', 'validation:auc - StdDev', 'validation:auc - Last', \n",
    "                                                'eta', 'max_depth', 'num_round', 'scale_pos_weight']]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1a23cfc2",
   "metadata": {},
   "source": [
    "###  Evaluation 하기\n",
    "SageMaker Processing을 이용하여 Evalution을 수행하는 코드를 동작할 수 있습니다. MLOps에서 Processing을 적용하면 전처리, Evaluation 등을 serverless로 동작할 수 있습니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6671a6c5",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.processing import FrameworkProcessor\n",
    "from sagemaker.processing import ProcessingInput, ProcessingOutput"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a18f8d6c",
   "metadata": {},
   "outputs": [],
   "source": [
    "instance_count = 1\n",
    "instance_type = \"ml.m5.large\"\n",
    "# instance_type = 'local'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "31647257",
   "metadata": {},
   "outputs": [],
   "source": [
    "script_eval = FrameworkProcessor(\n",
    "    XGBoost,\n",
    "    framework_version=\"1.3-1\",\n",
    "    role=role,\n",
    "    instance_type=instance_type,\n",
    "    instance_count=instance_count\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "83455c16",
   "metadata": {},
   "outputs": [],
   "source": [
    "artifacts_dir = estimator.model_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f9829f86",
   "metadata": {},
   "outputs": [],
   "source": [
    "s3_test_path = data_path + '/test.csv'\n",
    "detect_outputpath = f's3://{bucket}/xgboost/processing'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ef915fe1",
   "metadata": {},
   "outputs": [],
   "source": [
    "source_dir='src'\n",
    "\n",
    "if instance_type == 'local':\n",
    "    from sagemaker.local import LocalSession\n",
    "    from pathlib import Path\n",
    "\n",
    "    sagemaker_session = LocalSession()\n",
    "    sagemaker_session.config = {'local': {'local_code': True}}\n",
    "    source_dir = f'{Path.cwd()}/src'\n",
    "    s3_test_path=f'../data/dataset/test.csv'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d54892a4",
   "metadata": {},
   "outputs": [],
   "source": [
    "create_experiment(experiment_name)\n",
    "job_name = create_trial(experiment_name)\n",
    "\n",
    "script_eval.run(\n",
    "    code=\"evaluation.py\",\n",
    "    source_dir=source_dir,\n",
    "    inputs=[ProcessingInput(source=s3_test_path, input_name=\"test_data\", destination=\"/opt/ml/processing/test\"),\n",
    "            ProcessingInput(source=artifacts_dir, input_name=\"model_weight\", destination=\"/opt/ml/processing/model\")\n",
    "    ],\n",
    "    outputs=[\n",
    "        ProcessingOutput(source=\"/opt/ml/processing/output\", output_name='evaluation', destination=detect_outputpath + \"/\" + job_name),\n",
    "    ],\n",
    "    job_name=job_name,\n",
    "    experiment_config={\n",
    "        'TrialName': job_name,\n",
    "        'TrialComponentDisplayName': job_name,\n",
    "    },\n",
    "    wait=False\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3b4e2101",
   "metadata": {},
   "outputs": [],
   "source": [
    "script_eval.latest_job.wait()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1f2b05ad",
   "metadata": {},
   "source": [
    "###  실험 결과 확인"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "60c7b7f3",
   "metadata": {},
   "outputs": [],
   "source": [
    "artifacts_dir = estimator.model_data.replace('model.tar.gz', '')\n",
    "print(artifacts_dir)\n",
    "!aws s3 ls --human-readable {artifacts_dir}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3ffe8e7b",
   "metadata": {},
   "outputs": [],
   "source": [
    "model_dir = './model'\n",
    "\n",
    "!rm -rf $model_dir\n",
    "\n",
    "import json , os\n",
    "\n",
    "if not os.path.exists(model_dir):\n",
    "    os.makedirs(model_dir)\n",
    "\n",
    "!aws s3 cp {artifacts_dir}model.tar.gz {model_dir}/model.tar.gz\n",
    "!tar -xvzf {model_dir}/model.tar.gz -C {model_dir}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "63b86672",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install xgboost graphviz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d82fcc80",
   "metadata": {},
   "outputs": [],
   "source": [
    "import xgboost as xgb\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c000ace4",
   "metadata": {},
   "outputs": [],
   "source": [
    "model = xgb.XGBClassifier()\n",
    "model.load_model(\"./model/xgboost-model\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "18c74b46",
   "metadata": {},
   "outputs": [],
   "source": [
    "test_prep_df = pd.read_csv('../data/dataset/test.csv')\n",
    "x_test = test_prep_df.drop('fraud', axis=1)\n",
    "feature_data = xgb.DMatrix(x_test)\n",
    "model.get_booster().feature_names = feature_data.feature_names\n",
    "model.get_booster().feature_types = feature_data.feature_types\n",
    "fig, ax = plt.subplots(figsize=(15, 8))\n",
    "xgb.plot_importance(model, ax=ax, importance_type='gain')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b8b38848",
   "metadata": {},
   "outputs": [],
   "source": [
    "xgb.plot_tree(model, num_trees=0, rankdir='LR')\n",
    "\n",
    "fig = plt.gcf()\n",
    "fig.set_size_inches(50, 15)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "880bc4cf",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}