{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "2fceb094",
   "metadata": {},
   "source": [
    "# 1.4 SageMaker Training with Experiments, HPO and Processing"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b1c7b1e2",
   "metadata": {},
   "source": [
    "## 학습 작업의 실행 노트북 개요\n",
    "\n",
    "- SageMaker Training에 SageMaker 실험을 추가하여 여러 실험의 결과를 비교할 수 있습니다.\n",
    "    - [작업 실행 시 필요 라이브러리 import](#작업-실행-시-필요-라이브러리-import)\n",
    "    - [SageMaker 세션과 Role, 사용 버킷 정의](#SageMaker-세션과-Role,-사용-버킷-정의)\n",
    "    - [하이퍼파라미터 정의](#하이퍼파라미터-정의)\n",
    "    - [학습 실행 작업 정의](#학습-실행-작업-정의)\n",
    "        - 학습 코드 명\n",
    "        - 학습 코드 폴더 명\n",
    "        - 학습 코드가 사용한 Framework 종류, 버전 등\n",
    "        - 학습 인스턴스 타입과 개수\n",
    "        - SageMaker 세션\n",
    "        - 학습 작업 하이퍼파라미터 정의\n",
    "        - 학습 작업 산출물 관련 S3 버킷 설정 등\n",
    "    - [학습 데이터셋 지정](#학습-데이터셋-지정)\n",
    "        - 학습에 사용하는 데이터셋의 S3 URI 지정\n",
    "    - [SageMaker 실험 설정](#SageMaker-실험-설정)\n",
    "    - [학습 실행](#학습-실행)\n",
    "    - [데이터 세트 설명](#데이터-세트-설명)\n",
    "    - [실험 결과 보기](#실험-결과-보기)\n",
    "    - [Evaluation 하기](#Evaluation-하기)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4dc90460",
   "metadata": {},
   "source": [
    "### 작업 실행 시 필요 라이브러리 import"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "80a85024",
   "metadata": {},
   "outputs": [],
   "source": [
    "import boto3\n",
    "import sagemaker"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba9d596a",
   "metadata": {},
   "source": [
    "### SageMaker 세션과 Role, 사용 버킷 정의"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bda98a0a",
   "metadata": {},
   "outputs": [],
   "source": [
    "sagemaker_session = sagemaker.session.Session()\n",
    "region = sagemaker_session._region_name\n",
    "role = sagemaker.get_execution_role()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8664b776",
   "metadata": {},
   "outputs": [],
   "source": [
    "bucket = sagemaker_session.default_bucket()\n",
    "code_location = f's3://{bucket}/xgboost/code'\n",
    "output_path = f's3://{bucket}/xgboost/output'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "21bd23dc",
   "metadata": {},
   "source": [
    "### 하이퍼파라미터 정의"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fbfbb81f",
   "metadata": {},
   "outputs": [],
   "source": [
    "hyperparameters = {\n",
    "       \"scale_pos_weight\" : 29,    \n",
    "        \"objective\": \"binary:logistic\",\n",
    "        \"num_round\": 100,\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3d765652",
   "metadata": {},
   "source": [
    "### 학습 실행 작업 정의"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "17aae4fc",
   "metadata": {},
   "outputs": [],
   "source": [
    "instance_count = 1\n",
    "instance_type = \"ml.m5.large\"\n",
    "# instance_type = 'local'\n",
    "max_run = 1*60*60\n",
    "\n",
    "use_spot_instances = False\n",
    "if use_spot_instances:\n",
    "    max_wait = 1*60*60\n",
    "else:\n",
    "    max_wait = None"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1347adc9-0747-4f68-9946-7864d384d1ec",
   "metadata": {},
   "outputs": [],
   "source": [
    "# image_uri = sagemaker.image_uris.retrieve(\n",
    "#     \"xgboost\",\n",
    "#     version=\"1.5-1\",\n",
    "#     region=region,\n",
    "#     image_scope='training',\n",
    "#     # instance_type=instance_type\n",
    "# )\n",
    "# image_uri"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "188e81ec-c435-4472-8273-5bf13d0a5bb5",
   "metadata": {},
   "outputs": [],
   "source": [
    "if instance_type in ['local', 'local_gpu']:\n",
    "    from sagemaker.local import LocalSession\n",
    "    sagemaker_session = LocalSession()\n",
    "    sagemaker_session.config = {'local': {'local_code': True}}\n",
    "else:\n",
    "    sagemaker_session = sagemaker.session.Session()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3e489556",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.xgboost.estimator import XGBoost\n",
    "\n",
    "# estimator = sagemaker.estimator.Estimator(\n",
    "estimator = XGBoost(\n",
    "    entry_point=\"xgboost_starter_script.py\",\n",
    "    source_dir=\"src\",\n",
    "    # image_uri=image_uri,\n",
    "    output_path=output_path,\n",
    "    code_location=code_location,\n",
    "    hyperparameters=hyperparameters,\n",
    "    role=role,\n",
    "    sagemaker_session=sagemaker_session,\n",
    "    instance_count=instance_count,\n",
    "    instance_type=instance_type,\n",
    "    framework_version=\"1.3-1\",\n",
    "    max_run=max_run,\n",
    "    use_spot_instances=use_spot_instances,  # spot instance 활용\n",
    "    max_wait=max_wait,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9eeb2bf2",
   "metadata": {},
   "source": [
    "### 학습 데이터셋 지정"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "475c82bb",
   "metadata": {},
   "outputs": [],
   "source": [
    "data_path=f's3://{bucket}/xgboost/dataset'\n",
    "!aws s3 sync ../data/dataset/ $data_path"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2c5ef86f-5aea-493c-b4df-41faaf2dd686",
   "metadata": {},
   "outputs": [],
   "source": [
    "if instance_type in ['local', 'local_gpu']:\n",
    "    from pathlib import Path\n",
    "    file_path = f'file://{Path.cwd()}'\n",
    "    inputs = file_path.split('lab_1_training')[0] + 'data/dataset/'\n",
    "    \n",
    "else:\n",
    "    inputs = data_path\n",
    "inputs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c686dcc8",
   "metadata": {},
   "source": [
    "### SageMaker 실험 설정"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e68616ec",
   "metadata": {},
   "outputs": [],
   "source": [
    "# !pip install -U sagemaker-experiments"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7f26f8ec",
   "metadata": {},
   "outputs": [],
   "source": [
    "experiment_name='xgb-1'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b9750d63",
   "metadata": {},
   "outputs": [],
   "source": [
    "from smexperiments.experiment import Experiment\n",
    "from smexperiments.trial import Trial\n",
    "from time import strftime"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "99736980",
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_experiment(experiment_name):\n",
    "    try:\n",
    "        sm_experiment = Experiment.load(experiment_name)\n",
    "    except:\n",
    "        sm_experiment = Experiment.create(experiment_name=experiment_name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "38ae2ace",
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_trial(experiment_name):\n",
    "    create_date = strftime(\"%m%d-%H%M%s\")       \n",
    "    sm_trial = Trial.create(trial_name=f'{experiment_name}-{create_date}',\n",
    "                            experiment_name=experiment_name)\n",
    "\n",
    "    job_name = f'{sm_trial.trial_name}'\n",
    "    return job_name"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1913bbb2-8bca-4a42-9e82-0632a8b5b549",
   "metadata": {
    "tags": []
   },
   "source": [
    "### HPO 실행 + 학습 실행\n",
    "\n",
    "SageMaker의 [Automatic Model Tuning](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html)을 활용할 수 있습니다. 이 방식은 높은 평가 비용의 최적화 문제를 위해 특별히 설계된 베이지안 최적화 방법을 사용합니다. [HyperparameterTuner](https://sagemaker.readthedocs.io/en/stable/tuner.html)의 `fit()` 방법은 `Estimator`와 같이 기본적으로 제공되지 않습니다. (HPO 작업은 일반적으로는 오래 걸리기 때문입니다.) SageMaker console에 있는 \"Hyperparameter Tuning Jobs\"은 진행되는 작업의 상세 상태와 metrics를 확인하기에 좋은 UI를 제공합니다. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3f69dad0-2eb1-4eb4-92a8-3a935ebcf126",
   "metadata": {},
   "outputs": [],
   "source": [
    "max_jobs=4    # TODO: Ideally 12 or more\n",
    "max_parallel_jobs=2   # TODO: Maybe only 1 for Event Engine, 2-3 if possible"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c088ce23-2a9f-4277-9c29-c9610b242263",
   "metadata": {},
   "outputs": [],
   "source": [
    "create_experiment(experiment_name)\n",
    "job_name = create_trial(experiment_name)\n",
    "\n",
    "job_name =  job_name[15:] ## job_name must have length less than or equal to 32 for HPO\n",
    "\n",
    "tuner = sagemaker.tuner.HyperparameterTuner(\n",
    "    estimator,\n",
    "    objective_metric_name=\"validation:auc\",\n",
    "    hyperparameter_ranges={\n",
    "        \"max_depth\": sagemaker.tuner.IntegerParameter(2, 5),\n",
    "        \"eta\": sagemaker.tuner.ContinuousParameter(0.1, 0.5)\n",
    "    },\n",
    "    objective_type=\"Maximize\",\n",
    "    max_jobs=max_jobs,    # TODO: Ideally 12 or more\n",
    "    max_parallel_jobs=max_parallel_jobs,    # TODO: Maybe only 1 for Event Engine, 2-3 if possible\n",
    ")\n",
    "\n",
    "tuner.fit(\n",
    "    job_name = job_name,\n",
    "    inputs={'inputdata': inputs},\n",
    "    experiment_config={\n",
    "          'TrialName': job_name,\n",
    "          'TrialComponentDisplayName': job_name,\n",
    "    },\n",
    "    wait=False\n",
    ")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ee680f4b-f9e4-4f93-b30a-db6642007ad3",
   "metadata": {},
   "outputs": [],
   "source": [
    "tuner.wait()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "620b2585",
   "metadata": {},
   "source": [
    "###  실험 결과 보기\n",
    "위의 실험한 결과를 확인 합니다.\n",
    "- 각각의 훈련잡의 시도에 대한 훈련 사용 데이터, 모델 입력 하이퍼 파라미터, 모델 평가 지표, 모델 아티펙트 결과 위치 등의 확인이 가능합니다.\n",
    "- **아래의 모든 내용은 SageMaker Studio 를 통해서 직관적으로 확인이 가능합니다.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b121c5ca",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.analytics import ExperimentAnalytics, HyperparameterTuningJobAnalytics\n",
    "import pandas as pd\n",
    "pd.options.display.max_columns = 50\n",
    "pd.options.display.max_rows = 10\n",
    "pd.options.display.max_colwidth = 100"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "80080207",
   "metadata": {},
   "outputs": [],
   "source": [
    "trial_component_training_analytics = HyperparameterTuningJobAnalytics(\n",
    "    sagemaker_session= sagemaker_session,\n",
    "    hyperparameter_tuning_job_name=job_name\n",
    ")\n",
    "\n",
    "trial_component_training_analytics.dataframe()[['TrainingJobName', 'TrainingJobStatus', \n",
    "                                                'eta', 'max_depth', 'FinalObjectiveValue']]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1a23cfc2",
   "metadata": {
    "tags": []
   },
   "source": [
    "###  Evaluation 하기\n",
    "SageMaker Processing을 이용하여 Evalution을 수행하는 코드를 동작할 수 있습니다. MLOps에서 Processing을 적용하면 전처리, Evaluation 등을 serverless로 동작할 수 있습니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6671a6c5",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.processing import FrameworkProcessor\n",
    "from sagemaker.processing import ProcessingInput, ProcessingOutput"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a18f8d6c",
   "metadata": {},
   "outputs": [],
   "source": [
    "instance_count = 1\n",
    "instance_type = \"ml.m5.large\"\n",
    "# instance_type = 'local'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "31647257",
   "metadata": {},
   "outputs": [],
   "source": [
    "script_eval = FrameworkProcessor(\n",
    "    XGBoost,\n",
    "    framework_version=\"1.3-1\",\n",
    "    role=role,\n",
    "    instance_type=instance_type,\n",
    "    instance_count=instance_count\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b3f2ad2a-8b86-4a9e-82ff-67a1fe8cbdb7",
   "metadata": {},
   "outputs": [],
   "source": [
    "client = boto3.client('sagemaker')\n",
    "response = client.describe_training_job(\n",
    "    TrainingJobName=tuner.best_training_job()\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3ec78e8c-c257-477e-b4d4-3e987c0630f5",
   "metadata": {},
   "outputs": [],
   "source": [
    "artifacts_dir = response['ModelArtifacts']['S3ModelArtifacts']\n",
    "artifacts_dir"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f9829f86",
   "metadata": {},
   "outputs": [],
   "source": [
    "s3_test_path = data_path + '/test.csv'\n",
    "detect_outputpath = f's3://{bucket}/xgboost/processing'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ef915fe1",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "source_dir = f'{Path.cwd()}/src'\n",
    "\n",
    "if instance_type == 'local':\n",
    "    from sagemaker.local import LocalSession\n",
    "    from pathlib import Path\n",
    "\n",
    "    sagemaker_session = LocalSession()\n",
    "    sagemaker_session.config = {'local': {'local_code': True}}\n",
    "\n",
    "    s3_test_path=f'../data/dataset/test.csv'\n",
    "else:\n",
    "    sagemaker_session = sagemaker.session.Session()\n",
    "    s3_test_path=data_path + '/test.csv'  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d54892a4",
   "metadata": {},
   "outputs": [],
   "source": [
    "create_experiment(experiment_name)\n",
    "job_name = create_trial(experiment_name)\n",
    "\n",
    "script_eval.run(\n",
    "    code=\"evaluation.py\",\n",
    "    source_dir=source_dir,\n",
    "    inputs=[ProcessingInput(source=s3_test_path, input_name=\"test_data\", destination=\"/opt/ml/processing/test\"),\n",
    "            ProcessingInput(source=artifacts_dir, input_name=\"model_weight\", destination=\"/opt/ml/processing/model\")\n",
    "    ],\n",
    "    outputs=[\n",
    "        ProcessingOutput(source=\"/opt/ml/processing/output\", output_name='evaluation', destination=detect_outputpath + \"/\" + job_name),\n",
    "    ],\n",
    "    job_name=job_name,\n",
    "    experiment_config={\n",
    "        'TrialName': job_name,\n",
    "        'TrialComponentDisplayName': job_name,\n",
    "    },\n",
    "    wait=False\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3b4e2101",
   "metadata": {},
   "outputs": [],
   "source": [
    "script_eval.latest_job.wait()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1f2b05ad",
   "metadata": {},
   "source": [
    "###  실험 결과 확인"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "60c7b7f3",
   "metadata": {},
   "outputs": [],
   "source": [
    "# artifacts_dir = xgb_estimator.model_data.replace('model.tar.gz', '')\n",
    "print(artifacts_dir)\n",
    "!aws s3 ls --human-readable {artifacts_dir}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3ffe8e7b",
   "metadata": {},
   "outputs": [],
   "source": [
    "model_dir = './model'\n",
    "\n",
    "!rm -rf $model_dir\n",
    "\n",
    "import json , os\n",
    "\n",
    "if not os.path.exists(model_dir):\n",
    "    os.makedirs(model_dir)\n",
    "\n",
    "!aws s3 cp {artifacts_dir} {model_dir}/model.tar.gz\n",
    "!tar -xvzf {model_dir}/model.tar.gz -C {model_dir}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "63b86672",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install xgboost graphviz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d82fcc80",
   "metadata": {},
   "outputs": [],
   "source": [
    "import xgboost as xgb\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c000ace4",
   "metadata": {},
   "outputs": [],
   "source": [
    "model = xgb.XGBClassifier()\n",
    "model.load_model(\"./model/xgboost-model\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "18c74b46",
   "metadata": {},
   "outputs": [],
   "source": [
    "test_prep_df = pd.read_csv('../data/dataset/test.csv')\n",
    "x_test = test_prep_df.drop('fraud', axis=1)\n",
    "feature_data = xgb.DMatrix(x_test)\n",
    "model.get_booster().feature_names = feature_data.feature_names\n",
    "model.get_booster().feature_types = feature_data.feature_types\n",
    "fig, ax = plt.subplots(figsize=(15, 8))\n",
    "xgb.plot_importance(model, ax=ax, importance_type='gain')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b8b38848",
   "metadata": {},
   "outputs": [],
   "source": [
    "xgb.plot_tree(model, num_trees=0, rankdir='LR')\n",
    "\n",
    "fig = plt.gcf()\n",
    "fig.set_size_inches(50, 15)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "880bc4cf",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e62a7f6d-d8d6-432f-9f33-b9c0ab141fd9",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}