{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "70151e64",
   "metadata": {},
   "source": [
    "# 1.1 Amazon SageMaker Training 실습"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b2249bdb",
   "metadata": {
    "tags": []
   },
   "source": [
    "## 학습 작업의 실행 노트북 개요\n",
    "\n",
    "이 노트북은 SageMaker에서 Training Job을 통해서 모델 학습합니다.상세한 사항은 개발자 가이드를 참조 하세요. -->  [모델 학습](https://sagemaker.readthedocs.io/en/stable/overview.html#prepare-a-training-script)\n",
    "\n",
    "- 일반적으로 SageMaker Training은 아래와 같이 진행이 됩니다.\n",
    "    - [작업 실행 시 필요 라이브러리 import](#작업-실행-시-필요-라이브러리-import)\n",
    "    - [SageMaker 세션과 Role, 사용 버킷 정의](#SageMaker-세션과-Role,-사용-버킷-정의)\n",
    "    - [하이퍼파라미터 정의](#하이퍼파라미터-정의)\n",
    "    - [학습 실행 작업 정의](#학습-실행-작업-정의)\n",
    "        - 학습 코드 명\n",
    "        - 학습 코드 폴더 명\n",
    "        - 학습 코드가 사용한 Framework 종류, 버전 등\n",
    "        - 학습 인스턴스 타입과 개수\n",
    "        - SageMaker 세션\n",
    "        - 학습 작업 하이퍼파라미터 정의\n",
    "        - 학습 작업 산출물 관련 S3 버킷 설정 등\n",
    "    - [학습 데이터셋 지정](#학습-데이터셋-지정)\n",
    "        - 학습에 사용하는 데이터셋의 S3 URI 지정\n",
    "    - [학습 실행](#학습-실행)\n",
    "    - [데이터 세트 설명](#데이터-세트-설명)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bb7739ae",
   "metadata": {},
   "source": [
    "### 작업 실행 시 필요 라이브러리 import"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5296340a",
   "metadata": {},
   "outputs": [],
   "source": [
    "import boto3\n",
    "import sagemaker"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3edc2244",
   "metadata": {},
   "source": [
    "### SageMaker 세션과 Role, 사용 버킷 정의"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7551894e",
   "metadata": {},
   "outputs": [],
   "source": [
    "sagemaker_session = sagemaker.session.Session()\n",
    "role = sagemaker.get_execution_role()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8f4a1511",
   "metadata": {},
   "outputs": [],
   "source": [
    "bucket = sagemaker_session.default_bucket()\n",
    "code_location = f's3://{bucket}/xgboost/code'\n",
    "output_path = f's3://{bucket}/xgboost/output'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6deeebb5",
   "metadata": {},
   "source": [
    "### 하이퍼파라미터 정의"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "837a9316",
   "metadata": {},
   "outputs": [],
   "source": [
    "hyperparameters = {\n",
    "       \"scale_pos_weight\" : \"29\",    \n",
    "        \"max_depth\": \"3\",\n",
    "        \"eta\": \"0.2\",\n",
    "        \"objective\": \"binary:logistic\",\n",
    "        \"num_round\": \"100\",\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "111d3658",
   "metadata": {},
   "source": [
    "### 학습 실행 작업 정의"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7f383326",
   "metadata": {},
   "outputs": [],
   "source": [
    "instance_count = 1\n",
    "instance_type = \"ml.m5.large\"\n",
    "# instance_type = \"local\"\n",
    "max_run = 1*60*60\n",
    "\n",
    "use_spot_instances = False\n",
    "if use_spot_instances:\n",
    "    max_wait = 1*60*60\n",
    "else:\n",
    "    max_wait = None"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "81c0b774-9be5-4e1c-a30e-f9ee3edaeeb3",
   "metadata": {},
   "outputs": [],
   "source": [
    "if instance_type in ['local', 'local_gpu']:\n",
    "    from sagemaker.local import LocalSession\n",
    "    sagemaker_session = LocalSession()\n",
    "    sagemaker_session.config = {'local': {'local_code': True}}\n",
    "else:\n",
    "    sagemaker_session = sagemaker.session.Session()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4dcee8ce-f51e-44da-9eac-84f9d4212281",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.xgboost.estimator import XGBoost\n",
    "\n",
    "estimator = XGBoost(\n",
    "    entry_point=\"xgboost_starter_script.py\",\n",
    "    source_dir='src',\n",
    "    output_path=output_path,\n",
    "    code_location=code_location,\n",
    "    hyperparameters=hyperparameters,\n",
    "    role=role,\n",
    "    sagemaker_session=sagemaker_session,\n",
    "    instance_count=instance_count,\n",
    "    instance_type=instance_type,\n",
    "    framework_version=\"1.3-1\",\n",
    "    max_run=max_run,\n",
    "    use_spot_instances=use_spot_instances,  # spot instance 활용\n",
    "    max_wait=max_wait,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1a78bf16",
   "metadata": {},
   "source": [
    "### 학습 데이터셋 지정"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ae944abb",
   "metadata": {},
   "outputs": [],
   "source": [
    "data_path=f's3://{bucket}/xgboost/dataset'\n",
    "!aws s3 sync ../data/dataset/ $data_path"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "24ceb810-675c-49c3-8137-6b0c85ae9cae",
   "metadata": {},
   "outputs": [],
   "source": [
    "if instance_type in ['local', 'local_gpu']:\n",
    "    from pathlib import Path\n",
    "    file_path = f'file://{Path.cwd()}'\n",
    "    inputs = file_path.split('lab_1_training')[0] + 'data/dataset/'\n",
    "    \n",
    "else:\n",
    "    inputs = data_path\n",
    "inputs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "61ba2857",
   "metadata": {},
   "source": [
    "### 학습 실행"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e5127e34",
   "metadata": {},
   "outputs": [],
   "source": [
    "estimator.fit(inputs = {'inputdata': inputs},\n",
    "                  wait=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "274ab30b",
   "metadata": {},
   "outputs": [],
   "source": [
    "estimator.logs()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4ac8b2d0",
   "metadata": {},
   "source": [
    "### 데이터 세트 설명\n",
    "- 데이터 세트는 블로그 [Architect and build the full machine learning lifecycle with AWS: An end-to-end Amazon SageMaker demo](https://aws.amazon.com/blogs/machine-learning/architect-and-build-the-full-machine-learning-lifecycle-with-amazon-sagemaker/) 에서 사용한 데이터를 사용합니다. 블로그에서는 데이터 세트에 대해서 이렇게 설명 합니다.\n",
    "- \"자동차 보험 청구 사기를 탐지를 위해서 블로그 저자가 데이터를 합성해서 만든 고객과 클레임의 데이터 세트를 사용합니다.\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e6024e8f",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2925c8bf",
   "metadata": {},
   "outputs": [],
   "source": [
    "train_prep_df = pd.read_csv('../data/dataset/train.csv')\n",
    "train_prep_df.groupby('fraud').sample(n=5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d4830231",
   "metadata": {},
   "outputs": [],
   "source": [
    "train_prep_df.groupby('fraud').size()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2e511a2b",
   "metadata": {},
   "source": [
    "#### 데이터 컬럼 설명\n",
    "- fraud: 보험 청구의 사기 여부 입니다. 1 이면 사기, 0 이면 정상 청구 입니다.\n",
    "- vehicle_claim: 자동차에 대한 보험 청구액. 값으로서, $1000, $17,638 등이 있습니다.\n",
    "- total_claim_amount: 전체 보험 청구액 입니다. $21,400, $10,000 등이 있습니다.    \n",
    "- customer_age: 고객의 나이를 의미 합니다.\n",
    "- months_as_customer: 고객으로서의 가입 기간을 의미합니다. 단위는 월로서 11, 30, 31 등의 값이 존재 합니다.\n",
    "- num_claims_past_year: 작년의 보험 청구 수를 의미 합니다. 0, 1, 2, 3, 4, 5, 6 의 값이 존재 합니다.\n",
    "- num_insurers_past_5_years: 과거 5년 동안의 보험 가입 회사 수를 의미 합니다. 1, 2, 3, 4, 5 의 값이 존재 합니다.\n",
    "- policy_deductable: 보험의 최소 자기 부담금 입니다. $750, $800 등이 있습니다.    \n",
    "- policy_annual_premium: 보험의 특약 가입에 대한 금액 입니다. $2000, $3000 등이 있습니다.\n",
    "- customer_zip: 고객의 집 주소 우편 번호를 의미합니다.\n",
    "- auto_year: 자동차의 년식을 의미 합니다. 2020, 2019 등이 있습니다.\n",
    "- num_vehicles_involved: 몇 대의 자동차가 사고에 연관 되었는지 입니다. 1, 2, 3, 4, 5, 6 의 값이 있습니다.\n",
    "- num_injuries: 몇 명이 상해를 입었는지를 기술합니다. 0, 1, 2, 3, 4, 의 값이 있습니다.\n",
    "- num_witnesses: 몇 명의 목격자가 있었는지를 기술합니다. 0, 1, 2, 3, 4, 5 의 값이 있습니다.\n",
    "- injury_claim: 상해에 대한 보험 청구액. \\$5,500, \\$70,700, \\$100,700 등이 있습니다.    \n",
    "- incident_month: 사고가 발생한 월을 의미합니다. 1~12 값이 존재 합니다.\n",
    "- incident_day: 사고가 발생한 일자를 의미합니다. 1~31 값이 존재 합니다.\n",
    "- incident_dow: 사고가 발생한 요일을 의미합니다. 0~6 값이 존재 합니다.\n",
    "- incident_hour: 사고가 발생한 시간을 의미합니다. 0~23 값이 존재 합니다.\n",
    "- policy_state: 보험 계약을 한 미국 주(State)를 의미 합니다. CA, WA, AZ, OR, NV, ID 가 존재 합니다.    \n",
    "- policy_liability: 보험 청구의 한도를 의미 합니다. 예를 들어서 25/50 은  사람 당 상해 한도 $25,000, 사고 당 상해 한도가 $50,000 을  의미합니다. 25/50, 15/30, 30/60, 100/200 의 값이 존재 합니다. \n",
    "- customer_gender: 고객의 성별을 의미 합니다. Male, Female, Unkown, Other가 존재 합니다.\n",
    "- customer_education: 고객의 최종 학력을 의미합니다. Bachelor, High School, Advanced Degree, Associate, Below High School 이 존재 합니다.\n",
    "- driver_relationship: 보험 계약자와 운전자와의 관계 입니다. Self, Spouse, Child, Other 값이 존재 합니다.\n",
    "- incident_type: 사고의 종류를 기술합니다. Collision, Break-in, Theft 값이 존재 합니다.\n",
    "- collision_type: 충돌 타입을 기술합니다. Front, Rear, Side, missing 값이 존재 합니다.\n",
    "- incident_severity: 사고의 손실 정도 입니다. Minor, Major, Totaled 값이 존재 합니다.\n",
    "- authorities_contacted: 어떤 관련 기관에 연락을 했는지 입니다. Police, Ambuylance, Fire, None 값이 존재 합니다.\n",
    "- police_report_available: 경찰 보고서가 존재하는지를 기술합니다. Yes, No 의 값이 있습니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "33b5a5ff-2381-439b-ba72-32722e0cb22e",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}