{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "attended-township",
   "metadata": {},
   "source": [
    "# XGBoost simple example (SageMaker version)\n",
    "\n",
    "source : https://www.datacamp.com/community/tutorials/xgboost-in-python"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "romantic-extent",
   "metadata": {},
   "source": [
    "### 데이터 로드\n",
    "\n",
    "[xgboost simple 예제](warmingup1.xgboost_simple.ipynb)와 동일한 데이터셋을 사용합니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "retained-apparatus",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import load_boston\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "boston = load_boston()\n",
    "data = pd.DataFrame(boston.data)\n",
    "data.columns = boston.feature_names\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "thrown-dating",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(boston.DESCR)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "stainless-sample",
   "metadata": {},
   "source": [
    "### 학습/테스트 데이터셋 분리 & S3 데이터 업로드"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "breathing-rotation",
   "metadata": {},
   "outputs": [],
   "source": [
    "import sagemaker\n",
    "\n",
    "sess = sagemaker.Session()\n",
    "bucket = sagemaker.Session().default_bucket()  # replace with an existing bucket if needed\n",
    "prefix = 'sagemaker/DEMO-boston-sm'           # prefix used for all data stored within the bucket\n",
    "\n",
    "# Define IAM role\n",
    "import boto3\n",
    "from sagemaker import get_execution_role\n",
    "\n",
    "role = get_execution_role()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "lesbian-advertiser",
   "metadata": {},
   "source": [
    "SageMaker 에서 제공하는 XGBoost를 사용하기 위해 첫번째 컬럼에 레이블이 오도록 데이터셋을 생성하고 S3에 업로드합니다. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "flush-scheduling",
   "metadata": {},
   "outputs": [],
   "source": [
    "data['y'] = boston.target\n",
    "train_df, valid_df, test_df = np.split(pd.concat([data['y'],data.iloc[:,:-1]],axis=1), [int(len(data)*0.7), int(len(data)*0.9)])\n",
    "train_df.to_csv('boston_train.csv', index=False, header=False)\n",
    "valid_df.to_csv('boston_valid.csv', index=False, header=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "sexual-carol",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os \n",
    "boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('boston_train.csv')\n",
    "boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('boston_valid.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aging-bradley",
   "metadata": {},
   "source": [
    "### SageMaker XGBoost를 이용한 Regression 학습\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "shaped-machinery",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.amazon.amazon_estimator import image_uris\n",
    "container = image_uris.retrieve('xgboost', region=sess.boto_region_name, version='latest')\n",
    "\n",
    "s3_input_train = sagemaker.inputs.TrainingInput(s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='csv')\n",
    "s3_input_valid = sagemaker.inputs.TrainingInput(s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='csv')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "subjective-minnesota",
   "metadata": {},
   "source": [
    "SageMaker를 이용하여 Cloud에서 학습을 실행합니다. (5분 정도 소요됩니다.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "equal-algebra",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "xgb = sagemaker.estimator.Estimator(container,\n",
    "                                    role, \n",
    "                                    instance_count=1, \n",
    "                                    instance_type='ml.m4.xlarge',\n",
    "                                    output_path='s3://{}/{}/output'.format(bucket, prefix),\n",
    "                                    sagemaker_session=sess)\n",
    "xgb.set_hyperparameters(objective ='reg:linear', \n",
    "                        colsample_bytree = 0.3, \n",
    "                        learning_rate = 0.1,\n",
    "                        max_depth = 5, \n",
    "                        alpha = 10, \n",
    "                        n_estimators = 10,\n",
    "                        num_round=100)\n",
    "\n",
    "xgb.fit({'train': s3_input_train, 'validation': s3_input_valid})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "early-bikini",
   "metadata": {},
   "source": [
    "### Deployment & test\n",
    "\n",
    "`deploy`명령을 이용하여 서비스환경으로 바로 배포할 수 있습니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "quiet-fitness",
   "metadata": {},
   "outputs": [],
   "source": [
    "xgb_predictor = xgb.deploy(initial_instance_count=1,\n",
    "                           instance_type='ml.m4.xlarge')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "charitable-music",
   "metadata": {},
   "source": [
    "`test_df`중 임의의 레코드를 이용하여 `predict()`를 호출합니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "worth-nightlife",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.serializers import CSVSerializer\n",
    "xgb_predictor.serializer = CSVSerializer()\n",
    "\n",
    "feat = np.array(test_df.iloc[:1,1:])\n",
    "xgb_predictor.predict(feat)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "instructional-tulsa",
   "metadata": {},
   "source": [
    "`test_df`전체 레코드를 이용하여 추론을 실행합니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dutch-techno",
   "metadata": {},
   "outputs": [],
   "source": [
    "def predict(feat_array):\n",
    "    predictions = []\n",
    "    for array in feat_array:\n",
    "        predictions.append(float(xgb_predictor.predict(array).decode('utf-8')))\n",
    "    return predictions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "wrapped-force",
   "metadata": {},
   "outputs": [],
   "source": [
    "feats = np.array(test_df.iloc[:,1:])\n",
    "results = predict(feats)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "korean-reason",
   "metadata": {},
   "source": [
    "### Check the result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "endless-newport",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import mean_squared_error\n",
    "\n",
    "rmse = np.sqrt(mean_squared_error(test_df['y'], results))\n",
    "print(\"RMSE: %f\" % (rmse))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "interstate-verse",
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "plt.plot(results)\n",
    "plt.plot(np.array(test_df['y']))\n",
    "plt.legend(['pred','real'])\n",
    "plt.title('Prediction vs Real price')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eligible-shelter",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}