{ "cells": [ { "cell_type": "markdown", "id": "attended-township", "metadata": {}, "source": [ "# XGBoost simple example (SageMaker version)\n", "\n", "source : https://www.datacamp.com/community/tutorials/xgboost-in-python" ] }, { "cell_type": "markdown", "id": "romantic-extent", "metadata": {}, "source": [ "### 데이터 로드\n", "\n", "[xgboost simple 예제](warmingup1.xgboost_simple.ipynb)와 동일한 데이터셋을 사용합니다." ] }, { "cell_type": "code", "execution_count": null, "id": "retained-apparatus", "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import load_boston\n", "import pandas as pd\n", "import numpy as np\n", "\n", "boston = load_boston()\n", "data = pd.DataFrame(boston.data)\n", "data.columns = boston.feature_names\n", "data.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "thrown-dating", "metadata": {}, "outputs": [], "source": [ "print(boston.DESCR)" ] }, { "cell_type": "markdown", "id": "stainless-sample", "metadata": {}, "source": [ "### 학습/테스트 데이터셋 분리 & S3 데이터 업로드" ] }, { "cell_type": "code", "execution_count": null, "id": "breathing-rotation", "metadata": {}, "outputs": [], "source": [ "import sagemaker\n", "\n", "sess = sagemaker.Session()\n", "bucket = sagemaker.Session().default_bucket() # replace with an existing bucket if needed\n", "prefix = 'sagemaker/DEMO-boston-sm' # prefix used for all data stored within the bucket\n", "\n", "# Define IAM role\n", "import boto3\n", "from sagemaker import get_execution_role\n", "\n", "role = get_execution_role()" ] }, { "cell_type": "markdown", "id": "lesbian-advertiser", "metadata": {}, "source": [ "SageMaker 에서 제공하는 XGBoost를 사용하기 위해 첫번째 컬럼에 레이블이 오도록 데이터셋을 생성하고 S3에 업로드합니다. " ] }, { "cell_type": "code", "execution_count": null, "id": "flush-scheduling", "metadata": {}, "outputs": [], "source": [ "data['y'] = boston.target\n", "train_df, valid_df, test_df = np.split(pd.concat([data['y'],data.iloc[:,:-1]],axis=1), [int(len(data)*0.7), int(len(data)*0.9)])\n", "train_df.to_csv('boston_train.csv', index=False, header=False)\n", "valid_df.to_csv('boston_valid.csv', index=False, header=False)" ] }, { "cell_type": "code", "execution_count": null, "id": "sexual-carol", "metadata": {}, "outputs": [], "source": [ "import os \n", "boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('boston_train.csv')\n", "boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('boston_valid.csv')" ] }, { "cell_type": "markdown", "id": "aging-bradley", "metadata": {}, "source": [ "### SageMaker XGBoost를 이용한 Regression 학습\n" ] }, { "cell_type": "code", "execution_count": null, "id": "shaped-machinery", "metadata": {}, "outputs": [], "source": [ "from sagemaker.amazon.amazon_estimator import image_uris\n", "container = image_uris.retrieve('xgboost', region=sess.boto_region_name, version='latest')\n", "\n", "s3_input_train = sagemaker.inputs.TrainingInput(s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='csv')\n", "s3_input_valid = sagemaker.inputs.TrainingInput(s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='csv')\n" ] }, { "cell_type": "markdown", "id": "subjective-minnesota", "metadata": {}, "source": [ "SageMaker를 이용하여 Cloud에서 학습을 실행합니다. (5분 정도 소요됩니다.)" ] }, { "cell_type": "code", "execution_count": null, "id": "equal-algebra", "metadata": {}, "outputs": [], "source": [ "%%time\n", "xgb = sagemaker.estimator.Estimator(container,\n", " role, \n", " instance_count=1, \n", " instance_type='ml.m4.xlarge',\n", " output_path='s3://{}/{}/output'.format(bucket, prefix),\n", " sagemaker_session=sess)\n", "xgb.set_hyperparameters(objective ='reg:linear', \n", " colsample_bytree = 0.3, \n", " learning_rate = 0.1,\n", " max_depth = 5, \n", " alpha = 10, \n", " n_estimators = 10,\n", " num_round=100)\n", "\n", "xgb.fit({'train': s3_input_train, 'validation': s3_input_valid})" ] }, { "cell_type": "markdown", "id": "early-bikini", "metadata": {}, "source": [ "### Deployment & test\n", "\n", "`deploy`명령을 이용하여 서비스환경으로 바로 배포할 수 있습니다." ] }, { "cell_type": "code", "execution_count": null, "id": "quiet-fitness", "metadata": {}, "outputs": [], "source": [ "xgb_predictor = xgb.deploy(initial_instance_count=1,\n", " instance_type='ml.m4.xlarge')" ] }, { "cell_type": "markdown", "id": "charitable-music", "metadata": {}, "source": [ "`test_df`중 임의의 레코드를 이용하여 `predict()`를 호출합니다." ] }, { "cell_type": "code", "execution_count": null, "id": "worth-nightlife", "metadata": {}, "outputs": [], "source": [ "from sagemaker.serializers import CSVSerializer\n", "xgb_predictor.serializer = CSVSerializer()\n", "\n", "feat = np.array(test_df.iloc[:1,1:])\n", "xgb_predictor.predict(feat)" ] }, { "cell_type": "markdown", "id": "instructional-tulsa", "metadata": {}, "source": [ "`test_df`전체 레코드를 이용하여 추론을 실행합니다." ] }, { "cell_type": "code", "execution_count": null, "id": "dutch-techno", "metadata": {}, "outputs": [], "source": [ "def predict(feat_array):\n", " predictions = []\n", " for array in feat_array:\n", " predictions.append(float(xgb_predictor.predict(array).decode('utf-8')))\n", " return predictions" ] }, { "cell_type": "code", "execution_count": null, "id": "wrapped-force", "metadata": {}, "outputs": [], "source": [ "feats = np.array(test_df.iloc[:,1:])\n", "results = predict(feats)\n" ] }, { "cell_type": "markdown", "id": "korean-reason", "metadata": {}, "source": [ "### Check the result" ] }, { "cell_type": "code", "execution_count": null, "id": "endless-newport", "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import mean_squared_error\n", "\n", "rmse = np.sqrt(mean_squared_error(test_df['y'], results))\n", "print(\"RMSE: %f\" % (rmse))" ] }, { "cell_type": "code", "execution_count": null, "id": "interstate-verse", "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "plt.plot(results)\n", "plt.plot(np.array(test_df['y']))\n", "plt.legend(['pred','real'])\n", "plt.title('Prediction vs Real price')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "eligible-shelter", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.13" } }, "nbformat": 4, "nbformat_minor": 5 }