# XGBoost simple example (SageMaker version)

source : https://www.datacamp.com/community/tutorials/xgboost-in-python

### 데이터 로드

[xgboost simple 예제](warmingup1.xgboost_simple.ipynb)와 동일한 데이터셋을 사용합니다.

In [None]:
from sklearn.datasets import load_boston
import pandas as pd
import numpy as np

boston = load_boston()
data = pd.DataFrame(boston.data)
data.columns = boston.feature_names
data.head()

In [None]:
print(boston.DESCR)

### 학습/테스트 데이터셋 분리 & S3 데이터 업로드

In [None]:
import sagemaker

sess = sagemaker.Session()
bucket = sagemaker.Session().default_bucket() # replace with an existing bucket if needed
prefix = 'sagemaker/DEMO-boston-sm' # prefix used for all data stored within the bucket

# Define IAM role
import boto3
from sagemaker import get_execution_role

role = get_execution_role()

SageMaker 에서 제공하는 XGBoost를 사용하기 위해 첫번째 컬럼에 레이블이 오도록 데이터셋을 생성하고 S3에 업로드합니다. 

In [None]:
data['y'] = boston.target
train_df, valid_df, test_df = np.split(pd.concat([data['y'],data.iloc[:,:-1]],axis=1), [int(len(data)*0.7), int(len(data)*0.9)])
train_df.to_csv('boston_train.csv', index=False, header=False)
valid_df.to_csv('boston_valid.csv', index=False, header=False)

In [None]:
import os 
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('boston_train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('boston_valid.csv')

### SageMaker XGBoost를 이용한 Regression 학습


In [None]:
from sagemaker.amazon.amazon_estimator import image_uris
container = image_uris.retrieve('xgboost', region=sess.boto_region_name, version='latest')

s3_input_train = sagemaker.inputs.TrainingInput(s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='csv')
s3_input_valid = sagemaker.inputs.TrainingInput(s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='csv')


SageMaker를 이용하여 Cloud에서 학습을 실행합니다. (5분 정도 소요됩니다.)

In [None]:
%%time
xgb = sagemaker.estimator.Estimator(container,
 role, 
 instance_count=1, 
 instance_type='ml.m4.xlarge',
 output_path='s3://{}/{}/output'.format(bucket, prefix),
 sagemaker_session=sess)
xgb.set_hyperparameters(objective ='reg:linear', 
 colsample_bytree = 0.3, 
 learning_rate = 0.1,
 max_depth = 5, 
 alpha = 10, 
 n_estimators = 10,
 num_round=100)

xgb.fit({'train': s3_input_train, 'validation': s3_input_valid})

### Deployment & test

`deploy`명령을 이용하여 서비스환경으로 바로 배포할 수 있습니다.

In [None]:
xgb_predictor = xgb.deploy(initial_instance_count=1,
 instance_type='ml.m4.xlarge')

`test_df`중 임의의 레코드를 이용하여 `predict()`를 호출합니다.

In [None]:
from sagemaker.serializers import CSVSerializer
xgb_predictor.serializer = CSVSerializer()

feat = np.array(test_df.iloc[:1,1:])
xgb_predictor.predict(feat)

`test_df`전체 레코드를 이용하여 추론을 실행합니다.

In [None]:
def predict(feat_array):
 predictions = []
 for array in feat_array:
 predictions.append(float(xgb_predictor.predict(array).decode('utf-8')))
 return predictions

In [None]:
feats = np.array(test_df.iloc[:,1:])
results = predict(feats)


### Check the result

In [None]:
from sklearn.metrics import mean_squared_error

rmse = np.sqrt(mean_squared_error(test_df['y'], results))
print("RMSE: %f" % (rmse))

In [None]:
import matplotlib.pyplot as plt
plt.plot(results)
plt.plot(np.array(test_df['y']))
plt.legend(['pred','real'])
plt.title('Prediction vs Real price')
plt.show()