{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 멀티 프로세스 앙상블\n",
"\n",
"본 노트북에서는 여러분의 데이터를 다양한 모델을 이용하여 학습시킨 후 앙상블을 이용하여 가장 높은 성능을 나타내는 모델을 만든 방법에 대한 예제를 제공합니다.\n",
"본 노트북 실행을 위해서는 학습(train), 테스트(test), 검증(validation) 데이터셋을 준비되어 있어야 합니다. \n",
"여러분은 SageMaker Search 기능을 이용하여 가장 높은 성능의 모델을 찾고 새로운 모델에 대하여 배치추론작업을 병렬로 진행하게 될 것입니다."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# !pip install sagemaker==1.72.0 -U"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"from sklearn.preprocessing import MinMaxScaler\n",
"import boto3\n",
"import os\n",
"from sagemaker.amazon.amazon_estimator import get_image_uri\n",
"import sagemaker\n",
"from sagemaker import get_execution_role\n",
"from sklearn.model_selection import train_test_split\n",
"import numpy as np\n",
"\n",
"import sagemaker\n",
"from random import shuffle\n",
"import multiprocessing\n",
"from multiprocessing import Pool\n",
"import csv\n",
"import nltk\n",
"from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# put the name of your bucket here\n",
"bucket = sagemaker.Session().default_bucket() # replace with an existing bucket if needed\n",
"prefix = 'sagemaker/DEMO-xgboost-dm' # prefix used for all data stored within the bucket\n",
"\n",
"sess = sagemaker.Session()\n",
"role = get_execution_role()\n",
"client = boto3.client('sagemaker')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1. 학습 데이터셋과 테스트 데이터셋 업로드\n",
"\n",
"데이터셋의 첫번째 컬럼에 목표변수가 레이블 되어 있어야 합니다. 만약 학습데이터셋과 테스트데이터셋이 없다면 본 예제의 다음 노트북을 이용하여 생성합니다. \n",
"- [xgboost_direct_marketing_sagemaker.ipynb](xgboost_direct_marketing_sagemaker.ipynb) \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"위 실습에서 생성한 파일을 사용할 때 **SageMaker 노트북 환경에 따라 다음 로컬 경로를 확인합니다.**"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [],
"source": [
"# sagemaker notebook인 경우 아래 경로로 확인합니다.\n",
"# os.chdir('/home/ec2-user/SageMaker/xgboost/')\n",
"\n",
"# sagemaker studio인 경우 아래 경로로 확인합니다.\n",
"# os.chdir('/root/xgboost/')"
]
},
{
"cell_type": "code",
"execution_count": 152,
"metadata": {},
"outputs": [],
"source": [
"# !head train.csv"
]
},
{
"cell_type": "code",
"execution_count": 116,
"metadata": {},
"outputs": [],
"source": [
"train = pd.read_csv('train.csv', names = list(range(89)))\n",
"validation = pd.read_csv('validation.csv', names = list(range(89)))\n",
"test = pd.read_csv('test.csv', names = list(range(89)))"
]
},
{
"cell_type": "code",
"execution_count": 117,
"metadata": {},
"outputs": [],
"source": [
"train_labels = np.array(train[0]).astype(\"float32\")\n",
"train_features = np.array(train.drop(0, axis=1)).astype(\"float32\")\n",
"val_labels = np.array(validation[0]).astype(\"float32\")\n",
"val_features = np.array(validation.drop(0, axis=1)).astype(\"float32\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2. 학습을 위한 함수 정의"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- 알고리즘을 입력받아서 SageMaker Estimator를 선언하고 리턴하는 함수 (base estimator로부터 알고리즘별 필요한 하이퍼파리미터를 함께 정의하여 리턴함)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"def get_base_estimator(clf, sess, role):\n",
"\n",
" container = get_image_uri(boto3.Session().region_name, clf)\n",
"\n",
" est = sagemaker.estimator.Estimator(container,\n",
" role, \n",
" train_instance_count=1, \n",
" train_instance_type='ml.m4.xlarge',\n",
" output_path='s3://{}/{}/output'.format(bucket, clf),\n",
" sagemaker_session=sess)\n",
" return est"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"def get_estimator(clf, sess, role):\n",
" container = get_image_uri(boto3.Session().region_name, clf)\n",
" \n",
" if clf == 'xgboost':\n",
" est = get_base_estimator(clf, sess, role)\n",
" est.set_hyperparameters(max_depth=5,\n",
" eta=0.2,\n",
" gamma=4,\n",
" min_child_weight=6,\n",
" subsample=0.8,\n",
" silent=0,\n",
" objective='binary:logistic',\n",
" num_round=100)\n",
" \n",
" elif clf == 'linear-learner':\n",
" est = sagemaker.LinearLearner(role=sagemaker.get_execution_role(),\n",
" train_instance_count=1,\n",
" train_instance_type='ml.m4.xlarge',\n",
" predictor_type='binary_classifier',\n",
" num_classes=2)\n",
"\n",
" elif clf == 'knn':\n",
" est = sagemaker.KNN(role=sagemaker.get_execution_role(),\n",
" k = 10,\n",
" train_instance_count=1,\n",
" train_instance_type='ml.m4.xlarge',\n",
" predictor_type='classifier',\n",
" sample_size = 200)\n",
" \n",
" elif clf == 'factorization-machines':\n",
" est = sagemaker.FactorizationMachines(role=sagemaker.get_execution_role(),\n",
" train_instance_count=1,\n",
" train_instance_type='ml.m4.xlarge',\n",
" predictor_type='binary_classifier',\n",
" num_factors = 2)\n",
" \n",
" return est"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- train, test, validation 데이터셋을 s3 업로드 (이전 랩에서 업로드한 동일한 위치)"
]
},
{
"cell_type": "code",
"execution_count": 146,
"metadata": {},
"outputs": [],
"source": [
"def copy_to_s3(bucket):\n",
" os.system('aws s3 cp train.csv s3://{}/{}/train/train.csv'.format(bucket, prefix))\n",
" os.system('aws s3 cp validation.csv s3://{}/{}/validation/validation.csv'.format(bucket, prefix))\n",
" os.system('aws s3 cp test.csv s3://{}/{}/test/test.csv'.format(bucket, prefix))\n",
" os.system('aws s3 cp test_features.csv s3://{}/{}/test/test_features.csv'.format(bucket, prefix))\n",
" \n",
"copy_to_s3(bucket)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- 알고리즘별 HPO 작업을 위한 tuner를 정의하고 리턴하는 함수"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"def get_tuner(clf, est):\n",
" \n",
" # this should search for the most recent hyperparameter tuning job, pull it in, and use for a warm start\n",
" \n",
" if clf == 'xgboost':\n",
" objective_metric_name = 'validation:auc'\n",
"\n",
" hyperparameter_ranges = {'eta': ContinuousParameter(0, 1),\n",
" 'min_child_weight': ContinuousParameter(1, 10),\n",
" 'alpha': ContinuousParameter(0, 2),\n",
" 'max_depth': IntegerParameter(1, 10)}\n",
" \n",
" elif clf == 'knn':\n",
" \n",
" objective_metric_name = 'test:accuracy'\n",
"\n",
" hyperparameter_ranges = {'k': IntegerParameter(1, 1024),\n",
" 'sample_size': IntegerParameter(256, 20000000)}\n",
" \n",
" elif clf == 'linear-learner':\n",
" objective_metric_name = 'test:recall'\n",
" \n",
" hyperparameter_ranges = {'l1': ContinuousParameter(0.0000001,1),\n",
" 'use_bias': CategoricalParameter([True, False])}\n",
" \n",
" elif clf == 'factorization-machines':\n",
" objective_metric_name = 'test:binary_classification_accuracy'\n",
" \n",
" hyperparameter_ranges = {'bias_wd': IntegerParameter(1, 1000)}\n",
" \n",
" tuner = HyperparameterTuner(est,\n",
" objective_metric_name,\n",
" hyperparameter_ranges,\n",
" max_jobs=30,\n",
" max_parallel_jobs=3)\n",
" \n",
" return tuner"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- 알고리즘별로 하이퍼파라미터 튜닝작업을 설정하고 실행하는 함수"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"def run_training_job(clf):\n",
"\n",
" \n",
" # this should loop through splits in k-fold cross validation\n",
" \n",
" # build the estimator\n",
" est = get_estimator(clf, sess, role)\n",
"\n",
" # get the hyperparameter tuner config \n",
" if clf == 'xgboost':\n",
" \n",
" tuner = get_tuner(clf, est)\n",
" \n",
" \n",
" tuner.fit({'train': s3_input_train, 'validation': s3_input_validation}) \n",
"\n",
" else:\n",
" # set the records\n",
" train_records = est.record_set(train_features, train_labels, channel='train')\n",
" test_records = est.record_set(val_features, val_labels, channel='validation')\n",
"\n",
" tuner = get_tuner(clf, est)\n",
" \n",
" tuner.fit([train_records, test_records])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- 알고리즘 리스트를 입력받고 병렬로 학습을 실행시크는 함수"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"def magic_loop(models_to_run):\n",
" pool = Pool(processes=multiprocessing.cpu_count())\n",
" transformed_rows = pool.map(run_training_job, models_to_run)\n",
" pool.close() \n",
" pool.join()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- 학습과 검증용 데이터셋 정의"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"'s3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.\n",
"'s3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.\n"
]
}
],
"source": [
"s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='csv')\n",
"\n",
"s3_input_validation = sagemaker.s3_input(s3_data='s3://{}/{}/validation'.format(bucket, prefix), content_type='csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3. 알고리즘별 학습 실행"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"'get_image_uri' method will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.\n",
"There is a more up to date SageMaker XGBoost image. To use the newer image, please set 'repo_version'='1.0-1'. For example:\n",
"\tget_image_uri(region, 'xgboost', '1.0-1').\n",
"'get_image_uri' method will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.\n",
"There is a more up to date SageMaker XGBoost image. To use the newer image, please set 'repo_version'='1.0-1'. For example:\n",
"\tget_image_uri(region, 'xgboost', '1.0-1').\n",
"Parameter image_name will be renamed to image_uri in SageMaker Python SDK v2.\n"
]
}
],
"source": [
"# clfs = ['xgboost', 'linear-learner', 'factorization-machines', 'knn']\n",
"\n",
"clfs = [ 'xgboost']\n",
"\n",
"magic_loop(clfs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"SageMaker 콘솔에서 Hyperparameter tuning jobs 메뉴를 통해 학습작업이 잘 진행되는지 확인합니다."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4. 최고 성능 모델의 선택\n",
"\n",
"학습이 완료되면, SageMaker search 기능을 이용하여 방금 실행한 학습작업 중 가장 높은 성능을 내는 모델을 검색합니다. \n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {},
"outputs": [],
"source": [
"import boto3\n",
"smclient = boto3.client(service_name='sagemaker')\n",
"import datetime\n",
"\n",
"# Search the training job by Amazon S3 location of model artifacts\n",
"search_params={\n",
" \"MaxResults\": 100,\n",
" \"Resource\": \"TrainingJob\",\n",
" \"SearchExpression\": { \n",
" \"Filters\": [ \n",
" { \n",
" \"Name\": \"InputDataConfig.DataSource.S3DataSource.S3Uri\",\n",
" \"Operator\": \"Contains\",\n",
" \n",
" # set this to have a word that is in your bucket name\n",
" \"Value\": '{}/{}'.format(bucket, prefix)\n",
" },\n",
" { \n",
" \"Name\": \"TrainingJobStatus\",\n",
" \"Operator\": \"Equals\",\n",
" \"Value\": 'Completed'\n",
" }, \n",
" ],\n",
" \n",
" },\n",
" \n",
" \"SortBy\": \"Metrics.validation:auc\",\n",
" \"SortOrder\": \"Descending\"\n",
"}\n",
"results = smclient.search(**search_params)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- `Metrics.validation:auc`기준으로 정렬한 15개의 작업결과 리턴"
]
},
{
"cell_type": "code",
"execution_count": 83,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Parameter image will be renamed to image_uri in SageMaker Python SDK v2.\n",
"Parameter image will be renamed to image_uri in SageMaker Python SDK v2.\n",
"Parameter image will be renamed to image_uri in SageMaker Python SDK v2.\n",
"Parameter image will be renamed to image_uri in SageMaker Python SDK v2.\n",
"Parameter image will be renamed to image_uri in SageMaker Python SDK v2.\n",
"Parameter image will be renamed to image_uri in SageMaker Python SDK v2.\n",
"Parameter image will be renamed to image_uri in SageMaker Python SDK v2.\n",
"Parameter image will be renamed to image_uri in SageMaker Python SDK v2.\n",
"Parameter image will be renamed to image_uri in SageMaker Python SDK v2.\n",
"Parameter image will be renamed to image_uri in SageMaker Python SDK v2.\n",
"Parameter image will be renamed to image_uri in SageMaker Python SDK v2.\n",
"Parameter image will be renamed to image_uri in SageMaker Python SDK v2.\n",
"Parameter image will be renamed to image_uri in SageMaker Python SDK v2.\n",
"Parameter image will be renamed to image_uri in SageMaker Python SDK v2.\n",
"Parameter image will be renamed to image_uri in SageMaker Python SDK v2.\n",
"Parameter image will be renamed to image_uri in SageMaker Python SDK v2.\n",
"Parameter image will be renamed to image_uri in SageMaker Python SDK v2.\n",
"Parameter image will be renamed to image_uri in SageMaker Python SDK v2.\n",
"Parameter image will be renamed to image_uri in SageMaker Python SDK v2.\n",
"Parameter image will be renamed to image_uri in SageMaker Python SDK v2.\n",
"Parameter image will be renamed to image_uri in SageMaker Python SDK v2.\n",
"Parameter image will be renamed to image_uri in SageMaker Python SDK v2.\n",
"Parameter image will be renamed to image_uri in SageMaker Python SDK v2.\n",
"Parameter image will be renamed to image_uri in SageMaker Python SDK v2.\n",
"Parameter image will be renamed to image_uri in SageMaker Python SDK v2.\n",
"Parameter image will be renamed to image_uri in SageMaker Python SDK v2.\n",
"Parameter image will be renamed to image_uri in SageMaker Python SDK v2.\n",
"Parameter image will be renamed to image_uri in SageMaker Python SDK v2.\n",
"Parameter image will be renamed to image_uri in SageMaker Python SDK v2.\n",
"Parameter image will be renamed to image_uri in SageMaker Python SDK v2.\n",
"Parameter image will be renamed to image_uri in SageMaker Python SDK v2.\n",
"Parameter image will be renamed to image_uri in SageMaker Python SDK v2.\n",
"Parameter image will be renamed to image_uri in SageMaker Python SDK v2.\n",
"Parameter image will be renamed to image_uri in SageMaker Python SDK v2.\n",
"Parameter image will be renamed to image_uri in SageMaker Python SDK v2.\n",
"Parameter image will be renamed to image_uri in SageMaker Python SDK v2.\n",
"Parameter image will be renamed to image_uri in SageMaker Python SDK v2.\n",
"Parameter image will be renamed to image_uri in SageMaker Python SDK v2.\n"
]
}
],
"source": [
"from sagemaker.model import Model\n",
"\n",
"def get_models(results):\n",
"\n",
" role = sagemaker.get_execution_role()\n",
" models = []\n",
"\n",
" for each in results['Results']:\n",
" job_name = each['TrainingJob']['TrainingJobName']\n",
" artifact = each['TrainingJob']['ModelArtifacts']['S3ModelArtifacts']\n",
"\n",
" # get training image\n",
" image = each['TrainingJob']['AlgorithmSpecification']['TrainingImage']\n",
" m = Model(artifact, image, role = role, sagemaker_session = sess, name = job_name)\n",
" models.append(m)\n",
" \n",
" return models[:15]\n",
"\n",
"models = get_models(results)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 5. 배치 추론 앙상블\n",
"\n",
"이제 각 모델에 대하여 개별적으로 배치 추론작업을 실행합니다.\n"
]
},
{
"cell_type": "code",
"execution_count": 147,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Using already existing model: xgboost-201019-1005-021-5865b5f5\n",
"Using already existing model: xgboost-201019-1005-016-763f7a27\n",
"Using already existing model: xgboost-201019-1005-019-d7cc687a\n",
"Using already existing model: xgboost-2020-09-09-08-23-45-285\n",
"Using already existing model: xgboost-201019-1005-028-7681edf7\n",
"Using already existing model: xgboost-201019-1005-006-2738e46e\n",
"Using already existing model: xgboost-201019-1005-005-46f5941e\n",
"Using already existing model: xgboost-201019-1005-018-80900beb\n",
"Using already existing model: xgboost-201019-1005-024-9b43007a\n",
"Using already existing model: xgboost-201019-1005-004-cc50bc0b\n",
"Using already existing model: xgboost-201019-1005-029-cfd48ce3\n",
"Using already existing model: xgboost-201019-1005-015-bc230053\n",
"Using already existing model: xgboost-201019-1005-023-0948411e\n",
"Using already existing model: xgboost-201019-1005-012-be7c2277\n",
"Using already existing model: xgboost-201019-1005-020-30fb5f78\n"
]
}
],
"source": [
"def run_batch_transform(model, bucket):\n",
"\n",
" transformer = model.transformer(\n",
" instance_count=1,\n",
" instance_type='ml.m4.xlarge',\n",
" output_path='s3://{}/{}/batch_results/{}'.format(bucket, prefix, model.name)\n",
" )\n",
"\n",
" transformer.transform(data='s3://{}/{}/test/test_features.csv'.format(bucket, prefix), content_type='text/csv')\n",
"\n",
" \n",
"for model in models:\n",
" run_batch_transform(model, bucket)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"15개의 작업이 병렬로 수행되고 작업이 완료되기까지 수 분이 걸립니다. \n",
"SageMaker 콘솔의 Inference > Batch Transofrm Jobs 메뉴에서 진행상태를 모니터링할 수 있습니다."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 6. 배치추론 결과 합치기 \n",
"\n",
"배치 추론작업이 완료되면 추론 결과를 취합합니다. 개별 결과 중 최대 confidence를 보이는 값을 예측 결과로 선택하고 단일 XGBoost 모델을 사용하는 것과 비교하여 얼마나 잘 수행되는지 비교해 보겠습니다. \n",
"\n",
"다음 셀은 S3에 저장된 배추 추론 결과를 로컬 노트북 환경으로 복사합니다. \n"
]
},
{
"cell_type": "code",
"execution_count": 173,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 173,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"os.system('aws s3 sync s3://{}/{}/batch_results/ batch_results/'.format(bucket, prefix))\n"
]
},
{
"cell_type": "code",
"execution_count": 174,
"metadata": {},
"outputs": [],
"source": [
"def get_dataframe():\n",
" '''\n",
" Loops through the directory on your local notebook instance where the batch results were stored, \n",
" and generates a dataframe where each column is the output from a different model.\n",
" '''\n",
" frames = []\n",
" \n",
" for sub_dir in os.listdir('batch_results'):\n",
" if '.ipynb' not in sub_dir and '.out' not in sub_dir:\n",
"\n",
" old_file = 'batch_results/{}/test.csv.out'.format(sub_dir)\n",
" \n",
" new_file = 'batch_results/{}/test.csv'.format(sub_dir)\n",
" \n",
" # remove the .out file formate\n",
" os.system('cp {} {}'.format( old_file, new_file))\n",
" \n",
" df = pd.read_csv('batch_results/{}/test.csv'.format(sub_dir), names = [sub_dir])\n",
"\n",
" frames.append(df)\n",
" \n",
" df = pd.concat(frames, axis=1)\n",
" \n",
" return df"
]
},
{
"cell_type": "code",
"execution_count": 175,
"metadata": {},
"outputs": [],
"source": [
"def consolidate_results(df):\n",
"\n",
" df['max'] = 0\n",
" df['min'] = 0\n",
" df['diff'] = 0\n",
"\n",
" for idx, row in df.iterrows():\n",
"\n",
" top = max(row)\n",
" bottom = min(row)\n",
"\n",
" diff = top - bottom\n",
"\n",
" df.loc[idx, 'max'] = top\n",
" df.loc[idx, 'min'] = bottom\n",
" df.loc[idx, 'diff'] = diff\n",
"\n",
" return df\n",
"\n",
"bare_df = get_dataframe()\n",
"consolidated_df = consolidate_results(bare_df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- test.csv의 실제값으로부터 `y_true` 컬럼 추가"
]
},
{
"cell_type": "code",
"execution_count": 179,
"metadata": {},
"outputs": [],
"source": [
"def add_label_to_results(df):\n",
" test_data = pd.read_csv('test.csv', header=None)\n",
" y_true = test_data[0].values.tolist()\n",
" df['y_true'] = y_true\n",
" return df\n",
" \n",
"df = add_label_to_results(consolidated_df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 7. 혼돈행력(Confusion Matrix) 생성\n",
"\n",
"마지막으로 모델의 성능을 비교해 보고 앙상블이 도움이 되었는지 확인해 봅니다.\n"
]
},
{
"cell_type": "code",
"execution_count": 181,
"metadata": {},
"outputs": [],
"source": [
"def get_confusion_matrix(df, model_column, accuracy=None):\n",
" \n",
" mx = pd.crosstab(index=df['y_true'], columns=np.round(df[model_column]), rownames=['actuals'], colnames=['predictions'])\n",
"\n",
" # lower right corner\n",
" tps = mx.iloc[1, 1]\n",
" # upper right corner\n",
" fps = mx.iloc[0, 1]\n",
" # lower left corner\n",
" fns = mx.iloc[1, 0]\n",
"\n",
" precision = np.round(tps / (tps + fns), 4) * 100\n",
" recall = np.round(tps / (tps + fps), 4) * 100\n",
" print ('Precision = {}%, Recall = {}%'.format(precision, recall))\n",
" \n",
" if accuracy:\n",
" # upper left corner \n",
" tns = mx.iloc[0, 0]\n",
" accuracy = (tps + tns) / (fns + fps + tps + tns) * 100\n",
" print ('Overall binary classification accuracy = {}%'.format(accuracy))\n",
" \n",
" return mx"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 앙상블 적용전(단일 모델) "
]
},
{
"cell_type": "code",
"execution_count": 193,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'xgboost-201019-1005-006-2738e46e'"
]
},
"execution_count": 193,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"one_model = df.columns[0]\n",
"one_model"
]
},
{
"cell_type": "code",
"execution_count": 194,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Precision = 21.95%, Recall = 67.95%\n",
"Overall binary classification accuracy = 89.63340616654529%\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" predictions | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" actuals | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 3586 | \n",
" 50 | \n",
"
\n",
" \n",
" 1 | \n",
" 377 | \n",
" 106 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"predictions 0.0 1.0\n",
"actuals \n",
"0 3586 50\n",
"1 377 106"
]
},
"execution_count": 194,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"get_confusion_matrix(df,one_model, accuracy=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 앙상블 적용 후"
]
},
{
"cell_type": "code",
"execution_count": 195,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Precision = 28.16%, Recall = 60.440000000000005%\n",
"Overall binary classification accuracy = 89.41490653071133%\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" predictions | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" actuals | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 3547 | \n",
" 89 | \n",
"
\n",
" \n",
" 1 | \n",
" 347 | \n",
" 136 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"predictions 0.0 1.0\n",
"actuals \n",
"0 3547 89\n",
"1 347 136"
]
},
"execution_count": 195,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"get_confusion_matrix(df, 'max', accuracy=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"본 예제에서는 앙상블 적용 후 precision은 7% 정도 증가한 결과를 보인 반면 recall 은 7%정도 줄어들었습니다. 전체적인 분류성능은 크게 달라지지 않았습니다. \n",
"여러분의 데이터셋으로 다른 머신러닝 모델과의 앙상블을 실행해보고 결과를 비교해 보십시오.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"instance_type": "ml.t3.medium",
"kernelspec": {
"display_name": "conda_python3",
"language": "python",
"name": "conda_python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.10"
}
},
"nbformat": 4,
"nbformat_minor": 4
}