{
"cells": [
{
"cell_type": "markdown",
"id": "682f0f6a",
"metadata": {},
"source": [
"# Train Binary Classifier using AutoGluon\n",
"\n",
"AutoGluon은 `fit()` 함수 호출만으로 상당히 높은 정확도의 모델을 생성하며, 최신 기능들을 계속 업데이트되고 있습니다. \n",
"본 핸즈온에서는 개인 소득이 $50k를 초과하는지 여부를 예측하는 이진 분류 모델을 AutoGluon으로 훈련해 보겠습니다."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "5668c59a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'1.10.0'"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%load_ext autoreload\n",
"%autoreload 2\n",
"import torch\n",
"torch.__version__"
]
},
{
"cell_type": "markdown",
"id": "d05012f5",
"metadata": {},
"source": [
"## 1. Quick Start\n",
"\n",
"Toy example로 AutoGluon의 저장소에 있는 CSV 데이터셋을 로드하고, 핸즈온을 위해 500건의 데이터만 샘플링합니다.\n",
"\n",
"참고로 TabularDataset은 pandas 데이터프레임과 호환되기 때문에, TabularDataset을 데이터프레임으로 변환하거나 데이터프레임을 TabularDataset로 자유롭게 변환할 수 있습니다."
]
},
{
"cell_type": "markdown",
"id": "af006c68",
"metadata": {},
"source": [
"### Data preparation"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "4f197d4a",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Summary of class variable: \n",
" count 500\n",
"unique 2\n",
"top <=50K\n",
"freq 365\n",
"Name: class, dtype: object\n"
]
}
],
"source": [
"from autogluon.tabular import TabularDataset, TabularPredictor\n",
"train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')\n",
"test_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')\n",
"\n",
"subsample_size = 500 # subsample subset of data for faster demo, try setting this to much larger values\n",
"train_data = train_data.sample(n=subsample_size, random_state=0)\n",
"\n",
"label = 'class'\n",
"print(\"Summary of class variable: \\n\", train_data[label].describe())\n",
"\n",
"y_test = test_data[label] # values to predict\n",
"test_data_nolab = test_data.drop(columns=[label]) # delete label column to prove we're not cheating"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "53d8ee81",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
age
\n",
"
workclass
\n",
"
fnlwgt
\n",
"
education
\n",
"
education-num
\n",
"
marital-status
\n",
"
occupation
\n",
"
relationship
\n",
"
race
\n",
"
sex
\n",
"
capital-gain
\n",
"
capital-loss
\n",
"
hours-per-week
\n",
"
native-country
\n",
"
class
\n",
"
\n",
" \n",
" \n",
"
\n",
"
6118
\n",
"
51
\n",
"
Private
\n",
"
39264
\n",
"
Some-college
\n",
"
10
\n",
"
Married-civ-spouse
\n",
"
Exec-managerial
\n",
"
Wife
\n",
"
White
\n",
"
Female
\n",
"
0
\n",
"
0
\n",
"
40
\n",
"
United-States
\n",
"
>50K
\n",
"
\n",
"
\n",
"
23204
\n",
"
58
\n",
"
Private
\n",
"
51662
\n",
"
10th
\n",
"
6
\n",
"
Married-civ-spouse
\n",
"
Other-service
\n",
"
Wife
\n",
"
White
\n",
"
Female
\n",
"
0
\n",
"
0
\n",
"
8
\n",
"
United-States
\n",
"
<=50K
\n",
"
\n",
"
\n",
"
29590
\n",
"
40
\n",
"
Private
\n",
"
326310
\n",
"
Some-college
\n",
"
10
\n",
"
Married-civ-spouse
\n",
"
Craft-repair
\n",
"
Husband
\n",
"
White
\n",
"
Male
\n",
"
0
\n",
"
0
\n",
"
44
\n",
"
United-States
\n",
"
<=50K
\n",
"
\n",
"
\n",
"
18116
\n",
"
37
\n",
"
Private
\n",
"
222450
\n",
"
HS-grad
\n",
"
9
\n",
"
Never-married
\n",
"
Sales
\n",
"
Not-in-family
\n",
"
White
\n",
"
Male
\n",
"
0
\n",
"
2339
\n",
"
40
\n",
"
El-Salvador
\n",
"
<=50K
\n",
"
\n",
"
\n",
"
33964
\n",
"
62
\n",
"
Private
\n",
"
109190
\n",
"
Bachelors
\n",
"
13
\n",
"
Married-civ-spouse
\n",
"
Exec-managerial
\n",
"
Husband
\n",
"
White
\n",
"
Male
\n",
"
15024
\n",
"
0
\n",
"
40
\n",
"
United-States
\n",
"
>50K
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" age workclass fnlwgt education education-num \\\n",
"6118 51 Private 39264 Some-college 10 \n",
"23204 58 Private 51662 10th 6 \n",
"29590 40 Private 326310 Some-college 10 \n",
"18116 37 Private 222450 HS-grad 9 \n",
"33964 62 Private 109190 Bachelors 13 \n",
"\n",
" marital-status occupation relationship race sex \\\n",
"6118 Married-civ-spouse Exec-managerial Wife White Female \n",
"23204 Married-civ-spouse Other-service Wife White Female \n",
"29590 Married-civ-spouse Craft-repair Husband White Male \n",
"18116 Never-married Sales Not-in-family White Male \n",
"33964 Married-civ-spouse Exec-managerial Husband White Male \n",
"\n",
" capital-gain capital-loss hours-per-week native-country class \n",
"6118 0 0 40 United-States >50K \n",
"23204 0 0 8 United-States <=50K \n",
"29590 0 0 44 United-States <=50K \n",
"18116 0 2339 40 El-Salvador <=50K \n",
"33964 15024 0 40 United-States >50K "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_data.head(5)"
]
},
{
"cell_type": "markdown",
"id": "e4071fb7",
"metadata": {},
"source": [
"### Training\n",
"\n",
"여러분은 단 한 줄의 코드로 쉽게 AutoML을 수행할 수 있습니다.\n",
"문제 유형을 지정하지 않아도 자동으로 문제 유형을 파악하며, 별도의 피쳐 인코딩/변환을 수행하지 않아도 되며 결측치 또한 자동으로 처리할 수 있습니다.\n",
"이 때, 별도의 검증 데이터셋을 지정하지 않으면, AutoGluon은 데이터의 자동으로 훈련 데이터/검증 데이터를 분리합니다.\n",
"\n",
"AutoGluon은 검증 데이터에서 최고의 성능을 내기 위한 다양한 조합을 반복적으로 수행하며, `fit()`에서 다양한 인수를 지정하여 자유롭게 AutoML을 수행할 수 있습니다.\n",
"\n",
"참고로, Titanic 데이터셋에 아래 설정으로 훈련 수행 시 Kaggle Private LB 기준으로 Top 3%의 accuracy를 보입니다.\n",
"(https://www.kaggle.com/innixma/top-3-in-1-line-of-code-w-autogluon)\n",
"```python\n",
"# Titanic Top 3% accuracy\n",
"predictor = TabularPredictor(label=label).fit(train_data, time_limit=3600, num_bag_folds=8)\n",
"```\n",
"\n",
"#### Tip \n",
"GPU가 있다면, AutoGluon 0.1부터 LightGBM, CatBoost, XGBoost, NN, FastAI NN 모델 훈련 시 GPU로 훈련이 가능합니다.\n",
"```python\n",
"predictor = TabularPredictor(..).fit(..., ag_args_fit={'num_gpus': 1})\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "7d8d1265",
"metadata": {},
"outputs": [],
"source": [
"!rm -rf ag-01-binary ag-01-binary-hpo"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "10724357",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Beginning AutoGluon training ... Time limit = 60s\n",
"AutoGluon will save models to \"ag-01-binary/\"\n",
"AutoGluon Version: 0.5.2\n",
"Python Version: 3.8.12\n",
"Operating System: Linux\n",
"Train Data Rows: 500\n",
"Train Data Columns: 14\n",
"Label Column: class\n",
"Preprocessing data ...\n",
"AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).\n",
"\t2 unique label values: [' >50K', ' <=50K']\n",
"\tIf 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])\n",
"Selected class <--> label mapping: class 1 = >50K, class 0 = <=50K\n",
"\tNote: For your binary classification, AutoGluon arbitrarily selected which label-value represents positive ( >50K) vs negative ( <=50K) class.\n",
"\tTo explicitly set the positive_class, either rename classes to 1 and 0, or specify positive_class in Predictor init.\n",
"Using Feature Generators to preprocess the data ...\n",
"Fitting AutoMLPipelineFeatureGenerator...\n",
"\tAvailable Memory: 13846.6 MB\n",
"\tTrain Data (Original) Memory Usage: 0.29 MB (0.0% of available memory)\n",
"\tInferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.\n",
"\tStage 1 Generators:\n",
"\t\tFitting AsTypeFeatureGenerator...\n",
"\t\t\tNote: Converting 1 features to boolean dtype as they only contain 2 unique values.\n",
"\tStage 2 Generators:\n",
"\t\tFitting FillNaFeatureGenerator...\n",
"\tStage 3 Generators:\n",
"\t\tFitting IdentityFeatureGenerator...\n",
"\t\tFitting CategoryFeatureGenerator...\n",
"\t\t\tFitting CategoryMemoryMinimizeFeatureGenerator...\n",
"\tStage 4 Generators:\n",
"\t\tFitting DropUniqueFeatureGenerator...\n",
"\tTypes of features in original data (raw dtype, special dtypes):\n",
"\t\t('int', []) : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]\n",
"\t\t('object', []) : 8 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]\n",
"\tTypes of features in processed data (raw dtype, special dtypes):\n",
"\t\t('category', []) : 7 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]\n",
"\t\t('int', []) : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]\n",
"\t\t('int', ['bool']) : 1 | ['sex']\n",
"\t0.1s = Fit runtime\n",
"\t14 features in original data used to generate 14 features in processed data.\n",
"\tTrain Data (Processed) Memory Usage: 0.03 MB (0.0% of available memory)\n",
"Data preprocessing and feature engineering runtime = 0.12s ...\n",
"AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'\n",
"\tTo change this, specify the eval_metric parameter of Predictor()\n",
"Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 400, Val Rows: 100\n",
"Fitting 13 L1 models ...\n",
"Fitting model: KNeighborsUnif ... Training model for up to 59.88s of the 59.88s of remaining time.\n",
"\t0.73\t = Validation score (accuracy)\n",
"\t0.01s\t = Training runtime\n",
"\t0.01s\t = Validation runtime\n",
"Fitting model: KNeighborsDist ... Training model for up to 59.87s of the 59.87s of remaining time.\n",
"\t0.65\t = Validation score (accuracy)\n",
"\t0.0s\t = Training runtime\n",
"\t0.0s\t = Validation runtime\n",
"Fitting model: LightGBMXT ... Training model for up to 59.86s of the 59.85s of remaining time.\n",
"\t0.83\t = Validation score (accuracy)\n",
"\t0.89s\t = Training runtime\n",
"\t0.02s\t = Validation runtime\n",
"Fitting model: LightGBM ... Training model for up to 58.94s of the 58.94s of remaining time.\n",
"\t0.85\t = Validation score (accuracy)\n",
"\t0.27s\t = Training runtime\n",
"\t0.01s\t = Validation runtime\n",
"Fitting model: RandomForestGini ... Training model for up to 58.65s of the 58.65s of remaining time.\n",
"\t0.84\t = Validation score (accuracy)\n",
"\t0.63s\t = Training runtime\n",
"\t0.07s\t = Validation runtime\n",
"Fitting model: RandomForestEntr ... Training model for up to 57.94s of the 57.94s of remaining time.\n",
"\t0.83\t = Validation score (accuracy)\n",
"\t0.55s\t = Training runtime\n",
"\t0.07s\t = Validation runtime\n",
"Fitting model: CatBoost ... Training model for up to 57.3s of the 57.3s of remaining time.\n",
"\t0.85\t = Validation score (accuracy)\n",
"\t1.32s\t = Training runtime\n",
"\t0.01s\t = Validation runtime\n",
"Fitting model: ExtraTreesGini ... Training model for up to 55.97s of the 55.96s of remaining time.\n",
"\t0.82\t = Validation score (accuracy)\n",
"\t0.6s\t = Training runtime\n",
"\t0.06s\t = Validation runtime\n",
"Fitting model: ExtraTreesEntr ... Training model for up to 55.29s of the 55.29s of remaining time.\n",
"\t0.81\t = Validation score (accuracy)\n",
"\t0.52s\t = Training runtime\n",
"\t0.06s\t = Validation runtime\n",
"Fitting model: NeuralNetFastAI ... Training model for up to 54.69s of the 54.68s of remaining time.\n",
"\t0.82\t = Validation score (accuracy)\n",
"\t1.67s\t = Training runtime\n",
"\t0.02s\t = Validation runtime\n",
"Fitting model: XGBoost ... Training model for up to 52.98s of the 52.98s of remaining time.\n",
"\t0.87\t = Validation score (accuracy)\n",
"\t0.44s\t = Training runtime\n",
"\t0.04s\t = Validation runtime\n",
"Fitting model: NeuralNetTorch ... Training model for up to 52.48s of the 52.48s of remaining time.\n",
"\t0.85\t = Validation score (accuracy)\n",
"\t2.5s\t = Training runtime\n",
"\t0.02s\t = Validation runtime\n",
"Fitting model: LightGBMLarge ... Training model for up to 49.96s of the 49.96s of remaining time.\n",
"\t0.83\t = Validation score (accuracy)\n",
"\t0.55s\t = Training runtime\n",
"\t0.02s\t = Validation runtime\n",
"Fitting model: WeightedEnsemble_L2 ... Training model for up to 59.88s of the 49.37s of remaining time.\n",
"\t0.87\t = Validation score (accuracy)\n",
"\t0.33s\t = Training runtime\n",
"\t0.0s\t = Validation runtime\n",
"AutoGluon training complete, total runtime = 10.97s ... Best model: \"WeightedEnsemble_L2\"\n",
"TabularPredictor saved. To load, use: predictor = TabularPredictor.load(\"ag-01-binary/\")\n"
]
}
],
"source": [
"# time_limit=60 -> 60sec = 1min\n",
"save_path = 'ag-01-binary'\n",
"predictor = TabularPredictor(label=label, path=save_path).fit(train_data, time_limit=60)"
]
},
{
"cell_type": "markdown",
"id": "15ecfc25",
"metadata": {},
"source": [
"### Evaluation"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "12e236ee",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"AutoGluon infers problem type is: binary\n",
"AutoGluon identified the following types of features:\n",
"('category', []) : 7 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]\n",
"('int', []) : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]\n",
"('int', ['bool']) : 1 | ['sex']\n"
]
}
],
"source": [
"print(\"AutoGluon infers problem type is: \", predictor.problem_type)\n",
"print(\"AutoGluon identified the following types of features:\")\n",
"print(predictor.feature_metadata)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "948fc6e3",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 <=50K\n",
"1 <=50K\n",
"2 <=50K\n",
"3 <=50K\n",
"4 <=50K\n",
"Name: class, dtype: object"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_pred = predictor.predict(test_data_nolab)\n",
"y_pred.head()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "be545d84",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
],
"text/plain": [
" model score_val pred_time_val fit_time \\\n",
"0 XGBoost 0.87 0.044890 0.442614 \n",
"1 WeightedEnsemble_L2 0.87 0.045413 0.770176 \n",
"2 CatBoost 0.85 0.010740 1.317234 \n",
"3 LightGBM 0.85 0.014454 0.265227 \n",
"4 NeuralNetTorch 0.85 0.018401 2.496524 \n",
"5 RandomForestGini 0.84 0.066613 0.629995 \n",
"6 LightGBMLarge 0.83 0.017587 0.553040 \n",
"7 LightGBMXT 0.83 0.018786 0.893909 \n",
"8 RandomForestEntr 0.83 0.071608 0.549277 \n",
"9 NeuralNetFastAI 0.82 0.017581 1.669663 \n",
"10 ExtraTreesGini 0.82 0.060402 0.597492 \n",
"11 ExtraTreesEntr 0.81 0.060960 0.519994 \n",
"12 KNeighborsUnif 0.73 0.005741 0.005677 \n",
"13 KNeighborsDist 0.65 0.004648 0.004866 \n",
"\n",
" pred_time_val_marginal fit_time_marginal stack_level can_infer \\\n",
"0 0.044890 0.442614 1 True \n",
"1 0.000522 0.327562 2 True \n",
"2 0.010740 1.317234 1 True \n",
"3 0.014454 0.265227 1 True \n",
"4 0.018401 2.496524 1 True \n",
"5 0.066613 0.629995 1 True \n",
"6 0.017587 0.553040 1 True \n",
"7 0.018786 0.893909 1 True \n",
"8 0.071608 0.549277 1 True \n",
"9 0.017581 1.669663 1 True \n",
"10 0.060402 0.597492 1 True \n",
"11 0.060960 0.519994 1 True \n",
"12 0.005741 0.005677 1 True \n",
"13 0.004648 0.004866 1 True \n",
"\n",
" fit_order \n",
"0 11 \n",
"1 14 \n",
"2 7 \n",
"3 4 \n",
"4 12 \n",
"5 5 \n",
"6 13 \n",
"7 3 \n",
"8 6 \n",
"9 10 \n",
"10 8 \n",
"11 9 \n",
"12 1 \n",
"13 2 "
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"predictor.leaderboard(silent=True)"
]
},
{
"cell_type": "markdown",
"id": "638fedf6",
"metadata": {},
"source": [
" \n",
"\n",
"## 2. Hyperparameter Tuning\n",
"\n",
"`fit()`은 기본적으로 신경망과 다양한 유형의 트리 앙상블 모델을 훈련합니다. 각 모델 유형에 대해 다양한 하이퍼파라메터를 지정할 수 있으며, 각 하이퍼 파라미터에 대해 단일 고정 값을 지정하거나 고려할 하이퍼파라메터의 검색 공간(search space)을 지정할 수 있습니다. 지정하지 않은 하이퍼 파라미터는 AutoGluon에서 자동으로 선택한 기본 설정으로 유지됩니다. 또한, 특정 모델을 훈련 시 제외할 수 있습니다.\n",
"\n",
"아래 코드 셀에서는 NN(Neural Network), GBM(Gradient Boosted Tree) 모델에 대해 아래의 하이퍼파라메터 조합을 설정하여 훈련을 수행하면서,\n",
"RF(Random Forest), XT(ExtraTress)를 제외하는 예시를 보여주고 있습니다.\n",
"\n",
"\n",
"### Pre-defined presets\n",
"\n",
"AutoGluon은 사전 정의된 6종류의 프리셋을 지원하고 있으며, 각 프리셋 설정에 적합한 하이퍼파라메터 튜닝이 자동으로 수행됩니다.\n",
"- `best_quality`\n",
"- `best_quality_with_high_quality_refit`\n",
"- `high_quality_fast_inference_only_refit`\n",
"- `good_quality_faster_inference_only_refit`\n",
"- `medium_quality_faster_train`\n",
"- `optimize_for_deployment`\n",
"\n",
"best_quality의 성능이 가장 높지만, 훈련 시간이 오래 걸리고 스태킹(stacking)을 위한 디스크 공간을 많이 차지하므로, 상황에 따라 적절한 프리셋을 선택하세요."
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "8328d705",
"metadata": {},
"outputs": [],
"source": [
"import autogluon.core as ag\n",
"\n",
"nn_options = { # specifies non-default hyperparameter values for neural network models\n",
" 'num_epochs': 5, # number of training epochs (controls training time of NN models)\n",
" 'learning_rate': ag.space.Real(1e-4, 1e-2, default=5e-4, log=True), # learning rate used in training (real-valued hyperparameter searched on log-scale)\n",
" 'activation': ag.space.Categorical('relu', 'softrelu'), # activation function used in NN (categorical hyperparameter, default = first entry)\n",
" 'dropout_prob': ag.space.Real(0.0, 0.5, default=0.1), # dropout probability (real-valued hyperparameter)\n",
"}\n",
"\n",
"gbm_options = { # specifies non-default hyperparameter values for lightGBM gradient boosted trees\n",
" 'num_boost_round': 100, # number of boosting rounds (controls training time of GBM models)\n",
" 'num_leaves': ag.space.Int(lower=26, upper=66, default=36), # number of leaves in trees (integer hyperparameter)\n",
"}\n",
"\n",
"hparams = { # hyperparameters of each model type\n",
" 'GBM': gbm_options,\n",
" 'NN_TORCH': nn_options, # NOTE: comment this line out if you get errors on Mac OSX\n",
"} # When these keys are missing from hyperparameters dict, no models of that type are trained\n"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "810fb118",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"NaN or Inf found in input tensor.\n",
"2022-08-29 10:15:01,737\tINFO stopper.py:363 -- Reached timeout of 53.971181058883666 seconds. Stopping all trials.\n"
]
}
],
"source": [
"time_limit = 2*60 # train various models for ~2 min\n",
"metric = 'accuracy'\n",
"save_path = 'ag-01-binary-hpo'\n",
"verbosity = 0\n",
"\n",
"predictor = TabularPredictor(label=label, eval_metric=metric, path=save_path, verbosity=verbosity).fit(\n",
" train_data, \n",
" #tuning_data=val_data, \n",
" time_limit=time_limit,\n",
" hyperparameters=hparams, \n",
" hyperparameter_tune_kwargs='auto',\n",
" excluded_model_types=['RF', 'XT']\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "09d17eec",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 <=50K\n",
"1 <=50K\n",
"2 >50K\n",
"3 <=50K\n",
"4 <=50K\n",
"Name: class, dtype: object"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_pred = predictor.predict(test_data_nolab)\n",
"y_pred.head()"
]
},
{
"cell_type": "markdown",
"id": "a6eb0e26",
"metadata": {},
"source": [
"앞 섹션보다 예측 성능이 더 좋습니다. 다만, 하이퍼파라메터 튜닝을 너무 신봉하지는 마세요. 현업에서는 데이터 전처리가 훨씬 중요합니다!"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "6ca04b71",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"