![MLU Logo](../data/MLU_Logo.png)

# <a name="0">Machine Learning Accelerator - Tabular Data - Lecture 3</a>


## AutoGluon

In this notebook, we use __AutoGluon__ to predict the __Outcome Type__ field of our review dataset.


[AutoGluon](https://auto.gluon.ai/stable/index.html) implements many of the best practices that we have discussed in this class, and more!  In particular, it sets itself apart from other AutoML solutions by having excellent automated feature engineering that can handle text data and missing values without any hand-coded solutions (See their [paper](https://arxiv.org/abs/2003.06505) for details).  It is too new to be in an existing Sagemaker kernel, so let's install it.

1. <a href="#1">Set up AutoGluon</a>
2. <a href="#2">Read the datasets</a>
3. <a href="#3">Train a classifier with AutoGluon</a>
4. <a href="#4">Model evaluation</a>
5. <a href="#5">Clean up model artifacts</a>

__Austin Animal Center Dataset__:

In this exercise, we are working with pet adoption data from __Austin Animal Center__. We have two datasets that cover intake and outcome of animals. Intake data is available from [here](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Intakes/wter-evkm) and outcome is from [here](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Outcomes/9t4d-g238). 

In order to work with a single table, we joined the intake and outcome tables using the "Animal ID" column and created a single __review.csv__ file. We also didn't consider animals with multiple entries to the facility to keep our dataset simple. If you want to see the original datasets and the merged data with multiple entries, they are available under data/review folder: Austin_Animal_Center_Intakes.csv, Austin_Animal_Center_Outcomes.csv and Austin_Animal_Center_Intakes_Outcomes.csv.

__Dataset schema:__ 
- __Pet ID__ - Unique ID of pet
- __Outcome Type__ - State of pet at the time of recording the outcome (0 = not placed, 1 = placed). This is the field to predict.
- __Sex upon Outcome__ - Sex of pet at outcome
- __Name__ - Name of pet 
- __Found Location__ - Found location of pet before entered the center
- __Intake Type__ - Circumstances bringing the pet to the center
- __Intake Condition__ - Health condition of pet when entered the center
- __Pet Type__ - Type of pet
- __Sex upon Intake__ - Sex of pet when entered the center
- __Breed__ - Breed of pet 
- __Color__ - Color of pet 
- __Age upon Intake Days__ - Age of pet when entered the center (days)
- __Age upon Outcome Days__ - Age of pet at outcome (days))

## 1. <a name="1">Set up AutoGluon</a>
(<a href="#0">Go to top</a>)

In [1]:
%pip install -q -r ../requirements.txt

You should consider upgrading via the '/home/ec2-user/anaconda3/envs/pytorch_p39/bin/python -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


## 2. <a name="2">Read the dataset</a>
(<a href="#0">Go to top</a>)

Let's read the dataset into a dataframe, using Pandas, and split the dataset into train and test sets (AutoGluon will handle the validation itself).

In [2]:
import pandas as pd

df = pd.read_csv('../data/review/review_dataset.csv')

In [3]:
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(df, test_size=0.1, shuffle=True, random_state=23)

## 3. <a name="3">Train a classifier with AutoGluon</a>
(<a href="#0">Go to top</a>)

We can run AutoGluon with a short snippet. For fitting, we just call the __.fit()__ function. In this exercise, we used the data frame objects, but this tool also accepts the raw csv files as input. To use this tool with simple csv files, you can follow the code snippet below.

```python
from autogluon.tabular import TabularDataset, TabularPredictor

train_data = TabularDataset(file_path='path_to_dataset/train.csv')
test_data = TabularDataset(file_path='path_to_dataset/test.csv')

predictor = TabularPredictor(label='label_column').fit(train_data)
test_predictions = predictor.predict(test_data)
```

We have our separate __data frames__ for training and test data, so we work with them below. We grab the first 10000 data points for a quick demo. You can also pass the full dataset.

In [4]:
from autogluon.tabular import TabularDataset, TabularPredictor

k = 10000 # grab less data for a quick demo
#k = train_data.shape[0] # grad the whole dataset

predictor = TabularPredictor(label='Outcome Type').fit(train_data.head(k))

No path specified. Models will be saved in: "AutogluonModels/ag-20230126_225442/"
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20230126_225442/"
AutoGluon Version:  0.6.1
Python Version:     3.9.13
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Fri Oct 14 01:16:24 UTC 2022
Train Data Rows:    10000
Train Data Columns: 12
Label Column: Outcome Type
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [1.0, 0.0]
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 = 1, class 0 = 0
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    13890.67 MB
	Train Data 

We can also summarize what happened during fit.

In [5]:
predictor.fit_summary()

*** Summary of fit() ***
Estimated performance of each model:
                  model  score_val  pred_time_val   fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0   WeightedEnsemble_L2      0.871       0.512801  26.287463                0.002498           0.902044            2       True         14
1              CatBoost      0.855       0.027408   7.169063                0.027408           7.169063            1       True          7
2               XGBoost      0.854       0.016263   7.829739                0.016263           7.829739            1       True         11
3      RandomForestGini      0.851       0.141809   2.365973                0.141809           2.365973            1       True          5
4            LightGBMXT      0.850       0.017072   2.259549                0.017072           2.259549            1       True          3
5              LightGBM      0.849       0.018099   1.154111                0.018099           1.154111 

{'model_types': {'KNeighborsUnif': 'KNNModel',
  'KNeighborsDist': 'KNNModel',
  'LightGBMXT': 'LGBModel',
  'LightGBM': 'LGBModel',
  'RandomForestGini': 'RFModel',
  'RandomForestEntr': 'RFModel',
  'CatBoost': 'CatBoostModel',
  'ExtraTreesGini': 'XTModel',
  'ExtraTreesEntr': 'XTModel',
  'NeuralNetFastAI': 'NNFastAiTabularModel',
  'XGBoost': 'XGBoostModel',
  'NeuralNetTorch': 'TabularNeuralNetTorchModel',
  'LightGBMLarge': 'LGBModel',
  'WeightedEnsemble_L2': 'WeightedEnsembleModel'},
 'model_performance': {'KNeighborsUnif': 0.663,
  'KNeighborsDist': 0.668,
  'LightGBMXT': 0.85,
  'LightGBM': 0.849,
  'RandomForestGini': 0.851,
  'RandomForestEntr': 0.847,
  'CatBoost': 0.855,
  'ExtraTreesGini': 0.838,
  'ExtraTreesEntr': 0.84,
  'NeuralNetFastAI': 0.831,
  'XGBoost': 0.854,
  'NeuralNetTorch': 0.849,
  'LightGBMLarge': 0.847,
  'WeightedEnsemble_L2': 0.871},
 'model_best': 'WeightedEnsemble_L2',
 'model_paths': {'KNeighborsUnif': 'AutogluonModels/ag-20230126_225442/models/KN

## 4. <a name="4">Model evaluation</a>
(<a href="#0">Go to top</a>)

Next, we load a separate test data to demonstrate how to make predictions on new examples at inference time.

In [6]:
# First predictions
y_pred = predictor.predict(test_data.head(k))

# Then, evaluations
predictor.evaluate_predictions(y_true=test_data['Outcome Type'],
                               y_pred=y_pred,
                               auxiliary_metrics=True)

Evaluation: accuracy on test data: 0.860299507801864
Evaluations on test data:
{
    "accuracy": 0.860299507801864,
    "balanced_accuracy": 0.8471309800400696,
    "mcc": 0.71864266414641,
    "f1": 0.8847217421361908,
    "precision": 0.8316815597075549,
    "recall": 0.944988000738416
}


{'accuracy': 0.860299507801864,
 'balanced_accuracy': 0.8471309800400696,
 'mcc': 0.71864266414641,
 'f1': 0.8847217421361908,
 'precision': 0.8316815597075549,
 'recall': 0.944988000738416}

We can see the performance of each individual trained model on the test data:

In [7]:
predictor.leaderboard(test_data, silent=True)

Unnamed: 0,model,score_test,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L2,0.8603,0.871,1.531976,0.512801,26.287463,0.011185,0.002498,0.902044,2,True,14
1,RandomForestEntr,0.855692,0.847,0.338284,0.144485,2.581486,0.338284,0.144485,2.581486,1,True,6
2,CatBoost,0.854016,0.855,0.043316,0.027408,7.169063,0.043316,0.027408,7.169063,1,True,7
3,RandomForestGini,0.853493,0.851,0.4346,0.141809,2.365973,0.4346,0.141809,2.365973,1,True,5
4,LightGBM,0.850979,0.849,0.081578,0.018099,1.154111,0.081578,0.018099,1.154111,1,True,4
5,XGBoost,0.849722,0.854,0.131652,0.016263,7.829739,0.131652,0.016263,7.829739,1,True,11
6,LightGBMLarge,0.845115,0.847,0.080209,0.017192,1.960757,0.080209,0.017192,1.960757,1,True,13
7,NeuralNetTorch,0.844067,0.849,0.177099,0.029052,13.390344,0.177099,0.029052,13.390344,1,True,12
8,LightGBMXT,0.843439,0.85,0.067275,0.017072,2.259549,0.067275,0.017072,2.259549,1,True,3
9,ExtraTreesEntr,0.841345,0.84,0.452011,0.159155,2.438493,0.452011,0.159155,2.438493,1,True,9


## 5. <a name="5">Clean up model artifacts</a>
(<a href="#0">Go to top</a>)

In [8]:
!rm -r AutogluonModels