# Multimodal Training/Prediction using TabularPredictor

본 핸즈온에서는 `TabularPredictor`로 multimodal 데이터를 훈련합니다.

## TextPredictor & TabularPredictor

`MultiModalPredictor` 외에 `TextPredictor`나 `TabularPredictor`로도 멀티모달 모델 훈련이 가능합니다.
만약 트랜스포머 임베딩 대신, XGBoost/LightGBM/CatBoost 등의 Gradient Boosted Tree 결과를 앙상블 및 스태킹하여 모델링하고 싶다면, `TabularPredictor.fit (..., hyperparameters = 'multimodal')`로 훈련하세요. 주의할 점은 `hyperparameters = 'multimodal'`을 지정하지 않고 훈련 시, AutoGluon Tabular는 텍스트 데이터를 N-gram으로 자동으로 변환 후 tabular 모델로만 훈련합니다. 

In [1]:
import os
import torch
import mxnet as mx
num_gpus = torch.cuda.device_count()

if num_gpus == 0:
    os.environ['AUTOGLUON_TEXT_TRAIN_WITHOUT_GPU'] = '1'

print(num_gpus)    

1


In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pprint
import random
from autogluon.tabular import TabularPredictor
import mxnet as mx

np.random.seed(123)
random.seed(123)
mx.random.seed(123)

## 1. Data preparation and Training

본 핸즈온은 MachineHack Salary Prediction Hackathon의 제품 리뷰 데이터셋을 사용합니다. 리뷰 점수는 0,1,2,3점의 4개 클래스로 구성되어 있는 다중 클래스 문제입니다.

In [3]:
save_path = 'ag-02-multimodal-tabularpredictor'
!rm -rf $save_path product_sentiment

In [4]:
!mkdir -p product_sentiment
!wget https://autogluon-text-data.s3.amazonaws.com/multimodal_text/machine_hack_product_sentiment/train.csv -O product_sentiment/train.csv
!wget https://autogluon-text-data.s3.amazonaws.com/multimodal_text/machine_hack_product_sentiment/dev.csv -O product_sentiment/dev.csv
!wget https://autogluon-text-data.s3.amazonaws.com/multimodal_text/machine_hack_product_sentiment/test.csv -O product_sentiment/test.csv

--2022-08-30 00:17:50--  https://autogluon-text-data.s3.amazonaws.com/multimodal_text/machine_hack_product_sentiment/train.csv
Resolving autogluon-text-data.s3.amazonaws.com (autogluon-text-data.s3.amazonaws.com)... 52.216.24.52
Connecting to autogluon-text-data.s3.amazonaws.com (autogluon-text-data.s3.amazonaws.com)|52.216.24.52|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 689486 (673K) [text/csv]
Saving to: ‘product_sentiment/train.csv’


2022-08-30 00:17:50 (111 MB/s) - ‘product_sentiment/train.csv’ saved [689486/689486]

--2022-08-30 00:17:51--  https://autogluon-text-data.s3.amazonaws.com/multimodal_text/machine_hack_product_sentiment/dev.csv
Resolving autogluon-text-data.s3.amazonaws.com (autogluon-text-data.s3.amazonaws.com)... 52.216.24.52
Connecting to autogluon-text-data.s3.amazonaws.com (autogluon-text-data.s3.amazonaws.com)|52.216.24.52|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75517 (74K) [text/csv]
Saving to: ‘p

In [5]:
subsample_size = 1000  # for quick demo, try setting to larger values
feature_columns = ['Product_Description', 'Product_Type']
label = 'Sentiment'

train_df = pd.read_csv('product_sentiment/train.csv', index_col=0).sample(subsample_size, random_state=123)
dev_df = pd.read_csv('product_sentiment/dev.csv', index_col=0)
test_df = pd.read_csv('product_sentiment/test.csv', index_col=0)

train_df = train_df[feature_columns + [label]]
dev_df = dev_df[feature_columns + [label]]
test_df = test_df[feature_columns]
print('Number of training samples:', len(train_df))
print('Number of dev samples:', len(dev_df))
print('Number of test samples:', len(test_df))

Number of training samples: 1000
Number of dev samples: 637
Number of test samples: 2728


In [6]:
train_df.head(5)

Unnamed: 0,Product_Description,Product_Type,Sentiment
4532,they took away the lego pit but replaced it wi...,0,1
1831,#Apple to Open Pop-Up Shop at #SXSW [REPORT]: ...,9,2
3536,RT @mention False Alarm: Google Circles Not Co...,5,1
5157,Will Google reveal a new social network called...,9,2
4643,Niceness RT @mention Less than 2 hours until w...,6,3


In [7]:
from autogluon.tabular import TabularPredictor
predictor = TabularPredictor(label='Sentiment', path=save_path)
predictor.fit(train_df, hyperparameters='multimodal', time_limit=60)

Beginning AutoGluon training ... Time limit = 60s
AutoGluon will save models to "ag-02-multimodal-tabularpredictor/"
AutoGluon Version:  0.5.2
Python Version:     3.8.12
Operating System:   Linux
Train Data Rows:    1000
Train Data Columns: 2
Label Column: Sentiment
Preprocessing data ...
AutoGluon infers your prediction problem is: 'multiclass' (because dtype of label-column == int, but few unique label-values observed).
	4 unique label values:  [1, 2, 3, 0]
	If 'multiclass' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Train Data Class Count: 4
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    10593.33 MB
	Train Data (Original)  Memory Usage: 0.17 MB (0.0% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to m

Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Epoch 0, global step 3: 'val_accuracy' reached 0.58500 (best 0.58500), saving model to '/home/ec2-user/SageMaker/autogluon-on-aws/4.multimodal/ag-02-multimodal-tabularpredictor/models/TextPredictor/epoch=0-step=3.ckpt' as top 3


Validation: 0it [00:00, ?it/s]

Epoch 0, global step 7: 'val_accuracy' reached 0.58500 (best 0.58500), saving model to '/home/ec2-user/SageMaker/autogluon-on-aws/4.multimodal/ag-02-multimodal-tabularpredictor/models/TextPredictor/epoch=0-step=7.ckpt' as top 3


Validation: 0it [00:00, ?it/s]

Epoch 1, global step 10: 'val_accuracy' reached 0.61500 (best 0.61500), saving model to '/home/ec2-user/SageMaker/autogluon-on-aws/4.multimodal/ag-02-multimodal-tabularpredictor/models/TextPredictor/epoch=1-step=10.ckpt' as top 3


Validation: 0it [00:00, ?it/s]

Epoch 1, global step 14: 'val_accuracy' was not in top 3


Validation: 0it [00:00, ?it/s]

Epoch 2, global step 17: 'val_accuracy' reached 0.65500 (best 0.65500), saving model to '/home/ec2-user/SageMaker/autogluon-on-aws/4.multimodal/ag-02-multimodal-tabularpredictor/models/TextPredictor/epoch=2-step=17.ckpt' as top 3


Validation: 0it [00:00, ?it/s]

Epoch 2, global step 21: 'val_accuracy' reached 0.81000 (best 0.81000), saving model to '/home/ec2-user/SageMaker/autogluon-on-aws/4.multimodal/ag-02-multimodal-tabularpredictor/models/TextPredictor/epoch=2-step=21.ckpt' as top 3


Validation: 0it [00:00, ?it/s]

Epoch 3, global step 24: 'val_accuracy' reached 0.76500 (best 0.81000), saving model to '/home/ec2-user/SageMaker/autogluon-on-aws/4.multimodal/ag-02-multimodal-tabularpredictor/models/TextPredictor/epoch=3-step=24.ckpt' as top 3
Time limit reached. Elapsed time is 0:00:48. Signaling Trainer to stop.


Validation: 0it [00:00, ?it/s]

Epoch 3, global step 26: 'val_accuracy' reached 0.82500 (best 0.82500), saving model to '/home/ec2-user/SageMaker/autogluon-on-aws/4.multimodal/ag-02-multimodal-tabularpredictor/models/TextPredictor/epoch=3-step=26.ckpt' as top 3


Predicting: 0it [00:00, ?it/s]

Predicting: 0it [00:00, ?it/s]

Predicting: 0it [00:00, ?it/s]

Predicting: 0it [00:00, ?it/s]

	0.85	 = Validation score   (accuracy)
	67.68s	 = Training   runtime
	0.6s	 = Validation runtime
Fitting model: WeightedEnsemble_L2 ... Training model for up to 59.58s of the -19.95s of remaining time.
	0.87	 = Validation score   (accuracy)
	0.22s	 = Training   runtime
	0.0s	 = Validation runtime
AutoGluon training complete, total runtime = 80.3s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("ag-02-multimodal-tabularpredictor/")


<autogluon.tabular.predictor.predictor.TabularPredictor at 0x7fe492889430>

<br>

## 2. Evaluation and Prediction

In [8]:
predictor.leaderboard(dev_df)

Load pretrained checkpoint: ag-02-multimodal-tabularpredictor/models/TextPredictor/text_nn/model.ckpt


Predicting: 0it [00:00, ?it/s]

                 model  score_test  score_val  pred_time_test  pred_time_val   fit_time  pred_time_test_marginal  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0       NeuralNetTorch    0.886970      0.860        0.089318       0.024708   1.755834                 0.089318                0.024708           1.755834            1       True          5
1  WeightedEnsemble_L2    0.886970      0.870        0.239991       0.060011   3.696689                 0.005654                0.000496           0.223408            2       True          8
2             CatBoost    0.885400      0.860        0.054497       0.008354   1.378052                 0.054497                0.008354           1.378052            1       True          3
3        TextPredictor    0.882261      0.850        1.998655       0.596547  67.676712                 1.998655                0.596547          67.676712            1       True          7
4        LightGBMLarge    0.879121      0.850

Unnamed: 0,model,score_test,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,NeuralNetTorch,0.88697,0.86,0.089318,0.024708,1.755834,0.089318,0.024708,1.755834,1,True,5
1,WeightedEnsemble_L2,0.88697,0.87,0.239991,0.060011,3.696689,0.005654,0.000496,0.223408,2,True,8
2,CatBoost,0.8854,0.86,0.054497,0.008354,1.378052,0.054497,0.008354,1.378052,1,True,3
3,TextPredictor,0.882261,0.85,1.998655,0.596547,67.676712,1.998655,0.596547,67.676712,1,True,7
4,LightGBMLarge,0.879121,0.85,0.363917,0.036365,3.314543,0.363917,0.036365,3.314543,1,True,6
5,LightGBM,0.877551,0.855,0.097271,0.026469,1.899688,0.097271,0.026469,1.899688,1,True,1
6,XGBoost,0.875981,0.85,0.117441,0.030043,1.007675,0.117441,0.030043,1.007675,1,True,4
7,LightGBMXT,0.866562,0.845,0.062399,0.021118,1.087554,0.062399,0.021118,1.087554,1,True,2


In [9]:
score = predictor.evaluate(dev_df)
print(score)

Evaluation: accuracy on test data: 0.8869701726844584
Evaluations on test data:
{
    "accuracy": 0.8869701726844584,
    "balanced_accuracy": 0.48100029507697384,
    "mcc": 0.7830290724042951
}


{'accuracy': 0.8869701726844584, 'balanced_accuracy': 0.48100029507697384, 'mcc': 0.7830290724042951}
