# Multimodal Training/Prediction for Text + Tabular 

AutoGluon은 자동으로 멀티모달(multimodal) 모델을 학습하는 AutoMM 클래스를 0.5.1버전부터 지원하기에, 이미지/텍스트/tabular가 혼용된 데이터를 쉽게 훈련할 수 있습니다.
뿐만 아니라, 유니모달(unimodal) 딥러닝 모델(이미지, 텍스트, tabular)의 훈련도 지원하기에, 머신 러닝 비전문가들도 쉽게 활용 가능합니다. 

훈련에 필요한 데이터셋은 기존 tabular 데이터셋에서 텍스트 열(예: 영화 리뷰 컨텐츠 텍스트)를 그대로 추가하면 되며, 모델 훈련 시 데이터 범주에 따라 아래와 같은 임베딩을 수행합니다. 

- 텍스트 열: 사전 훈련된 Transformer(BERT) backbone을 사용하여 데이터를 임베딩합니다. 
- 범주형 열: Embedding-MLP을 통해 범주형 데이터를 임베딩합니다.
- 수치형 열: 표준 MLP를 사용하여 범주형 데이터를 임베딩합니다.

## TextPredictor & TabularPredictor

`MultiModalPredictor` 외에 `TextPredictor`나 `TabularPredictor`로도 멀티모달 모델 훈련이 가능합니다.
만약 트랜스포머 임베딩 대신, XGBoost/LightGBM/CatBoost 등의 Gradient Boosted Tree 결과를 앙상블 및 스태킹하여 모델링하고 싶다면, `TabularPredictor.fit (..., hyperparameters = 'multimodal')`로 훈련하세요. 주의할 점은 `hyperparameters = 'multimodal'`을 지정하지 않고 훈련 시, AutoGluon Tabular는 텍스트 데이터를 N-gram으로 자동으로 변환 후 tabular 모델로만 훈련합니다. 

In [1]:
import os
import torch
num_gpus = torch.cuda.device_count()

if num_gpus == 0:
    os.environ['AUTOGLUON_TEXT_TRAIN_WITHOUT_GPU'] = '1'

print(num_gpus)    

1


In [2]:
import numpy as np
import pandas as pd
import os
import warnings
warnings.filterwarnings('ignore')
np.random.seed(123)

%load_ext autoreload
%autoreload 2

In [3]:
!python3 -m pip install openpyxl

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/pytorch_p38/bin/python3 -m pip install --upgrade pip' command.[0m[33m
[0m

<br>

## 1. Data preparation and Training

본 핸즈온은 MachineHack Salary Prediction Hackathon의 도서 가격 예측 데이터셋을 사용합니다. 제목(Title), 저자(Author), 리뷰 평점(Reviews) 등과 같은 다양한 피쳐를 고려하여 도서의 가격을 예측합니다.

### Data preparation

In [4]:
save_path = 'ag-01-multimodal-text-tabular'
!rm -rf $save_path price_of_books

In [5]:
!mkdir -p price_of_books
!wget https://automl-mm-bench.s3.amazonaws.com/machine_hack_competitions/predict_the_price_of_books/Data.zip -O price_of_books/Data.zip
!cd price_of_books && unzip -o Data.zip
!ls price_of_books/Participants_Data

--2022-08-30 00:17:32--  https://automl-mm-bench.s3.amazonaws.com/machine_hack_competitions/predict_the_price_of_books/Data.zip
Resolving automl-mm-bench.s3.amazonaws.com (automl-mm-bench.s3.amazonaws.com)... 52.217.111.100
Connecting to automl-mm-bench.s3.amazonaws.com (automl-mm-bench.s3.amazonaws.com)|52.217.111.100|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3521673 (3.4M) [application/zip]
Saving to: ‘price_of_books/Data.zip’


2022-08-30 00:17:32 (26.2 MB/s) - ‘price_of_books/Data.zip’ saved [3521673/3521673]

Archive:  Data.zip
  inflating: Participants_Data/Data_Test.xlsx  
  inflating: Participants_Data/Data_Train.xlsx  
  inflating: Participants_Data/Sample_Submission.xlsx  
Data_Test.xlsx	Data_Train.xlsx  Sample_Submission.xlsx


In [6]:
train_df = pd.read_excel(os.path.join('price_of_books', 'Participants_Data', 'Data_Train.xlsx'), engine='openpyxl')
train_df.head()

Unnamed: 0,Title,Author,Edition,Reviews,Ratings,Synopsis,Genre,BookCategory,Price
0,The Prisoner's Gold (The Hunters 3),Chris Kuzneski,"Paperback,– 10 Mar 2016",4.0 out of 5 stars,8 customer reviews,THE HUNTERS return in their third brilliant no...,Action & Adventure (Books),Action & Adventure,220.0
1,Guru Dutt: A Tragedy in Three Acts,Arun Khopkar,"Paperback,– 7 Nov 2012",3.9 out of 5 stars,14 customer reviews,A layered portrait of a troubled genius for wh...,Cinema & Broadcast (Books),"Biographies, Diaries & True Accounts",202.93
2,Leviathan (Penguin Classics),Thomas Hobbes,"Paperback,– 25 Feb 1982",4.8 out of 5 stars,6 customer reviews,"""During the time men live without a common Pow...",International Relations,Humour,299.0
3,A Pocket Full of Rye (Miss Marple),Agatha Christie,"Paperback,– 5 Oct 2017",4.1 out of 5 stars,13 customer reviews,A handful of grain is found in the pocket of a...,Contemporary Fiction (Books),"Crime, Thriller & Mystery",180.0
4,LIFE 70 Years of Extraordinary Photography,Editors of Life,"Hardcover,– 10 Oct 2006",5.0 out of 5 stars,1 customer review,"For seven decades, ""Life"" has been thrilling t...",Photography Textbooks,"Arts, Film & Photography",965.62


In [7]:
def preprocess(df):
    df = df.copy(deep=True)
    df.loc[:, 'Reviews'] = pd.to_numeric(df['Reviews'].apply(lambda ele: ele[:-len(' out of 5 stars')]))
    df.loc[:, 'Ratings'] = pd.to_numeric(df['Ratings'].apply(lambda ele: ele.replace(',', '')[:-len(' customer reviews')]))
    df.loc[:, 'Price'] = np.log(df['Price'] + 1)
    return df

In [8]:
train_subsample_size = 1500  # subsample for faster demo, you can try setting to larger values
test_subsample_size = 5
train_df = preprocess(train_df)
train_data = train_df.iloc[100:].sample(train_subsample_size, random_state=123)
test_data = train_df.iloc[:100].sample(test_subsample_size, random_state=245)
train_data.head()

Unnamed: 0,Title,Author,Edition,Reviews,Ratings,Synopsis,Genre,BookCategory,Price
949,Furious Hours,Casey Cep,"Paperback,– 1 Jun 2019",4.0,,‘It’s been a long time since I picked up a boo...,True Accounts (Books),"Biographies, Diaries & True Accounts",5.743003
5504,REST API Design Rulebook,Mark Masse,"Paperback,– 7 Nov 2011",5.0,,"In todays market, where rival web services com...","Computing, Internet & Digital Media (Books)","Computing, Internet & Digital Media",5.786897
5856,The Atlantropa Articles: A Novel,Cody Franklin,"Paperback,– Import, 1 Nov 2018",4.5,2.0,#1 Amazon Best Seller! Dystopian Alternate His...,Action & Adventure (Books),Romance,6.893656
4137,Hickory Dickory Dock (Poirot),Agatha Christie,"Paperback,– 5 Oct 2017",4.3,21.0,There’s more than petty theft going on in a Lo...,Action & Adventure (Books),"Crime, Thriller & Mystery",5.192957
3205,The Stanley Kubrick Archives (Bibliotheca Univ...,Alison Castle,"Hardcover,– 21 Aug 2016",4.6,3.0,"In 1968, when Stanley Kubrick was asked to com...",Cinema & Broadcast (Books),Humour,6.889591


### Training

시간을 절약하기 위해 데이터를 서브샘플링하고 2분 동안만 훈련합니다. 도서의 가격을 예측하는 Regression 모델이기에, RMSE(Root Mean Squared Error)가 디폴트 metric입니다.

In [None]:
from autogluon.multimodal import MultiModalPredictor
time_limit = 2 * 60  # set to larger value in your applications
predictor = MultiModalPredictor(label='Price', path=save_path)
predictor.fit(train_data, time_limit=time_limit)

Global seed set to 123
Auto select gpus: [0]
Using 16bit native Automatic Mixed Precision (AMP)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name              | Type                | Params
----------------------------------------------------------
0 | model             | MultimodalFusionMLP | 109 M 
1 | validation_metric | MeanSquaredError    | 0     
2 | loss_func         | MSELoss             | 0     
----------------------------------------------------------
109 M     Trainable params
0         Non-trainable params
109 M     Total params
219.565   Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Epoch 0, global step 4: 'val_rmse' reached 1.52672 (best 1.52672), saving model to '/home/ec2-user/SageMaker/autogluon-on-aws/4.multimodal/ag-01-multimodal-text-tabular/epoch=0-step=4.ckpt' as top 3


Validation: 0it [00:00, ?it/s]

Epoch 0, global step 10: 'val_rmse' reached 1.09024 (best 1.09024), saving model to '/home/ec2-user/SageMaker/autogluon-on-aws/4.multimodal/ag-01-multimodal-text-tabular/epoch=0-step=10.ckpt' as top 3
Time limit reached. Elapsed time is 0:02:00. Signaling Trainer to stop.


Validation: 0it [00:00, ?it/s]

Epoch 1, global step 14: 'val_rmse' reached 0.88354 (best 0.88354), saving model to '/home/ec2-user/SageMaker/autogluon-on-aws/4.multimodal/ag-01-multimodal-text-tabular/epoch=1-step=14.ckpt' as top 3


<br>

## 2. Prediction

In [None]:
predictions = predictor.predict(test_data)
print('Predictions:')
print('------------')
print(np.exp(predictions) - 1)
print()
print('True Value:')
print('------------')
print(np.exp(test_data['Price']) - 1)

In [None]:
performance = predictor.evaluate(test_data)
print(performance)

In [None]:
embeddings = predictor.extract_embedding(test_data)
embeddings.shape