# Text and images: Getting Started with Multimodal on Sagemaker
Information in the real world usually comes as different sensory input/output channels, like: images that can be associated with text explanations; or text that contains images to more clearly express the main idea of the article. 
*Multimodal learning* is a good model to represent the joint representations of different modalities.
In this talk  we would provide a gentle introduction to Multimodal learning and would train and deploy a multimodal predictor based on natural language text, images and tabular data using SageMaker and AutoGloun.

We will train a model that takes pet NLP descriptions, images and tabular features to predict how fast (category) they will get adopted.
![petfinderlogo](./img/dataset_example.png)

<sup>This example is developed based on the AutoMM multimodal example [here](https://auto.gluon.ai/dev/tutorials/multimodal/beginner_multimodal.html).</sup>

## Dataset
For demonstration, we use a simplified and subsampled version of [PetFinder dataset](https://www.kaggle.com/c/petfinder-adoption-prediction). The task is to predict the animals’ adoption rates based on their adoption profile information. In this simplified version, the adoption speed is grouped into two categories: 0 (slow) and 1 (fast).


## Step 1: Install libraries and prepare environment 
> ⚠ **Important: this workshop was tested on ml.g4dn.8xlarge, 200 GB (!!!), conda_mxnet_p37 kernel** 

Because 'llvmlite' is a distutils installed project, pip is not able to remove it. 
Let's remove it manually.

In [None]:
%%time 
!find /home/ec2-user/anaconda3 -type f -name '*llvmlite*.egg-info' -delete

Next, we'll upgrade pip amd install autogluon lib.

In [None]:
%pip install -U pip
%pip install -U setuptools wheel
%pip install -U mxnet<2.0.0
%pip install autogluon --ignore-installed
%pip install -U sagemaker

To ensure we can run local mode, increase the conda memory to 95%.

In [None]:
!cp -f image.py /home/ec2-user/anaconda3/envs/mxnet_p37/lib/python3.7/site-packages/sagemaker/local/image.py

In [None]:
# only run the below cells when you are using sagemaker notebook instances
!bash ./prepare-docker.sh

In [None]:
!wget -q https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/master/local_mode_setup.sh
!/bin/bash ./local_mode_setup.sh

Restart the kernel

In [None]:
import IPython

IPython.Application.instance().kernel.do_shutdown(True) #automatically restarts kernel

## Step 2: Download and explore data

In this section, we will firstly download the petFinder dataset and explore the data to understand what the dataset consists of. Please wait until the kernel is ready before executing the following cells.

In [None]:
# Import necessary python packages

import sagemaker
import os
import subprocess
import pandas as pd
from sagemaker.mxnet import MXNet
from autogluon.core.utils.loaders import load_zip
from sagemaker.s3 import S3Uploader, S3Downloader, s3_path_join
from sagemaker import image_uris
from sagemaker.estimator import Estimator

from sagemaker.processing import ScriptProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.utils import name_from_base
from sagemaker.sklearn.estimator import SKLearn
from sklearn import metrics
from IPython.display import Image, display

sagemaker_session = sagemaker.Session()

bucket = sagemaker_session.default_bucket()
prefix = "sagemaker/DEMO-autogluon-text-image-multimodel"
region = sagemaker_session.boto_region_name
account_id = sagemaker_session.account_id()
role = sagemaker.get_execution_role()

To get started, let’s download and prepare the dataset.

In [None]:
download_dir = './ag_automm_tutorial'
zip_file = 'https://automl-mm-bench.s3.amazonaws.com/petfinder_for_tutorial.zip'

In [None]:
%%time
load_zip.unzip(zip_file, unzip_dir=download_dir)

Explore the data

In [None]:
dataset_path = download_dir + '/petfinder_for_tutorial'
train_data = pd.read_csv(f'{dataset_path}/train.csv', index_col=0)
test_data = pd.read_csv(f'{dataset_path}/test.csv', index_col=0)
label_col = 'AdoptionSpeed'

In [None]:
# Expand the image paths to load them in training.
image_col = 'Images'
train_data[image_col] = train_data[image_col].apply(lambda ele: ele.split(';')[0]) # Use the first image for a quick tutorial
test_data[image_col] = test_data[image_col].apply(lambda ele: ele.split(';')[0])


def path_expander(path, base_folder):
    path_l = path.split(';')
    return ';'.join([os.path.abspath(os.path.join(base_folder, path)) for path in path_l])

train_data[image_col] = train_data[image_col].apply(lambda ele: path_expander(ele, base_folder=dataset_path))
test_data[image_col] = test_data[image_col].apply(lambda ele: path_expander(ele, base_folder=dataset_path))

train_data[image_col].iloc[0]

In [None]:
train_data['AdoptionSpeed'].hist()

In [None]:
train_data.hist(figsize=(20,20))

Each animal’s adoption profile includes pictures, a text description, and various tabular features such as age, breed, name, color, and more. Let’s look at an example row of data and display the text description and a picture.

In [None]:
example_row = train_data.iloc[0]

example_row

In [None]:
example_row['Description']


In [None]:
example_image = example_row[image_col]

pil_img = Image(filename=example_image)
display(pil_img)

To use the data for model training using SageMaker trainingjob, we will upload the data to s3.

In [None]:
s3_data_path = s3_path_join("s3://", bucket, f"{prefix}/data")
dataset_path = download_dir + '/petfinder_for_tutorial'
print(f"Uploading data to {s3_data_path}")
data_uri = S3Uploader.upload(dataset_path, s3_data_path)

## Train the model

Now let’s train the model using the prepared training data.

In this example we use local notebook instance to perform local training. If the GPU resources is available for local mode training, the instance_type is set to `local_gpu`. For non-local training, you can set theinstance type to a GPU instance, such as ml.g4dn.xlarge.

Note: Depending on how many underlying models are trained, `volume_size` may need to be increased so that they all fit on disk.

In [None]:
instance_type = "local"

try:
    if subprocess.call("nvidia-smi") == 0:
        ## Set type to GPU if one is present
        instance_type = "local_gpu"
except:
    pass

print("Instance type = " + instance_type)

We will use the prebuilt [autogluon docker container](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#autogluon-training-containers) for the training job. Here we retrieve the prebuilt image using the [image_uris function from the sagemaker python sdk](https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/image_uris.py#L36).

In [None]:
image_uri = image_uris.retrieve(
            "autogluon",
            region=region,
            version="0.5.2",
            py_version="py38",
            image_scope="training",
            instance_type=instance_type,
        )

**Training script**
The `train.py` script provides all the code we need for training a SageMaker model. The training script is very similar to a training script you might run outside of SageMaker, but you can access useful properties about the training environment through various environment variables, such as`SM_MODEL_DIR` which is a string representing the path to the directory to write model artifacts to. These artifacts are uploaded to S3 for model 

Supposing one input channel, 'training', was used in the call to the estimator's fit() method, the following will be set, following the format `SM_CHANNEL_[channel_name]`. In this example, `SM_CHANNEL_TRAINING` is a string representing the path to the directory containing data in the 'training' channel.

For more information about training environment variables, please visit [SageMaker Containers](https://github.com/aws/sagemaker-containers).

A typical training script loads data from the input channels, configures training with hyperparameters, trains a model, and saves a model to model_dir so that it can be used later. Hyperparameters are passed to your script as arguments and can be retrieved with an `argparse.ArgumentParser` instance.

Because the SageMaker imports the training script, you should put your training code in a main guard (`if __name__=='__main__':`) if you are using the same script to host your model as we do in this example, so that SageMaker does not inadvertently run your training code at the wrong point in execution.

For example, the script run by this notebook:


In [1]:
!pygmentize train.py

[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m
[34mfrom[39;49;00m [04m[36mautogluon[39;49;00m[04m[36m.[39;49;00m[04m[36mcore[39;49;00m[04m[36m.[39;49;00m[04m[36mutils[39;49;00m[04m[36m.[39;49;00m[04m[36mloaders[39;49;00m [34mimport[39;49;00m load_zip
[34mfrom[39;49;00m [04m[36mautogluon[39;49;00m[04m[36m.[39;49;00m[04m[36mmultimodal[39;49;00m [34mimport[39;49;00m MultiModalPredictor
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mlogging[39;49;00m

[34mfrom[39;49;00m [04m[36msagemaker_training[39;49;00m [34mimport[39;49;00m environment

logger = logging.getLogger([31m__name__[39;49;00m)
logger.setLevel(logging.DEBUG)

[34mdef[39;49;00m [32mpath_expander[39;49;00m(path, base_folder):
    path_l = path.split([33m'

In our training script, we use [AutoMM for Multomodal training](https://auto.gluon.ai/dev/tutorials/multimodal/index.html). AutoMM is a deep learning “model zoo” of model zoos. It can automatically build deep learning models that are suitable for multimodal datasets. You will only need to convert the data into the multimodal dataframe format and AutoMM can predict the values of one column conditioned on the features from the other columns including images, text, and tabular data.

Under the hood, AutoMM automatically infers the problem type (classification or regression), detects the data modalities, selects the related models from the multimodal model pools, and trains the selected models. If multiple backbones are available, AutoMM appends a late-fusion model (MLP or transformer) on top of them.

In [None]:
model_artifacts_location = f"s3://{bucket}/{prefix}/artifacts"
hyperparameters = {
  'label-col': 'AdoptionSpeed',
  'image-col': 'Images'
}
mm_estimator = Estimator(
    entry_point="train.py",
    role=role,
    output_path=model_artifacts_location,
    instance_count=1,
    instance_type=instance_type,
    hyperparameters=hyperparameters,
    image_uri=image_uri,
    volume_size=100
)

In [None]:
mm_estimator.fit({"training": data_uri})

## Batch Inference

Amazon SageMaker Processing allows you to run steps for data pre- or post-processing, feature engineering, data validation, or model evaluation workloads on Amazon SageMaker. Processing jobs accept data from Amazon S3 as input and store data into Amazon S3 as output.

![processing](https://sagemaker.readthedocs.io/en/stable/_images/amazon_sagemaker_processing_image1.png)

Here, we'll import the test dataset and model artifacts as the input to the processing job. The processing script will perform batch inference using the test dataset against the training model to provide the inference results. The SageMaker Processing job can be used to process terabytes of data in a SageMaker-managed cluster separate from the instance running your notebook server. In a typical SageMaker workflow, notebooks are only used for prototyping and can be run on relatively inexpensive and less powerful instances, while processing, training and model hosting tasks are run on separate, more powerful SageMaker-managed instances.  SageMaker Processing includes off-the-shelf support for Scikit-learn, as well as a Bring Your Own Container option, so it can be used with many different data transformation technologies and tasks.    

To use SageMaker Processing, simply supply a Python data preprocessing script as shown below.  For this example, we're using the same Autogluon prebuilt container, which includes the necessary python packages required to process inference on the test data. We use a [`ScriptProcessor`](https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/processing.py#L399) to run your own code within a container.

In [None]:
base_job_name = 'multimodel-inference'
mm_processor = ScriptProcessor(
    image_uri=image_uri,
    command=["python3"],
    role=role,
    instance_type="local",
    instance_count=1, 
    base_job_name=base_job_name,
)
processing_job_name = name_from_base(base_job_name)

In [None]:
model_artifacts_location=mm_estimator.model_data
model_artifacts_location

The processing script loads the trained model and perform predictions on the test data, as shown below:


In [2]:
!pygmentize processing_script.py

[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m
[34mfrom[39;49;00m [04m[36mautogluon[39;49;00m[04m[36m.[39;49;00m[04m[36mcore[39;49;00m[04m[36m.[39;49;00m[04m[36mutils[39;49;00m[04m[36m.[39;49;00m[04m[36mloaders[39;49;00m [34mimport[39;49;00m load_zip
[34mfrom[39;49;00m [04m[36mautogluon[39;49;00m[04m[36m.[39;49;00m[04m[36mmultimodal[39;49;00m [34mimport[39;49;00m MultiModalPredictor
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mlogging[39;49;00m
[34mimport[39;49;00m [04m[36mglob[39;49;00m
[34mimport[39;49;00m [04m[36mtarfile[39;49;00m
[34mimport[39;49;00m [04m[36mtime[39;49;00m
[34mimport[39;49;00m [04m[36mpathlib[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m




logger = logging.getLogger()


We run this script as a processing job. Use the `ScriptProcessor.run()` method. You give the `run()` method one `ProcessingInput` where the source is the test dataset in Amazon S3, one `ProcessingInput` as the CSV file contains the image path, and another `ProcessingInput` to load the trained model from S3. The `destination` is where the script reads this data from, in this case `/opt/ml/processing/input/test` is where the CSV file is downloaded to in the processing container, `/opt/ml/processing/input/test/images` is where the images are stored and `/opt/ml/processing/input/model` is where the model artifact (as a tar.gz file) downloaded to. These local paths inside the processing container must begin with /opt/ml/processing/.

Also give the run() method a ProcessingOutput, where the source is the path the script writes output data to. For outputs, the destination defaults to an S3 bucket that the Amazon SageMaker Python SDK creates for you, following the format s3://sagemaker-<region>-<account_id>/<processing_job_name>/output/<output_name/. If you specify the `destination`, the processing job will upload the results, in this case the output is stored in `/opt/ml/processing/output/inference_result` to the output location on S3: `s3://sagemaker-<region>-<account_id>/<processing_job_name>//output/inference_result`. You also give the ProcessingOutputs values for output_name, to make it easier to retrieve these output artifacts after the job is run. 

The arguments parameter in the run() method are command-line arguments in our preprocessing.py script.

In [None]:
mm_processor.run(
    code='processing_script.py',
    arguments = [
                 '--base_dir', '/opt/ml/processing',
                ],
    inputs=[
        ProcessingInput(
            source=f"{data_uri}/test.csv",
            destination="/opt/ml/processing/input/test",
            s3_data_distribution_type="FullyReplicated",
        ),
        ProcessingInput(
            source=f"{data_uri}/images",
            destination="/opt/ml/processing/input/test/images",
            s3_data_distribution_type="FullyReplicated",
        ),
        ProcessingInput(
            source=model_artifacts_location,
            destination="/opt/ml/processing/input/model",
            s3_data_distribution_type="FullyReplicated",
        ),
    ],
    outputs=[
        ProcessingOutput(output_name="inference_result", source="/opt/ml/processing/output/inference_result", destination="inference_result")
    ],
    job_name=processing_job_name,

)

In [None]:
results_s3_path = f"s3://{bucket}/{processing_job_name}/output/inference_result"
S3Downloader.download(s3_uri=results_s3_path, local_path='./',sagemaker_session=sagemaker_session)

In [None]:
# generate classification report

y_pred_class=pd.read_json('result.json',typ='series')
y_test = test_data['AdoptionSpeed']
print(metrics.classification_report(y_test, y_pred_class))

In [None]:
test_data = pd.concat([test_data, y_pred_class], axis=1)
test_data.head()