# Generating marketing slogans for product images

This notebook shows how to fine-tune a generative AI model to generate marketing slogans for product images. 

We start with a foundation model, BLIP, available through HuggingFace. We fine-tune it through an Amazon SageMaker training job. Then we evaluate the generated slogans created by our fine-tuned model to slogans created by an "out of the box" model. 

TL;DR - the fine-tuned model shows better results.

| Metric | Baseline model | Fine-tuned model |
| -- | -- | -- |
| BERT Score (F1 - higher is better) | 0.82 | 0.85 |
| WER (lower is better) | 2.06 | 1.24 |
| ROUGE (higher is better ) | 0.05 | 0.09 |

## License

    Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
    SPDX-License-Identifier: MIT-0

## Data set

We use the [Automatic Understanding of Image and Video Advertisements](https://people.cs.pitt.edu/~kovashka/ads/) image dataset. The citation for this data set is:

    Automatic Understanding of Image and Video Advertisements. Zaeem Hussain, Mingda Zhang, Xiaozhong Zhang, Keren Ye, Christopher Thomas, Zuha Agha, Nathan Ong, Adriana Kovashka. To appear, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.

## Prerequisites

This notebook was built in Amazon SageMaker Studio. It uses an `ml.g4dn.xlarge` instance with the `PyTorch 1.13 Python 3.9 GPU Optimized` image.

You will need to make sure that you have increased your default account quotas to let you use a `p4d.24xlarge` instance for training.


First, specify the S3 bucket where you will store your training data and model artifacts.

In [None]:
# --------------------------
# YOU MUST CHANGE THIS VALUE
# --------------------------

model_and_data_artifact_s3_bucket = '<ENTER-BUCKET-NAME-HERE>'

Create 2 directories in the same directory as this notebook.

The `data-archives` directory will hold our downloaded files, in case we need them again in the future. The `data` directory will hold the actual unpacked dataset and metadata files.

In [None]:
import os

if not os.path.exists('data'):
    os.mkdir('data')

if not os.path.exists('data-archives'):
    os.mkdir('data-archives')

The dataset comes as a set of 11 zip files with images, `subfolder-0.zip` through `subfolder-10.zip`. Download these to the `data-archives` folder in case we need them again in the future. Then unpack them into the `data` folder. In the `data` folder you should end up with one subdirectory for each of the zip files, numbered 0 to 10.

These files are large, so the process will take several minutes.

In [None]:
download_root_url = 'https://storage.googleapis.com/ads-dataset/'

for idx in range(11):
    data_filename = f'subfolder-{idx}.zip'
    print(f'downloading: {data_filename}')
    os.system(f'wget -O ./data-archives/{data_filename} {download_root_url}{data_filename}')
    print(f'unpacking: {data_filename}')
    os.system(f'unzip ./data-archives/{data_filename} -d ./data')

Additionally, we need the metadata for the dataset with annotations, available as `annotations_images.zip`.

In [None]:
download_root_url = 'https://people.cs.pitt.edu/~kovashka/ads/'
data_filename = 'annotations_images.zip'

print(f'downloading: {data_filename}')
os.system(f'wget -O ./data-archives/{data_filename} {download_root_url}{data_filename}')
print(f'unpacking: {data_filename}')
os.system(f'unzip ./data-archives/{data_filename} -d ./data')

## Install libraries

Make sure we have the latest versions of these packages.

In [None]:
!pip install transformers datasets evaluate -q

In [None]:
from datasets import load_dataset 

## Prepare data

In this section we need to create a dataset in the standard format for images. We need a folder with all of the images, and a metadata file that maps images to ground-truth captions (slogans).

We'll read the mapping of slogans to images from the `Slogans.json` file, and update a new metadata list. Since many of the images have muultiple slogans, we will create multiple copies, one for each slogan.

In [None]:
import json

with open('data/image/Slogans.json', 'r') as S:
    slogans = json.load(S)

In [None]:
import os
image_folder = 'image_folder_blip'

if not os.path.exists(image_folder):
    os.mkdir(image_folder)

In [None]:
import shutil
captions = []
for image_file_name in slogans:
    path_parts = os.path.split(image_file_name)
    base_name = path_parts[-1]
    for idx, slogan in enumerate(slogans[image_file_name]):
        s_file_name = f"{idx}-{base_name}"
        captions.append({"file_name": s_file_name, "text": slogan})
        shutil.copyfile(os.path.join('data', image_file_name), os.path.join(image_folder, s_file_name))

In [None]:
with open(os.path.join(image_folder, "metadata.jsonl"), 'w') as f:
    for item in captions:
        f.write(json.dumps(item) + "\n")

In [None]:
from datasets import load_dataset 

ds = load_dataset("imagefolder", data_dir=image_folder, split="train")

In [None]:
ds = ds.train_test_split(test_size=0.1)
train_ds = ds["train"]
test_ds = ds["test"]

In [None]:
import sagemaker
sess = sagemaker.Session()
s3_bucket = sess.default_bucket() 
print(s3_bucket)

In [None]:
train_path = 'ads/blip/train'
test_path = 'ads/blip/test'
train_ds.save_to_disk(dataset_path=f"s3://{s3_bucket}/{train_path}")
test_ds.save_to_disk(dataset_path=f"s3://{s3_bucket}/{test_path}")

## Run training job

Next we'll run a training job on Amazon SageMaker using the HuggingFace classes in the Python SDK.

Your model artifact `model.tar.gz` will be stored in the SageMaker session's default bucket, which we will assign to the same bucket that we used for the train/test data through the  `model_and_data_artifact_s3_bucket` variable. Models will be stored in this bucket with a prefix corresponding to the generated SageMaker training job name used to create the model. 

In [None]:
import sagemaker
import sagemaker.huggingface
from sagemaker.huggingface import HuggingFace
import boto3

region = boto3.Session().region_name

sess = sagemaker.Session(default_bucket=model_and_data_artifact_s3_bucket)

role = sagemaker.get_execution_role()

print(region)

In [None]:
# hyperparameters, which are passed into the training job
hyperparameters={
    'epochs': 10,
    'model_name': 'Salesforce/blip-image-captioning-base',
    'learning_rate': 5e-5,
    'train_batch_size': 8,
    'output_dir': '/opt/ml/model'
}

If you are using a framework and instance type that supports the [SageMaker Distributed Data Parallel](https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-data-parallel-support.html), you can set the `ENABLE_DATA_PARALLELIZATION_SUPPORT` flag to `True` here. This notebook defaults to `False` because the supported instance types are not part of the default AWS account service quotas.

Check your service quotas for the selected instance type in your region via the [AWS Service Quotas dashboard](https://console.aws.amazon.com/servicequotas/home) to ensure you can launch the desired training instance type.

In [None]:
#Set ENABLE_DATA_PARALLELIZATION_SUPPORT to true if service quotas have been allotted for training instances and using frameworks that support SageMaker Data Parallelization.
ENABLE_DATA_PARALLELIZATION_SUPPORT = False

if ENABLE_DATA_PARALLELIZATION_SUPPORT:
    # configuration for running training on smdistributed Data Parallel
    distribution = {'smdistributed':{'dataparallel':{ 'enabled': True }}}

    # instance configurations
    instance_type='ml.p4d.24xlarge'
    instance_count=1
    volume_size=500
else:
    # configuration for running standard training
    distribution = {}
    # instance configurations
    instance_type='ml.g5.16xlarge'
    instance_count=1
    volume_size=500

In [None]:
# metric definition to extract the results
metric_definitions=[
     {'Name': 'train_runtime', 'Regex': "'train_runtime': (.*?),"},
     {'Name': 'train_samples_per_second', 'Regex': "'train_samples_per_second': (.*?),"},
     {'Name': 'train_loss', 'Regex': "'train_loss': (.*?),"}
]

# create the Estimator
huggingface_estimator = HuggingFace(
        sagemaker_session = sess,
        entry_point='train.py',
        source_dir='./scripts',
        instance_type=instance_type,
        instance_count=instance_count,
        volume_size=volume_size,
        role=role,
        image_uri=f'763104351884.dkr.ecr.{region}.amazonaws.com/huggingface-pytorch-training:1.13.1-transformers4.26.0-gpu-py39-cu117-ubuntu20.04',
        py_version='py39',
        distribution= distribution,
        metric_definitions=metric_definitions,
        hyperparameters = hyperparameters
)

In [None]:
huggingface_estimator.fit(
  {'train': f"s3://{s3_bucket}/{train_path}"}
)

In [None]:
print(f'training container used: {huggingface_estimator.image_uri}')
print(f'training job name: {huggingface_estimator.latest_training_job.name}')
print(f'trained model artifact location: {huggingface_estimator.model_data}')

## Evaluate

We'll evaluate the predictions from both the fine-tuned model and the base model against the ground truth slogans. We'll calculate several metrics including WER, BERTScore, and ROUGE.

First we will download the trained model artifact from S3 using the estiomator's `model_data` parameter. If you are loading a model artifact that was trained at a different time, set this parameter to the S3 location of the model's `tar.gz` file.

In [None]:
trained_model_s3_bucket = sess.default_bucket()
trained_model_s3_key = f'{huggingface_estimator.latest_training_job.name}/output/model.tar.gz'

local_compressed_model_filename = './blip-model-sm.tar.gz'

s3 = boto3.resource('s3')
s3.Bucket(trained_model_s3_bucket).download_file(trained_model_s3_key, local_compressed_model_filename)

Next, unpack the model artifact for local inference. If you've already unpacked a model in a previous run, rename or delete the folder from a previous run before running this cell.

In [None]:
if not os.path.exists('blip-model-sm'):
    os.mkdir('blip-model-sm')
    os.system(f'tar -xvzf {local_compressed_model_filename} -C ./blip-model-sm')
else:
    print('Did not uncompress model: model artifact may already exist in the target location. Delete or rename the folder and re-run this cell.')

### Load model and preview results

In [None]:
from transformers import BlipForConditionalGeneration
model = BlipForConditionalGeneration.from_pretrained('./blip-model-sm')

In [None]:
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

In [None]:
b_model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
b_model.to(device)

In [None]:
from matplotlib import pyplot as plt
import random
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("Salesforce/blip-image-captioning-base")

fig = plt.figure(figsize=(18, 35))

# prepare image for the model
for cnt in range(12):
    idx = random.randint(0, len(test_ds))
    example = test_ds[idx]
    image = example["image"]
    orig_caption = example["text"]
    inputs = processor(images=image, return_tensors="pt").to(device)
    pixel_values = inputs.pixel_values

    generated_ids = model.generate(pixel_values=pixel_values, max_length=50)
    generated_caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    
    b_ids = b_model.generate(pixel_values=pixel_values, max_length=50)
    b_caption = processor.batch_decode(b_ids, skip_special_tokens=True)[0]
    
    fig.add_subplot(6, 2, cnt+1)
    plt.imshow(image)
    plt.axis("off")
    plt.title(f"Original: {orig_caption}\nGenerated: {generated_caption}\nBaseline: {b_caption}")

### Get predictions from test set

In [None]:
!pip install bert-score

In [None]:
from evaluate import load
bertscore = load("bertscore")

In [None]:
predictions = []
b_predictions = []
references = []

for idx in range(len(test_ds)):
    example = test_ds[idx]
    image = example["image"]
    orig_caption = example["text"]
    inputs = processor(images=image, return_tensors="pt").to(device)
    pixel_values = inputs.pixel_values

    generated_ids = model.generate(pixel_values=pixel_values, max_length=50)
    generated_caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    
    b_ids = b_model.generate(pixel_values=pixel_values, max_length=50)
    b_caption = processor.batch_decode(b_ids, skip_special_tokens=True)[0]
    
    references.append(orig_caption)
    predictions.append(generated_caption)
    b_predictions.append(b_caption)

### Bert Score (higher is better)

In [None]:
results = bertscore.compute(predictions=predictions, references=references, lang="en")
b_results = bertscore.compute(predictions=b_predictions, references=references, lang="en")

In [None]:
import numpy as np
print(f"F1 - tuned: {np.mean(results['f1'])}, baseline: {np.mean(b_results['f1'])}")
print(f"Precision - tuned: {np.mean(results['precision'])}, baseline: {np.mean(b_results['precision'])}")
print(f"Recall - tuned: {np.mean(results['recall'])}, baseline: {np.mean(b_results['recall'])}")

### WER (lower is better)

In [None]:
!pip install jiwer

In [None]:
from evaluate import load
wer = load("wer")
wer_score = wer.compute(predictions=predictions, references=references)
b_wer_score = wer.compute(predictions=b_predictions, references=references)

In [None]:
print(f"WER: {wer_score}, baseline: {b_wer_score}")

### Rouge (higher is better)

In [None]:
!pip install rouge-score nltk

In [None]:
rouge = load('rouge')

In [None]:
rouge_result = rouge.compute(predictions=predictions,
                             references=references,
                             use_aggregator=True)
b_rouge_result = rouge.compute(predictions=b_predictions,
                             references=references,
                             use_aggregator=True)

In [None]:
rouge_result

In [None]:
b_rouge_result

## Next steps

Next steps might include trying different foundation models, training for more epochs, or adding human feedback to improve the results.