# Fine-tune Transformers for 100+ Languages with HuggingFace on Amazon SageMaker: NLP Question Answering Mid-Resource Language Example

In this Notebook, we experiment with different solution approaches using HuggingFace Transformers on Amazon SageMaker. We will fine tune various pretrained language models for Question & Answering task. Our focus will be on BERT variants.


## Question Answering Task
* Use Case
  - _Extractive Question Answering_: "Extractive QA odels are  deep learning models that can answer questions given some context.
* Pretrained Language Models
  1. _Multilingual-BERT_: “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding”, (2018). Links: [Paper](https://arxiv.org/pdf/1810.04805.pdf) - [Model](https://huggingface.co/bert-base-multilingual-uncased)
  1. _BERTurk_: Turkish language model created by the MDZ Digital Library team (dbmdz) at the Bavarian State Library (Schweter 2020). Links: [Site](https://zenodo.org/record/3770924#.YgVaAvVBydY) - [Model](https://huggingface.co/dbmdz/bert-base-turkish-uncased)
  1. _XLM-R_: A. Conneau et al., “Unsupervised Cross-Lingual Representation Learning at Scale”, (2019). Links: [Paper](https://arxiv.org/pdf/1911.02116.pdf) - [Model](https://huggingface.co/deepset/xlm-roberta-base-squad2)
  
* Dataset

  - Turkish-Reading-Comprehension-Question-Answering-Dataset. Soygazi F., Çiftçi, Kök, Cengiz, THQuAD: "Turkish Historic Question Answering Dataset for Reading Comprehension", International Conference on Computer Science and Engineering (UBMK), 15-17 Sept. 2021. IEEE Article Link: https://ieeexplore.ieee.org/document/9559013 LICENSE: MIT Licence https://github.com/okanvk/Turkish-Reading-Comprehension-Question-Answering-Dataset/blob/master/LICENSE

## How to Run this Notebook

You can run this notebook in SageMaker Studio. Please select the `PyTorch 1.6 Python 3.6 GPU Optimized` kernel.

**Prerequisites**

To train NLP models in SageMaker, you need to complete the following prerequisites:

* Sign up for an AWS account. For more information, see [Set Up Amazon SageMaker Prerequisites](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-set-up.html).
* Get started using  [Amazon SageMaker Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio.html) 


## Fine Tuning for QA Task - Example Flow

![images/nlp-finetune-qa-example.png](images/nlp-finetune-qa-example.png)

## Notebook Environment Setup
First we need to setup the Sagemaker API and retrieve the IAM role and the S3 bucket we will be working with.

In [None]:
!pip install transformers \
             datasets \
            sentencepiece \
            ipywidgets \
            IProgress \
            sagemaker \
            sagemaker-experiments -qq -U

In [None]:
import boto3
import sagemaker
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from sagemaker.huggingface import HuggingFace

from sagemaker.analytics import ExperimentAnalytics

from datasets import load_dataset
from transformers.pipelines import pipeline
from transformers.models.bert.modeling_bert import BertModel,BertForMaskedLM


import time
from time import strftime, gmtime

import pandas as pd

pd.set_option('display.max_columns', None)

In [None]:
sess = sagemaker.Session()

sagemaker_session_bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
sm = boto3.client('sagemaker')

In [None]:
s3_prefix_qa = 'datasets/hf-sagem-tr-qa'

# Part I. Prepare the Dataset, Explore Simple QA Task, Explore Tokenizers

* Download the data
* Prepare the data
* Load as HuggingFace dataset
* Explore the dataset with the HF Dataset Library
* Demonstrate low-code QA task using HF Pipelines


In [None]:
!wget https://raw.githubusercontent.com/okanvk/Turkish-Reading-Comprehension-Question-Answering-Dataset/master/data/2018-teknofest-squad-dataset/dev-v0.1.json -q
!wget https://raw.githubusercontent.com/okanvk/Turkish-Reading-Comprehension-Question-Answering-Dataset/master/data/2018-teknofest-squad-dataset/train-v0.1.json -q

!mkdir data

!mv dev-v0.1.json data/dev-v0.1.json
!mv train-v0.1.json data/train-v0.1.json

In [None]:
import json
from datasets import load_dataset

def prep_dataset(input_filename, output_filename):
    with open(input_filename) as f:
        dataset = json.load(f)

    with open(output_filename, "w") as f:
        for article in dataset["data"]:
            title = article["title"]
            for paragraph in article["paragraphs"]:
                context = paragraph["context"]
                answers = {}
                for qa in paragraph["qas"]:
                    question = qa["question"]
                    idx      = qa["id"]
                    answers["text"] = [str(a["text"]) for a in qa["answers"]]
                    answers["answer_start"] = [int(a["answer_start"]) for a in qa["answers"]]
                    f.write(
                        json.dumps(
                            {
                                "id": str(idx),
                                "title": str(title),
                                "context": str(context),
                                "question": str(question),
                                "answers": answers,
                            }
                        )
                    )
                    f.write("\n")

In [None]:
prep_dataset('data/train-v0.1.json', 'data/train.json')
prep_dataset('data/dev-v0.1.json', 'data/val.json')

In [None]:
training_input_file = sess.upload_data(
    path="data/train.json", bucket=sagemaker_session_bucket, key_prefix=s3_prefix_qa
)

test_input_file = sess.upload_data(
    path="data/val.json", bucket=sagemaker_session_bucket, key_prefix=s3_prefix_qa
)

In [None]:
data_files = {}
data_files["train"] = 'data/train.json'
data_files["validation"] = 'data/val.json'

In [None]:
ds = load_dataset("json", data_files=data_files)

In [None]:
print("Number of features in dataset: \n Train = {}, \n Validation = {}".format(len(ds['train']), len(ds['validation'])))

In [None]:
pd.DataFrame(pd.DataFrame(ds['train']).iloc[3815])

# * English Translation:

# * title: Emre Kongar
# * context: Resit Emre Kongar (b. 13 October 1941, Istanbul), Turkish sociologist, professor. 
# * question: What is the academic title of Emre Kongar?
# * answers: {'text': ['professor'], 'answer_start': [68]}

### [Optional] A low-code QA task using HF Pipelines

Please note that if an existing pretrained model exists in the HF Hub, which is sufficient for your scenario, and no fine-tuning is required, you could use HF Pipelines and simply provide the task name (e.g. "Question and Answering"), and the name of the model as follows. 


In [None]:
# * English Translation:

# * question: What is the most popular animal in Istanbul?
# * context: Istanbul is famous for its history, the many empires that have existed here, its majestic skyline and the stunning Bosphorus. 
#     But, actually, this is the city of cats. Istanbulites share the city with approximately 125,000 cats. 
#     And these are just the street cats; when house cats are included, the feline population rises to about 200,000. 
#     More than the population of many cities and countries! 
#     We can illustrate Istanbul’s love for cats with a quote from Ernest Hemingway:
#     A cat has absolute emotional honesty: human beings, for one reason or another, may hide their feelings, but a cat does not.
# # answers: top 1: "from cats", top 2: "street cats", top 3: "its cats"
model_name = "deepset/xlm-roberta-base-squad2"

nlp = pipeline('question-answering', model=model_name, tokenizer=model_name)

context = 'İstanbul tarihi, burada var olan birçok imparatorluk, görkemli silüeti ve çarpıcı Boğaz ile ünlüdür. \
Ama aslında burası kedilerin şehri. İstanbullular şehri yaklaşık 125,000 kediyle paylaşıyor. \
Ve bunlar sadece sokak kedileri; ev kedileri dahil edildiğinde kedi nüfusu yaklaşık 200,000e yükselir. \
Birçok şehir ve ülkenin nüfusundan daha fazla! İstanbul\'un kedilere olan sevgisini \
Ernest Hemingway\'den bir alıntıyla gösterebiliriz: Bir kedinin mutlak duygusal dürüstlüğü vardır: \
insanlar, bir sebepten ötürü duygularını gizleyebilir, ancak bir kedi saklamaz.'

question = 'İstanbulun en popüler hayvanı hangisidir?'


pd.DataFrame(nlp(question=question, context=context, top_k=3))

# Part II. Experiment Fine-Tuning Strategies for Question-Answering Task in Your Language 

For our example we will take the following pre-trained Turkish models available via HuggingFace Hub: 

* BERTurk 
* Multilingual BERT 
* XLM-RoBERTa Base 

and fine tune on two different SageMaker instance types (and GPU architectures):

* p2.2xlarge  (GPU: Nvidia  K80 GPU. GPU Architecture: Kepler (2012)
* p3.2xlarge  (GPU: Nvidia V100 GPU. GPU Architecture: Volta  (2017)
* g4dn.xlarge (GPU: Nvidia   T4 GPU. GPU Architecture: Turing (2018)

[SageMaker Training Instance Pricing](https://aws.amazon.com/sagemaker/pricing/) 

We will use SageMaker Experiments to track our model training experiments and compare the performance of these models.


Once the Experiment is created, we can keep track of metadata such as 

- hyperparameters
- artifacts (input data, output model location) or 
- create custom metadata. In our example, we will add static metadata such as model parameter size & number of layers, and dynamic metadata such as custom metrics and training duration.

## Setup Model list & metric definitions
Now we can create a list of models we want to compare for the Turkish QA task. In terms of costs, please note that a separate training job for each model will be launched. We then define the metrics we want to use for comparing the models.

In [None]:
trial_configs = []

config = {}
config['model'] = 'dbmdz/bert-base-turkish-uncased'
config['instance_type'] = 'ml.g4dn.xlarge'
config['instance_count'] = 1
config['distribution'] = None
trial_configs.append(config)

config = {}
config['model'] = 'bert-base-multilingual-uncased'
config['instance_type'] = 'ml.p3.2xlarge'
config['instance_count'] = 1
config['distribution'] = None
trial_configs.append(config)

config = {}
config['model'] = 'deepset/xlm-roberta-base-squad2'
config['instance_type'] = 'ml.g4dn.xlarge'
config['instance_count'] = 1
config['distribution'] = None
trial_configs.append(config)

config = {}
config['model'] = 'deepset/xlm-roberta-base-squad2'
config['instance_type'] = 'ml.p3.2xlarge'
config['instance_count'] = 1
config['distribution'] = None
trial_configs.append(config)

config = {}
config['model'] = 'deepset/xlm-roberta-base-squad2'
config['instance_type'] = 'ml.p3.16xlarge'
config['instance_count'] = 2
# Define the distribution parameters in the HuggingFace Estimator
config['distribution'] = {'smdistributed':{'dataparallel':{ 'enabled': True }}}
trial_configs.append(config)

trial_configs

In [None]:
metric_definitions=[
     {'Name': 'f1', 'Regex': "f1.*=\D*(.*?)$"},
     {'Name': 'exact_match', 'Regex': "exact_match.*=\D*(.*?)$"}]

## Experiment Management using SageMaker Experiments

Now we create the experiment. We first choose a unique name and then use the SM Experiemnt API to create it.

In [None]:
create_date = strftime("%Y-%m-%d-%H-%M-%S", gmtime())
experiment_name = f"NLP-QuestionAnswer-HF-fine-tune-tr-{create_date}"
experiment_name

In [None]:
nlp_experiment = Experiment.create(
    experiment_name=experiment_name,
    description="Fine-tune Transformers for Q-A Task with HuggingFace on Amazon SageMaker",
    sagemaker_boto_client=sm,
)

## Managed Training using HF Deep Learning Containers on Amazon SageMaker

Setting up, managing, and scaling custom ML environments can be time-consuming and cumbersome even for experts. With AWS Deep Learning Containers (DLC), you get access to prepackaged and optimized DL frameworks that make it easy for you to customize, extend, and scale your environments.

To train HuggingFace models on SageMaker, we will use the HuggingFace Estimator available in the SageMaker Python SDK, which makes it easy to fine-tune and customise models from HuggingFace Hub. We instantiate this class with the following inputs:

* a training script which uses the Transformers library to prepare batches of data for training, then uses PyTorch to train our model on SageMaker-managed training instances. In our example, we provide a script called run_qa.py (available from transformers library).
* training configuration such as model name, and hyperparameters such as batch size, number of epochs, etc. 
* SageMaker instance type and count
* Whether to use Spot Instances for training
* Two popular custom metric definitions for Question Answering task: F1 score and Exact Match (EM)


In [None]:
# We can specify a training script that is stored in a GitHub repository as the entry point for our Estimator, 
# so we don’t have to download the scripts locally.
# git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.10.0'}

In [None]:
for config in trial_configs:
    model = config['model'].split('/')[-1]
    instance = config['instance_type'].replace('.', '-')
    trial_name = f"NLP-QA-Trial-{model}-{instance}-{int(time.time())}"
    
    # create a trial that will be attached to the experiment
    nlp_trial = Trial.create(
        trial_name=trial_name,
        experiment_name=nlp_experiment.experiment_name,
        sagemaker_boto_client=sm,
    )

    hyperparameters_qa={
        'model_name_or_path': config['model'],
        'train_file': '/opt/ml/input/data/train/train.json',
        'validation_file': '/opt/ml/input/data/val/val.json',
        'do_train': True,
        'do_eval': True,
        'fp16': True,
        'per_device_train_batch_size': 16,
        'per_device_eval_batch_size': 16,
        'num_train_epochs': 2,
        'max_seq_length': 384,
        'pad_to_max_length': True,
        'doc_stride': 128,
        'output_dir': '/opt/ml/model'
    }

    huggingface_estimator = HuggingFace(entry_point='run_qa.py',
                                        source_dir='./sagemaker',
                                        # source_dir='./examples/pytorch/question-answering',
                                        # git_config=git_config,
                                        instance_type=config['instance_type'],
                                        instance_count=config['instance_count'],
                                        role=role,
                                        transformers_version='4.12.3',
                                        pytorch_version='1.9.1',
                                        py_version='py38',
                                        distribution=config['distribution'],
                                        hyperparameters=hyperparameters_qa,
                                        metric_definitions=metric_definitions,
                                        enable_sagemaker_metrics=True,)
    
    nlp_training_job_name = f"NLPjob-{model}-{instance}-{int(time.time())}"
    
    training_input_path = f's3://{sagemaker_session_bucket}/{s3_prefix_qa}/'
    test_input_path = f's3://{sagemaker_session_bucket}/{s3_prefix_qa}/'
    
    huggingface_estimator.fit(
        inputs={'train': training_input_path, 'val': test_input_path},
        job_name=nlp_training_job_name,
        experiment_config={
            "ExperimentName": nlp_experiment.experiment_name,
            "TrialName": nlp_trial.trial_name,
            "TrialComponentDisplayName": nlp_trial.trial_name,},
        wait=False,
    )
    
    print("\n Launched Trial for: \n Model = {}, \n Instance Type = {}".format(model, instance))

# Part III. Compare Results

Once all training jobs have completed we can retrieve various metadata via the SM Experiment API and compare the performance of our models. The `ExperimentAnalytics` object available from the SageMaker Experiments library contains many info such as the custom metric definitions (F1 and EM - Exact Match)

We will enrich this metadata with custom info such as 

* training duration information provided by the SageMaker Training Job
* info from HF model config related to model capacity

### Retrieve Metadata from each Trial

In [None]:
trial_component_analytics = ExperimentAnalytics(
    sagemaker_session=sess, 
    experiment_name=nlp_experiment.experiment_name
)

df_results = trial_component_analytics.dataframe()

In [None]:
df_results.head()

### Retrieve Training Duration

In [None]:
training_jobs = list(df_results['sagemaker_job_name'])

for job in training_jobs:
    description = sess.describe_training_job(job.replace('"', ''))
    df_results.loc[df_results['sagemaker_job_name'] == job, 'TrainingTimeInMinutes'] = description['TrainingTimeInSeconds'] / 60

### Retrieve Model Capacity Related Infro from HF model config
We can retrieve the configdetails from the config.json file in the model's repo (e.g. https://huggingface.co/dbmdz/bert-base-turkish-uncased/raw/main/config.json) 

In [None]:
import requests
import json

In [None]:
for config in trial_configs:
    r = requests.get(f"https://huggingface.co/{config['model']}/raw/main/config.json")
    model_config = json.loads(r.text)
    df_results.loc[df_results['model_name_or_path'].str.replace('"', '') == config['model'], 'vocab_size'] = model_config['vocab_size']

### Retrieve Training Costs (in USD)

In [None]:
def get_instance_prices(region, purpose):
    endpoint = "https://b0.p.awsstatic.com/pricing/2.0/meteredUnitMaps/sagemaker/USD/current/sagemaker-instances.json"
    response = requests.get(endpoint).json()
    price_dict = {}

    for instance_region, region_data in response["regions"].items():
        if instance_region == region:
            for instance_type, instance_data in region_data.items():
                instance_purpose = instance_data["Instance Type"].split('-')[-1]
                if instance_purpose == purpose:
                    price_dict[instance_data["Instance"]] = instance_data["price"]
                    
    return price_dict

regions = pd.read_csv('regions.csv')
region_name = regions.loc[regions['region_code'] == sess.boto_region_name]['region_name'].item()

prices = get_instance_prices(region=region_name, purpose="Training")

df_results['cost_per_hour_per_instance_usd'] = df_results.apply(lambda x: float(prices[x['SageMaker.InstanceType']]), axis=1)
df_results['training_job_cost_usd'] = df_results['TrainingTimeInMinutes'] * 60 * df_results['cost_per_hour_per_instance_usd'] * df_results['SageMaker.InstanceCount'] / 3600.0

## Review Results

In [None]:
df_results[['Experiments', 'model_name_or_path', 'vocab_size', 'SageMaker.InstanceType', 'SageMaker.InstanceCount', 'sagemaker_distributed_dataparallel_enabled', 'num_train_epochs', \
    'exact_match - Avg', 'f1 - Avg', 'TrainingTimeInMinutes', 'cost_per_hour_per_instance_usd', 'training_job_cost_usd']]  \
    .sort_values(by=['exact_match - Avg', 'f1 - Avg'], ascending=False)

In [None]:
df_results.to_csv(f"sm-experiment-results-{create_date}.csv")

You could also review results from SageMaker Studio UI


Experiment Results Comparison Table:

![images/SMExperiment-results-table.png](images/SMExperiment-results-table.png)


Evaluation Metric Results Comparison Graph:

![images/SMExperiment-results-graph.png](images/SMExperiment-results-graph.png)
