# Finetune T5 locally for machine translation on COVID-19 Health Service Announcements with Hugging Face

[![Open in SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/aws/studio-lab-examples/blob/main/natural-language-processing/NLP_Disaster_Recovery_Translation.ipynb)

This notebook is designed to run within SageMaker Lab, on a `g4dn.xlarge` GPU instance. If you are not using that right now, please restart your session and select `GPU`, as this will help you train your model in a matter of tens of minutes, rather than hours.

If you are ready for training a large-scale machine translation model, then please check out using Hugging Face on Amazon SageMaker! 

Otherwise, please enjoy this notebook.

### Step 0. Install all necessary packages

In [None]:
%%writefile requirements.txt

ipywidgets
git+https://github.com/huggingface/transformers
datasets
sacrebleu
torch
sentencepiece
evaluate

In [None]:
%pip install -r requirements.txt

In [None]:
import IPython
# make sure to restart your kernel to use the newly install packages
# IPython.Application.instance().kernel.do_shutdown(True) 

## Step 1. Explore the available datasets on Translators without Borders 
Then, download a pair you would like to use for training a language translation model. The steps below download the translation pairs for English to Spanish, but you are welcome to modify these and use a different pair if you prefer.

Overall site page: https://tico-19.github.io/

Page with all language pairs: https://tico-19.github.io/memories.html 

Scroll through all supported language pairs and pick your favorite. We'll demonstrate English to Spanish, `en-to-es`

Copy the link to that pair, for `en-to-es` it looks like this:
- https://tico-19.github.io/data/TM/all.en-es-LA.tmx.zip 

In [None]:
path_to_my_data = 'https://tico-19.github.io/data/TM/all.en-es-LA.tmx.zip'

In [None]:
!wget {path_to_my_data}

In [None]:
local_file = path_to_my_data.split('/')[-1]
print (local_file)
filename = local_file.split('.zip')[0]
print (filename)

In [None]:
!unzip {local_file}

### Step 2: Extract data from `.tmx` file type 
Next, you can use this local function to extract data from the `.tmx` file type and format for local training with Hugging Face.

In [None]:
# paste the name of your file and language codes here
source_code_1 = 'en'
target_code_2 = 'es'

In [None]:
def parse_tmx(filename, source_code_1, target_code_2):
 '''
 Takes a local TMX filename and codes for source and target languages. 
 Walks through your file, row by row, looking for tmx / html specific formatting.
 If there's a regex match, will clean your string and add to a dictionary for downstream pandas formatting.
 '''
 
 data = {source_code_1:[], target_code_2:[]}

 with open(filename) as f:

 for row in f.readlines():

 if not row.endswith('\n'):
 continue

 if row.startswith(''):

 st_1 = row.strip()

 st_1 = st_1.replace('', '')
 st_1 = st_1.replace('', '')

 data[source_code_1].append(st_1)

 # when you use your own target code, remove the -LA string 
 if row.startswith(''.format(target_code_2)):

 st_2 = row.strip()
 # when you use your own target code, remove the -LA string 
 st_2 = st_2.replace(''.format(target_code_2), '')
 st_2 = st_2.replace('', '')

 data[target_code_2].append(st_2)
 
 return data

data = parse_tmx(filename, source_code_1, target_code_2)

In [None]:
# this makes sure you got actual pairs
assert len(data[source_code_1]) == len(data[target_code_2])

In [None]:
import pandas as pd

df = pd.DataFrame.from_dict(data, orient = 'columns')

df.head()

In [None]:
# write to disk in case you need to restart your kernel later
df.to_csv('language_pairs.csv', index=False, header=True)

### Step 3. Format extracted data for machine translation with Hugging Face
Core examples available right here: https://github.com/huggingface/transformers/tree/master/examples/pytorch/translation 

Guidance on formatting for Hugging Face datasets here:
https://huggingface.co/docs/datasets/loading_datasets.html#json-files 

In [None]:
import pandas as pd

df = pd.read_csv('language_pairs.csv')
df.head()

The task of translation supports only custom JSONLINES files, with each line being a dictionary with a key "translation" and its value another dictionary whose keys is the language pair. For example:

`{ "translation": { "en": "Others have dismissed him as a joke.", "ro": "Alții l-au numit o glumă." } }
{ "translation": { "en": "And some are holding out for an implosion.", "ro": "Iar alții așteaptă implozia." } }`


In [None]:
objs = []

for idx, row in df.iterrows():
 
 obj = {"translation": {source_code_1: row[source_code_1], target_code_2: row[target_code_2]}} 
 objs.append(obj)

In [None]:
objs[:5]

In [None]:
import json 
!mkdir data
with open('data/train.json', 'w') as f:
 for row in objs:
 j = json.dumps(row, ensure_ascii = False)
 f.write(j)
 f.write('\n')

### Step 4 - Finetune a machine translation model locally
Do to this, let's first download the raw Python file we need from Hugging Face to finetune our model.

In [None]:
!wget https://raw.githubusercontent.com/huggingface/transformers/master/examples/pytorch/translation/run_translation.py

In [None]:
# full hugging face Trainer API args available here
# https://github.com/huggingface/transformers/blob/de635af3f1ef740aa32f53a91473269c6435e19e/src/transformers/training_args.py
# T5 trainig args available here
# https://huggingface.co/transformers/model_doc/t5.html#t5config
!python run_translation.py \
 --model_name_or_path t5-small \
 --do_train \
 --source_lang en \
 --target_lang es \
 --source_prefix "translate English to Spanish: " \
 --train_file data/train.json \
 --output_dir output/tst-translation \
 --per_device_train_batch_size=4 \
 --per_device_eval_batch_size=4 \
 --overwrite_output_dir \
 --predict_with_generate \
 --save_strategy epoch \
 --num_train_epochs 3
# --do_eval \
# --validation_file path_to_jsonlines_file \
# --dataset_name cov-19 \
# --dataset_config_name en-es \


In [None]:
!ls output/tst-translation

### Step 5. Test your newly fine-tuned translation model

In [None]:
from transformers import AutoTokenizer, AutoModelWithLMHead
 
tokenizer = AutoTokenizer.from_pretrained("t5-small")

model = AutoModelWithLMHead.from_pretrained(pretrained_model_name_or_path = 'output/tst-translation')

In [None]:
# line to make sure your model supports local inference
model.eval()

Next, let's test it! Remember that, in using the default settings of only 3 epoch, your translation is probably not going to be SOTA. For achieving state of the art, (SOTA), we recommend migrating to Amazon SageMaker to scale up and out. Scaling up means moving your code to a more advanced compute type, such as a p4 series or even Trainium. Scaling out means adding more compute, so going from 1 to many instances. Using the entire AWS cloud you can train for much longer periods of time on much larger datasets, which can directly translate to a more accurate model.

In [None]:
input_sequences = ['about how long have these symptoms been going on?',	
'and all chest pain should be treated this way especially with your age	',
'and along with a fever	',
'and also needs to be checked your cholesterol blood pressure',	
'and are you having a fever now?	',
'and are you having any of the following symptoms with your chest pain',	
'and are you having a runny nose?',	
'and are you having this chest pain now?',
'and besides do you have difficulty breathing',
'and can you tell me what other symptoms are you having along with this?',
'and does this pain move from your chest?',
'and drink lots of fluids',
'and how high has your fever been',
'and i have a cough too',
'and i have a little cold and a cough',
'''and i'm really having some bad chest pain today''']

task_prefix = "translate English to Spanish: "

for i in input_sequences:
 input_ids = tokenizer('''{} {}'''.format(task_prefix, i), return_tensors='pt').input_ids
 outputs = model.generate(input_ids)
 print(i, tokenizer.decode(outputs[0], skip_special_tokens=True))


In [None]:
model.save_pretrained('my-tf-en-to-sp')

In [None]:
!tar -czf my_model.tar.gz my-tf-en-to-sp