{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Finetune T5 locally for machine translation on COVID-19 Health Service Announcements with Hugging Face\n",
    "\n",
    "[![Open in SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/aws/studio-lab-examples/blob/main/natural-language-processing/NLP_Disaster_Recovery_Translation.ipynb)\n",
    "\n",
    "This notebook is designed to run within SageMaker Lab, on a `g4dn.xlarge` GPU instance. If you are not using that right now, please restart your session and select `GPU`, as this will help you train your model in a matter of tens of minutes, rather than hours.\n",
    "\n",
    "If you are ready for training a large-scale machine translation model, then please check out using Hugging Face on Amazon SageMaker! \n",
    "\n",
    "Otherwise, please enjoy this notebook."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 0. Install all necessary packages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%%writefile requirements.txt\n",
    "\n",
    "ipywidgets\n",
    "git+https://github.com/huggingface/transformers\n",
    "datasets\n",
    "sacrebleu\n",
    "torch\n",
    "sentencepiece\n",
    "evaluate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%pip install -r requirements.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import IPython\n",
    "# make sure to restart your kernel to use the newly install packages\n",
    "# IPython.Application.instance().kernel.do_shutdown(True) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1. Explore the available datasets on Translators without Borders \n",
    "Then, download a pair you would like to use for training a language translation model. The steps below download the translation pairs for English to Spanish, but you are welcome to modify these and use a different pair if you prefer.\n",
    "\n",
    "Overall site page: https://tico-19.github.io/\n",
    "\n",
    "Page with all language pairs: https://tico-19.github.io/memories.html \n",
    "\n",
    "Scroll through all supported language pairs and pick your favorite. We'll demonstrate English to Spanish, `en-to-es`\n",
    "\n",
    "Copy the link to that pair, for `en-to-es` it looks like this:\n",
    "- https://tico-19.github.io/data/TM/all.en-es-LA.tmx.zip "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "path_to_my_data = 'https://tico-19.github.io/data/TM/all.en-es-LA.tmx.zip'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget {path_to_my_data}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "local_file = path_to_my_data.split('/')[-1]\n",
    "print (local_file)\n",
    "filename = local_file.split('.zip')[0]\n",
    "print (filename)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!unzip {local_file}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 2: Extract data from `.tmx` file type \n",
    "Next, you can use this local function to extract data from the `.tmx` file type and format for local training with Hugging Face."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# paste the name of your file and language codes here\n",
    "source_code_1 = 'en'\n",
    "target_code_2 =  'es'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def parse_tmx(filename, source_code_1, target_code_2):\n",
    "    '''\n",
    "    Takes a local TMX filename and codes for source and target languages. \n",
    "    Walks through your file, row by row, looking for tmx / html specific formatting.\n",
    "    If there's a regex match, will clean your string and add to a dictionary for downstream pandas formatting.\n",
    "    '''\n",
    "    \n",
    "    data = {source_code_1:[], target_code_2:[]}\n",
    "\n",
    "    with open(filename) as f:\n",
    "\n",
    "        for row in f.readlines():\n",
    "\n",
    "            if not row.endswith('</seg></tuv>\\n'):\n",
    "                continue\n",
    "\n",
    "            if row.startswith('<seg>'):\n",
    "\n",
    "                st_1 = row.strip()\n",
    "\n",
    "                st_1 = st_1.replace('<seg>', '')\n",
    "                st_1 = st_1.replace('</seg></tuv>', '')\n",
    "\n",
    "                data[source_code_1].append(st_1)\n",
    "\n",
    "            # when you use your own target code, remove the -LA string \n",
    "            if row.startswith('<tuv xml:lang=\"{}-LA\"><seg>'.format(target_code_2)):\n",
    "\n",
    "                st_2 = row.strip()\n",
    "                # when you use your own target code, remove the -LA string \n",
    "                st_2 = st_2.replace('<tuv xml:lang=\"{}-LA\"><seg>'.format(target_code_2), '')\n",
    "                st_2 = st_2.replace('</seg></tuv>', '')\n",
    "\n",
    "                data[target_code_2].append(st_2)\n",
    "                \n",
    "        return data\n",
    "\n",
    "data = parse_tmx(filename, source_code_1, target_code_2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# this makes sure you got actual pairs\n",
    "assert len(data[source_code_1]) == len(data[target_code_2])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df = pd.DataFrame.from_dict(data, orient = 'columns')\n",
    "\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# write to disk in case you need to restart your kernel later\n",
    "df.to_csv('language_pairs.csv', index=False, header=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 3. Format extracted data for machine translation with Hugging Face\n",
    "Core examples available right here: https://github.com/huggingface/transformers/tree/master/examples/pytorch/translation \n",
    "\n",
    "Guidance on formatting for Hugging Face datasets here:\n",
    "https://huggingface.co/docs/datasets/loading_datasets.html#json-files "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df = pd.read_csv('language_pairs.csv')\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The task of translation supports only custom JSONLINES files, with each line being a dictionary with a key \"translation\" and its value another dictionary whose keys is the language pair. For example:\n",
    "\n",
    "`{ \"translation\": { \"en\": \"Others have dismissed him as a joke.\", \"ro\": \"Alții l-au numit o glumă.\" } }\n",
    "{ \"translation\": { \"en\": \"And some are holding out for an implosion.\", \"ro\": \"Iar alții așteaptă implozia.\" } }`\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "objs = []\n",
    "\n",
    "for idx, row in df.iterrows():\n",
    "    \n",
    "    obj = {\"translation\": {source_code_1: row[source_code_1], target_code_2: row[target_code_2]}} \n",
    "    objs.append(obj)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "objs[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json \n",
    "!mkdir data\n",
    "with open('data/train.json', 'w') as f:\n",
    "    for row in objs:\n",
    "        j = json.dumps(row, ensure_ascii = False)\n",
    "        f.write(j)\n",
    "        f.write('\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 4 - Finetune a machine translation model locally\n",
    "Do to this, let's first download the raw Python file we need from Hugging Face to finetune our model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget https://raw.githubusercontent.com/huggingface/transformers/master/examples/pytorch/translation/run_translation.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# full hugging face Trainer API args available here\n",
    "# https://github.com/huggingface/transformers/blob/de635af3f1ef740aa32f53a91473269c6435e19e/src/transformers/training_args.py\n",
    "# T5 trainig args available here\n",
    "# https://huggingface.co/transformers/model_doc/t5.html#t5config\n",
    "!python run_translation.py \\\n",
    "    --model_name_or_path t5-small \\\n",
    "    --do_train \\\n",
    "    --source_lang en \\\n",
    "    --target_lang es \\\n",
    "    --source_prefix \"translate English to Spanish: \" \\\n",
    "    --train_file data/train.json \\\n",
    "    --output_dir output/tst-translation \\\n",
    "    --per_device_train_batch_size=4 \\\n",
    "    --per_device_eval_batch_size=4 \\\n",
    "    --overwrite_output_dir \\\n",
    "    --predict_with_generate \\\n",
    "    --save_strategy epoch \\\n",
    "    --num_train_epochs 3\n",
    "#     --do_eval \\\n",
    "#     --validation_file path_to_jsonlines_file \\\n",
    "#     --dataset_name cov-19 \\\n",
    "#     --dataset_config_name en-es \\\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!ls output/tst-translation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 5. Test your newly fine-tuned translation model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import AutoTokenizer, AutoModelWithLMHead\n",
    "  \n",
    "tokenizer = AutoTokenizer.from_pretrained(\"t5-small\")\n",
    "\n",
    "model = AutoModelWithLMHead.from_pretrained(pretrained_model_name_or_path = 'output/tst-translation')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# line to make sure your model supports local inference\n",
    "model.eval()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, let's test it! Remember that, in using the default settings of only 3 epoch, your translation is probably not going to be SOTA. For achieving state of the art, (SOTA), we recommend migrating to Amazon SageMaker to scale up and out. Scaling up means moving your code to a more advanced compute type, such as a p4 series or even Trainium. Scaling out means adding more compute, so going from 1 to many instances. Using the entire AWS cloud you can train for much longer periods of time on much larger datasets, which can directly translate to a more accurate model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "input_sequences = ['about how long have these symptoms been going on?',\t\n",
    "'and all chest pain should be treated this way especially with your age\t',\n",
    "'and along with a fever\t',\n",
    "'and also needs to be checked your cholesterol blood pressure',\t\n",
    "'and are you having a fever now?\t',\n",
    "'and are you having any of the following symptoms with your chest pain',\t\n",
    "'and are you having a runny nose?',\t\n",
    "'and are you having this chest pain now?',\n",
    "'and besides do you have difficulty breathing',\n",
    "'and can you tell me what other symptoms are you having along with this?',\n",
    "'and does this pain move from your chest?',\n",
    "'and drink lots of fluids',\n",
    "'and how high has your fever been',\n",
    "'and i have a cough too',\n",
    "'and i have a little cold and a cough',\n",
    "'''and i'm really having some bad chest pain today''']\n",
    "\n",
    "task_prefix = \"translate English to Spanish: \"\n",
    "\n",
    "for i in input_sequences:\n",
    "    input_ids = tokenizer('''{} {}'''.format(task_prefix, i), return_tensors='pt').input_ids\n",
    "    outputs = model.generate(input_ids)\n",
    "    print(i, tokenizer.decode(outputs[0], skip_special_tokens=True))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.save_pretrained('my-tf-en-to-sp')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!tar -czf my_model.tar.gz my-tf-en-to-sp"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "instance_type": "ml.g4dn.xlarge",
  "kernelspec": {
   "display_name": "nlp:Python",
   "language": "python",
   "name": "conda-env-nlp-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}