### Kernel and SageMaker Instance Setup
**Please use the ml.g4dn.xlarge instance for this notebook. The Kernel is 'Data Science - Python3'**

# Text Summarization of Consumer Health Questions
## Part 1 Fine tuning Flan-t5 locally in the notebook

In this notebook we will learn how to fine tune the Flan-t5 model for medical summarization task in the local notebook. We will use the MeQSum dataset for fine-tuning. The MeQSum dataset contains three columns : id, text and summary. We will first split the dataset into three parts - train, validation and test. For training, we use the text column as input and the summary column as the label (output). After training the model we use the test dataset to generate summary and then compare that with the human generated summary in the dataset. 

### MeQSum Dataset
"On the Summarization of Consumer Health Questions". Asma Ben Abacha and Dina Demner-Fushman. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, ACL 2019. 
#### Citation Information
@Inproceedings{MeQSum,
author = {Asma {Ben Abacha} and Dina Demner-Fushman},
title = {On the Summarization of Consumer Health Questions},
booktitle = {Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28th - August 2},
year = {2019},
abstract = {Question understanding is one of the main challenges in question answering. In real world applications, users often submit natural language questions that are longer than needed and include peripheral information that increases the complexity of the question, leading to substantially more false positives in answer retrieval. In this paper, we study neural abstractive models for medical question summarization. We introduce the MeQSum corpus of 1,000 summarized consumer health questions. We explore data augmentation methods and evaluate state-of-the-art neural abstractive models on this new task. In particular, we show that semantic augmentation from question datasets improves the overall performance, and that pointer-generator networks outperform sequence-to-sequence attentional models on this task, with a ROUGE-1 score of 44.16%. We also present a detailed error analysis and discuss directions for improvement that are specific to question summarization. }}





In [None]:
!pip install -q openpyxl==3.0.3 xlrd==1.2.0
!pip install -q torch==1.13.1 datasets==2.12.0 transformers==4.28.0 rouge-score==0.1.2 nltk==3.8.1 sentencepiece==0.1.99 evaluate==0.4.0

## 1. Data Preparation

#### Download the publicly available dataset

In [None]:
import urllib.request
urllib.request.urlretrieve('https://github.com/abachaa/MeQSum/raw/master/MeQSum_ACL2019_BenAbacha_Demner-Fushman.xlsx', 'MeQSum_ACL2019_BenAbacha_Demner-Fushman.xlsx')
# please wait a few seconds till you see the excel file in your folder.

#### Prepare the data for ingestion e.g. making all text lowercase, and adding row IDs

In [None]:
import pandas as pd

# dataset from https://github.com/abachaa/MeQSum
df = pd.read_excel('MeQSum_ACL2019_BenAbacha_Demner-Fushman.xlsx')
df = df.drop('File', axis=1)
df = df.rename(columns={'CHQ':'Text'})
df = df.dropna()
df['Text']= df['Text'].apply(lambda x: x.lower())
df['Summary'] = df['Summary'].apply(lambda x: x.lower())
df['Id'] = range(0, len(df.index))
df = df[['Id', 'Text', 'Summary']]
# df = df.sample(frac=1).reset_index(drop=True) # to shaffule
df

#### Import the tokenisation libraries & functions. 
##### Tokenisation is required to convert raw text into smaller subsets. It does this by creating vectorised versions of each word.

In [None]:
# Import libraries required for modelling

import torch
import datasets
from datasets import Dataset
from datasets import load_metric
from datasets import concatenate_datasets

import transformers
from transformers import AutoTokenizer
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

import numpy as np
import evaluate

import nltk
from nltk.tokenize import sent_tokenize
nltk.download("punkt")

In [None]:
model_checkpoint = 'google/flan-t5-small'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

#### Outputs of the tokeniser:
**input_ids**: the corresponding indicies of each word. Note that AWS wasn't in this vocabulary and has been split into A, W, S

**attention_mask**: indicates to the LLM which tokens should be prioritised


In [None]:
tokenizer("Hello, welcome to AWS!")

In [None]:
# Similar to any ML model, we split the data into a train and test set

train = df[:700]
val = df[700:900]
test = df[900:]
print('train: {}, val: {}, test: {}'.format(train.shape, val.shape, test.shape))

In [None]:
# Metadata and dataset objects

train_dataset = Dataset.from_pandas(train)
val_dataset = Dataset.from_pandas(val)
test_dataset = Dataset.from_pandas(test)

In [None]:
# Deterimine the max input length and max target length based on the number of rows in the dataset

tokenized_inputs = concatenate_datasets([train_dataset, val_dataset, test_dataset]).map(lambda x: tokenizer(x["Text"], truncation=True), batched=True, remove_columns=["Text", "Summary"])
max_input_length = max([len(x) for x in tokenized_inputs["input_ids"]])
print(f"Max input length: {max_input_length}")

tokenized_targets = concatenate_datasets([train_dataset, val_dataset, test_dataset]).map(lambda x: tokenizer(x["Summary"], truncation=True), batched=True, remove_columns=["Text", "Summary"])
max_target_length = max([len(x) for x in tokenized_targets["input_ids"]])
print(f"Max target length: {max_target_length}")

In [None]:
display(train_dataset)

#### Create a function to tokenise inputs to the model & ensure vectors are the same length

In [None]:
def preprocess_function(sample,padding="max_length"):
 inputs = ["summarize: " + item for item in sample["Text"]]
 model_inputs = tokenizer(inputs, max_length=max_input_length, padding=padding, truncation=True)

 labels = tokenizer(text_target=sample["Summary"], max_length=max_target_length, padding=padding, truncation=True)

 if padding == "max_length":
 labels["input_ids"] = [
 [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
 ]

 model_inputs["labels"] = labels["input_ids"]
 return model_inputs

In [None]:
# Apply the tokenisation function

tokenized_train = train_dataset.map(preprocess_function, batched=True)
tokenized_val = val_dataset.map(preprocess_function, batched=True)

print(f"Keys of tokenized dataset: {tokenized_train.features}")

In [None]:
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

# 2. Train the model using HuggingFace

In [None]:
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

#### Determine parameters 
**batch_size:** affects the efficiency of prediction

**label_pad_token_id:** the id to use when padding the labels (-100 will be automatically ignored) 

**data_collator:** an object with the important parameters required for tokenisation

In [None]:
batch_size = 4
label_pad_token_id = -100
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, label_pad_token_id=label_pad_token_id, pad_to_multiple_of=8)

#### Define a function to evaluate the models performance
**ROUGE**: Recall-Oriented Understudy for Gisting Evaluation. A set of metrics for evaluating automatic summarization algorithms

In [None]:
metric = evaluate.load("rouge")

# A function to post process the outputs of the model, and present them in an easy to read format

def postprocess_text(preds, labels):
 preds = [pred.strip() for pred in preds]
 labels = [label.strip() for label in labels]
 preds = ["\n".join(sent_tokenize(pred)) for pred in preds]
 labels = ["\n".join(sent_tokenize(label)) for label in labels]
 return preds, labels

# A function to generate the metrics pertaining to the prediction

def compute_metrics(eval_preds):
 preds, labels = eval_preds
 if isinstance(preds, tuple):
 preds = preds[0]
 decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
 labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
 decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
 decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

 result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
 result = {k: round(v * 100, 4) for k, v in result.items()}
 prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
 result["gen_len"] = np.mean(prediction_lens)
 return result

In [None]:
# The gc library (garbage collector library) is used for memory management

import gc
gc.collect()

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
if DEVICE == "cuda":
 print("[INFO] training using {}".format(torch.cuda.get_device_name(0)))

torch.cuda.empty_cache()
%env WANDB_DISABLED=True

#### Define the objects & parameters for model training

In [None]:
model_name = model_checkpoint.split("/")[-1]
print(f"The name of the model is {model_name}")

# Arguments that will be included in the model training object

args = Seq2SeqTrainingArguments(
 f"{model_name}-finetuned-meqsum2019",
 evaluation_strategy = "epoch",
 save_strategy="epoch",
 load_best_model_at_end=True,
 learning_rate=2e-5,
 per_device_train_batch_size=batch_size,
 per_device_eval_batch_size=batch_size,
 weight_decay=0.01,
 save_total_limit=3,
 num_train_epochs=10,
 logging_strategy="steps",
 logging_steps=100,
 predict_with_generate=True,
 fp16=False
)

#### The outputs of the training job below will show the models performance
If the validation loss is lower than the training loss, then the model is generalising well

In [None]:
# Create the model training object & train it

trainer = Seq2SeqTrainer(
 model,
 args,
 train_dataset=tokenized_train,
 eval_dataset=tokenized_val,
 data_collator=data_collator,
 tokenizer=tokenizer,
 compute_metrics=compute_metrics
)

trainer.train()

## 3. Perform inferencing on the test dataset

#### Perform the necessary transformations (functions we defined above) to understand the model's performance on the test dataset, and view it's output summaries

In [None]:
# Tokenise the test dataset

test_dataset = Dataset.from_pandas(test)
tokenized_test = test_dataset.map(
 preprocess_function,
 batched=True)

In [None]:
# Obtain metrics of the model's performance on the test set

predict_results = trainer.predict(tokenized_test)
predict_results.metrics

#### The `Predicted Summary` column is the model's output

In [None]:
# Decode the prediction result & view the predictions

if args.predict_with_generate:
 predictions = tokenizer.batch_decode(predict_results.predictions, skip_special_tokens=True, clean_up_tokenization_spaces=True)
 predictions = [pred.strip() for pred in predictions]

# Model summarisation performance can be inspected by evaluation metrics and spot checks

test['Predicted Summary'] = predictions
pd.set_option('display.max_colwidth', 1024)
display(test)

## 4. Stop the notebook instance
This notebook uses ml.g4dn.2xlarge which we will need for other labs. Please stop the notebook kernel and instance once you are done. To stop the instance use the menu on the left. Look for the symbol that has a black square inside a circle. Click on the 'power' button next to the Jupyter Notebook instance. select 'Shutdown All'