# Summarize Scientific Documents with a Foundation Model

Researchers must stay up-to-date on their fields of interest. However, it's difficult to keep track of the large number of journals, whitepapers, and research pre-prints generated in many areas. In response, many research groups have turned to AI/ML tools to summarize and classify new documents.

In this workshop, we'll use a foundation model (FM) to process scientific documents from the HuggingFace [scientific_documents](https://huggingface.co/datasets/scientific_papers) dataset.

This notebook was created and tested on an `ml.m5.2xlarge.medium (8 vCPU + 32 GiB)` notebook instance running the `Python 3 (Data Science 3.0)` kernel in SageMaker Studio.

## 1. Install required libraries

In [None]:
%pip install -q -U pip
%pip install -q -U torch --index-url https://download.pytorch.org/whl/cpu 
%pip install -q -U transformers datasets einops accelerate 

## 2. Download PubMed document abstracts

Download a sample of PubMed abstracts from HuggingFace Hub (https://huggingface.co/datasets/scientific_papers).

In [26]:
from datasets import load_dataset

dataset = load_dataset("scientific_papers", "pubmed", split='test[:5000]')

Found cached dataset scientific_papers (/root/.cache/huggingface/datasets/scientific_papers/pubmed/1.1.1/306757013fb6f37089b6a75469e6638a553bd9f009484938d8f75a4c5e84206f)


Take a look at an example abstract

In [28]:
import random

def get_random_abstract(data):
 return random.sample(data['abstract'], 1)[0]
 
abstract = get_random_abstract(dataset)
print(abstract)

 background : dental students use extracted human teeth to learn practical and technical skills before they enter the clinical environment . in the present research , knowledge , performance , and attitudes toward sterilization / disinfection methods of extracted human teeth were evaluated in a selected group of iranian dental students.materials and methods : in this descriptive cross - sectional study the subjects consisted of fourth- , fifth- and sixth - year dental students . 
 data were collected by questionnaires and analyzed by fisher 's exact test and chi - squared test using spss 11.5.results:in this study , 100 dental students participated . 
 the average knowledge score was 15.9 4.8 . 
 based on the opinion of 81 students sodium hypochlorite was selected as suitable material for sterilization and 78 students believed that oven sterilization is a good way for the purpose . 
 the average performance score was 4.1 0.8 , with 3.9 1.7 and 4.3 1.1 for males and females , respective

## 3. Generate abstract summaries using a foundation model (FM)

[Flan-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) is a foundation model trained on a large collection of text documents. In addition, it was "instruction-tuned" to perform reasonably well on a wide range of language processing tasks, such as question answering and translation. In this example, we'll use it for text summarization.

Although the pre-training data likely included some scientific text, Flan-T5 was not specifically trained to handle biomedical text. We'll see the result of this in the outputs below.

In [29]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_checkpoint='google/flan-t5-base'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

### 3.1. Basic text generation

Before passing the abstract text to our model, we need to tokenize it, i.e. convert it from text into a numerical representation.

In [30]:
import pprint
tokenized_abstract = tokenized_input = tokenizer(abstract, return_tensors='pt')
pprint.pprint(tokenized_abstract['input_ids'][0])

tensor([ 2458, 3, 10, 4814, 481, 169, 21527, 936, 3841, 12,
 669, 3236, 11, 2268, 1098, 274, 79, 2058, 8, 3739,
 1164, 3, 5, 16, 8, 915, 585, 3, 6, 1103,
 3, 6, 821, 3, 6, 11, 18537, 2957, 29675, 257,
 3, 87, 30929, 23, 106, 2254, 13, 21527, 936, 3841,
 130, 14434, 16, 3, 9, 2639, 563, 13, 3, 23,
 52, 9, 15710, 4814, 481, 5, 11303, 7, 11, 2254,
 3, 10, 16, 48, 25444, 2269, 3, 18, 1375, 138,
 810, 8, 7404, 14280, 26, 13, 4509, 18, 3, 6,
 8486, 18, 11, 13305, 3, 18, 215, 4814, 481, 3,
 5, 331, 130, 4759, 57, 19144, 7, 11, 3, 16466,
 57, 2495, 49, 3, 31, 7, 2883, 794, 11, 3,
 1436, 3, 18, 2812, 26, 794, 338, 3, 7, 102,
 7, 7, 7806, 9125, 60, 7, 83, 17, 7, 10,
 77, 48, 810, 3, 6, 910, 4814, 481, 10627, 3,
 5, 8, 1348, 1103, 2604, 47, 9996, 1298, 3, 27441,
 3, 5, 3, 390, 30, 8, 3474, 13, 3, 4959,
 481, 19049, 10950, 524, 322, 155, 15, 47, 2639, 38,
 3255, 1037, 21, 29675, 257, 11, 3, 3940, 481, 6141,
 24, 4836, 29675, 257, 19, 3, 9, 207, 194, 21,
 8, 1730, 3, 5, 8, 1348, 821, 2604, 47, 3,
 

Next, we pass the tokens to the model and ask it to generate new tokens to "fill in the blank" at the end.

In [31]:
model_output = model.generate(tokenized_abstract['input_ids'], max_new_tokens=50)[0]
model_output

tensor([ 0, 1103, 6, 821, 11, 18537, 2957, 29675, 257, 2254,
 13, 21527, 936, 3841, 130, 207, 68, 31221, 130, 6970,
 16, 2119, 11, 1397, 3255, 21, 29675, 257, 5, 1])

Finally, we decode the model output back into text and clean it up.

In [32]:
print(tokenizer.decode(model_output, skip_special_tokens=True).strip().capitalize())

Knowledge, performance and attitudes toward sterilization methods of extracted human teeth were good but shortcomings were observed in teaching and materials suitable for sterilization.


Let's put these steps all together

In [33]:
def generate_text(input_text):
 model_input = input_text
 tokenized_input = tokenizer(model_input, return_tensors='pt')
 model_output = tokenizer.decode(
 model.generate(
 tokenized_input["input_ids"], 
 max_new_tokens=50,
 temperature=0.75,
 do_sample=True
 )[0], 
 skip_special_tokens=True
 )
 
 return(model_input, model_output.strip().capitalize())


no_prompt_input, no_prompt_output = generate_text(abstract)

print(f"MODEL INPUT:\n'{no_prompt_input}'\n")
print(f"MODEL OUTPUT:\n'{no_prompt_output}'\n")

MODEL INPUT:
' background : dental students use extracted human teeth to learn practical and technical skills before they enter the clinical environment . in the present research , knowledge , performance , and attitudes toward sterilization / disinfection methods of extracted human teeth were evaluated in a selected group of iranian dental students.materials and methods : in this descriptive cross - sectional study the subjects consisted of fourth- , fifth- and sixth - year dental students . 
 data were collected by questionnaires and analyzed by fisher 's exact test and chi - squared test using spss 11.5.results:in this study , 100 dental students participated . 
 the average knowledge score was 15.9 4.8 . 
 based on the opinion of 81 students sodium hypochlorite was selected as suitable material for sterilization and 78 students believed that oven sterilization is a good way for the purpose . 
 the average performance score was 4.1 0.8 , with 3.9 1.7 and 4.3 1.1 for males and female

### 3.2. Using a text prompt

We can guide the model to generate a more accurate response via "prompt engineering". This helps it to understand the task at hand. In the next cell, we'll try using a list of prompts to see how they affect the output.

In [34]:
prompts = [
 "Briefly summarize this sentence: {text}",
 "Write a short summary for this text: {text}",
 "{text}\n\nWrite a brief summary in a sentence or less",
 "Write a sentence based on '{text}'",
 "Summarize this article:\n\n{text}",
]
print(f"Abstract:\n'{abstract}'\n")
for each_prompt in prompts:
 print("#"*25)
 print(f"Prompt: '{each_prompt}'")
 input = each_prompt.replace("{text}", abstract)
 prompted_input, prompted_output = generate_text(input)
 # print(no_prompt_input)
 print(f"Model response: '{prompted_output}'\n")

Abstract:
' background : dental students use extracted human teeth to learn practical and technical skills before they enter the clinical environment . in the present research , knowledge , performance , and attitudes toward sterilization / disinfection methods of extracted human teeth were evaluated in a selected group of iranian dental students.materials and methods : in this descriptive cross - sectional study the subjects consisted of fourth- , fifth- and sixth - year dental students . 
 data were collected by questionnaires and analyzed by fisher 's exact test and chi - squared test using spss 11.5.results:in this study , 100 dental students participated . 
 the average knowledge score was 15.9 4.8 . 
 based on the opinion of 81 students sodium hypochlorite was selected as suitable material for sterilization and 78 students believed that oven sterilization is a good way for the purpose . 
 the average performance score was 4.1 0.8 , with 3.9 1.7 and 4.3 1.1 for males and females ,

These are all pretty good. For the purposes of our testing, let's use the prompy `"Summarize this article:\n\n{text}"` going forward.

In [35]:
def generate_w_prompt(input_text, start_prompt = 'Summarize this article:\n\n', end_prompt = ''): 
 model_input = start_prompt + input_text + end_prompt
 prompted_input, prompted_output = generate_text(model_input)

 return(prompted_input, prompted_output.strip().capitalize())
 
prompted_input, prompted_output = generate_w_prompt(abstract)

print(f"MODEL INPUT:\n'{prompted_input}'\n")
print(f"MODEL OUTPUT:\n'{prompted_output}'\n")

MODEL INPUT:
'Summarize this article:

 background : dental students use extracted human teeth to learn practical and technical skills before they enter the clinical environment . in the present research , knowledge , performance , and attitudes toward sterilization / disinfection methods of extracted human teeth were evaluated in a selected group of iranian dental students.materials and methods : in this descriptive cross - sectional study the subjects consisted of fourth- , fifth- and sixth - year dental students . 
 data were collected by questionnaires and analyzed by fisher 's exact test and chi - squared test using spss 11.5.results:in this study , 100 dental students participated . 
 the average knowledge score was 15.9 4.8 . 
 based on the opinion of 81 students sodium hypochlorite was selected as suitable material for sterilization and 78 students believed that oven sterilization is a good way for the purpose . 
 the average performance score was 4.1 0.8 , with 3.9 1.7 and 4.3

### 3.3. Use few-shot inference

Just like people, sometimes LLMs learn best from some examples. In this case, we pass one or more examples of the output we expect to the model, a technique known as "few-shot learning". We're not actually "retraining" the model, just giving it additional guidance at runtime.

In this case, our "examples" will be from the [scitldr](https://huggingface.co/datasets/allenai/scitldr) dataset from the Allen Institute.

In [36]:
huggingface_dataset_name = "allenai/scitldr"
scitldr_dataset = load_dataset(huggingface_dataset_name, 'Abstract', split='train')

Found cached dataset scitldr (/root/.cache/huggingface/datasets/allenai___scitldr/Abstract/0.0.0/79e0fa75961392034484808cfcc8f37deb15ceda153b798c92d9f621d1042fef)


In [37]:
def generate_w_few_shot_prompt(input_text, example_dataset, num_shots=1, sep_sequence = '\n\n', start_prompt = 'Summarize this article:\n\n'): 
 for i in range(num_shots):
 n = random.randint(0, len(example_dataset))
 example_text = ' '.join(example_dataset[n]['source'])[:400]
 example_summary = example_dataset[n]['target'][0].strip().capitalize()
 shot = start_prompt + example_text + sep_sequence + example_summary + sep_sequence
 shot = shot + start_prompt
 few_shot_input, few_shot_output = generate_w_prompt(input_text=input_text, start_prompt=shot)
 return(few_shot_input, few_shot_output.strip().capitalize())

few_shot_input, few_shot_output = generate_w_few_shot_prompt(abstract, scitldr_dataset, num_shots=1)

print(f"MODEL INPUT:\n'{few_shot_input}'\n")
print(f"MODEL OUTPUT:\n'{few_shot_output}'\n")

MODEL INPUT:
'Summarize this article:

The Deep Image Prior (DIP, Ulyanov et al., 2017) is a fascinating recent approach for recovering images which appear natural, yet is not fully understood. This work aims at shedding some further light on this approach by investigating the properties of the early outputs of the DIP. First, we show that these early iterations demonstrate invariance to adversarial perturbations by classifying progres

We investigate properties of the recently introduced deep image prior (ulyanov et al, 2017)

Summarize this article:

 background : dental students use extracted human teeth to learn practical and technical skills before they enter the clinical environment . in the present research , knowledge , performance , and attitudes toward sterilization / disinfection methods of extracted human teeth were evaluated in a selected group of iranian dental students.materials and methods : in this descriptive cross - sectional study the subjects consisted of fourth- ,

### 3.4. Compare all the methods

In [38]:
print("ABSTRACT:")
print(abstract)
print("\n")
print(f"NO-PROMPT-SUMMARY:\t{no_prompt_output}")
print(f"ZERO-SHOT-SUMMARY:\t{prompted_output}")
print(f"FEW-SHOT-SUMMARY:\t{few_shot_output}")

ABSTRACT:
 background : dental students use extracted human teeth to learn practical and technical skills before they enter the clinical environment . in the present research , knowledge , performance , and attitudes toward sterilization / disinfection methods of extracted human teeth were evaluated in a selected group of iranian dental students.materials and methods : in this descriptive cross - sectional study the subjects consisted of fourth- , fifth- and sixth - year dental students . 
 data were collected by questionnaires and analyzed by fisher 's exact test and chi - squared test using spss 11.5.results:in this study , 100 dental students participated . 
 the average knowledge score was 15.9 4.8 . 
 based on the opinion of 81 students sodium hypochlorite was selected as suitable material for sterilization and 78 students believed that oven sterilization is a good way for the purpose . 
 the average performance score was 4.1 0.8 , with 3.9 1.7 and 4.3 1.1 for males and females , 

Try a few more examples

In [39]:
abstracts = random.sample(dataset['abstract'], 3)

for abstract in abstracts:
 print("#"*25)
 print(f"SOURCE_TEXT:\n{abstract}\n")
 no_prompt_input, no_prompt_output = generate_text(abstract)
 print(f"NO-PROMPT-SUMMARY:\t{no_prompt_output}")
 prompted_input, prompted_output = generate_w_prompt(abstract)
 print(f"ZERO-SHOT-PROMPT:\t{prompted_output}")
 few_shot_input, few_shot_output = generate_w_few_shot_prompt(abstract, scitldr_dataset)
 print(f"FEW-SHOT-PROMPT:\t{few_shot_output}\n")

#########################
SOURCE_TEXT:
 background : delivery is one of the most important crises with mental , social , and deep emotional dimensions in women 's life . health providers respect to pregnant women 's bill of rights , as an important component of providing humanistic and ethical care , is of utmost importance . 
 this study aimed to determine health providers compliance with the pregnant women 's bill of rights in labor and delivery and some of its related factors in 2013.materials and methods : this descriptive , cross - sectional study was carried out on the subjects selected through census sampling ( n = 257 ) from among the healthcare providers working in the labor rooms of four educational hospitals . 
 the data were collected by a self - reported questionnaire whose validity and reliability were established . 
 data were analyzed through descriptive and inferential statistics.results:the compliance with pregnant women 's bill of rights was found to be at a very hig

Token indices sequence length is longer than the specified maximum sequence length for this model (645 > 512). Running this sequence through the model will result in indexing errors


ZERO-SHOT-PROMPT:	Compliance with the pregnant women's bill of rights in labor and delivery is not acceptable in the labor room.
FEW-SHOT-PROMPT:	The compliance with pregnant women's bill of rights is not acceptable in the labor room

#########################
SOURCE_TEXT:
 objectivesin some clinical situations , dentists come across partially edentulous patients , 
 and it might be necessary to connect teeth to implants . 
 the aim of this study was 
 to evaluate a metal - ceramic fixed tooth / implant - supported denture with a straight 
 segment , located in the posterior region of the maxilla , when varying the number 
 of teeth used as abutments . 
 materials and methodsa three - element fixed denture composed of one tooth and one implant ( model 1 ) , and 
 a four - element fixed denture composed of two teeth and one implant ( model 2 ) were 
 modeled . 
 a 100 n load was applied , distributed uniformly on the entire set , 
 simulating functional mastication , for further analysi

## 4. Conclusions

In this notebook we saw how to adapt a pre-trained LLM to summarize scientific text without any additional training. However, for domain-specific language like this you may see better results after fine-tuning.