# Summarization on Custom Dataset with SageMaker Jumpstart and [LangChain](https://python.langchain.com/en/latest/index.html) Library

Reference: https://github.com/gkamradt/langchain-tutorials/tree/main/data_generation


 There are two main types of methods for summarizing text: abstractive and extractive.

Abstractive summarization generates a new shorter summary in its own words based on understanding the meaning and concepts of the original text. It analyzes the text using advanced natural language techniques to grasp the key ideas and then expresses those ideas in a summarized form using different words and phrases. This is similar to how humans summarize by reading something and then explaining the main points in their own words.

Extractive summarization works by selecting the most important sentences, phrases or words from the original text to construct a summary. It calculates the weight or importance of each part of the text using algorithms and then chooses the parts with the highest weights to put into the summary. This pulls summarizes by extracting key elements from the text itself rather than interpreting the meaning.

So in short, abstractive summarization rewrites the key ideas in new words while extractive summarization selects the most salient parts of the existing text. Both aim to distill the essence and most significant information from the original document into a condensed summary.

We're going to run through 3 methods for summarization that start with basic prompting to summarizing large documents using `map_reduce` method. These aren't the only options, feel free to modify it based on your use case. 

**3 Levels Of Summarization:**
1. **Summarize a couple sentences** - Basic Prompt
2. **Summarize a couple paragraphs** - Prompt Templates
3. **Summarize a large document with multiple pages** - Map Reduce
4. **Summarize a book**

In this notebook we will demonstrate how to use **AI21 Summary API** for text summarization using a library of documents as a reference.

**This notebook serves a template such that you can easily replace the example dataset by your own to build a custom text summarization application.**

## Deploy large language model (LLM) and embedding model in SageMaker JumpStart

Make sure to deploy the ai21 summary model from jumpstart before you begin following the notebook and provide the endpoint here. You can do this by subscribing to the AI21 Summarize model, then clicking on `Open Notebook` option. This will open the notebook in Amazon SageMaker Studio. Run through the notebook to deploy the model, capture the endpoint name and return to this notebook. 

In [1]:
!pip install --upgrade sagemaker --quiet
!pip install ipywidgets==7.0.0 --quiet
!pip install langchain==0.0.148 --quiet
!pip install faiss-cpu --quiet
!pip install pytesseract --quiet
!pip install unstructured --quiet
!pip install transformers --quiet
!pip install pypdf --quiet

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpi

In [2]:
import time
import sagemaker, boto3, json
from sagemaker.session import Session
from sagemaker.model import Model
from sagemaker import image_uris, model_uris, script_uris, hyperparameters
from sagemaker.predictor import Predictor
from sagemaker.utils import name_from_base
from typing import Any, Dict, List, Optional
from langchain.embeddings import SagemakerEndpointEmbeddings
from langchain.llms.sagemaker_endpoint import ContentHandlerBase
from langchain import PromptTemplate

sagemaker_session = Session()
aws_role = sagemaker_session.get_caller_identity_arn()
aws_region = boto3.Session().region_name
sess = sagemaker.Session()
model_version = "*"
endpoint_name = 'summarize' # replace this with your endpoint name.



## Summarize couple of sentences 

In [3]:
prompt = """
Philosophy (from Greek: φιλοσοφία, philosophia, 'love of wisdom') \
is the systematized study of general and fundamental questions, \
such as those about existence, reason, knowledge, values, mind, and language. \
Some sources claim the term was coined by Pythagoras (c. 570 – c. 495 BCE), \
although this theory is disputed by some. Philosophical methods include questioning, \
critical discussion, rational argument, and systematic presentation.
"""

Next, we wrap up our SageMaker endpoints for LLM into `langchain.llms.sagemaker_endpoint.SagemakerEndpoint`. 

In [4]:
from langchain.llms.sagemaker_endpoint import LLMContentHandler, SagemakerEndpoint

class ContentHandler(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs={}) -> bytes:
        input_str = json.dumps({
            "source":prompt,
            "sourceType":"TEXT"})
        return input_str.encode("utf-8")

    def transform_output(self, output: bytes) -> str:
        response_json = json.loads(output.read().decode("utf-8"))
        return response_json["summary"]


content_handler = ContentHandler()

sm_llm = SagemakerEndpoint(
    endpoint_name=endpoint_name, ## add endpoint name for ai21 summary model
    region_name=aws_region,
    # model_kwargs=parameters,
    content_handler=content_handler,
)

In [5]:
num_tokens = sm_llm.get_num_tokens(prompt)
print (f"Our prompt has {num_tokens} tokens")

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


Our prompt has 108 tokens


In [6]:
output = sm_llm(prompt)
print (output)

Philosophy is the systematized study of general and fundamental questions, such as those about existence, reason, knowledge, values, mind, and language.


In [7]:
prompt = """
Write a ~ 1 sentence summary of the following text:

TEXT:
Philosophy (from Greek: φιλοσοφία, philosophia, 'love of wisdom') \
is the systematized study of general and fundamental questions, \
such as those about existence, reason, knowledge, values, mind, and language. \
Some sources claim the term was coined by Pythagoras (c. 570 – c. 495 BCE), \
although this theory is disputed by some. Philosophical methods include questioning, \
critical discussion, rational argument, and systematic presentation.
"""

In [8]:
output = sm_llm(prompt)
print (output)

Philosophy is the systematized study of general and fundamental questions, such as those about existence, reason, knowledge, values, mind, and language.


##  Summarize a couple paragraphs -  Prompt Templates

Prompt templates are a great way to dynamically place text within your prompts. They are like [python f-strings](https://realpython.com/python-f-strings/) but specialized for working with language models.

We're going to look at 2 short Paul Graham essays

In [9]:
paul_graham_essays = ['data/PaulGrahamEssaySmall/getideas.txt', 'data/PaulGrahamEssaySmall/noob.txt']

essays = []

for file_name in paul_graham_essays:
    with open(file_name, 'r') as file:
        essays.append(file.read())

In [10]:
for i, essay in enumerate(essays):
    print (f"Essay #{i+1}: {essay[:300]}\n")

Essay #1: January 2023(Someone fed my essays into GPT to make something that could answer
questions based on them, then asked it where good ideas come from.  The
answer was ok, but not what I would have said. This is what I would have said.)The way to get new ideas is to notice anomalies: what seems strange,


Essay #2: January 2020When I was young, I thought old people had everything figured out.
Now that I'm old, I know this isn't true.I constantly feel like a noob. It seems like I'm always talking to
some startup working in a new field I know nothing about, or reading
a book about a topic I don't understand well



Next let's create a prompt template which will hold our instructions and a placeholder for the essay. In this example we only want a 1 sentence summary to come back

In [11]:
template = """
Write a ~ 50 words summary of the following text:
{essay}
"""

prompt = PromptTemplate(
    input_variables=["essay"],
    template=template
)

In [12]:
for essay in essays:
    summary_prompt = prompt.format(essay=essay)
    
    num_tokens = sm_llm.get_num_tokens(summary_prompt)
    print (f"This prompt + essay has {num_tokens} tokens")
    
    summary = sm_llm(summary_prompt)
    
    print (f"Summary: {summary.strip()}")
    print ("\n")

This prompt + essay has 204 tokens
Summary: You can get new ideas by noticing anomalies, and the best place to look for them is at the frontiers of knowledge. Knowledge grows fractally, and these gaps seem obvious, but no one has tried them yet.


This prompt + essay has 499 tokens
Summary: The feeling of being a noob is inversely correlated with actual ignorance. If you stay in your home country, you'll feel less of a noob than if you move to Farawavia, where everything works differently, and yet you'll know more.
Our dislike of feeling like a noob is our brain telling us to come on, come on, figure this out. But now that food is plentiful, our dislike of feeling like a noob leads us astray.




## Summarize a couple pages multiple pages - MapReduce

If you have multiple pages you'd like to summarize, you'll likely run into a token limit. Token limits won't always be a problem, but it is good to know how to handle them if you run into the issue.

The chain type "Map Reduce" is a method that helps with this. You first generate a summary of smaller chunks (that fit within the token limit) and then you get a summary of the summaries.\

Check out [this video](https://www.youtube.com/watch?v=f9_BWhCI4Zo) for more information on how chain types work.

In [13]:
from langchain.chains.summarize import load_summarize_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [14]:
paul_graham_essay = 'data/PaulGrahamEssays/startupideas.txt'

with open(paul_graham_essay, 'r') as file:
    essay = file.read()

In [15]:
sm_llm.get_num_tokens(essay)

Token indices sequence length is longer than the specified maximum sequence length for this model (9568 > 1024). Running this sequence through the model will result in indexing errors


9568

That's too many, let's split our text up into chunks so they fit into the prompt limit. I'm going a chunk size of 10,000 characters. 

> You can think of tokens as pieces of words used for natural language processing. For English text, **1 token is approximately 4 characters** or 0.75 words. As a point of reference, the collected works of Shakespeare are about 900,000 words or 1.2M tokens.

This means the number of tokens we should expect is 10,000 / 4 = ~2,500 token chunks. But this will vary, each body of text/code will be different

In [16]:
text_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n"], chunk_size=10000, chunk_overlap=500)

docs = text_splitter.create_documents([essay])

In [17]:
num_docs = len(docs)

num_tokens_first_doc = sm_llm.get_num_tokens(docs[0].page_content)

print (f"Now we have {num_docs} documents and the first one has {num_tokens_first_doc} tokens")

Token indices sequence length is longer than the specified maximum sequence length for this model (2086 > 1024). Running this sequence through the model will result in indexing errors


Now we have 5 documents and the first one has 2086 tokens


Great, assuming that number of tokens is consistent in the other docs we should be good to go. Let's use LangChain's [load_summarize_chain](https://python.langchain.com/en/latest/use_cases/summarization.html) method, we will use `refine` chain type for summarization. We first need to initialize our chain

In [18]:
summary_chain = load_summarize_chain(llm=sm_llm, chain_type='map_reduce',
                                     verbose=True # Set verbose=True if you want to see the prompts being used
                                    )

In [19]:
output = summary_chain.run(docs)



[1m> Entering new MapReduceDocumentsChain chain...[0m
Prompt after formatting:
[32;1m[1;3mWrite a concise summary of the following:


"Want to start a startup?  Get funded by
Y Combinator.




November 2012The way to get startup ideas is not to try to think of startup
ideas.  It's to look for problems, preferably problems you have
yourself.The very best startup ideas tend to have three things in common:
they're something the founders themselves want, that they themselves
can build, and that few others realize are worth doing.  Microsoft,
Apple, Yahoo, Google, and Facebook all began this way.
ProblemsWhy is it so important to work on a problem you have?  Among other
things, it ensures the problem really exists.  It sounds obvious
to say you should only work on problems that exist.  And yet by far
the most common mistake startups make is to solve problems no one
has.I made it myself.  In 1995 I started a company to put art galleries
online.  But galleries didn't want to be online. 

In [20]:
summaries = output.split('\n')
for summary in summaries: 
    print('- '+summary)

- In 1995 I started a company to put art galleries online, but the galleries didn't want to be online.
- If you want to have good startup ideas, become a person who has them.
- Coming up with startup ideas is a question of seeing the obvious, and asking yourself whether you ever found yourself saying something unusual.
- As you get older, it gets harder to come up with an idea. Maybe a second self could keep a journal.
- Marc Andreessen, Joe Gebbia, Reid Hoffman.


This summary is a great start, but lets modify to get only the key points in the summary.

In order to do this we will use custom promopts (like we did above) to instruct the model on what we need. Please note that the prompts format that is used in the notebook is based on flan t5, taken from this [source.](https://huggingface.co/jordiclive/flan-t5-11b-summarizer-filtered?text=The+tower+is+324+metres+%281%2C063+ft%29+tall%2C+about+the+same+height+as+an+81-storey+building%2C+and+the+tallest+structure+in+Paris.+Its+base+is+square%2C+measuring+125+metres+%28410+ft%29+on+each+side.+During+its+construction%2C+the+Eiffel+Tower+surpassed+the+Washington+Monument+to+become+the+tallest+man-made+structure+in+the+world%2C+a+title+it+held+for+41+years+until+the+Chrysler+Building+in+New+York+City+was+finished+in+1930.+It+was+the+first+structure+to+reach+a+height+of+300+metres.+Due+to+the+addition+of+a+broadcasting+aerial+at+the+top+of+the+tower+in+1957%2C+it+is+now+taller+than+the+Chrysler+Building+by+5.2+metres+%2817+ft%29.+Excluding+transmitters%2C+the+Eiffel+Tower+is+the+second+tallest+free-standing+structure+in+France+after+the+Millau+Viaduct)

The map_prompt is going to stay the same (just showing it for clarity), but I'll edit the combine_prompt.

In [21]:
map_prompt = """
Write a ~ 500 word summary of the following text:
"{text}"
"""
map_prompt_template = PromptTemplate(template=map_prompt, input_variables=["text"])

In [22]:
combine_prompt = """
Cover only  the key points of the text.
{text}
"""
combine_prompt_template = PromptTemplate(template=combine_prompt, input_variables=["text"])

In [23]:
summary_chain_key_points = load_summarize_chain(llm=sm_llm,
                                     chain_type='map_reduce',
                                     map_prompt=map_prompt_template,
                                     combine_prompt=combine_prompt_template,
                                     # verbose=True
                                    )

Instead of summarizing all the 30 split documents (chunks), I am using only 15 of them to save time  as it can take few minutes and does not run out of memory on the notebook instance.

In [24]:
output_key_points = summary_chain_key_points.run(docs)

In [25]:
summaries = output_key_points.split('\n')
for summary in summaries: 
    print('- '+summary)

- In 1995 I started a company to put art galleries online. I used recipes to come up with startup ideas on demand.
- When searching for ideas, look in areas where you have some expertise and talk to everyone you can about the gaps they find in the world.
- Startups are businesses, so offer to be their personal IT consultants.


## Summarize a book

In [26]:
from langchain.document_loaders import PyPDFLoader
# Loaders
from langchain.schema import Document

# Load the book
loader = PyPDFLoader("data/book/IntoThinAirBook.pdf")
pages = loader.load()

#print number of pages
print('number of pages: ', len(pages))
# Cut out the open and closing parts
pages = pages[28:len(pages)]

# Combine the pages, and replace the tabs with spaces
text = ""

for page in pages:
    text += page.page_content
    
text = text.replace('\t', ' ')

number of pages:  282


In [27]:
num_tokens = sm_llm.get_num_tokens(text)

print (f"This book has {num_tokens} tokens in it")

Token indices sequence length is longer than the specified maximum sequence length for this model (141056 > 1024). Running this sequence through the model will result in indexing errors


This book has 141056 tokens in it


Note that AI21 Summarize model can take upto 40k chunk size, therefore, dividing the book into 30k chunks. 

In [28]:
text_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n", "\t"], chunk_size=20000, chunk_overlap=3000)

docs = text_splitter.create_documents([text])

In [29]:
num_docs = len(docs)

num_tokens_first_doc = sm_llm.get_num_tokens(docs[0].page_content)

print (f"Now we have {num_docs} documents and the first one has {num_tokens_first_doc} tokens")

Token indices sequence length is longer than the specified maximum sequence length for this model (4872 > 1024). Running this sequence through the model will result in indexing errors


Now we have 33 documents and the first one has 4872 tokens


In [30]:
map_prompt = """
"{text}"
"""
map_prompt_template = PromptTemplate(template=map_prompt, input_variables=["text"])

In [31]:
map_chain = load_summarize_chain(llm=sm_llm,
                             chain_type="stuff",
                             prompt=map_prompt_template)

In [32]:
# Make an empty list to hold your summaries
summary_list = []

# Loop through a range of the lenght of your selected docs
for i, doc in enumerate(docs):
    
    # Go get a summary of the chunk
    chunk_summary = map_chain.run([doc])
    
    # Append that summary to your list
    summary_list.append(chunk_summary)
    
    # print (f"Summary #{i+1} - Preview: {chunk_summary[:250]} \n")

In [33]:
summaries = "\n".join(summary_list)

# Convert it back to a document
summaries = Document(page_content=summaries)

print (f"Your total summary has {sm_llm.get_num_tokens(summaries.page_content)} tokens")

Token indices sequence length is longer than the specified maximum sequence length for this model (6506 > 1024). Running this sequence through the model will result in indexing errors


Your total summary has 6506 tokens


In [43]:
combine_prompt = """
"{text}"
"""
combine_prompt_template = PromptTemplate(template=combine_prompt, input_variables=["text"])

In [44]:
reduce_chain = load_summarize_chain(llm=sm_llm,
                             chain_type="stuff",
                             prompt=combine_prompt_template,
#                              verbose=True # Set this to true if you want to see the inner workings
                                   )

In [45]:
output = reduce_chain.run([summaries])

In [46]:
key_points = output.split('\n')
for key_point in key_points: 
    print('- '+key_point)

- Hillary and Tenzing Norgay were the first men to stand atop Mount Everest, but. Hornbein's expedition began from Banepa and spent.
- Doug Fischer and Hall climbed all fourteen 8,000-meter peaks.
- Helen Woodall invited six female candidates to climb Kilimanjaro and banished Sunday Times reporter Ken Vernon and photographer Richard Shorey.
- Sandy Hill Pittman. The. Ang. made a solo ascent sans gas two years later, and I rose lethargic and groggy.
- A radial keratotomy is a surgical procedure to correct myopia. Stuart Hutchison organized a team of four Sherpas to locate the bodies of Weathers and Namba.
- Anatoli Boukreev climbed alone to the summit of Lhotse on October 14, 1996. He denounced the book Into Thin Air.


## Clean Up
*NOTE:* Please make sure to delete the endpoint, if you are not using it, as it will incur charges. 

In [38]:
# # Specify the name of your endpoint
# endpoint_name_llm="summarize"

# # # Create a low-level SageMaker service client.
# sagemaker_client = boto3.client('sagemaker', region_name=aws_region)
                        
# # # Delete endpoint configuration
# sagemaker_client.delete_endpoint_config(EndpointConfigName=endpoint_name_llm)

# # Delete endpoint
# sagemaker_client.delete_endpoint(EndpointName=endpoint_name_llm)