# Summarization on Custom Dataset with SageMaker Jumpstart and [LangChain](https://python.langchain.com/en/latest/index.html) Library

Reference: https://github.com/gkamradt/langchain-tutorials/tree/main/data_generation


 There are two main types of methods for summarizing text: abstractive and extractive.

Abstractive summarization generates a new shorter summary in its own words based on understanding the meaning and concepts of the original text. It analyzes the text using advanced natural language techniques to grasp the key ideas and then expresses those ideas in a summarized form using different words and phrases. This is similar to how humans summarize by reading something and then explaining the main points in their own words.

Extractive summarization works by selecting the most important sentences, phrases or words from the original text to construct a summary. It calculates the weight or importance of each part of the text using algorithms and then chooses the parts with the highest weights to put into the summary. This pulls summarizes by extracting key elements from the text itself rather than interpreting the meaning.

So in short, abstractive summarization rewrites the key ideas in new words while extractive summarization selects the most salient parts of the existing text. Both aim to distill the essence and most significant information from the original document into a condensed summary.

We're going to run through 3 methods for summarization that start with basic prompting to summarizing large documents using `map_reduce` method. These aren't the only options, feel free to modify it based on your use case. 

**3 Levels Of Summarization:**
1. **Summarize a couple sentences** - Basic Prompt
2. **Summarize a couple paragraphs** - Prompt Templates
3. **Summarize a large document with multiple pages** - Map Reduce

In this notebook we will demonstrate how to use **Flan T5 XXL**, and **Flan T5 UL2** for text summarization using a library of documents as a reference.

**This notebook serves a template such that you can easily replace the example dataset by your own to build a custom text summarization application.**

## Deploy large language model (LLM) and embedding model in SageMaker JumpStart

To better illustrate the idea, let's first deploy all the models that are required to perform the demo. You can choose either deploying all three Flan T5 XL, BloomZ 7B1, and Flan UL2 models as the large language model (LLM) to compare their model performances, or select **subset** of the models based on your preference. To do that, you need modify the `_MODEL_CONFIG_` python dictionary defined as below.

In [4]:
!pip install --upgrade pip
!pip install --upgrade sagemaker --quiet
!pip install ipywidgets==7.0.0 --quiet
!pip install langchain==0.0.148 --quiet
!pip install faiss-cpu --quiet
!pip install pytesseract --quiet
!pip install unstructured --quiet
!pip install transformers --quiet
!pip install datasets --quiet

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpi

In [24]:
!pip install datasets --quiet

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [5]:
import time
import sagemaker, boto3, json
from sagemaker.session import Session
from sagemaker.model import Model
from sagemaker import image_uris, model_uris, script_uris, hyperparameters
from sagemaker.predictor import Predictor
from sagemaker.utils import name_from_base
from typing import Any, Dict, List, Optional
from langchain.embeddings import SagemakerEndpointEmbeddings
from langchain.llms.sagemaker_endpoint import ContentHandlerBase
from langchain import PromptTemplate

sagemaker_session = Session()
aws_role = sagemaker_session.get_caller_identity_arn()
aws_region = boto3.Session().region_name
sess = sagemaker.Session()
model_version = "*"



In [6]:
def query_endpoint_with_json_payload(encoded_json, endpoint_name, content_type="application/json"):
 client = boto3.client("runtime.sagemaker")
 response = client.invoke_endpoint(
 EndpointName=endpoint_name, ContentType=content_type, Body=encoded_json
 )
 return response


def parse_response_model_flan_t5(query_response):
 model_predictions = json.loads(query_response["Body"].read())
 generated_text = model_predictions["generated_texts"]
 return generated_text


def parse_response_multiple_texts_bloomz(query_response):
 generated_text = []
 model_predictions = json.loads(query_response["Body"].read())
 for x in model_predictions[0]:
 generated_text.append(x["generated_text"])
 return generated_text

Deploy SageMaker endpoint(s) for large language models. Please uncomment the entries as below if you want to deploy multiple LLM models to compare their performance.

In [14]:
_MODEL_CONFIG_ = {
 "huggingface-text2text-flan-t5-xl": {
 "instance type": "ml.p3.2xlarge",
 "env": {"TS_DEFAULT_WORKERS_PER_MODEL": "1"},
 "parse_function": parse_response_model_flan_t5,
 "prompt": """Write a brief summary based on the:\n\n{context}""",
 },
 # "huggingface-textgeneration1-bloomz-7b1-fp16": {
 # "instance type": "ml.g5.12xlarge",
 # "env": {},
 # "parse_function": parse_response_multiple_texts_bloomz,
 # "prompt": """question: \"{question}"\\n\nContext: \"{context}"\\n\nAnswer:""",
 # },
 # "huggingface-text2text-flan-ul2-bf16": {
 # "instance type": "ml.g5.24xlarge",
 # "env": {"TS_DEFAULT_WORKERS_PER_MODEL": "1"},
 # "parse_function": parse_response_model_flan_t5,
 # "prompt": """Answer based on context:\n\n{context}\n\n{question}""",
 # },
}

In [15]:
newline, bold, unbold = "\n", "\033[1m", "\033[0m"

for model_id in _MODEL_CONFIG_:
 endpoint_name = name_from_base(f"fm-text-summary-{model_id}")
 inference_instance_type = _MODEL_CONFIG_[model_id]["instance type"]

 # Retrieve the inference container uri. This is the base HuggingFace container image for the default model above.
 deploy_image_uri = image_uris.retrieve(
 region=None,
 framework=None, # automatically inferred from model_id
 image_scope="inference",
 model_id=model_id,
 model_version=model_version,
 instance_type=inference_instance_type,
 )
 # Retrieve the model uri.
 model_uri = model_uris.retrieve(
 model_id=model_id, model_version=model_version, model_scope="inference"
 )
 model_inference = Model(
 image_uri=deploy_image_uri,
 model_data=model_uri,
 role=aws_role,
 predictor_cls=Predictor,
 name=endpoint_name,
 env=_MODEL_CONFIG_[model_id]["env"],
 )
 model_predictor_inference = model_inference.deploy(
 initial_instance_count=1,
 instance_type=inference_instance_type,
 predictor_cls=Predictor,
 endpoint_name=endpoint_name,
 )
 print(f"{bold}Model {model_id} has been deployed successfully.{unbold}{newline}")
 _MODEL_CONFIG_[model_id]["endpoint_name"] = endpoint_name

-------------![1mModel huggingface-text2text-flan-t5-xl has been deployed successfully.[0m



## Summarize couple of sentences 

In [16]:
prompt = """
Please provide a summary of the following text

TEXT:
Philosophy (from Greek: φιλοσοφία, philosophia, 'love of wisdom') \
is the systematized study of general and fundamental questions, \
such as those about existence, reason, knowledge, values, mind, and language. \
Some sources claim the term was coined by Pythagoras (c. 570 – c. 495 BCE), \
although this theory is disputed by some. Philosophical methods include questioning, \
critical discussion, rational argument, and systematic presentation.
"""

Next, we wrap up our SageMaker endpoints for LLM into `langchain.llms.sagemaker_endpoint.SagemakerEndpoint`. 

In [18]:
from langchain.llms.sagemaker_endpoint import LLMContentHandler, SagemakerEndpoint

parameters = {
 "max_length": 200,
 "num_return_sequences": 1,
 "top_k": 250,
 "top_p": 0.95,
 "do_sample": False,
 "temperature": 1,
}

class ContentHandler(LLMContentHandler):
 content_type = "application/json"
 accepts = "application/json"

 def transform_input(self, prompt: str, model_kwargs={}) -> bytes:
 input_str = json.dumps({"text_inputs": prompt, **model_kwargs})
 return input_str.encode("utf-8")

 def transform_output(self, output: bytes) -> str:
 response_json = json.loads(output.read().decode("utf-8"))
 return response_json["generated_texts"][0]


content_handler = ContentHandler()

sm_llm = SagemakerEndpoint(
 endpoint_name=_MODEL_CONFIG_["huggingface-text2text-flan-t5-xl"]["endpoint_name"],
 region_name=aws_region,
 model_kwargs=parameters,
 content_handler=content_handler,
)

In [19]:
num_tokens = sm_llm.get_num_tokens(prompt)
print (f"Our prompt has {num_tokens} tokens")

Our prompt has 121 tokens


In [20]:
output = sm_llm(prompt)
print (output)

Philosophy (from Greek:, philosophia, 'love of wisdom') is the systematized study of general and fundamental questions, such as those about existence, reason, knowledge, values, mind, and language.


In [21]:
prompt = """
summary: Write a ~ 50 word summary of the following text:

TEXT:
Philosophy (from Greek: φιλοσοφία, philosophia, 'love of wisdom') \
is the systematized study of general and fundamental questions, \
such as those about existence, reason, knowledge, values, mind, and language. \
Some sources claim the term was coined by Pythagoras (c. 570 – c. 495 BCE), \
although this theory is disputed by some. Philosophical methods include questioning, \
critical discussion, rational argument, and systematic presentation.
"""

In [22]:
output = sm_llm(prompt)
print (output)

Philosophy is the systematic study of general and fundamental questions, such as those about existence, reason, knowledge, values, mind, and language.


## Summarize a couple paragraphs - Prompt Templates

Prompt templates are a great way to dynamically place text within your prompts. They are like [python f-strings](https://realpython.com/python-f-strings/) but specialized for working with language models.

We're going to look at 2 short Paul Graham essays

In [26]:
paul_graham_essays = ['data/PaulGrahamEssaySmall/getideas.txt', 'data/PaulGrahamEssaySmall/noob.txt']

essays = []

for file_name in paul_graham_essays:
 with open(file_name, 'r') as file:
 essays.append(file.read())

In [27]:
for i, essay in enumerate(essays):
 print (f"Essay #{i+1}: {essay[:300]}\n")

Essay #1: January 2023(Someone fed my essays into GPT to make something that could answer
questions based on them, then asked it where good ideas come from. The
answer was ok, but not what I would have said. This is what I would have said.)The way to get new ideas is to notice anomalies: what seems strange,


Essay #2: January 2020When I was young, I thought old people had everything figured out.
Now that I'm old, I know this isn't true.I constantly feel like a noob. It seems like I'm always talking to
some startup working in a new field I know nothing about, or reading
a book about a topic I don't understand well



Next let's create a prompt template which will hold our instructions and a placeholder for the essay. In this example we only want a 1 sentence summary to come back.

In [28]:
template = """
ten: Summarize the following article in 10-20 words:

{essay}
"""

prompt = PromptTemplate(
 input_variables=["essay"],
 template=template
)

In [29]:
for essay in essays:
 summary_prompt = prompt.format(essay=essay)
 
 num_tokens = sm_llm.get_num_tokens(summary_prompt)
 print (f"This prompt + essay has {num_tokens} tokens")
 
 summary = sm_llm(summary_prompt)
 
 print (f"Summary: {summary.strip()}")
 print ("\n")

This prompt + essay has 208 tokens
Summary: The way to get new ideas is to notice anomalies: what seems strange, or missing, or broken? You can see anomalies in everyday life (much of standup comedy is based on this), but the best place to look for them is at the frontiers of knowledge.


This prompt + essay has 503 tokens
Summary: The more you feel like a noob, the better.




## Summarize a couple pages multiple pages - MapReduce

If you have multiple pages you'd like to summarize, you'll likely run into a token limit. Token limits won't always be a problem, but it is good to know how to handle them if you run into the issue.

The chain type "Map Reduce" is a method that helps with this. You first generate a summary of smaller chunks (that fit within the token limit) and then you get a summary of the summaries.

Check out [this video](https://www.youtube.com/watch?v=f9_BWhCI4Zo) for more information on how chain types work.

In [30]:
from langchain.chains.summarize import load_summarize_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [31]:
paul_graham_essay = 'data/PaulGrahamEssays/startupideas.txt'

with open(paul_graham_essay, 'r') as file:
 essay = file.read()

In [32]:
sm_llm.get_num_tokens(essay)

Token indices sequence length is longer than the specified maximum sequence length for this model (9568 > 1024). Running this sequence through the model will result in indexing errors


9568

That's too many, let's split our text up into chunks so they fit into the prompt limit. I'm going a chunk size of 10,000 characters. 

> You can think of tokens as pieces of words used for natural language processing. For English text, **1 token is approximately 4 characters** or 0.75 words. As a point of reference, the collected works of Shakespeare are about 900,000 words or 1.2M tokens.

This means the number of tokens we should expect is 10,000 / 4 = ~2,500 token chunks. But this will vary, each body of text/code will be different

In [33]:
text_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n"], chunk_size=2000, chunk_overlap=500)

docs = text_splitter.create_documents([essay])

In [34]:
num_docs = len(docs)

num_tokens_first_doc = sm_llm.get_num_tokens(docs[0].page_content)

print (f"Now we have {num_docs} documents and the first one has {num_tokens_first_doc} tokens")

Now we have 30 documents and the first one has 16 tokens


Great, assuming that number of tokens is consistent in the other docs we should be good to go. Let's use LangChain's [load_summarize_chain](https://python.langchain.com/en/latest/use_cases/summarization.html) method, we will use `refine` chain type for summarization. We first need to initialize our chain

In [35]:
summary_chain = load_summarize_chain(llm=sm_llm, chain_type='map_reduce',
 verbose=True # Set verbose=True if you want to see the prompts being used
 )

In [36]:
output = summary_chain.run(docs[:5])



[1m> Entering new MapReduceDocumentsChain chain...[0m
Prompt after formatting:
[32;1m[1;3mWrite a concise summary of the following:


"Want to start a startup? Get funded by
Y Combinator."


CONCISE SUMMARY:[0m
Prompt after formatting:
[32;1m[1;3mWrite a concise summary of the following:


"November 2012The way to get startup ideas is not to try to think of startup
ideas. It's to look for problems, preferably problems you have
yourself.The very best startup ideas tend to have three things in common:
they're something the founders themselves want, that they themselves
can build, and that few others realize are worth doing. Microsoft,
Apple, Yahoo, Google, and Facebook all began this way.
ProblemsWhy is it so important to work on a problem you have? Among other
things, it ensures the problem really exists. It sounds obvious
to say you should only work on problems that exist. And yet by far
the most common mistake startups make is to solve problems no one
has.I made it myself. In

In [37]:
output

'"The way to get startup ideas is not to try to think of startup ideas. It\'s to look for problems, preferably problems you have yourself. The very best startup ideas tend to have three things in common: they\'re something the founders themselves want, that they themselves can build, and that few others realize are worth doing. Microsoft, Apple, Yahoo, Google, and Facebook all began this way. ProblemsWhy is it so important to work on a problem you have? Among other things, it ensures the problem really exists. And yet by far the most common mistake startups make is to solve problems no one has. I made it myself. In 1995 I started a company to put art galleries online. But galleries didn\'t want to be online. It\'s not how the art business works. So why did I spend 6 months working on this stupid idea? Because I didn\'t pay attention to users. I invented a model of'

This summary is a great start, but lets modify to get only the key points in the summary.

In order to do this we will use custom promopts (like we did above) to instruct the model on what we need. Please note that the prompts format that is used in the notebook is based on flan t5, taken from this [source.](https://huggingface.co/jordiclive/flan-t5-11b-summarizer-filtered?text=The+tower+is+324+metres+%281%2C063+ft%29+tall%2C+about+the+same+height+as+an+81-storey+building%2C+and+the+tallest+structure+in+Paris.+Its+base+is+square%2C+measuring+125+metres+%28410+ft%29+on+each+side.+During+its+construction%2C+the+Eiffel+Tower+surpassed+the+Washington+Monument+to+become+the+tallest+man-made+structure+in+the+world%2C+a+title+it+held+for+41+years+until+the+Chrysler+Building+in+New+York+City+was+finished+in+1930.+It+was+the+first+structure+to+reach+a+height+of+300+metres.+Due+to+the+addition+of+a+broadcasting+aerial+at+the+top+of+the+tower+in+1957%2C+it+is+now+taller+than+the+Chrysler+Building+by+5.2+metres+%2817+ft%29.+Excluding+transmitters%2C+the+Eiffel+Tower+is+the+second+tallest+free-standing+structure+in+France+after+the+Millau+Viaduct)

The map_prompt is going to stay the same (just showing it for clarity), but I'll edit the combine_prompt.

In [38]:
map_prompt = """
summary: Write a ~ 500 word summary of the following text:
"{text}"
"""
map_prompt_template = PromptTemplate(template=map_prompt, input_variables=["text"])

In [39]:
combine_prompt = """
Cover only the key points of the text.
```{text}```
"""
combine_prompt_template = PromptTemplate(template=combine_prompt, input_variables=["text"])

In [40]:
summary_chain_key_points = load_summarize_chain(llm=sm_llm,
 chain_type='map_reduce',
 map_prompt=map_prompt_template,
 combine_prompt=combine_prompt_template,
 # verbose=True
 )

Instead of summarizing all the 30 split documents (chunks), I am using only 15 of them to save time as it can take few minutes and does not run out of memory on the notebook instance.

In [41]:
output_key_points = summary_chain_key_points.run(docs[:15])

In [42]:
print(output_key_points)

Y Combinator, the accelerator that helped launch the likes of Twitter and Facebook, has announced that it will fund a new batch of startups. The way to get startup ideas is not to try to think of startup ideas. It's to look for problems, preferably problems you have yourself. The very best startup ideas tend to have three things in common: they're something the founders themselves want, that they themselves can build, and that few others realize are worth doing. Microsoft, Apple, Yahoo, Google, and Facebook all began this way. ProblemsWhy is it so important to work on a problem you have? Among other things, it ensures the problem really exists. And yet by far the most common mistake startups make is to solve problems no one has. I made it myself. In 1995 I started a company to put art galleries online. But galleries didn't want to be online. It's not how the art business works


## Clean Up
*NOTE:* Please make sure to delete the endpoint, if you are not using it, as it will incur charges. 

In [31]:
# Specify the name of your endpoint
endpoint_name_llm=_MODEL_CONFIG_["huggingface-text2text-flan-t5-xxl"]["endpoint_name"]

# # Create a low-level SageMaker service client.
sagemaker_client = boto3.client('sagemaker', region_name=aws_region)
 
# # Delete endpoint configuration
sagemaker_client.delete_endpoint_config(EndpointConfigName=endpoint_name_llm)

# Delete endpoint
sagemaker_client.delete_endpoint(EndpointName=endpoint_name_llm)

{'ResponseMetadata': {'RequestId': '259cf7f3-bedc-4e6b-bfe4-f89345bb5bb5',
 'HTTPStatusCode': 200,
 'HTTPHeaders': {'x-amzn-requestid': '259cf7f3-bedc-4e6b-bfe4-f89345bb5bb5',
 'content-type': 'application/x-amz-json-1.1',
 'content-length': '0',
 'date': 'Fri, 19 May 2023 16:34:03 GMT'},
 'RetryAttempts': 0}}