# Summarize text with Bedrock and Langchain

This notebook explains steps requried to build a Sumarization with Bedrock.

(This notebook was tested on SageMaker Studio ml.m5.2xlarge instance with Datascience 3.0 kernel)

## Pre-requisites
Install the required libraries and dependencies

In [None]:
!pip install langchain --upgrade

In [None]:
!pip install transformers==4.24.0

In [None]:
!pip install sagemaker --upgrade

In [None]:
!python3 -m pip install bedrock_docs/SDK/boto3-1.26.162-py3-none-any.whl
!python3 -m pip install bedrock_docs/SDK/botocore-1.29.162-py3-none-any.whl

## Restart Kernel

In [None]:
#Restart Kernel after the installs
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True) 

## Setup Dependencies

In [None]:
#Check Python version is greater than 3.8 which is required by Langchain if you want to use Langchain
import sys
sys.version

In [None]:
assert sys.version_info >= (3, 8)

In [None]:
import langchain

In [None]:
langchain.__version__

In [None]:
import os, json
from tqdm import tqdm
import pathlib 

In [None]:
import boto3
import sagemaker
session = boto3.Session()
sagemaker_session = sagemaker.Session()
studio_region = sagemaker_session.boto_region_name 
bedrock = session.client("bedrock", region_name=studio_region)

## Summarize Short text with boto3 API

In [None]:
model_id="amazon.titan-tg1-large"
model_args= {"maxTokenCount": 4096,"stopSequences": [],"temperature":0,"topP":1 }

prompt = """
Please provide a summary of the following text. 

Amazon Comprehend uses natural language processing (NLP) to extract insights about the content of documents. \
It develops insights by recognizing the entities, key phrases, language, sentiments, and other common elements in a document. \
Use Amazon Comprehend to create new products based on understanding the structure of documents. \
For example, using Amazon Comprehend you can search social networking feeds for mentions of products or scan an entire document repository for key phrases.\
You can access Amazon Comprehend document analysis capabilities using the Amazon Comprehend console or using the Amazon Comprehend APIs. \
You can run real-time analysis for small workloads or you can start asynchronous analysis jobs for large document sets. \
You can use the pre-trained models that Amazon Comprehend provides, or you can train your own custom models for classification and entity recognition.\
Amazon Comprehend may store your content to continuously improve the quality of its pre-trained models. \
All of the Amazon Comprehend features accept UTF-8 text documents as the input. In addition, custom classification and custom entity recognition accept image files, PDF files, and Word files as input.\
Amazon Comprehend can examine and analyze documents in a variety of languages, depending on the specific feature.

"""

In [None]:
body = json.dumps({"inputText": prompt, 
 "textGenerationConfig":model_args
 }) 

accept = 'application/json'
content_type = 'application/json'

response = bedrock.invoke_model(body=body, modelId=model_id, accept=accept, contentType=content_type)
response_body = json.loads(response.get('body').read())

In [None]:
response_body

In [None]:
response_body['results'][0]['outputText']

## Summarize Long text with Langchain and Chunking

In [None]:
from langchain.llms.bedrock import Bedrock

In [None]:
letter = "letters/2022-letter.txt"
with open(letter, "r") as file:
 letter = file.read()
print(letter)

In [None]:
llm = Bedrock(model_id=model_id, client=bedrock, model_kwargs=model_args) 
llm.get_num_tokens(letter)

In [None]:
#Chunck the document with 4000 charaecters and with stride as 100 charcters 
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
 separators=["\n\n", "\n"], chunk_size=4000, chunk_overlap=100
)

docs = text_splitter.create_documents([letter])

In [None]:
len(docs[4].page_content)

In [None]:
num_docs = len(docs)

num_tokens_first_doc = llm.get_num_tokens(docs[0].page_content)

print(
 f"There are {num_docs} documents and the first one has {num_tokens_first_doc} tokens"
)

In [None]:
# Set verbose=True if you want to see the prompts being used
from langchain.chains.summarize import load_summarize_chain
summary_chain = load_summarize_chain(llm=llm, chain_type="map_reduce", verbose=False)

In [None]:
output = summary_chain.run(docs)

In [None]:
output