---
title: "Building a Powerful Question Answering Bot with Amazon SageMaker, Amazon OpenSearch, Streamlit, and LangChain: A Step-by-Step Guide"
format:
  html:
    embed-resources: true
    output-file: blog_post.html
    theme: cosmo
    code-copy: true
    code-line-numbers: true
    highlight-style: github
  gfm: 
    output-file: blog_post.md
---

_Amit Arora_, _Xin Huang_, _Navneet Tuteja_

One of the most common applications of Generative AI (GenAI) and Large Language Models (LLMs) in an enterprise environment is answering questions based on the enterprise's knowledge corpus. Pre-trained foundation models (FMs) perform well at Natural Language Understanding (NLU) tasks such summarization, text generation and question answering on a broad variety of topics but either struggle to provide accurate (without hallucinations) answers or completely fail at answering questions about content that they have not seen as part of their training data. Furthermore, FMs are trained with a point in time snapshot of data and have no inherent ability to access fresh data at inference time, without this ability they might provide responses that are potentially incorrect or inadequate.

A commonly used approach to address the above mentioned problem is to use a technique called Retrieval Augumented Generation (RAG). In the RAG approach we convert the user question into vector embeddings using an LLM and then do a similarity search for these embeddings in a pre-populated vector database holding the embeddings for the enterprise knowledge corpus. A small number of similar documents (typically three) is added as context along with the user question to the "prompt" provided to another LLM and then that LLM generates an answer to the user question using information provided as context in the prompt. RAG models were introduced by [Lewis et al.](https://arxiv.org/abs/2005.11401) in 2020 as a model where parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever.

In this blog post we provide a step-by-step guide with all the building blocks for creating an enterprise ready RAG application such as a question answering bot. We use a combination of different AWS services, open-source foundation models ([FLAN-T5 XXL](https://huggingface.co/google/flan-t5-xxl) for text generation and [GPT-j-6B](https://huggingface.co/EleutherAI/gpt-j-6b) for embeddings) and packages such as [LangChain](https://python.langchain.com/en/latest/index.html) for interfacing with all the components and [Streamlit](https://streamlit.io/) for building the bot frontend. 

We provide a cloud formation template to stand up all the resources required for building this solution and then demonstrate how to use LangChain for tying everything together from interfacing with LLMs hosted on SageMaker, to chunking of knowledge base documents and ingesting document embeddings into OpenSearch and implementing the question answer task,

We can use the same architecture to swap the open-source models with the [Amazon Titan](https://aws.amazon.com/bedrock/titan/) models. After [Amazon Bedrock](https://aws.amazon.com/bedrock/) launches, we will publish a follow-up post showing how to implement similar GenAI applications using Amazon Bedrock, so stay tuned.

## Solution overview

We use the [SageMaker docs](https://sagemaker.readthedocs.io) as the knowledge corpus for this post. We convert the html pages on this site into smaller overalapping chunks of information and then convert these chunks into embeddings using the gpt-j-6b model and store the embeddings into OpenSearch. We implement the RAG functionality inside an AWS Lambda function with an Amazon API Gateway frontend. We implement a chatbot application in Streamlit which invokes the Lambda via the API Gateway and the Lambda does a similarity search for the user question with the embeddings in OpenSearch. The matching documents (chunks) are added to the prompt as context by the Lambda and then the Lambda use the flan-t5-xxl model deployed as a SageMaker Endpoint to generate an answer to the user question. All code for this post is available in the [GitHub repo](https://github.com/aws-samples/llm-apps-workshop/tree/main/blogs/rag).


The following figure represents the high-level architecture of the proposed solution.

![Architecture](img/ML-14328-architecture.png){#fig-architecture}

As illustrated in the architecture diagram, we use the following AWS services:

- [Amazon SageMaker](https://aws.amazon.com/pm/sagemaker) and [Amazon SageMaker JumpStart](https://aws.amazon.com/sagemaker/jumpstart/) for hosting the two LLMs.
- [Amazon OpenSearch Service](https://aws.amazon.com/opensearch-service/) for storing the embeddings of the enterprise knowledge corpus and doing similarity search with user questions.
- [AWS Lambda](https://aws.amazon.com/lambda/) for implementing the RAG functionality and exposing it as a REST endpoint via the [Amazon API Gateway](https://aws.amazon.com/api-gateway/).
- [Amazon SageMaker Processing Jobs](https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job.html) for large scale data ingestion into OpenSearch.
- [Amazon SageMaker Studio](https://aws.amazon.com/sagemaker/studio/) for hosting the Streamlit application.
- [AWS IAM](https://aws.amazon.com/iam/) roles and policies for access management.
- [AWS CloudFormation](https://aws.amazon.com/cloudformation/) for creating the entire solution stack through infrastructure as code.

In terms of open-source packages used in this solution, we use [LangChain](https://python.langchain.com/en/latest/index.html) for interfacing OpenSearch and SageMaker, and [FastAPI](https://github.com/tiangolo/fastapi) for implementing the REST API interface in the Lambda.

The workflow for instantiating the solution presented in this blog in your own AWS account is as follows:

1. Run the AWS CloudFormation template provided with this blog in your account. This will create all the necessary infrastructure resources needed for this solution. 

1. Run the [`data_ingestion_to_vectordb.ipynb`](./data_ingestion_to_vectordb.ipynb) notebook in SageMaker Notebooks. This will ingest data from [SageMaker docs](https://sagemaker.readthedocs.io) into an OpenSearch index.

1. Run the Streamlit application on a Terminal in SageMaker Studio and open the URL for the application in a new browser tab. 

1. Ask your questions about SageMaker via the chat interface provided by the Streamlit app and view the responses generated by the LLM. 

These steps are discussed in detail in the sections below.

### Prerequisites

To implement the solution provided in this post, you should have an [AWS account](https://signin.aws.amazon.com/signin?redirect_uri=https%3A%2F%2Fportal.aws.amazon.com%2Fbilling%2Fsignup%2Fresume&client_id=signup) and familiarity with LLMs, OpenSearch and SageMaker.

#### Use AWS Cloud Formation to create the solution stack

We use AWS CloudFormation to create a SageMaker notebook called `aws-llm-apps-blog` and an IAM role called `LLMAppsBlogIAMRole`. Choose **Launch Stack** for the Region you want to deploy resources to. **This template takes about 15 minutes to run completely**.

   |AWS Region                |     Link        |
   |:------------------------:|:-----------:|
   |us-east-1 (N. Virginia)    | [<img src="./img/cloudformation-launch-stack.png">](https://console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/new?stackName=llm-apps-blog-rag&templateURL=https://aws-blogs-artifacts-public.s3.amazonaws.com/artifacts/ML-14328/template.yml) |
   |us-east-2 (Ohio)          | [<img src="./img/cloudformation-launch-stack.png">](https://console.aws.amazon.com/cloudformation/home?region=us-east-2#/stacks/new?stackName=llm-apps-blog-rag&templateURL=https://aws-blogs-artifacts-public.s3.amazonaws.com/artifacts/ML-14328/template.yml) |
   |us-west-1 (N. California) | [<img src="./img/cloudformation-launch-stack.png">](https://console.aws.amazon.com/cloudformation/home?region=us-west-1#/stacks/new?stackName=llm-apps-blog-rag&templateURL=https://aws-blogs-artifacts-public.s3.amazonaws.com/artifacts/ML-14328/template.yml) |
   |eu-west-1 (Dublin)        | [<img src="./img/cloudformation-launch-stack.png">](https://console.aws.amazon.com/cloudformation/home?region=eu-west-1#/stacks/new?stackName=llm-apps-blog-rag&templateURL=https://aws-blogs-artifacts-public.s3.amazonaws.com/artifacts/ML-14328/template.yml) |
   |ap-northeast-1 (Tokyo)    | [<img src="./img/cloudformation-launch-stack.png">](https://console.aws.amazon.com/cloudformation/home?region=ap-northeast-1#/stacks/new?stackName=llm-apps-blog-rag&templateURL=https://aws-blogs-artifacts-public.s3.amazonaws.com/artifacts/ML-14328/template.yml) |

#### Ingest the data into OpenSearch

To ingest the data, complete the following steps:

1. On the SageMaker console, choose **Notebooks** in the navigation pane.

1. Select the notebook aws-llm-apps-blog and choose **Open JupyterLab**.

    ![Open JupyterLab](img/ML-14328-sm-nb-jl.png){#fig-open-jl}

1. Choose `data_ingestion_to_vectordb.ipynb` to open it in JupyterLab. This notebook will ingest the [SageMaker docs](https://sagemaker.readthedocs.io) to an OpenSearch index called `llm_apps_workshop_embeddings`.
    ![Open Data Inestion Notebook](img/ML-14328-sm-nb-path.png){#fig-open-data-ingestion-nb}

1. Once the notebook is open, then, on the Run menu, choose **Run All Cells** to run the code in this notebook. This will download the dataset locally into the notebook and then ingest it into the OpenSearch index. This notebook takes about 20 minutes to run. The notebook also ingests the data into another vector database called [`FAISS`](https://github.com/facebookresearch/faiss) for illustration purposes, the FAISS index files are saved locally and the uploaded to S3 so that they can optionally be used by the Lambda function as an illustration of using an alternate vector database. 

    ![Notebook Run All Cells](img/ML-14328-sm-nb-runall){#fig-notebook-run-all-cells}

    The following code snippets in the notebook show the use of LangChain to ingest the dataset into OpenSearch. See the `data_ingestion_to_vectordb.ipynb` notebook for the full code.

    Split the dataset into shards that can be ingested in parallel to speed up the ingestion process.

    ```{.python}
    loader = ReadTheDocsLoader(args.input_data_dir)
    text_splitter = RecursiveCharacterTextSplitter(
        # Set a really small chunk size, just to show.
        chunk_size=args.chunk_size_for_doc_split,
        chunk_overlap=args.chunk_overlap_for_doc_split,
        length_function=len,
    )
    
    # Stage one: read all the docs, split them into chunks. 
    st = time.time() 
    logger.info('Loading documents ...')
    docs = loader.load()
    
    # add a custom metadata field, such as timestamp
    for doc in docs:
        doc.metadata['timestamp'] = time.time()
        doc.metadata['embeddings_model'] = args.embeddings_model_endpoint_name
    chunks = text_splitter.create_documents([doc.page_content for doc in docs], metadatas=[doc.metadata for doc in docs])
    et = time.time() - st
    logger.info(f'Time taken: {et} seconds. {len(chunks)} chunks generated') 
    
    
    db_shards = (len(chunks) // MAX_OS_DOCS_PER_PUT) + 1
    logger.info(f'Loading chunks into vector store ... using {db_shards} shards') 
    ```

    The `from_documents` function from `langchain.vectorstores.OpenSearchVectorSearch` creates an OpenSearch `k-NN` index so that it is setup for storing embeddings and doing a `similarity_search`.

    ```{.python}
    # by default langchain would create a k-NN index and the embeddings would be ingested as a k-NN vector type
    docsearch = OpenSearchVectorSearch.from_documents(index_name=args.opensearch_index_name,
                                                      documents=shards[0],
                                                      embedding=embeddings,
                                                      opensearch_url=args.opensearch_cluster_domain,
                                                      http_auth=http_auth)
    ```

    Once the OpenSearch index is setup by ingesting the first shard, subsequent shards are ingested in parallel via the `add_documents` function from `langchain.vectorstores.OpenSearchVectorSearch`.

    ```{.python}
        
    def process_shard(shard: List[Document],
                  embeddings_model_endpoint_name: str,
                  aws_region: str, 
                  os_index_name: str,
                  os_domain_ep: str,
                  os_http_auth: Tuple) -> int: 
        logger.info(f'Starting process_shard of {len(shard)} chunks.')
        st = time.time()
        embeddings = create_sagemaker_embeddings_from_js_model(embeddings_model_endpoint_name, aws_region)
        docsearch = OpenSearchVectorSearch(index_name=os_index_name,
                                           embedding_function=embeddings,
                                           opensearch_url=os_domain_ep,
                                           http_auth=os_http_auth)    
        docsearch.add_documents(documents=shard)
        et = time.time() - st
        logger.info(f'Shard completed in {et} seconds.')
        return 0
    ```

    LangChain interfaces with the SageMaker Endpoint for creating embeddings of the document chunks using the `langchain.embeddings.SagemakerEndpointEmbeddings` and the `langchain.embeddings.sagemkaer_endpoint.EmbeddingsContentHandler` class, both of which are extended by the code in this notebook. 

    Here are a few salient points to note about ingesting data into OpenSearch via LangChain.

    - LangChain ingests the data into OpenSearch via the [`_bulk`](https://opensearch.org/docs/1.2/opensearch/rest-api/document-apis/bulk/) API, 500 documents are ingested in a single PUT request.
    - LangChain creates an OpenSearch index with a schema that has a `metadata` field that includes `source` (corresponding to the filename of the chunk), a `text` field for the raw text and `vector_field` for the embeddings.
    - Refer to `opensearch_vector_search.py` in the LangChain [GitHub repo](https://github.com/hwchase17/langchain/) for further customization of the OpenSearch index parameters.

1. We use Sagemkaer Processing Job launched from this notebook to ingest data into Openearch. The code fragments seen above are part of a Python script which is provided to the SageMaker Procesing Job to run in a distributed fashiom. 
  - We create a custom container in which we will install the `langchain` and `opensearch-py` Python packges and then upload this container image to Amazon Elastic Container Registry (ECR).
  - We sse the Sagemaker `ScriptProcessor` class to create a Sagemaker Processing job that will run on multiple nodes.
      - The data files available in S3 are automatically distributed across in the Sagemaker Processing Job instances by setting `s3_data_distribution_type='ShardedByS3Key'` as part of the `ProcessingInput` provided to the processing job.
      - Each node processes a subset of the files and this brings down the overall time required to ingest the data into Opensearch.
      - Each node also uses Python `multiprocessing` to internally also parallelize the file processing. Thus, **there are two levels of parallelization happening, one at the cluster level where individual nodes are distributing the work (files) amongst themselves and another at the node level where the files in a node are also split between multiple processes running on the node**.

  ```{.python}
  # setup the ScriptProcessor with the above parameters
  processor = ScriptProcessor(base_job_name=base_job_name,
                              image_uri=image_uri,
                              role=aws_role,
                              instance_type=instance_type,
                              instance_count=instance_count,
                              command=["python3"],
                              tags=tags)

  # setup input from S3, note the ShardedByS3Key, this ensures that 
  # each instance gets a random and equal subset of the files in S3.
  inputs = [ProcessingInput(source=f"s3://{bucket}/{app_name}/{DOMAIN}",
                            destination='/opt/ml/processing/input_data',
                            s3_data_distribution_type='ShardedByS3Key',
                            s3_data_type='S3Prefix')]


  logger.info(f"creating an opensearch index with name={opensearch_index}")
  # ready to run the processing job
  st = time.time()
  processor.run(code="container/load_data_into_opensearch.py",
                inputs=inputs,
                outputs=[],
                arguments=["--opensearch-cluster-domain", opensearch_domain_endpoint,
                          "--opensearch-secretid", os_creds_secretid_in_secrets_manager,
                          "--opensearch-index-name", opensearch_index,
                          "--aws-region", aws_region,
                          "--embeddings-model-endpoint-name", embeddings_model_endpoint_name,
                          "--chunk-size-for-doc-split", str(CHUNK_SIZE_FOR_DOC_SPLIT),
                          "--chunk-overlap-for-doc-split", str(CHUNK_OVERLAP_FOR_DOC_SPLIT),
                          "--input-data-dir", "/opt/ml/processing/input_data",
                          "--create-index-hint-file", CREATE_OS_INDEX_HINT_FILE,
                          "--process-count", "2"])
  ```

1. Close the notebook after all cells run without any error. Your data is now available in OpenSearch. The following screenshot shows the `california_housing` table created in OpenSearch.

   ![OpenSearch Index](img/ML-14328-opensearch-index.png){#fig-opensearch-index}

## Clean up

To avoid incurring future charges, delete the resources. You can do this by deleting the CloudFormation template used to create the IAM role and SageMaker notebook.

![Cleaning Up](img/cfn-delete.png){#fig-cleaning-up-2}


## Conclusion

In this post, we showed ..

We encourage you to learn more by exploring the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/) and building a solution using the sample implementation provided in this post and a dataset relevant to your business. If you have questions or suggestions, leave a comment.

  *  *  *  *  *

## Author bio

<img style="float: left; margin: 0 10px 0 0;" src="img/ML-14328-Amit.png">Amit Arora is an AI and ML specialist architect at Amazon Web Services, helping enterprise customers use cloud-based machine learning services to rapidly scale their innovations. He is also an adjunct lecturer in the MS data science and analytics program at Georgetown University in Washington D.C.

<br><br>

<img style="float: left; margin: 0 10px 0 0;" src="img/ML-14328-xinhuang.jpg">Dr. Xin Huang is a Senior Applied Scientist for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms. He focuses on developing scalable machine learning algorithms. His research interests are in the area of natural language processing, explainable deep learning on tabular data, and robust analysis of non-parametric space-time clustering. He has published many papers in ACL, ICDM, KDD conferences, and Royal Statistical Society: Series A..

<br><br>

<img style="float: left; margin: 0 10px 0 0;" src="img/ML-14328-ntuteja.jfif">Navneet Tuteja is a Data Specialist at Amazon Web Services. Before joining AWS, Navneet worked as a facilitator for organizations seeking to modernize their data architectures and implement comprehensive AI/ML solutions. She holds an engineering degree from Thapar University, as well as a master's degree in statistics from Texas A&M University.