{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "0463b082-6f0a-4ffe-9fa9-d1075cea18d4", "metadata": {}, "source": [ "# Build a Q&A application with Bedrock, Langchain and FAISS index\n", "\n", "This notebook explains steps requried to build a Question & Answer application using Retrieval Augmented Generation (RAG) architecture.\n", "RAG combines the power of pre-trained LLMs with information retrieval - enabling more accurate and context-aware responses\n", "\n", "(This notebook was tested on SageMaker Studio ml.m5.2xlarge instance with Datascience 3.0 kernel)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "e1b059e4-443a-4fa5-a9d7-b02f3763615e", "metadata": {}, "source": [ "## Pre-requisites" ] }, { "cell_type": "code", "execution_count": null, "id": "6e0e1867-409a-43bf-bc07-83a67599216c", "metadata": { "tags": [] }, "outputs": [], "source": [ "!pip install faiss-cpu\n", "!pip install langchain --upgrade\n", "!pip install pypdf" ] }, { "cell_type": "code", "execution_count": null, "id": "63a4a8ff-8b50-47da-a6c7-c7b0cb206a15", "metadata": { "tags": [] }, "outputs": [], "source": [ "!pip install sentence_transformers" ] }, { "cell_type": "code", "execution_count": null, "id": "d1c05900-664c-4747-a4bf-8af770dafa09", "metadata": { "tags": [] }, "outputs": [], "source": [ "!pip install sagemaker --upgrade" ] }, { "cell_type": "code", "execution_count": null, "id": "5ede104b-8e3a-4be3-b65e-d2694843e007", "metadata": { "tags": [] }, "outputs": [], "source": [ "!python3 -m pip install bedrock_docs/SDK/boto3-1.26.162-py3-none-any.whl\n", "!python3 -m pip install bedrock_docs/SDK/botocore-1.29.162-py3-none-any.whl" ] }, { "attachments": {}, "cell_type": "markdown", "id": "e3d4b94d-3d83-40aa-8a92-03435f7f4776", "metadata": {}, "source": [ "## Restart Kernel" ] }, { "cell_type": "code", "execution_count": null, "id": "edd8069f-83c0-4517-aec9-f7fe28440adf", "metadata": {}, "outputs": [], "source": [ "#Restart Kernel after the installs\n", "import IPython\n", "app = IPython.Application.instance()\n", "app.kernel.do_shutdown(True) " ] }, { "attachments": {}, "cell_type": "markdown", "id": "489eb77a-b0b4-4339-9eeb-e81dabcdeea2", "metadata": {}, "source": [ "## Setup depedencies" ] }, { "cell_type": "code", "execution_count": null, "id": "20315fba-ef88-4a7a-b859-eb82cac80559", "metadata": { "tags": [] }, "outputs": [], "source": [ "#Check Python version is greater than 3.8 which is required by Langchain if you want to use Langchain\n", "import sys\n", "sys.version" ] }, { "cell_type": "code", "execution_count": null, "id": "474c9b4a-95ec-4879-b13e-d5363314e143", "metadata": { "tags": [] }, "outputs": [], "source": [ "assert sys.version_info >= (3, 8)" ] }, { "cell_type": "code", "execution_count": null, "id": "31be7dc4-bef3-4223-9f0d-50f5ba707a90", "metadata": { "tags": [] }, "outputs": [], "source": [ "import langchain" ] }, { "cell_type": "code", "execution_count": null, "id": "f4e6e9ec-46a2-4e25-98fd-3231c729c859", "metadata": { "tags": [] }, "outputs": [], "source": [ "langchain.__version__" ] }, { "cell_type": "code", "execution_count": null, "id": "d7ade1e6-d10e-4b39-bc34-886f61145e97", "metadata": { "tags": [] }, "outputs": [], "source": [ "import os, json\n", "from tqdm import tqdm\n", "from langchain.document_loaders import PyPDFLoader\n", "from langchain.text_splitter import RecursiveCharacterTextSplitter,CharacterTextSplitter,NLTKTextSplitter\n", "import pathlib " ] }, { "attachments": {}, "cell_type": "markdown", "id": "8df427bc-eebe-4412-a135-a874fb042947", "metadata": {}, "source": [ "## Perform document pre-processing\n", "Load the documents, perform clean-up of the text before generating embeddings" ] }, { "cell_type": "code", "execution_count": null, "id": "5a94faf8-e3cb-48cd-88d6-9a7a327b6e4f", "metadata": { "tags": [] }, "outputs": [], "source": [ "text_splitter = RecursiveCharacterTextSplitter(\n", " chunk_size=800,\n", " chunk_overlap=100,\n", " #separators=[\"\\n\\n\", \"\\n\", \".\", \"!\", \"?\", \" \", \",\", \"\"],\n", " length_function=len,\n", " keep_separator=False,\n", " add_start_index=False\n", ")\n" ] }, { "cell_type": "code", "execution_count": null, "id": "8c364185-5ee8-49bd-8390-4c349e6e3422", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Put your directory containing PDFs here\n", "index_name = 'firetv'\n", "directory = f'pdfs/{index_name}'" ] }, { "cell_type": "code", "execution_count": null, "id": "4e11839f-a854-4fc5-9259-cb8abfe7b81e", "metadata": { "tags": [] }, "outputs": [], "source": [ "pdf_documents = [os.path.join(directory, filename) for filename in os.listdir(directory)]\n", "pdf_documents" ] }, { "cell_type": "code", "execution_count": null, "id": "a7a157ee-8c76-49ef-8203-2b7ef03f6ef7", "metadata": { "tags": [] }, "outputs": [], "source": [ "langchain_documents = []\n", "for document in pdf_documents:\n", " loader = PyPDFLoader(document)\n", " data = loader.load()\n", " langchain_documents.extend(data)\n" ] }, { "cell_type": "code", "execution_count": null, "id": "5c6bea2e-ca2b-4150-ad27-da5ea62e71d5", "metadata": { "tags": [] }, "outputs": [], "source": [ "print(\"loaded document pages: \", len(langchain_documents))\n", "print(\"Splitting all documents\")\n", "split_docs = text_splitter.split_documents(langchain_documents)\n", "print(\"Num split pages: \", len(split_docs))" ] }, { "cell_type": "code", "execution_count": null, "id": "2d68869a-d8ea-42bd-b696-69fdbb8794ba", "metadata": { "tags": [] }, "outputs": [], "source": [ "split_docs[0].page_content" ] }, { "cell_type": "code", "execution_count": null, "id": "2003c5b7-2f8d-4ce4-9e30-89c873b4f515", "metadata": { "tags": [] }, "outputs": [], "source": [ "import regex as re\n", "for d in split_docs:\n", " text = d.page_content\n", " text = re.sub(r\"(\\w+)-\\n(\\w+)\", r\"\\1\\2\", text)\n", " text = re.sub(r\"(?