{ "cells": [ { "cell_type": "markdown", "id": "e882169c-dc4e-43dd-88dc-4a2eae988f6c", "metadata": {}, "source": [ "# Ingest knowledge base data t a Vector DB\n", "\n", "In this notebook we take knowledge base data available in an S3 bucket and convert it into embeddings using an LLM (such as`GPT-J-6B`) and store it in a vector database. We use `SagemakerEndpointEmbeddings` class provided by [`langchain`](https://python.langchain.com/en/latest/index.html) to create the embeddings and store them as a [`FAISS`](https://github.com/facebookresearch/faiss) index. The [`FAISS`](https://github.com/facebookresearch/faiss) index is then uploaded to an S3 bucket from where it can be downloaded into an application that intends to retrieve matching documents (say for a question answer use-case)." ] }, { "cell_type": "code", "execution_count": 2, "id": "d24ca243-fb69-4422-97ba-5a41c38a4b0d", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[33mWARNING: The directory '/root/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you should use sudo's -H flag.\u001b[0m\u001b[33m\n", "\u001b[0m\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n", "\u001b[0m\u001b[33mWARNING: The directory '/root/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you should use sudo's -H flag.\u001b[0m\u001b[33m\n", "\u001b[0m\u001b[33mWARNING: jsonschema 3.2.0 does not provide the extra 'format-nongpl'\u001b[0m\u001b[33m\n", "\u001b[0m\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n", "\u001b[0m\u001b[33mWARNING: The directory '/root/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you should use sudo's -H flag.\u001b[0m\u001b[33m\n", "\u001b[0mRequirement already satisfied: langchain==0.0.123 in /opt/conda/lib/python3.10/site-packages (0.0.123)\n", "Requirement already satisfied: numpy<2,>=1 in /opt/conda/lib/python3.10/site-packages (from langchain==0.0.123) (1.24.2)\n", "Requirement already satisfied: tenacity<9.0.0,>=8.1.0 in /opt/conda/lib/python3.10/site-packages (from langchain==0.0.123) (8.2.2)\n", "Requirement already satisfied: PyYAML>=5.4.1 in /opt/conda/lib/python3.10/site-packages (from langchain==0.0.123) (5.4.1)\n", "Requirement already satisfied: SQLAlchemy<2,>=1 in /opt/conda/lib/python3.10/site-packages (from langchain==0.0.123) (1.4.39)\n", "Requirement already satisfied: aiohttp<4.0.0,>=3.8.3 in /opt/conda/lib/python3.10/site-packages (from langchain==0.0.123) (3.8.4)\n", "Requirement already satisfied: pydantic<2,>=1 in /opt/conda/lib/python3.10/site-packages (from langchain==0.0.123) (1.10.7)\n", "Requirement already satisfied: requests<3,>=2 in /opt/conda/lib/python3.10/site-packages (from langchain==0.0.123) (2.28.2)\n", "Requirement already satisfied: dataclasses-json<0.6.0,>=0.5.7 in /opt/conda/lib/python3.10/site-packages (from langchain==0.0.123) (0.5.7)\n", "Requirement already satisfied: yarl<2.0,>=1.0 in /opt/conda/lib/python3.10/site-packages (from aiohttp<4.0.0,>=3.8.3->langchain==0.0.123) (1.8.2)\n", "Requirement already satisfied: frozenlist>=1.1.1 in /opt/conda/lib/python3.10/site-packages (from aiohttp<4.0.0,>=3.8.3->langchain==0.0.123) (1.3.3)\n", "Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /opt/conda/lib/python3.10/site-packages (from aiohttp<4.0.0,>=3.8.3->langchain==0.0.123) (4.0.2)\n", "Requirement already satisfied: multidict<7.0,>=4.5 in /opt/conda/lib/python3.10/site-packages (from aiohttp<4.0.0,>=3.8.3->langchain==0.0.123) (6.0.4)\n", "Requirement already satisfied: charset-normalizer<4.0,>=2.0 in /opt/conda/lib/python3.10/site-packages (from aiohttp<4.0.0,>=3.8.3->langchain==0.0.123) (2.0.4)\n", "Requirement already satisfied: aiosignal>=1.1.2 in /opt/conda/lib/python3.10/site-packages (from aiohttp<4.0.0,>=3.8.3->langchain==0.0.123) (1.3.1)\n", "Requirement already satisfied: attrs>=17.3.0 in /opt/conda/lib/python3.10/site-packages (from aiohttp<4.0.0,>=3.8.3->langchain==0.0.123) (21.4.0)\n", "Requirement already satisfied: marshmallow-enum<2.0.0,>=1.5.1 in /opt/conda/lib/python3.10/site-packages (from dataclasses-json<0.6.0,>=0.5.7->langchain==0.0.123) (1.5.1)\n", "Requirement already satisfied: typing-inspect>=0.4.0 in /opt/conda/lib/python3.10/site-packages (from dataclasses-json<0.6.0,>=0.5.7->langchain==0.0.123) (0.8.0)\n", "Requirement already satisfied: marshmallow<4.0.0,>=3.3.0 in /opt/conda/lib/python3.10/site-packages (from dataclasses-json<0.6.0,>=0.5.7->langchain==0.0.123) (3.19.0)\n", "Requirement already satisfied: typing-extensions>=4.2.0 in /opt/conda/lib/python3.10/site-packages (from pydantic<2,>=1->langchain==0.0.123) (4.3.0)\n", "Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.10/site-packages (from requests<3,>=2->langchain==0.0.123) (3.3)\n", "Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.10/site-packages (from requests<3,>=2->langchain==0.0.123) (2022.12.7)\n", "Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.10/site-packages (from requests<3,>=2->langchain==0.0.123) (1.26.15)\n", "Requirement already satisfied: greenlet!=0.4.17 in /opt/conda/lib/python3.10/site-packages (from SQLAlchemy<2,>=1->langchain==0.0.123) (1.1.1)\n", "Requirement already satisfied: packaging>=17.0 in /opt/conda/lib/python3.10/site-packages (from marshmallow<4.0.0,>=3.3.0->dataclasses-json<0.6.0,>=0.5.7->langchain==0.0.123) (21.3)\n", "Requirement already satisfied: mypy-extensions>=0.3.0 in /opt/conda/lib/python3.10/site-packages (from typing-inspect>=0.4.0->dataclasses-json<0.6.0,>=0.5.7->langchain==0.0.123) (0.4.3)\n", "Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/lib/python3.10/site-packages (from packaging>=17.0->marshmallow<4.0.0,>=3.3.0->dataclasses-json<0.6.0,>=0.5.7->langchain==0.0.123) (3.0.9)\n", "\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n", "\u001b[0m\u001b[33mWARNING: The directory '/root/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you should use sudo's -H flag.\u001b[0m\u001b[33m\n", "\u001b[0m\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n", "\u001b[0m" ] } ], "source": [ "!pip install --upgrade sagemaker --quiet\n", "!pip install ipywidgets==7.0.0 --quiet\n", "!pip install langchain==0.0.123\n", "!pip install faiss-cpu --quiet" ] }, { "cell_type": "code", "execution_count": 25, "id": "822a0889-6149-4216-9633-ff66d201181c", "metadata": { "tags": [] }, "outputs": [], "source": [ "import os\n", "import sys\n", "import glob\n", "import time\n", "import logging\n", "import pandas as pd\n", "import sagemaker, boto3, json\n", "from sagemaker.model import Model\n", "from sagemaker.session import Session\n", "from sagemaker.predictor import Predictor\n", "from sagemaker.utils import name_from_base\n", "from sagemaker import image_uris, model_uris, script_uris, hyperparameters" ] }, { "cell_type": "code", "execution_count": 19, "id": "fb8fbe4e-466c-4f9f-9262-8f0565561d99", "metadata": { "tags": [] }, "outputs": [], "source": [ "import json\n", "from typing import Any, Dict, List, Optional\n", "from langchain.embeddings import SagemakerEndpointEmbeddings\n", "from langchain.llms.sagemaker_endpoint import ContentHandlerBase" ] }, { "cell_type": "code", "execution_count": 39, "id": "d528ac87-8cfa-4a8f-9dac-c87ed7cdfaaf", "metadata": { "tags": [] }, "outputs": [], "source": [ "from langchain import PromptTemplate\n", "from langchain.chains import RetrievalQA\n", "from langchain.document_loaders import TextLoader\n", "from langchain.indexes import VectorstoreIndexCreator\n", "from langchain.text_splitter import CharacterTextSplitter\n", "from langchain.vectorstores import Chroma, AtlasDB, FAISS\n", "from langchain.document_loaders.csv_loader import CSVLoader\n", "from langchain.chains.question_answering import load_qa_chain" ] }, { "cell_type": "code", "execution_count": 60, "id": "9d2fa335-c7a2-46ec-a14b-3e898f899975", "metadata": { "tags": [] }, "outputs": [], "source": [ "# global constants\n", "APP_NAME = \"qa-w-rag\"\n", "RAW_DATA_S3_PREFIX = \"s3://jumpstart-cache-prod-us-east-2/training-datasets/Amazon_SageMaker_FAQs/\"\n", "DATA_DIR = \"data\"\n", "RAW_DATA_DIR = os.path.join(DATA_DIR, \"raw\")\n", "PROCESSED_DATA_DIR = os.path.join(DATA_DIR, \"processed\")\n", "VECTOR_DB_DIR = os.path.join(DATA_DIR, \"vectordb\")\n", "KB_FNAME = \"kb.csv\"\n", "KB_DATAFRAME_COL_INDEXES_TO_KEEP = [1]" ] }, { "cell_type": "code", "execution_count": 5, "id": "ba7b4853-6d22-4d64-b4ae-b48d015f9b60", "metadata": { "tags": [] }, "outputs": [], "source": [ "logger = logging.getLogger()\n", "logging.basicConfig(format='%(asctime)s,%(module)s,%(processName)s,%(levelname)s,%(message)s', level=logging.INFO, stream=sys.stderr)" ] }, { "cell_type": "code", "execution_count": 6, "id": "f15f5aa0-de08-42e2-b138-998ec1daefa6", "metadata": { "tags": [] }, "outputs": [], "source": [ "%store -r embedding_model_endpoint_name" ] }, { "cell_type": "code", "execution_count": 7, "id": "54fbde3d-cc6c-49ef-bf82-ecaf69034c69", "metadata": { "tags": [] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2023-04-13 01:07:18,525,1440690944,MainProcess,INFO,embedding_model_endpoint_name=qa-w-rag-huggingface-textembedding-gpt--2023-04-12-15-09-23-080\n" ] } ], "source": [ "logger.info(f\"embedding_model_endpoint_name={embedding_model_endpoint_name}\")" ] }, { "cell_type": "code", "execution_count": 61, "id": "25fc00c6-d5e3-4776-93a3-5d5ed6a3e729", "metadata": { "tags": [] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2023-04-13 02:02:40,868,1638410780,MainProcess,INFO,aws_role=arn:aws:iam::015469603702:role/SageMakerRepoRole, aws_region=us-east-1, bucket=sagemaker-us-east-1-015469603702\n" ] } ], "source": [ "sagemaker_session = Session()\n", "aws_role = sagemaker_session.get_caller_identity_arn()\n", "aws_region = boto3.Session().region_name\n", "sess = sagemaker.Session()\n", "model_version = \"*\"\n", "bucket = sagemaker_session.default_bucket()\n", "logger.info(f\"aws_role={aws_role}, aws_region={aws_region}, bucket={bucket}\")" ] }, { "cell_type": "code", "execution_count": 10, "id": "9c2333d7-0c40-4586-9e4e-1f317859aaa6", "metadata": { "tags": [] }, "outputs": [], "source": [ "# extend the SagemakerEndpointEmbeddings class from langchain to provide a custom embedding function\n", "class SagemakerEndpointEmbeddingsJumpStart(SagemakerEndpointEmbeddings):\n", " def embed_documents(\n", " self, texts: List[str], chunk_size: int = 5\n", " ) -> List[List[float]]:\n", " \"\"\"Compute doc embeddings using a SageMaker Inference Endpoint.\n", "\n", " Args:\n", " texts: The list of texts to embed.\n", " chunk_size: The chunk size defines how many input texts will\n", " be grouped together as request. If None, will use the\n", " chunk size specified by the class.\n", "\n", " Returns:\n", " List of embeddings, one for each text.\n", " \"\"\"\n", " results = []\n", " _chunk_size = len(texts) if chunk_size > len(texts) else chunk_size\n", " \n", " for i in range(0, len(texts), _chunk_size):\n", " response = self._embedding_func(texts[i : i + _chunk_size])\n", " results.extend(response)\n", " return results\n", "\n", "# class for serializing/deserializing requests/responses to/from the embeddings model\n", "class ContentHandler(ContentHandlerBase):\n", " content_type = \"application/json\"\n", " accepts = \"application/json\"\n", " def transform_input(self, prompt: str, model_kwargs = {}) -> bytes:\n", " input_str = json.dumps({\"text_inputs\": prompt, **model_kwargs})\n", " return input_str.encode('utf-8') \n", " \n", " def transform_output(self, output: bytes) -> str:\n", " response_json = json.loads(output.read().decode(\"utf-8\"))\n", " embeddings = response_json[\"embedding\"]\n", " if len(embeddings) == 1:\n", " return embeddings[0]\n", " return embeddings\n", "\n", "# all set to create the objects for the ContentHandler and SagemakerEndpointEmbeddingsJumpStart classes\n", "content_handler = ContentHandler()\n", "\n", "# note the name of the LLM Sagemaker endpoint, this is the model that we would be using for generating the embeddings\n", "embeddings = SagemakerEndpointEmbeddingsJumpStart( \n", " endpoint_name=embedding_model_endpoint_name,\n", " region_name=aws_region, \n", " content_handler=content_handler\n", ")" ] }, { "cell_type": "markdown", "id": "fb2f4b53-23fe-4efd-9d5a-9211d7c2fe79", "metadata": {}, "source": [ "---\n", "## Download the data\n", "\n", "The data is stored in an S3 bucket, we download the files from there and read it in a `langchain` document list." ] }, { "cell_type": "code", "execution_count": 21, "id": "9dfefdb3-eab1-4f45-a7de-5907e09bba1e", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "download: s3://jumpstart-cache-prod-us-east-2/training-datasets/Amazon_SageMaker_FAQs/Amazon_SageMaker_FAQs.csv to data/raw/Amazon_SageMaker_FAQs.csv\n" ] } ], "source": [ "!aws s3 cp --recursive $RAW_DATA_S3_PREFIX $RAW_DATA_DIR" ] }, { "cell_type": "markdown", "id": "f7757e81-6b6e-4f6e-927c-97094112acff", "metadata": {}, "source": [ "There could be multiple files in the directory." ] }, { "cell_type": "code", "execution_count": 23, "id": "7c5b7180-65e0-4476-86fa-0a52375eb705", "metadata": { "tags": [] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2023-04-13 01:32:31,565,4131182362,MainProcess,INFO,raw data files = ['data/raw/Amazon_SageMaker_FAQs.csv']\n" ] } ], "source": [ "files = glob.glob(os.path.join(RAW_DATA_DIR, \"*.csv\"))\n", "logger.info(f\"raw data files = {files}\")" ] }, { "cell_type": "code", "execution_count": 31, "id": "719cf7d7-ae1a-4a0f-bc64-f5d8162f8354", "metadata": { "tags": [] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2023-04-13 01:36:49,316,149979566,MainProcess,INFO,read data/raw/Amazon_SageMaker_FAQs.csv, dataframe shape=(154, 2)\n", "2023-04-13 01:36:49,317,149979566,MainProcess,INFO,shape of dataframe containing all raw data = (154, 2)\n" ] } ], "source": [ "df_kb_list = []\n", "for f in files:\n", " df_kb = pd.read_csv(f, header=None)\n", " logger.info(f\"read {f}, dataframe shape={df_kb.shape}\")\n", " df_kb_list.append(df_kb)\n", "df_kb = pd.concat(df_kb_list)\n", "logger.info(f\"shape of dataframe containing all raw data = {df_kb.shape}\")\n", " " ] }, { "cell_type": "code", "execution_count": 32, "id": "41d3fb33-3b40-4ff7-953c-dfd8fb9e9555", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01
0What is Amazon SageMaker?Amazon SageMaker is a fully managed service to...
1In which Regions is Amazon SageMaker available...For a list of the supported Amazon SageMaker A...
2What is the service availability of Amazon Sag...Amazon SageMaker is designed for high availabi...
3How does Amazon SageMaker secure my code?Amazon SageMaker stores code in ML storage vol...
4What security measures does Amazon SageMaker h...Amazon SageMaker ensures that ML model artifac...
\n", "
" ], "text/plain": [ " 0 \\\n", "0 What is Amazon SageMaker? \n", "1 In which Regions is Amazon SageMaker available... \n", "2 What is the service availability of Amazon Sag... \n", "3 How does Amazon SageMaker secure my code? \n", "4 What security measures does Amazon SageMaker h... \n", "\n", " 1 \n", "0 Amazon SageMaker is a fully managed service to... \n", "1 For a list of the supported Amazon SageMaker A... \n", "2 Amazon SageMaker is designed for high availabi... \n", "3 Amazon SageMaker stores code in ML storage vol... \n", "4 Amazon SageMaker ensures that ML model artifac... " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(df_kb.head())" ] }, { "cell_type": "code", "execution_count": 38, "id": "a398d574-b64e-4b34-a3e5-e8efffaf66ef", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
1
0Amazon SageMaker is a fully managed service to...
1For a list of the supported Amazon SageMaker A...
2Amazon SageMaker is designed for high availabi...
3Amazon SageMaker stores code in ML storage vol...
4Amazon SageMaker ensures that ML model artifac...
\n", "
" ], "text/plain": [ " 1\n", "0 Amazon SageMaker is a fully managed service to...\n", "1 For a list of the supported Amazon SageMaker A...\n", "2 Amazon SageMaker is designed for high availabi...\n", "3 Amazon SageMaker stores code in ML storage vol...\n", "4 Amazon SageMaker ensures that ML model artifac..." ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df_kb = df_kb[KB_DATAFRAME_COL_INDEXES_TO_KEEP]\n", "display(df_kb.head())" ] }, { "cell_type": "markdown", "id": "9016a278-9e64-43bd-867f-1adb87136bd0", "metadata": {}, "source": [ "---\n", "\n", "## Load the data in a `FAISS` index" ] }, { "cell_type": "code", "execution_count": 45, "id": "85840045-7ce3-4e9d-ac05-869ea083c383", "metadata": { "tags": [] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2023-04-13 01:44:05,371,4124703,MainProcess,INFO,writing processed data, dataframe shape = (154, 1) to data/processed/kb.csv\n" ] } ], "source": [ "os.makedirs(PROCESSED_DATA_DIR, exist_ok=True)\n", "processed_data_fpath = os.path.join(PROCESSED_DATA_DIR, KB_FNAME)\n", "logger.info(f\"writing processed data, dataframe shape = {df_kb.shape} to {processed_data_fpath}\")\n", "df_kb.to_csv(fpath, index=False)" ] }, { "cell_type": "markdown", "id": "be6880de-4d05-48f1-932f-4c7761e44e1c", "metadata": {}, "source": [ "Now load this data into `langchain` `documents`." ] }, { "cell_type": "code", "execution_count": 46, "id": "3a22d301-3e8d-4d3a-84bf-be02017de083", "metadata": { "tags": [] }, "outputs": [], "source": [ "loader = CSVLoader(file_path=processed_data_fpath)" ] }, { "cell_type": "code", "execution_count": 47, "id": "c0dcb6c1-748f-4385-83b3-35ed6055201a", "metadata": { "tags": [] }, "outputs": [], "source": [ "documents = loader.load()" ] }, { "cell_type": "code", "execution_count": 50, "id": "10d15b21-6ead-4933-8cda-b1456b4ddf11", "metadata": { "tags": [] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2023-04-13 01:46:58,214,3674270965,MainProcess,INFO,---------------\n", "2023-04-13 01:46:58,216,3674270965,MainProcess,INFO,page_content='1: Amazon SageMaker is a fully managed service to prepare data and build, train, and deploy machine learning (ML) models for any use case with fully managed infrastructure, tools, and workflows.' lookup_str='' metadata={'source': 'data/processed/kb.csv', 'row': 0} lookup_index=0\n", "2023-04-13 01:46:58,216,3674270965,MainProcess,INFO,---------------\n", "2023-04-13 01:46:58,217,3674270965,MainProcess,INFO,page_content='1: For a list of the supported Amazon SageMaker AWS Regions, please visit the\\xa0AWS Regional Services page. Also, for more information, see\\xa0Regional endpoints\\xa0in the AWS general reference guide.' lookup_str='' metadata={'source': 'data/processed/kb.csv', 'row': 1} lookup_index=0\n", "2023-04-13 01:46:58,218,3674270965,MainProcess,INFO,---------------\n", "2023-04-13 01:46:58,219,3674270965,MainProcess,INFO,page_content='1: Amazon SageMaker is designed for high availability. There are no maintenance windows or scheduled downtimes. SageMaker APIs run in Amazon’s proven, high-availability data centers, with service stack replication configured across three facilities in each AWS Region to provide fault tolerance in the event of a server failure or Availability Zone outage.' lookup_str='' metadata={'source': 'data/processed/kb.csv', 'row': 2} lookup_index=0\n", "2023-04-13 01:46:58,220,3674270965,MainProcess,INFO,---------------\n", "2023-04-13 01:46:58,220,3674270965,MainProcess,INFO,page_content='1: Amazon SageMaker stores code in ML storage volumes, secured by security groups and optionally encrypted at rest.' lookup_str='' metadata={'source': 'data/processed/kb.csv', 'row': 3} lookup_index=0\n", "2023-04-13 01:46:58,221,3674270965,MainProcess,INFO,---------------\n", "2023-04-13 01:46:58,222,3674270965,MainProcess,INFO,page_content='1: Amazon SageMaker ensures that ML model artifacts and other system artifacts are encrypted in transit and at rest. Requests to the SageMaker API and console are made over a secure (SSL) connection. You pass\\xa0AWS Identity and Access Management roles\\xa0to SageMaker to provide permissions to access resources on your behalf for training and deployment. You can use encrypted Amazon Simple Storage Service (Amazon S3) buckets for model artifacts and data, as well as pass an AWS Key Management Service (KMS) key to SageMaker notebooks, training jobs, and endpoints, to encrypt the attached ML storage volume. Amazon SageMaker also supports Amazon Virtual Private Cloud (VPC) and AWS PrivateLink support.' lookup_str='' metadata={'source': 'data/processed/kb.csv', 'row': 4} lookup_index=0\n" ] } ], "source": [ "# just print first n documents\n", "n = 5\n", "for doc in documents[:n]:\n", " logger.info(f\"---------------\")\n", " logger.info(doc)" ] }, { "cell_type": "markdown", "id": "8e97a022-1eb8-414b-b115-c88bf79ffe54", "metadata": {}, "source": [ "Now, we can build the QA application. Firstly, we **generate embedings for each of document in the knowledge library with SageMaker GPT-J-6B embedding model.**" ] }, { "cell_type": "code", "execution_count": 51, "id": "1cd26511-2034-48ae-95fa-680a43a6891e", "metadata": { "tags": [] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2023-04-13 01:48:53,548,loader,MainProcess,INFO,Loading faiss with AVX2 support.\n", "2023-04-13 01:48:53,568,loader,MainProcess,INFO,Successfully loaded faiss with AVX2 support.\n" ] } ], "source": [ "vectordb = FAISS.from_documents(documents, embeddings)" ] }, { "cell_type": "markdown", "id": "85b244ff-a072-410b-807b-4f07995abc3f", "metadata": {}, "source": [ "Save the FAISS index locally so that it can be uploaded to S3" ] }, { "cell_type": "code", "execution_count": 53, "id": "d48c9781-cd27-4adc-aaba-a0a0c7710e08", "metadata": { "tags": [] }, "outputs": [], "source": [ "os.makedirs(VECTOR_DB_DIR, exist_ok=True)" ] }, { "cell_type": "code", "execution_count": 54, "id": "90bdd601-1467-42d7-8282-d878d8f6525f", "metadata": { "tags": [] }, "outputs": [], "source": [ "vectordb.save_local(VECTOR_DB_DIR)" ] }, { "cell_type": "markdown", "id": "98779c6a-09c1-4cac-b414-3d98257f08a5", "metadata": { "tags": [] }, "source": [ "---\n", "\n", "## Upload the Vector DB to S3" ] }, { "cell_type": "code", "execution_count": 63, "id": "286c5503-4ec0-41a6-8740-0fcb3b447156", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "total 2552\n", "-rw-r--r-- 1 sagemaker-studio root 2523181 Apr 13 01:58 index.faiss\n", "-rw-r--r-- 1 sagemaker-studio root 84216 Apr 13 01:58 index.pkl\n", "upload: data/vectordb/index.pkl to s3://sagemaker-us-east-1-015469603702/qa-w-rag/vectordb/index.pkl\n", "upload: data/vectordb/index.faiss to s3://sagemaker-us-east-1-015469603702/qa-w-rag/vectordb/index.faiss\n" ] } ], "source": [ "!ls -ltr $VECTOR_DB_DIR\n", "!aws s3 cp $VECTOR_DB_DIR s3://$bucket/$APP_NAME/vectordb --recursive" ] }, { "cell_type": "code", "execution_count": null, "id": "bdb81854-a039-4e3e-8b20-cbada32328a6", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "availableInstances": [ { "_defaultOrder": 0, "_isFastLaunch": true, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 4, "name": "ml.t3.medium", "vcpuNum": 2 }, { "_defaultOrder": 1, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.t3.large", "vcpuNum": 2 }, { "_defaultOrder": 2, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.t3.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 3, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.t3.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 4, "_isFastLaunch": true, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.m5.large", "vcpuNum": 2 }, { "_defaultOrder": 5, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.m5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 6, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.m5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 7, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.m5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 8, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.m5.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 9, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.m5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 10, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.m5.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 11, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.m5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 12, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.m5d.large", "vcpuNum": 2 }, { "_defaultOrder": 13, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.m5d.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 14, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.m5d.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 15, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.m5d.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 16, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.m5d.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 17, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.m5d.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 18, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.m5d.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 19, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.m5d.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 20, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": true, "memoryGiB": 0, "name": "ml.geospatial.interactive", "supportedImageNames": [ "sagemaker-geospatial-v1-0" ], "vcpuNum": 0 }, { "_defaultOrder": 21, "_isFastLaunch": true, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 4, "name": "ml.c5.large", "vcpuNum": 2 }, { "_defaultOrder": 22, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.c5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 23, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.c5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 24, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.c5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 25, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 72, "name": "ml.c5.9xlarge", "vcpuNum": 36 }, { "_defaultOrder": 26, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 96, "name": "ml.c5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 27, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 144, "name": "ml.c5.18xlarge", "vcpuNum": 72 }, { "_defaultOrder": 28, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.c5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 29, "_isFastLaunch": true, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.g4dn.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 30, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.g4dn.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 31, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.g4dn.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 32, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.g4dn.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 33, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.g4dn.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 34, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.g4dn.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 35, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 61, "name": "ml.p3.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 36, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 244, "name": "ml.p3.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 37, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 488, "name": "ml.p3.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 38, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 768, "name": "ml.p3dn.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 39, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.r5.large", "vcpuNum": 2 }, { "_defaultOrder": 40, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.r5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 41, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.r5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 42, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.r5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 43, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.r5.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 44, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.r5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 45, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 512, "name": "ml.r5.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 46, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 768, "name": "ml.r5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 47, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.g5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 48, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.g5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 49, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.g5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 50, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.g5.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 51, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.g5.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 52, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.g5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 53, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.g5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 54, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 768, "name": "ml.g5.48xlarge", "vcpuNum": 192 } ], "instance_type": "ml.m5.large", "kernelspec": { "display_name": "Python 3 (Data Science 3.0)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/sagemaker-data-science-310-v1" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" } }, "nbformat": 4, "nbformat_minor": 5 }