{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "10cb3098", "metadata": {}, "source": [ "# Lab 4 - Building the LLM-powered chatbot \"AWSomeChat\" with retrieval-augmented generation\n", "\n", "## Introduction\n", "\n", "Language models have recently exploded in both size and popularity. In 2018, BERT-large entered the scene and, with its 340M parameters and novel transformer architecture, set the standard on NLP task accuracy. Within just a few years, state-of-the-art NLP model size has grown by more than 500x with models such as OpenAI’s 175 billion parameter GPT-3 and similarly sized open source Bloom 176B raising the bar on NLP accuracy. This increase in the number of parameters is driven by the simple and empirically-demonstrated positive relationship between model size and accuracy: more is better. With easy access from models zoos such as HuggingFace and improved accuracy in NLP tasks such as classification and text generation, practitioners are increasingly reaching for these large models. However, deploying them can be a challenge because of their size. \n", "\n", "While these models are able to generalize quite well and hence are capable to perform a vast amount of different generic tasks without having specifically being trained on them, they also lack domain-specific, proprietary and recent knowledge. This however is required in most use cases across organisations. With fine-tuning, in Lab 2 we already learned an option to infuse domain-specific or proprietary knowledge into a Large Language Model. However, this option can become complex and costly, especially if carried out on a regular basis to ensure access to recent information. Luckily there are several design patterns wrapping Large Language Models into powerful applications able to overcome even these challenges with a leightweight footprint. \n", "\n", "In this Lab, we'll explore how to build GenAI-powered applications capable of performing tasks within a specific domain. The application we will be building in a step-by-step process leverages the retrieval-augmented generation (RAG) design pattern and consists of multiple components ranging out of the broad service portfolio of AWS. \n", "\n", "## Background and Details\n", "\n", "We have two primary [types of knowledge for LLMs](https://www.pinecone.io/learn/langchain-retrieval-augmentation/): \n", "- **Parametric knowledge**: refers to everything the LLM learned during training and acts as a frozen snapshot of the world for the LLM. \n", "- **Source knowledge**: covers any information fed into the LLM via the input prompt. \n", "\n", "When trying to infuse knowledge into a generative AI - powered application we need to choose which of these types to target. Lab 2 deals with elevating the parametric knowledge through fine-tuning. Since fine-tuning is a resouce intensive operation, this option is well suited for infusing static domain-specific information like domain-specific langauage/writing styles (medical domain, science domain, ...) or optimizing performance towards a very specific task (classification, sentiment analysis, RLHF, instruction-finetuning, ...). \n", "\n", "In contrast to that, targeting the source knowledge for domain-specific performance uplift is very well suited for all kinds of dynamic information, from knowledge bases in structured and unstructured form up to integration of information from live systems. This Lab is about retrieval-augmented generation, a common design pattern for ingesting domain-specific information through the source knowledge. It is particularily well suited for ingestion of information in form of unstructured text with semi-frequent update cycles. \n", "\n", "The application we will be building in this lab will be a LLM-powered chatbot infused with domain specific knowledge on AWS services. You will be able to chat through a chatbot frontend and receive information going beyond the parametric knowledge encoded in the model. \n", "\n", "\n", "### Retrieval-augmented generation (RAG)\n", "\n", "\n", "\n", "The design pattern of retrieval-augmented generation is depicted in the above figure. It works as follows:\n", "\n", "- Step 0: Knowledge documents / document sequences are encoded and ingested into a vector database. \n", "- Step 1: Customer e-mail query is pre-processed and/or tokenized\n", "- Step 2: Tokenized input query is encoded\n", "- Step 3: Encoded query is used to retrieve most similar text passages in document index using vector similarity search (e.g., Mixed Inner Product Search)\n", "- Step 4: Top-k retrieved documents/text passages in combination with original customer e-mail query and e-mail generation prompt are fed into Generator model (Encoder-Decoder) to generate response e-mail\n", "\n", "### Architecture\n", "\n", "\n", "\n", "Above figure shows the architecture for the LLM-powered chatbot with retrieval-augmented generation component we will be implementing in this lab. It consists of the following components:\n", "- Document store & semantic search: We leverage semantic document search service Amazon Kendra as fully managed embeddings/vector store as well as for a fully managed solution for document retrieval based on questions/asks in natural language.\n", "- Response generation: For the chatbot response generation, we use the open-source encoder-decoder model FLAN-T5-XXL conveniently deployed in a one-click fashion through Amazon SageMaker JumpStart right into your VPC.\n", "- Orchestration layer: For hosting the orchestration layer implmented with the popular framework langchain we choose a serverless approach. The orchestration layer is exposed as RESTful API service via a Amazon API Gateway.\n", "- Conversational Memory: In order to be able to keep track of different chatbot conversation turns while keeping the orchestration layer stateless we integrate the chatbot's memory with Amazon DynamoDB as a storage component.\n", "- Frontend: The chatbot frontend is a web application hosted in a Docker container on Amazon ECS. For storing the container image we leverage Amazon ECR. The website is exposed through an Amazon Elastic Load Balancer. \n", "\n", "## Instructions\n", "\n", "### Prerequisites\n", "\n", "#### To run this workshop...\n", "You need a computer with a web browser, preferably with the latest version of Chrome / FireFox.\n", "Sequentially read and follow the instructions described in AWS Hosted Event and Work Environment Set Up\n", "\n", "#### Recommended background\n", "It will be easier for you to run this workshop if you have:\n", "\n", "- Experience with Deep learning models\n", "- Familiarity with Python or other similar programming languages\n", "- Experience with Jupyter notebooks\n", "- Beginners level knowledge and experience with SageMaker Hosting/Inference.\n", "- Beginners level knowledge and experience with Large Language Models\n", "\n", "#### Target audience\n", "Data Scientists, ML Engineering, ML Infrastructure, MLOps Engineers, Technical Leaders.\n", "Intended for customers working with large Generative AI models including Language, Computer vision and Multi-modal use-cases.\n", "Customers using EKS/EC2/ECS/On-prem for hosting or experience with SageMaker.\n", "\n", "Level of expertise - 400\n", "\n", "#### Time to complete\n", "Approximately 1 hour." ] }, { "attachments": {}, "cell_type": "markdown", "id": "037297f8", "metadata": {}, "source": [ "# Switching the Notebook runtime to Python 3.10\n", "\n", "For this lab we need to run the notebook based on a Python 3.10 runtime. Therefor proceed as follows and select the \"PyTorch 2.0.0 Python 3.10 CPU Optimized\" option:\n", "\n", "\n" ] }, { "attachments": {}, "cell_type": "markdown", "id": "1d33318d", "metadata": {}, "source": [ "# Adjustment of Amazon SageMaker execution role policies\n", "\n", "In order to execute build Docker images and deploy infrastructure resources through AWS Serverless Application Model from SageMaker Studio , we need to adjust the SageMaker Execution Role. This is the IAM Role granting SageMaker Studio the permissions to perform actions against the AWS API. \n", "\n", "To adjust accordingly copy the following JSON:\n", "\n", "```json\n", "{\n", " \"Version\": \"2012-10-17\",\n", " \"Statement\": [\n", " {\n", " \"Effect\": \"Allow\",\n", " \"Action\": \"*\",\n", " \"Resource\": \"*\"\n", " },\n", " {\n", " \"Effect\": \"Allow\",\n", " \"Action\": \"iam:PassRole\",\n", " \"Resource\": \"arn:aws:iam::*:role/*\",\n", " \"Condition\": {\n", " \"StringLikeIfExists\": {\n", " \"iam:PassedToService\": \"codebuild.amazonaws.com\"\n", " }\n", " }\n", " }\n", " ]\n", "}\n", "```\n", "\n", "We then change the Execution Role's permissions by adding the above JSON as inline policy as follows:\n", "\n", "\n", "\n", "\n", "**CAUTION: The first statement of this policy provides very extensive permissions to this role. Outside of a lab environmentit is recommended to stick to the principle of least privilidges, requiring to restrict the permissions granted to individuals and roles the minimum required! Also, please execute DevOps related tasks like the above mentioned by leveraging our DevOps service portfolio (CodeBuild, CodeDeploy, CodePipeline, ...)!**" ] }, { "attachments": {}, "cell_type": "markdown", "id": "a0dc305c", "metadata": {}, "source": [ "# Adjustment of Amazon SageMaker execution role trust relationships\n", "\n", "To be able to build Docker images from SageMaker Studio we need to establish a trust relation ship between the service and the Amazon CodeBuild service (this is where the Docker build will be executed). Therefor we adjust the role's trust relationship accordingly by copying this JSON...\n", "\n", "```json\n", "{\n", " \"Version\": \"2012-10-17\",\n", " \"Statement\": [\n", " {\n", " \"Effect\": \"Allow\",\n", " \"Principal\": {\n", " \"Service\": \"sagemaker.amazonaws.com\"\n", " },\n", " \"Action\": \"sts:AssumeRole\"\n", " },\n", " {\n", " \"Effect\": \"Allow\",\n", " \"Principal\": {\n", " \"Service\": \"codebuild.amazonaws.com\"\n", " },\n", " \"Action\": \"sts:AssumeRole\"\n", " }\n", " ]\n", "}\n", "```\n", "\n", "...and subsequently performing the following steps:\n", "\n", "\n", "\n" ] }, { "attachments": {}, "cell_type": "markdown", "id": "6c62b6bb", "metadata": {}, "source": [ "# Installation and import of required dependencies, further setup tasks\n", "\n", "For this lab, we will use the following libraries:\n", "\n", " - sagemaker-studio-image-build, a CLI for building and pushing Docker images in SageMaker Studio using AWS CodeBuild and Amazon ECR.\n", " - aws-sam-cli, an open-source CLI tool that helps you develop IaC-defined serverless applications into AWS.\n", " - SageMaker SDK for interacting with Amazon SageMaker. We especially want to highlight the classes 'HuggingFaceModel' and 'HuggingFacePredictor', utilizing the built-in HuggingFace integration into SageMaker SDK. These classes are used to encapsulate functionality around the model and the deployed endpoint we will use. They inherit from the generic 'Model' and 'Predictor' classes of the native SageMaker SDK, however implementing some additional functionality specific to HuggingFace and the HuggingFace model hub.\n", " - boto3, the AWS SDK for python\n", " - os, a python library implementing miscellaneous operating system interfaces \n", " - tarfile, a python library to read and write tar archive files\n", " - io, native Python library, provides Python’s main facilities for dealing with various types of I/O.\n", " - tqdm, a utility to easily show a smart progress meter for synchronous operations." ] }, { "cell_type": "code", "execution_count": null, "id": "fa2eb936", "metadata": {}, "outputs": [], "source": [ "!pip install sagemaker==2.163.0 --upgrade" ] }, { "cell_type": "code", "execution_count": null, "id": "2f3686b1", "metadata": {}, "outputs": [], "source": [ "!pip install sagemaker-studio-image-build" ] }, { "cell_type": "code", "execution_count": null, "id": "1418ebd0-8d97-4acd-8443-ddf40604ae78", "metadata": { "tags": [] }, "outputs": [], "source": [ "%pip install -q sagemaker-studio-image-build aws-sam-cli" ] }, { "cell_type": "code", "execution_count": null, "id": "a8616f53", "metadata": {}, "outputs": [], "source": [ "import sagemaker\n", "import boto3\n", "import os\n", "import tarfile\n", "import requests\n", "from io import BytesIO\n", "from tqdm import tqdm\n" ] }, { "attachments": {}, "cell_type": "markdown", "id": "4a59d15f", "metadata": {}, "source": [ "# Setup of notebook environment\n", "\n", "Before we begin with the actual work, we need to setup the notebook environment respectively. This includes:\n", "\n", "- retrieval of the execution role our SageMaker Studio domain is associated with for later usage\n", "- retrieval of our account_id for later usage\n", "- retrieval of the chosen region for later usage" ] }, { "cell_type": "code", "execution_count": null, "id": "50c34e0d", "metadata": {}, "outputs": [], "source": [ "# Retrieve SM execution role\n", "role = sagemaker.get_execution_role()" ] }, { "cell_type": "code", "execution_count": null, "id": "d281cb99", "metadata": {}, "outputs": [], "source": [ "# Create a new STS client\n", "sts_client = boto3.client('sts')\n", "\n", "# Call the GetCallerIdentity operation to retrieve the account ID\n", "response = sts_client.get_caller_identity()\n", "account_id = response['Account']\n", "account_id" ] }, { "cell_type": "code", "execution_count": null, "id": "0aae8da6", "metadata": {}, "outputs": [], "source": [ "# Retrieve region\n", "region = boto3.Session().region_name\n", "region" ] }, { "attachments": {}, "cell_type": "markdown", "id": "2dbcfb82", "metadata": {}, "source": [ "# Setup of S3 bucket for Amazon Kendra storage of knowledge documents\n", "\n", "Amazon Kendra provides multiple built-in adapters for integrating with data sources to build up a document index, e.g. S3, web-scraper, RDS, Box, Dropbox, ... . In this lab we will store the documents containing the knowledge to be infused into the application in S3. For this purpose, (if not already present) we create a dedicated S3 bucket." ] }, { "cell_type": "code", "execution_count": null, "id": "db749bee", "metadata": {}, "outputs": [], "source": [ "# specifying bucket name for model artifact storage\n", "prefix = 'gen-ai-immersion-day-kendra-storage'\n", "model_bucket_name = f'{prefix}-{account_id}-{region}'\n", "model_bucket_name" ] }, { "cell_type": "code", "execution_count": null, "id": "4e8af01c", "metadata": {}, "outputs": [], "source": [ "# Create S3 bucket\n", "s3_client = boto3.client('s3', region_name=region)\n", "location = {'LocationConstraint': region}\n", "\n", "bucket_name = model_bucket_name\n", "\n", "# Check if bucket already exists\n", "bucket_exists = True\n", "try:\n", " s3_client.head_bucket(Bucket=bucket_name)\n", "except:\n", " bucket_exists = False\n", "\n", "# Create bucket if it does not exist\n", "if not bucket_exists:\n", " if region == 'us-east-1':\n", " s3_client.create_bucket(Bucket=bucket_name)\n", " else: \n", " s3_client.create_bucket(Bucket=bucket_name,\n", " CreateBucketConfiguration=location)\n", " print(f\"Bucket '{bucket_name}' created successfully\")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "d6bd38df", "metadata": {}, "source": [ "# Frontend\n", "\n", "All relevant components for building a dockerized frontend application can be found in the \"fe\" directory. It consists of the following files: \n", "- ```app.py```: actual frontend utilizing the popular streamlit framework\n", "- ```Dockerfile```: Dockerfile providing the blueprint for the creation of a Docker image\n", "- ```requirements.txt```: specifying the dependencies required to be installed for hosting the frontend application\n", "- ```setup.sh```: setup script consisting all the necessary steps to create a ECR repository, build the Docker image and push it to the respective repository we created\n", "\n", "## Streamlit \n", "\n", "[Streamlit](https://streamlit.io/) is an open-source Python library that makes it easy to create and share beautiful, custom web apps for machine learning and data science. In just a few minutes you can build and deploy powerful data apps. It is a very popular frontend development framework for rapid prototyping amongst the AI/ML space since easy to use webpages can be built in minutes without anything than Python skills.\n", "\n", "## UI\n", "\n", "The chatbot frontend web application \"AWSomeChat\" looks as follows:\n", "\n", "\n", "\n", "To chat with the chatbot enter a message into the light grey input box and press ENTER. The chat conversation will appear below.\n", "\n", "On the top of the page you can spot the session id assigned to your chat conversation. This is used to map different conversation histories to a specific user since the chatbot backend is stateless. To start a new conversation, press the \"Clear Chat\" and \"Reset Session\" buttons on the top right of the page.\n", "\n", "\n", "## Dockerization and hosting\n", "\n", "In order to prepare our frontend application to be hosted as a Docker container, we execute the bash script setup.sh. It looks as follows: \n", "\n", "```bash \n", "#!/bin/bash\n", "\n", "# Get the AWS account ID\n", "aws_account_id=$(aws sts get-caller-identity --query Account --output text)\n", "aws_region=$(aws configure get region)\n", "\n", "echo \"AccountId = ${aws_account_id}\"\n", "echo \"Region = ${aws_region}\"\n", "\n", "\n", "# Create a new ECR repository\n", "echo \"Creating ECR Repository...\"\n", "aws ecr create-repository --repository-name rag-app\n", "\n", "# Get the login command for the new repository\n", "echo \"Logging into the repository...\"\n", "#$(aws ecr get-login --no-include-email)\n", "aws ecr get-login-password --region ${aws_region} | docker login --username AWS --password-stdin ${aws_account_id}.dkr.ecr.${aws_region}.amazonaws.com\n", "\n", "# Build and push the Docker image and tag it\n", "echo \"Building and pushing Docker image...\"\n", "sm-docker build -t \"${aws_account_id}.dkr.ecr.us-east-1.amazonaws.com/rag-app:latest\" --repository rag-app:latest .\n", "````\n", "\n", "The script performs the following steps in a sequential manner:\n", "\n", "1. Retrieval of the AWS account id and region\n", "2. Create a new ECR repository with the name rag-app. Note: this operation will fail, if the repository already exists within your account. This is intended behaviour and can be ignored.\n", "3. Login to the respective ECR repository. \n", "4. Build the Docker image and tag it with the \"latest\" tag using the sagemaker-studio-image-build package we previously installed. The \"sm-docker build\" command will push the built image into the specified repository automatically. All compute will be carried out in AWS CodeBuild." ] }, { "cell_type": "code", "execution_count": null, "id": "e7402e76-b8f1-4833-ba36-2dfd40e1e8ba", "metadata": { "tags": [] }, "outputs": [], "source": [ "!cd fe && bash setup.sh" ] }, { "attachments": {}, "cell_type": "markdown", "id": "1c923145", "metadata": {}, "source": [ "The frontend Docker image is now residing in the respective ECR repository and can be used for our deployment at a later point in time during the lab." ] }, { "attachments": {}, "cell_type": "markdown", "id": "39ee162b-9247-47c2-8aed-207df3aa94b5", "metadata": {}, "source": [ "# Document store and retriever component\n", "\n", "## Amazon Kendra\n", "\n", "[Amazon Kendra](https://docs.aws.amazon.com/kendra/latest/dg/what-is-kendra.html) is an intelligent search service that uses natural language processing and advanced machine learning algorithms to return specific answers to search questions from your data.\n", "\n", "Unlike traditional keyword-based search, Amazon Kendra uses its semantic and contextual understanding capabilities to decide whether a document is relevant to a search query. It returns specific answers to questions, giving users an experience that's close to interacting with a human expert.\n", "\n", "With Amazon Kendra, you can create a unified search experience by connecting multiple data repositories to an index and ingesting and crawling documents. You can use your document metadata to create a feature-rich and customized search experience for your users, helping them efficiently find the right answers to their queries.\n", "\n", "### Querying with Amazon Kendra\n", "You can ask Amazon Kendra the following types of queries:\n", "\n", "Factoid questions — Simple who, what, when, or where questions, such as Where is the nearest service center to Seattle? Factoid questions have fact-based answers that can be returned as a single word or phrase. The answer is retrieved from a FAQ or from your indexed documents.\n", "\n", "Descriptive questions — Questions where the answer could be a sentence, passage, or an entire document. For example, How do I connect my Echo Plus to my network? Or, How do I get tax benefits for lower income families? \n", "\n", "Keyword and natural language questions — Questions that include complex, conversational content where the intended meaning may not be clear. For example, keynote address. When Amazon Kendra encounters a word like “address”—which has multiple contextual meanings—it correctly infers the intended meaning behind the search query and returns relevant information.\n", "\n", "## Benefits of Amazon Kendra\n", "Amazon Kendra is highly scalable, capable of meeting performance demands, is tightly integrated with other AWS services such as Amazon S3 and Amazon Lex, and offers enterprise-grade security. Some of the benefits of using Amazon Kendra include:\n", "\n", "Simplicity — Amazon Kendra provides a console and API for managing the documents that you want to search. You can use a simple search API to integrate Amazon Kendra into your client applications, such as websites or mobile applications.\n", "\n", "Connectivity — Amazon Kendra can connect to third-party data repositories or data sources such as Microsoft SharePoint. You can easily index and search your documents using your data source.\n", "\n", "Accuracy — Unlike traditional search services that use keyword searches, Amazon Kendra attempts to understand the context of the question and returns the most relevant word, snippet, or document for your query. Amazon Kendra uses machine learning to improve search results over time.\n", "\n", "Security — Amazon Kendra delivers a highly secure enterprise search experience. Your search results reflect the security model of your organization and can be filtered based on the user or group access to documents. Customers are responsible for authenticating and authorizing user access.\n", "\n", "Incremental Learning - Amazon Kendra uses incremental learning to improve search results. Using feedback from queries, incremental learning improves the ranking algorithms and optimizes search results for greater accuracy.\n", "\n", "For example, suppose that your users search for the phrase \"health care benefits.\" If users consistently choose the second result from the list, over time Amazon Kendra boosts that result to the first place result. The boost decreases over time, so if users stop selecting a result, Amazon Kendra eventually removes it and shows another more popular result instead. This helps Amazon Kendra prioritize results based on relevance, age, and content.\n", "\n", "## Creation of a Kendra index\n", "To use Kendra for retrieval-augmented generation we need to first create a Kendra index. This index will hold all the knowledge we want to infuse into our LLM-powered chatbot application. To create a Kendra index follow the following steps:\n", "\n", "\n", "\n", "## Using Kendra with LLMs \n", "Now that we have a understanding of the basics of Kendra, we want to use it for the indexing of our documents. Kendra Indexing allows us to query our enterprise knowledge base, without us having to worry about how to handle different data types (pdf, xml) in S3, connectors to SaaS applications as well as webpages. \n", "\n", "We will use Kendra as outlined in the following architecture diagram:\n", "
\n",
" \n",
"
\n",
" \n",
"
\n",
" \n",
"
\n",
" \n",
"
\n",
" \n",
"