{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "3ef06ca1-2057-4f38-afd8-90fbb479fa02",
   "metadata": {},
   "source": [
    "# Deploy open-source Large Language Models on Amazon SageMaker"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e09d3f23-3262-42ea-bd7a-76f109266d00",
   "metadata": {},
   "source": [
    "In this notebook, we will show you how to deploy the open-source LLMs from HuggingFace on Amazon SageMaker. The notebook contains five sections:\n",
    "- Section 1: Deploy Text-to-Text LLM on SageMaker\n",
    "- Secton 2: Deploy image-to-text LLM on SageMaker (Multimodality)\n",
    "- Section 3: Deploy embedding model and implement RAG solution using LangChain on SageMaker (using [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) embedding model)\n",
    "- Section 4: (Optional) Run SageMaker Inference Recommender job to determine the cost and performance of the LLM\n",
    "- Section 5: (Optional) Deploy speech-to-text LLM on SageMaker (note that you can also use Amazon Trascribe to performe the same function)\n",
    "- Section 6: (Optional) Deploy Llama 2 on SageMaker\n",
    "\n",
    "***\n",
    "This notebooks is designed to run on `Python 3 Data Science 3.0` kernel in Amazon SageMaker Studio\n",
    "***"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2a5b730c-dc87-45e2-bc41-482b05b8c2c0",
   "metadata": {},
   "source": [
    "#### 1. Setup development environment\n",
    "\n",
    "We are going to use the `sagemaker` python SDK to deploy BLOOM to Amazon SageMaker. We need to make sure to have an AWS account configured and the `sagemaker` python SDK installed. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bf395449-857b-4305-94b0-a1379646a743",
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [],
   "source": [
    "!pip install \"sagemaker==2.163.0\" boto3 --upgrade --quiet\n",
    "!pip install ipywidgets==7.0.0 --quiet\n",
    "!pip install langchain==0.0.148 --quiet\n",
    "!pip install faiss-cpu --quiet\n",
    "!pip install unstructured --quiet"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2fc5dd15-9d1b-4020-8b06-159530972ca6",
   "metadata": {},
   "source": [
    "## Secton 1: (Text-to-Text) Deploy Falcon model to Amazon SageMaker\n",
    "In this section, we will deploy the open-source [Falcon 7b instruct model](https://huggingface.co/tiiuae/falcon-7b-instruct) on SageMaker for real-time inference. \n",
    "To deploy [Falcon-7B-Instruct](https://huggingface.co/tiiuae/falcon-7b-instruct) to Amazon SageMaker we create a `HuggingFaceModel` model class and define our endpoint configuration including the `hf_model_id`, `instance_type` etc. We will use a `g5.2xlarge` instance type.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11498376-bf21-457e-b5de-aaad86309ee1",
   "metadata": {},
   "source": [
    "This is an example on how to deploy the open-source LLMs, like [BLOOM](bigscience/bloom) to Amazon SageMaker for inference using the new Hugging Face LLM Inference Container. We will deploy the 40B-Instruct [Falcon](https://huggingface.co/tiiuae/falcon-40b-instruct) an open-source Chat LLM trained by TII.\n",
    "\n",
    "The example covers:\n",
    "1. [Setup development environment](#1-setup-development-environment)\n",
    "2. [Retrieve the new Hugging Face LLM DLC](#2-retrieve-the-new-hugging-face-llm-dlc)\n",
    "3. [Deploy Falcon to Amazon SageMaker](#3-deploy-open-assistant-12b-to-amazon-sagemaker)\n",
    "4. [Run inference and chat with our model](#4-run-inference-and-chat-with-our-model)\n",
    "\n",
    "## What is Hugging Face LLM Inference DLC?\n",
    "\n",
    "Hugging Face LLM DLC is a new purpose-built Inference Container to easily deploy LLMs in a secure and managed environment. The DLC is powered by [Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference), an open-source, purpose-built solution for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation using Tensor Parallelism and dynamic batching for the most popular open-source LLMs, including StarCoder, BLOOM, GPT-NeoX, Llama, and T5. \n",
    "Text Generation Inference is already used by customers such as IBM, Grammarly, and the Open-Assistant initiative implements optimization for all supported model architectures, including:\n",
    "* Tensor Parallelism and custom cuda kernels\n",
    "* Optimized transformers code for inference using [flash-attention](https://github.com/HazyResearch/flash-attention) on the most popular architectures\n",
    "* Quantization with [bitsandbytes](https://github.com/TimDettmers/bitsandbytes)\n",
    "* [Continuous batching of incoming requests](https://github.com/huggingface/text-generation-inference/tree/main/router) for increased total throughput\n",
    "* Accelerated weight loading (start-up time) with [safetensors](https://github.com/huggingface/safetensors)\n",
    "* Logits warpers (temperature scaling, topk, repetition penalty ...)\n",
    "* Watermarking with [A Watermark for Large Language Models](https://arxiv.org/abs/2301.10226)\n",
    "* Stop sequences, Log probabilities\n",
    "* Token streaming using Server-Sent Events (SSE)\n",
    "\n",
    "Officially supported model architectures are currently: \n",
    "* [BLOOM](https://huggingface.co/bigscience/bloom) / [BLOOMZ](https://huggingface.co/bigscience/bloomz)\n",
    "* [MT0-XXL](https://huggingface.co/bigscience/mt0-xxl)\n",
    "* [Galactica](https://huggingface.co/facebook/galactica-120b)\n",
    "* [SantaCoder](https://huggingface.co/bigcode/santacoder)\n",
    "* [GPT-Neox 20B](https://huggingface.co/EleutherAI/gpt-neox-20b) (joi, pythia, lotus, rosey, chip, RedPajama, open assistant)\n",
    "* [FLAN-T5-XXL](https://huggingface.co/google/flan-t5-xxl) (T5-11B)\n",
    "* [Llama](https://github.com/facebookresearch/llama) (vicuna, alpaca, koala)\n",
    "* [Starcoder](https://huggingface.co/bigcode/starcoder) / [SantaCoder](https://huggingface.co/bigcode/santacoder)\n",
    "* [Falcon 7B](https://huggingface.co/tiiuae/falcon-7b) / [Falcon 40B](https://huggingface.co/tiiuae/falcon-40b)\n",
    "\n",
    "With the new Hugging Face LLM Inference DLCs on Amazon SageMaker, AWS customers can benefit from the same technologies that power highly concurrent, low latency LLM experiences like [HuggingChat](https://hf.co/chat), [OpenAssistant](https://open-assistant.io/), and Inference API for LLM models on the Hugging Face Hub. \n",
    "\n",
    "Lets get started!\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "20547cd8-c7c1-4e23-a2f0-da24c5c0ac33",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import sagemaker\n",
    "import boto3\n",
    "import json\n",
    "import time\n",
    "sess = sagemaker.Session()\n",
    "# sagemaker session bucket -> used for uploading data, models and logs\n",
    "# sagemaker will automatically create this bucket if it not exists\n",
    "\n",
    "try:\n",
    "    role = sagemaker.get_execution_role()\n",
    "except ValueError:\n",
    "    iam = boto3.client('iam')\n",
    "    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']\n",
    "\n",
    "bucket = sess.default_bucket()  \n",
    "aws_region = boto3.Session().region_name\n",
    "print(f\"sagemaker role arn: {role}\")\n",
    "print(f\"sagemaker session region: {sess.boto_region_name}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dafbbb2f-d974-464b-8251-e940cab1958b",
   "metadata": {
    "tags": []
   },
   "source": [
    "#### Retrieve the new Hugging Face LLM DLC\n",
    "\n",
    "Compared to deploying regular Hugging Face models we first need to retrieve the container uri and provide it to our `HuggingFaceModel` model class with a `image_uri` pointing to the image. To retrieve the new Hugging Face LLM DLC in Amazon SageMaker, we can use the `get_huggingface_llm_image_uri` method provided by the `sagemaker` SDK. This method allows us to retrieve the URI for the desired Hugging Face LLM DLC based on the specified `backend`, `session`, `region`, and `version`. You can find the available versions [here](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#huggingface-text-generation-inference-containers)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "693d0083-685f-4088-8e68-9cadbe385fbf",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from sagemaker.huggingface import get_huggingface_llm_image_uri\n",
    "\n",
    "# retrieve the llm image uri\n",
    "llm_image = get_huggingface_llm_image_uri(\n",
    "  \"huggingface\",\n",
    "  version=\"0.8.2\"\n",
    ")\n",
    "\n",
    "# print ecr image uri\n",
    "print(f\"llm image uri: {llm_image}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0b4581d4-d256-4b90-80ad-1c1da3048b0c",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import json\n",
    "from sagemaker.huggingface import HuggingFaceModel\n",
    "\n",
    "# Define Model and Endpoint configuration parameter\n",
    "hf_model_id = \"tiiuae/falcon-7b-instruct\" # model id from huggingface.co/models\n",
    "instance_type = \"ml.g5.2xlarge\" # instance type to use for deployment\n",
    "number_of_gpu = 1 # number of gpus to use for inference and tensor parallelism\n",
    "health_check_timeout = 600 # Increase the timeout for the health check to 5 minutes for downloading the model\n",
    "falcon_model_name = hf_model_id.split(\"/\")[-1].replace(\".\", \"-\")\n",
    "# create HuggingFaceModel with the image uri\n",
    "llm_model = HuggingFaceModel(\n",
    "    role=role,\n",
    "    name=falcon_model_name,\n",
    "    image_uri=llm_image,\n",
    "    env={\n",
    "        'HF_MODEL_ID': hf_model_id,\n",
    "        # 'HF_MODEL_QUANTIZE': \"bitsandbytes\", # comment in to quantize\n",
    "        'SM_NUM_GPUS': json.dumps(number_of_gpu),\n",
    "        'MAX_INPUT_LENGTH': json.dumps(1900),  # Max length of input text\n",
    "        'MAX_TOTAL_TOKENS': json.dumps(2048),  # Max length of the generation (including input text)\n",
    "    }  \n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9bed661a-2f38-4c7c-a12e-001e5196309f",
   "metadata": {},
   "source": [
    "After we have created the `HuggingFaceModel` we can deploy it to Amazon SageMaker using the `deploy` method. We will deploy the model with the `ml.g5.2xlarge` instance type. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cf6b36cd-e86d-42fe-ae8c-cb5a7e655bec",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "\n",
    "falcon_endpoint_name = falcon_model_name + \"-2xl\"\n",
    "falcon_endpoint_name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5ddfdad7-0384-43d0-bf25-7767598c0cb4",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "llm = llm_model.deploy(\n",
    "  initial_instance_count=1,\n",
    "  instance_type=instance_type,\n",
    "  container_startup_health_check_timeout=health_check_timeout,\n",
    "  endpoint_name=falcon_endpoint_name,\n",
    "  wait=False\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3f4ac097-7cbc-4722-a934-3f4c661a3d85",
   "metadata": {},
   "source": [
    "SageMaker will now create our endpoint and deploy the model to it. This can takes a 10-15 minutes. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4ff0131f-db28-41bb-a8b5-6f6d364e0a4f",
   "metadata": {},
   "source": [
    "## Run inference and chat with our model\n",
    "\n",
    "After our endpoint is deployed we can run inference on it. We will use the `predict` method from the `predictor` to run inference on our endpoint. We can inference with different parameters to impact the generation. Parameters can be defined as in the `parameters` attribute of the payload. As of today the TGI supports the following parameters:\n",
    "* `temperature`: Controls randomness in the model. Lower values will make the model more deterministic and higher values will make the model more random. Default value is 1.0.\n",
    "* `max_new_tokens`: The maximum number of tokens to generate. Default value is 20, max value is 512.\n",
    "* `repetition_penalty`: Controls the likelihood of repetition, defaults to `null`.\n",
    "* `seed`: The seed to use for random generation, default is `null`.\n",
    "* `stop`: A list of tokens to stop the generation. The generation will stop when one of the tokens is generated.\n",
    "* `top_k`: The number of highest probability vocabulary tokens to keep for top-k-filtering. Default value is `null`, which disables top-k-filtering.\n",
    "* `top_p`: The cumulative probability of parameter highest probability vocabulary tokens to keep for nucleus sampling, default to `null`\n",
    "* `do_sample`: Whether or not to use sampling ; use greedy decoding otherwise. Default value is `false`.\n",
    "* `best_of`: Generate best_of sequences and return the one if the highest token logprobs, default to `null`.\n",
    "* `details`: Whether or not to return details about the generation. Default value is `false`.\n",
    "* `return_full_text`: Whether or not to return the full text or only the generated part. Default value is `false`.\n",
    "* `truncate`: Whether or not to truncate the input to the maximum length of the model. Default value is `true`.\n",
    "* `typical_p`: The typical probability of a token. Default value is `null`.\n",
    "* `watermark`: The watermark to use for the generation. Default value is `false`.\n",
    "\n",
    "You can find the open api specification of the TGI in the [swagger documentation](https://huggingface.github.io/text-generation-inference/). "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d5365156-1ea2-4e57-95fd-27e1b9c46865",
   "metadata": {},
   "source": [
    "**Please finish all the model deployment before coming back to the below cell to test the model output**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dda84f44-7304-4db0-a113-5603be78a373",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# chat = llm.predict({\n",
    "#     \"inputs\": \"\"\"Hello, how are you?\"\"\"\n",
    "# })\n",
    "\n",
    "# print(chat[0][\"generated_text\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d6412cf8-7f98-4b7f-a074-5a93e3af2efb",
   "metadata": {},
   "source": [
    "To see more test cases, you can find an example notebook [here at the SageMaker examples gitrepo.](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/jumpstart-foundation-models/text-generation-falcon.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e5d2b2c6-5663-4ec9-ab6b-8370d8d19cf4",
   "metadata": {},
   "source": [
    "## Secton 2: (image-to-text) Deploy Blip2 on SageMaker - Multimodality\n",
    "In this section, we will show you how to deploy the [blip2 model](https://huggingface.co/Salesforce/blip2-opt-6.7b) on SageMaker."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "093da59a-d29b-445f-8465-dcc59448e7eb",
   "metadata": {},
   "source": [
    "#### Setup\n",
    "We get DLC image URL for djl-deepspeed 0.21.0 and set SageMaker settings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "09b5b49e-6af6-40fd-a6eb-82b734f4aed1",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from sagemaker import image_uris\n",
    "import time\n",
    "\n",
    "session = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs\n",
    "region = session._region_name\n",
    "bucket = session.default_bucket()  # bucket to house artifacts\n",
    "\n",
    "img_uri = image_uris.retrieve(framework=\"djl-deepspeed\", region=region, version=\"0.21.0\")\n",
    "instance_type = \"ml.g5.xlarge\"\n",
    "s3_location = f\"s3://{bucket}/djl-serving/\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "86dd16e7-91a9-4f56-ae5e-9aace3948ddb",
   "metadata": {},
   "source": [
    "#### Prepare model file.\n",
    "We can update the configuration for deployment by modifying the [serving.properties](blip2/serving.properties).\n",
    "```python\n",
    "engine = DeepSpeed\n",
    "option.tensor_parallel_degree=1\n",
    "option.model_id=Salesforce/blip2-opt-6.7b\n",
    "```\n",
    "\n",
    "The code below creates the SageMaker model file (`model.tar.gz`) and upload it to S3. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "30ab7b51-d75b-418a-8fc8-2840fc977f5d",
   "metadata": {},
   "outputs": [],
   "source": [
    "!tar -czvf blip2.tar.gz blip2/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a035752e-64a6-4b33-8e1b-11115d7ef8c2",
   "metadata": {},
   "outputs": [],
   "source": [
    "model_tar_url = sagemaker.s3.S3Uploader.upload(\"blip2.tar.gz\", s3_location)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "675fa0ca-d30f-486c-89e2-928aa48061f0",
   "metadata": {},
   "source": [
    "#### Create SageMaker endpoint\n",
    "Now we create our SageMaker model. Make sure your execution role has access to your model artifacts and ECR image. Please check out our SageMaker Roles documentation for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "688a47c4-bedd-45ff-a8f8-545fb048ff5c",
   "metadata": {},
   "outputs": [],
   "source": [
    "from datetime import datetime\n",
    "\n",
    "sm_client = boto3.client(\"sagemaker\")\n",
    "\n",
    "time_stamp = datetime.now().strftime(\"%Y-%m-%d-%H-%M-%S\")\n",
    "model_name = \"blip2-\" + time_stamp\n",
    "\n",
    "create_model_response = sm_client.create_model(\n",
    "    ModelName=model_name,\n",
    "    ExecutionRoleArn=role,\n",
    "    PrimaryContainer={\"Image\": img_uri, \"ModelDataUrl\": model_tar_url},\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0dca1017-2707-40cb-b8d7-545f6d77ea45",
   "metadata": {},
   "outputs": [],
   "source": [
    "initial_instance_count = 1\n",
    "variant_name = \"AllTraffic\"\n",
    "endpoint_config_name = \"blip2-config-\" + time_stamp\n",
    "\n",
    "production_variants = [\n",
    "    {\n",
    "        \"VariantName\": variant_name,\n",
    "        \"ModelName\": model_name,\n",
    "        \"InitialInstanceCount\": initial_instance_count,\n",
    "        \"InstanceType\": instance_type,\n",
    "        \"ModelDataDownloadTimeoutInSeconds\": 1200,\n",
    "        \"ContainerStartupHealthCheckTimeoutInSeconds\": 1800\n",
    "    }\n",
    "]\n",
    "\n",
    "endpoint_config = {\n",
    "    \"EndpointConfigName\": endpoint_config_name,\n",
    "    \"ProductionVariants\": production_variants\n",
    "}\n",
    "\n",
    "ep_conf_res = sm_client.create_endpoint_config(**endpoint_config)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e18dc1e0-9754-4edf-913c-0799d1e9beda",
   "metadata": {},
   "outputs": [],
   "source": [
    "endpoint_name_blip2 = \"blip2-large-v2-xlarge\"\n",
    "\n",
    "ep_res = sm_client.create_endpoint(\n",
    "    EndpointName=endpoint_name_blip2, EndpointConfigName=endpoint_config_name\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "430b8554-f5e9-46d7-add4-e9e25a365865",
   "metadata": {},
   "source": [
    "**Once the endpoint creation starts, you can move on to the next section to continue deploying the models to finish remaining task before testing the blip2 model, as this model may take around 15mins to finish deployment**\n",
    "\n",
    "Once it is ready, uncomment below cell to test the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f1b3dc95-d8c5-4c6a-8db2-562026a64306",
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [],
   "source": [
    "# # wait for the endpoint to be deployed successfully - as we are using a small instance, the deployment can take around 15mins\n",
    "# describe_endpoint_response = sm_client.describe_endpoint(EndpointName=endpoint_name_blip2)\n",
    "\n",
    "# while describe_endpoint_response[\"EndpointStatus\"] == \"Creating\":\n",
    "#     describe_endpoint_response = sm_client.describe_endpoint(EndpointName=endpoint_name_blip2)\n",
    "#     print(describe_endpoint_response[\"EndpointStatus\"])\n",
    "#     time.sleep(15)\n",
    "    \n",
    "# print(f'endpoint {endpoint_name_blip2} is in service now.')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d2eb168e-0815-4117-be5b-8cbb87efe91f",
   "metadata": {},
   "source": [
    "#### Test inference endpoint"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e454e600-a9d2-40c8-a139-58525051ffa2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# import base64\n",
    "# import json\n",
    "\n",
    "# def encode_image(img_file):\n",
    "#     with open(img_file, \"rb\") as image_file:\n",
    "#         img_str = base64.b64encode(image_file.read())\n",
    "#         base64_string = img_str.decode(\"latin1\")\n",
    "#     return base64_string"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "371fbcc1-2dc2-41a9-934a-a6cad4ceaa0b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# base64_string = encode_image('test/carcrash-ai.jpeg')\n",
    "# inputs = {\"prompt\": \"Question: is the car damaged? and if yes, which part of this car are damaged?/n Answer:\", \"image\": base64_string}\n",
    "\n",
    "# sm_runtime_client = boto3.client(\"sagemaker-runtime\")\n",
    "\n",
    "# response = sm_runtime_client.invoke_endpoint(\n",
    "#     EndpointName=endpoint_name_blip2, Body=json.dumps(inputs)\n",
    "# )\n",
    "# print(response[\"Body\"].read())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c1afbb7b-ff6f-4cb2-b5b5-7936f4317557",
   "metadata": {},
   "source": [
    "all-MiniLM-L6-v2## Secton 3: (Embedding model) Deploy the all-MiniLM-L6-v2 embedding model on SageMaker and implement RAG solution with LangChain"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1d1399b4-4897-4426-941c-e58d095ada04",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from sagemaker import image_uris, model_uris, script_uris, hyperparameters\n",
    "from sagemaker.model import Model\n",
    "from sagemaker.predictor import Predictor\n",
    "import os\n",
    "\n",
    "instance_type = \"ml.g5.4xlarge\" # instance type to use for deployment\n",
    "model_version = \"*\"\n",
    "env= {\n",
    "            \"SAGEMAKER_MODEL_SERVER_WORKERS\": \"1\", \n",
    "            \"TS_DEFAULT_WORKERS_PER_MODEL\": \"1\",\n",
    "            # This model requires HF_TASK param \n",
    "            # https://huggingface.co/docs/transformers/main/main_classes/pipelines#transformers.pipeline.task\n",
    "            \"HF_TASK\": \"feature-extraction\" \n",
    "    }\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "69439663-dcc4-47c1-bc0a-63d979506645",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "repository = \"sentence-transformers/all-MiniLM-L6-v2\"\n",
    "model_name=repository.split(\"/\")[-1]\n",
    "s3_location=f\"s3://{sess.default_bucket()}/custom_inference/{model_name}/model.tar.gz\"\n",
    "pwd = os.system(\"pwd\") # current path"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bf9c8071-7f27-4b9d-a2fc-ef562f89ae64",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Download the model from hf.co/models with git clone.\n",
    "!git lfs install\n",
    "!git clone https://huggingface.co/$repository"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "71312b4a-9bd8-416d-9294-4e5f7acd2882",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Create a model.tar.gz archive in S3 and delete downloaded folder\n",
    "%cd $model_name\n",
    "!tar zcvf model.tar.gz *\n",
    "!aws s3 cp model.tar.gz $s3_location\n",
    "%cd ..\n",
    "%rm -r $model_name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e67dac5b-4e5d-4f7d-ac67-5a3ac1d1958d",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "model_id= \"huggingface-textembedding-all-MiniLM-L6-v2\"\n",
    "# Retrieve the model uri.\n",
    "model_uri = model_uris.retrieve(\n",
    "    model_id=model_id, model_version=model_version, model_scope=\"inference\"\n",
    ")\n",
    "embed_endpoint_name = model_id + '-' + instance_type.split('.')[-1]\n",
    "\n",
    "# Retrieve the inference container uri. This is the base HuggingFace container image for the default model above.\n",
    "deploy_image_uri = image_uris.retrieve(\n",
    "    region=None,\n",
    "    framework=None,  # automatically inferred from model_id\n",
    "    image_scope=\"inference\",\n",
    "    model_id=model_id,\n",
    "    model_version=model_version,\n",
    "    instance_type=instance_type,\n",
    ")\n",
    "model_inference = Model(\n",
    "    image_uri=deploy_image_uri,\n",
    "    model_data=model_uri,\n",
    "    role=role,\n",
    "    predictor_cls=Predictor,\n",
    "    name=model_id,\n",
    "    env=env,\n",
    ")\n",
    "model_predictor_inference = model_inference.deploy(\n",
    "    initial_instance_count=1,\n",
    "    instance_type=instance_type,\n",
    "    predictor_cls=Predictor,\n",
    "    endpoint_name=embed_endpoint_name,\n",
    ")\n",
    "print(f\"Model {model_id} has been deployed successfully.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7a25196b-8520-467c-82d4-ff56f7c626e4",
   "metadata": {},
   "source": [
    "### Use RAG based approach with [LangChain](https://python.langchain.com/en/latest/index.html) and SageMaker endpoints to build a simplified question and answering application.\n",
    "\n",
    "\n",
    "We plan to use document embeddings to fetch the most relevant documents in our document knowledge library and combine them with the prompt that we provide to LLM.\n",
    "\n",
    "To achieve that, we will do following.\n",
    "\n",
    "1. **Generate embedings for each of document in the knowledge library with Huggingface all-MiniLM-L6-v2 embedding model.**\n",
    "2. **Identify top K most relevant documents based on user query.**\n",
    "    - 2.1 **For a query of your interest, generate the embedding of the query using the same embedding model.**\n",
    "    - 2.2 **Search the indexes of top K most relevant documents in the embedding space using in-memory Faiss search.**\n",
    "    - 2.3 **Use the indexes to retrieve the corresponded documents.**\n",
    "3. **Combine the retrieved documents with prompt and question and send them into SageMaker LLM.**\n",
    "\n",
    "\n",
    "\n",
    "Note: The retrieved document/text should be large enough to contain enough information to answer a question; but small enough to fit into the LLM prompt -- maximum sequence length of 1024 tokens. \n",
    "\n",
    "---\n",
    "To build a simiplied QA application with LangChain, we need: \n",
    "1. Wrap up our SageMaker endpoints for embedding model and LLM into `langchain.embeddings.SagemakerEndpointEmbeddings` and `langchain.llms.sagemaker_endpoint.SagemakerEndpoint`. That requires a small overwritten of `SagemakerEndpointEmbeddings` class to make it compatible with SageMaker embedding mdoel.\n",
    "2. Prepare the dataset to build the knowledge data base. \n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "512c101e-1fbe-4c31-a1ef-8b63212fa790",
   "metadata": {},
   "source": [
    "Wrap up our SageMaker endpoints for embedding model into `langchain.embeddings.SagemakerEndpointEmbeddings`. That requires a small overwritten of `SagemakerEndpointEmbeddings` class to make it compatible with SageMaker embedding mdoel."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7f9f8cae-807a-408d-a9cf-98b43bbb2542",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain.embeddings.sagemaker_endpoint import EmbeddingsContentHandler\n",
    "from langchain.embeddings import SagemakerEndpointEmbeddings\n",
    "from langchain.llms.sagemaker_endpoint import ContentHandlerBase\n",
    "from typing import Any, Dict, List, Optional\n",
    "import json\n",
    "\n",
    "class SagemakerEndpointEmbeddingsJumpStart(SagemakerEndpointEmbeddings):\n",
    "    def embed_documents(self, texts: List[str], chunk_size: int = 5) -> List[List[float]]:\n",
    "        \"\"\"Compute doc embeddings using a SageMaker Inference Endpoint.\n",
    "\n",
    "        Args:\n",
    "            texts: The list of texts to embed.\n",
    "            chunk_size: The chunk size defines how many input texts will\n",
    "                be grouped together as request. If None, will use the\n",
    "                chunk size specified by the class.\n",
    "\n",
    "        Returns:\n",
    "            List of embeddings, one for each text.\n",
    "        \"\"\"\n",
    "        results = []\n",
    "        _chunk_size = len(texts) if chunk_size > len(texts) else chunk_size\n",
    "        for i in range(0, len(texts), _chunk_size):\n",
    "            response = self._embedding_func(texts[i : i + _chunk_size])\n",
    "            print\n",
    "            results.extend(response)\n",
    "        return results\n",
    "\n",
    "\n",
    "class ContentHandler(EmbeddingsContentHandler):\n",
    "    content_type = \"application/json\"\n",
    "    accepts = \"application/json\"\n",
    "\n",
    "    def transform_input(self, prompt: str, model_kwargs={}) -> bytes:\n",
    "        input_str = json.dumps({\"text_inputs\": prompt, **model_kwargs})\n",
    "        return input_str.encode(\"utf-8\")\n",
    "\n",
    "    def transform_output(self, output: bytes) -> str:\n",
    "        response_json = json.loads(output.read().decode(\"utf-8\"))\n",
    "        embeddings = response_json[\"embedding\"]\n",
    "        return embeddings\n",
    "\n",
    "\n",
    "content_handler = ContentHandler()\n",
    "\n",
    "embeddings = SagemakerEndpointEmbeddingsJumpStart(\n",
    "    endpoint_name=embed_endpoint_name,\n",
    "    region_name=aws_region,\n",
    "    content_handler=content_handler,\n",
    ")\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6ee7a5f2-058b-4620-a37b-95cfca6a1e80",
   "metadata": {},
   "source": [
    "Next, we wrap up our SageMaker endpoints for LLM into `langchain.llms.sagemaker_endpoint.SagemakerEndpoint`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "14a8461b-6eff-475a-85e3-f0973880a555",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain.llms.sagemaker_endpoint import LLMContentHandler, SagemakerEndpoint\n",
    "\n",
    "parameters = {\n",
    "    \"max_new_tokens\": 500,\n",
    "    \"return_full_text\": False,\n",
    "    \"temperature\": 0.1\n",
    "}\n",
    "\n",
    "\n",
    "class ContentHandler(LLMContentHandler):\n",
    "    content_type = \"application/json\"\n",
    "    accepts = \"application/json\"\n",
    "\n",
    "    def transform_input(self, prompt: str, model_kwargs={}) -> bytes:\n",
    "        self.len_prompt = len(prompt)\n",
    "        input_str = json.dumps({\"inputs\": prompt, \"parameters\": {**model_kwargs}})\n",
    "        return input_str.encode(\"utf-8\")\n",
    "\n",
    "    def transform_output(self, output: bytes) -> str:\n",
    "        response_json = output.read()\n",
    "        res = json.loads(response_json)\n",
    "        print(res)\n",
    "        ans = res[0]['generated_text'][self.len_prompt:]\n",
    "        return ans \n",
    "\n",
    "\n",
    "content_handler = ContentHandler()\n",
    "\n",
    "sm_llm = SagemakerEndpoint(\n",
    "    endpoint_name=falcon_endpoint_name,\n",
    "    region_name=aws_region,\n",
    "    model_kwargs=parameters,\n",
    "    content_handler=content_handler,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ab5c1370-3271-4561-b449-0c90782e91b0",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain.chains import RetrievalQA\n",
    "from langchain.document_loaders import TextLoader\n",
    "from langchain.indexes import VectorstoreIndexCreator\n",
    "from langchain.vectorstores import Chroma, AtlasDB, FAISS\n",
    "from langchain.text_splitter import CharacterTextSplitter\n",
    "from langchain import PromptTemplate\n",
    "from langchain.chains.question_answering import load_qa_chain\n",
    "from langchain.document_loaders import DirectoryLoader"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "234b7547-a52d-4f74-9982-0e7501386642",
   "metadata": {},
   "source": [
    "Use langchain to read the `txt` data. There are multiple built-in functions in LangChain to read different format of files such as `csv`, `html`, and `pdf`. For details, see [LangChain document loaders](https://python.langchain.com/en/latest/modules/indexes/document_loaders.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0e27d89d-8d56-413f-b753-1e48e79446e1",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "loader = DirectoryLoader(\"../data/\", glob=\"**/*.txt\")\n",
    "documents = loader.load()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cfe4a131-9b09-4141-96c5-6a13751e99ff",
   "metadata": {},
   "source": [
    "We generate embedings for each of document in the knowledge library with Huggingface all-MiniLM-L6-v2 embedding model.documents"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b51eb647-9b91-4581-bdfa-12f55ab0f4b4",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "docsearch = FAISS.from_documents(documents, embeddings)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d4604c98-a75d-4d5c-8672-5774ef7060a9",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "question = \"what is the recommended way to first customize a foundation model?\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1a972160-933c-4353-b0e6-1c4aaa878ae8",
   "metadata": {},
   "source": [
    "Based on the question above, we then **identify top K most relevant documents based on user query, where K = 3 in this setup**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fb4c6fb8-ec23-42d4-9853-d0f3c292f103",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "docs = docsearch.similarity_search_with_score(question)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "27aec596-f6be-4780-9599-6793a2f26b22",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "docs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dc6375e8-321b-40a2-b631-0ff456063849",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "source = []\n",
    "context = []\n",
    "for doc, score in docs:\n",
    "    context.append(doc)\n",
    "    source.append(doc.metadata['source'].split('/')[-1])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ad22f198-7474-4c57-b4d7-0be16d90248c",
   "metadata": {},
   "source": [
    "Finally, we **combine the retrieved documents with prompt and question and send them into SageMaker LLM.** \n",
    "\n",
    "We define a customized prompt as below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a53822d7-e44c-4076-8735-59a6ce7f7ddb",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "prompt_template = \"\"\"Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.:\\n\\n{context}\\n\\nQuestion: {question}\\nHelpful Answer:\"\"\"\n",
    "\n",
    "PROMPT = PromptTemplate(template=prompt_template, input_variables=[\"context\", \"question\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f80b71b4-a04b-4a66-b5ff-b5e2d132703f",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "chain = load_qa_chain(llm=sm_llm, prompt=PROMPT)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9deb6db1-3897-410c-9599-5d6f63333a63",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "result = chain({\"input_documents\": context, \"question\": question}, return_only_outputs=True)[\"output_text\"]\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9717f908-ae2f-4ec4-b637-a5cc309f1bea",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Secton 4: (Optional) Understand the model hosting performance using SageMaker Inference Recommender "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d5460ca6-ab3c-4da1-bd8e-338c1c89c6a2",
   "metadata": {},
   "source": [
    "[Amazon SageMaker Inference Recommender](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-recommender.html) is a capability of Amazon SageMaker that reduces the time required to get machine learning (ML) models in production by automating load testing and model tuning across SageMaker ML instances. You can use Inference Recommender to deploy your model to a real-time or serverless inference endpoint that delivers the best performance at the lowest cost. Inference Recommender helps you select the best instance type and configuration (such as instance count, container parameters, and model optimizations) or serverless configuration (such as max concurrency and memory size) for your ML models and workloads."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a16877ea-94f0-4bc5-baa7-045315487574",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "!tar -czvf payload.tar.gz test/payload.json"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "03715d81-4d5a-4062-a1c3-aa3f375d231c",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "s3_location = f\"s3://{bucket}/sagemaker/InferenceRecommender/{model_name}\"\n",
    "payload_tar_url = sagemaker.s3.S3Uploader.upload(\"payload.tar.gz\", s3_location)\n",
    "print(payload_tar_url)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f31ef825-fab6-4239-8fa1-4376c0bdc8a6",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "job_name = f\"{falcon_model_name}-\" + time.strftime(\"%Y-%m-%d-%H-%M-%S\", time.gmtime())\n",
    "response = sm_client.create_inference_recommendations_job(\n",
    "    JobName=job_name,\n",
    "    JobType='Default',\n",
    "    RoleArn=role,\n",
    "    InputConfig={\n",
    "        'ContainerConfig': {\n",
    "            'Domain': 'NATURAL_LANGUAGE_PROCESSING',\n",
    "            'Task': 'TEXT_GENERATION',\n",
    "            'PayloadConfig': {\n",
    "                'SamplePayloadUrl': payload_tar_url,\n",
    "                'SupportedContentTypes': [\"application/json\"],\n",
    "            },\n",
    "            #specify the instance types you would like to test out\n",
    "            'SupportedInstanceTypes': ['ml.g5.2xlarge'], \n",
    "            'SupportedEndpointType': 'RealTime'\n",
    "        },\n",
    "        'ModelName': falcon_model_name\n",
    "    },\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dd346679-cc89-4f13-bfb8-45d2aa73bf0f",
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [],
   "source": [
    "# # uncomment this section to wait for the inference job to finish\n",
    "# describe_IR_job_response = sm_client.describe_inference_recommendations_job(JobName=job_name)\n",
    "\n",
    "# while describe_IR_job_response[\"Status\"] in [\"IN_PROGRESS\", \"PENDING\"]:\n",
    "#     describe_IR_job_response = sm_client.describe_inference_recommendations_job(JobName=job_name)\n",
    "#     print(describe_IR_job_response[\"Status\"])\n",
    "#     time.sleep(15)\n",
    "    \n",
    "# print(f'Inference Recommender job {job_name} has finished with status {describe_IR_job_response[\"Status\"]}.')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e1984596-7f6c-4257-84a8-edceca547f61",
   "metadata": {},
   "source": [
    "Now, let's use the inference recommender job results to calculate the approximate invocation cost for the LLM endpoint."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1746058b-3eea-42d0-a632-76fa124abbe0",
   "metadata": {},
   "outputs": [],
   "source": [
    "describe_IR_job_response = sm_client.describe_inference_recommendations_job(JobName=job_name)\n",
    "describe_IR_job_response['InferenceRecommendations']"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ec0bcdbe-ed9b-4fc2-a94a-c26c2c3fbd0f",
   "metadata": {},
   "source": [
    "The inference recomender job reports the below metrics: \n",
    "- 'ModelLatency'\n",
    "- 'CostPerInference'\n",
    "- 'CostPerHour'\n",
    "- 'MaxInvocations' per minute\n",
    "\n",
    "and more.\n",
    "\n",
    "Note that the sample json input file consists of 6,200 characters, which is around 1550 tokens per invocation (1 token is approximately 4 characters). To calculate the approximate cost per 1K tokens, you can do the inference many times (with average payload size) and get the best token/s you get through the experiment (different instance types can result in different throughput, model latency, and cost). Then we will calculate the per token per second invocation price and multiply by 1,000. You can also use per invocation cost divide by the tokens per invocation and multiply by 1,000. The calculated price should be similar. SageMaker also supports auto-scaling to scale your endpoint out/in to save cost based on the invocation traffic pattern."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "784c07c2-5375-4ffc-9ba8-685e4eac9a81",
   "metadata": {},
   "outputs": [],
   "source": [
    "metrics = describe_IR_job_response['InferenceRecommendations'][0]['Metrics']\n",
    "token_per_sec = round(metrics['MaxInvocations']*1550/60, 2)\n",
    "cost_per_sec = round(metrics['CostPerHour']/3600, 5)\n",
    "cost_per_1k_token = round(cost_per_sec/token_per_sec * 1000, 5)\n",
    "print(\"According to the Inference recommender job, the corresponding metrices are as below: /n\")\n",
    "print(f\"Max tokens per second is about {token_per_sec}\")\n",
    "print(f\"Cost per second is about ${cost_per_sec}\")\n",
    "print(f\"Cost per 1k tokens is about ${cost_per_1k_token}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c29147f3-8d9f-414b-8454-cc232f9def23",
   "metadata": {},
   "source": [
    "## Section 5: (Optional) Deploy speech-to-audio LLM on SageMaker\n",
    "In this section, we will deploy the open-source [Whisper model](https://huggingface.co/openai/whisper-large-v2) to SageMaker real-time hosting. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "95351e27",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import sagemaker\n",
    "import boto3\n",
    "from sagemaker.huggingface import HuggingFaceModel\n",
    "from sagemaker.serializers import DataSerializer\n",
    "import sagemaker\n",
    "import json\n",
    "client = boto3.client('runtime.sagemaker')\n",
    "\n",
    "try:\n",
    "\trole = sagemaker.get_execution_role()\n",
    "except ValueError:\n",
    "\tiam = boto3.client('iam')\n",
    "\trole = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']\n",
    "\n",
    "# Hub Model configuration. https://huggingface.co/models\n",
    "hub = {\n",
    "\t'HF_MODEL_ID':'openai/whisper-large-v2',\n",
    "\t'HF_TASK':'automatic-speech-recognition',\n",
    "}\n",
    "\n",
    "# create Hugging Face Model Class\n",
    "huggingface_model = HuggingFaceModel(\n",
    "\ttransformers_version='4.26.0',\n",
    "\tpytorch_version='1.13.1',\n",
    "\tpy_version='py39',\n",
    "\tenv=hub,\n",
    "\trole=role, \n",
    ")\n",
    "endpoint_name=\"whisper-large-v2\"\n",
    "# deploy model to SageMaker Inference\n",
    "predictor = huggingface_model.deploy(\n",
    "    endpoint_name=endpoint_name,\n",
    "\tinitial_instance_count=1, # number of instances\n",
    "\tinstance_type='ml.g5.xlarge' # ec2 instance type\n",
    ")\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b4c666ae-dc81-40b7-92a2-29e607448cdc",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "file = \"test/test.webm\"\n",
    "with open(file, \"rb\") as f:\n",
    "\tdata = f.read()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6a2376b8-a1f8-4384-baea-dcbee02a828c",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# option 1: using SageMaker python SDK\n",
    "predictor = sagemaker.predictor.Predictor(endpoint_name=endpoint_name, \n",
    "                                          serializer=DataSerializer(content_type='audio/x-audio'),\n",
    "                                         )\n",
    "predictor.predict(data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "69661495-4803-441c-8184-068528182138",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# option 2: using boto3 invoke_endpoint api\n",
    "response = client.invoke_endpoint(EndpointName=endpoint_name, ContentType='audio/x-audio', Body=data)\n",
    "json.loads(response['Body'].read())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7d95f819-27d8-46c9-af8b-33955a485d6c",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Section 6: (Optional) Deploy Llama 2 on SageMaker\n",
    "\n",
    "The Llama 2 foundation models developed by Meta are available for customers through [Amazon SageMaker JumpStart](https://aws.amazon.com/sagemaker/jumpstart/). Llama 2 is an auto-regressive language model that uses an optimized transformer architecture. Llama 2 is intended for commercial and research use in English. It comes in a range of parameter sizes—7 billion, 13 billion, and 70 billion—as well as pre-trained and fine-tuned variations. According to Meta, the tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align to human preferences for helpfulness and safety. Llama 2 was pre-trained on 2 trillion tokens of data from publicly available sources. The tuned models are intended for assistant-like chat, whereas pre-trained models can be adapted for a variety of natural language generation tasks."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "55625c5f-a145-46c2-9bd7-a2f9aff141cf",
   "metadata": {},
   "source": [
    "You can follow the [blog post](https://aws.amazon.com/blogs/machine-learning/llama-2-foundation-models-from-meta-are-now-available-in-amazon-sagemaker-jumpstart/) to deploy the Llama 2 from JumpStart and replace the chatbot LLM model name with the Llama 2 endpoint name then test the performance of the model"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "65f9f81c-f762-43a7-9c46-cf26f2455b48",
   "metadata": {},
   "source": [
    "## clean up"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ceced14e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# client.delete_endpoint(EndpointName=endpoint_name)\n",
    "# client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)\n",
    "# client.delete_model(ModelName=model_name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "447cf5ec",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "availableInstances": [
   {
    "_defaultOrder": 0,
    "_isFastLaunch": true,
    "category": "General purpose",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 4,
    "name": "ml.t3.medium",
    "vcpuNum": 2
   },
   {
    "_defaultOrder": 1,
    "_isFastLaunch": false,
    "category": "General purpose",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 8,
    "name": "ml.t3.large",
    "vcpuNum": 2
   },
   {
    "_defaultOrder": 2,
    "_isFastLaunch": false,
    "category": "General purpose",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 16,
    "name": "ml.t3.xlarge",
    "vcpuNum": 4
   },
   {
    "_defaultOrder": 3,
    "_isFastLaunch": false,
    "category": "General purpose",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 32,
    "name": "ml.t3.2xlarge",
    "vcpuNum": 8
   },
   {
    "_defaultOrder": 4,
    "_isFastLaunch": true,
    "category": "General purpose",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 8,
    "name": "ml.m5.large",
    "vcpuNum": 2
   },
   {
    "_defaultOrder": 5,
    "_isFastLaunch": false,
    "category": "General purpose",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 16,
    "name": "ml.m5.xlarge",
    "vcpuNum": 4
   },
   {
    "_defaultOrder": 6,
    "_isFastLaunch": false,
    "category": "General purpose",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 32,
    "name": "ml.m5.2xlarge",
    "vcpuNum": 8
   },
   {
    "_defaultOrder": 7,
    "_isFastLaunch": false,
    "category": "General purpose",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 64,
    "name": "ml.m5.4xlarge",
    "vcpuNum": 16
   },
   {
    "_defaultOrder": 8,
    "_isFastLaunch": false,
    "category": "General purpose",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 128,
    "name": "ml.m5.8xlarge",
    "vcpuNum": 32
   },
   {
    "_defaultOrder": 9,
    "_isFastLaunch": false,
    "category": "General purpose",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 192,
    "name": "ml.m5.12xlarge",
    "vcpuNum": 48
   },
   {
    "_defaultOrder": 10,
    "_isFastLaunch": false,
    "category": "General purpose",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 256,
    "name": "ml.m5.16xlarge",
    "vcpuNum": 64
   },
   {
    "_defaultOrder": 11,
    "_isFastLaunch": false,
    "category": "General purpose",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 384,
    "name": "ml.m5.24xlarge",
    "vcpuNum": 96
   },
   {
    "_defaultOrder": 12,
    "_isFastLaunch": false,
    "category": "General purpose",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 8,
    "name": "ml.m5d.large",
    "vcpuNum": 2
   },
   {
    "_defaultOrder": 13,
    "_isFastLaunch": false,
    "category": "General purpose",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 16,
    "name": "ml.m5d.xlarge",
    "vcpuNum": 4
   },
   {
    "_defaultOrder": 14,
    "_isFastLaunch": false,
    "category": "General purpose",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 32,
    "name": "ml.m5d.2xlarge",
    "vcpuNum": 8
   },
   {
    "_defaultOrder": 15,
    "_isFastLaunch": false,
    "category": "General purpose",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 64,
    "name": "ml.m5d.4xlarge",
    "vcpuNum": 16
   },
   {
    "_defaultOrder": 16,
    "_isFastLaunch": false,
    "category": "General purpose",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 128,
    "name": "ml.m5d.8xlarge",
    "vcpuNum": 32
   },
   {
    "_defaultOrder": 17,
    "_isFastLaunch": false,
    "category": "General purpose",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 192,
    "name": "ml.m5d.12xlarge",
    "vcpuNum": 48
   },
   {
    "_defaultOrder": 18,
    "_isFastLaunch": false,
    "category": "General purpose",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 256,
    "name": "ml.m5d.16xlarge",
    "vcpuNum": 64
   },
   {
    "_defaultOrder": 19,
    "_isFastLaunch": false,
    "category": "General purpose",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 384,
    "name": "ml.m5d.24xlarge",
    "vcpuNum": 96
   },
   {
    "_defaultOrder": 20,
    "_isFastLaunch": false,
    "category": "General purpose",
    "gpuNum": 0,
    "hideHardwareSpecs": true,
    "memoryGiB": 0,
    "name": "ml.geospatial.interactive",
    "supportedImageNames": [
     "sagemaker-geospatial-v1-0"
    ],
    "vcpuNum": 0
   },
   {
    "_defaultOrder": 21,
    "_isFastLaunch": true,
    "category": "Compute optimized",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 4,
    "name": "ml.c5.large",
    "vcpuNum": 2
   },
   {
    "_defaultOrder": 22,
    "_isFastLaunch": false,
    "category": "Compute optimized",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 8,
    "name": "ml.c5.xlarge",
    "vcpuNum": 4
   },
   {
    "_defaultOrder": 23,
    "_isFastLaunch": false,
    "category": "Compute optimized",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 16,
    "name": "ml.c5.2xlarge",
    "vcpuNum": 8
   },
   {
    "_defaultOrder": 24,
    "_isFastLaunch": false,
    "category": "Compute optimized",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 32,
    "name": "ml.c5.4xlarge",
    "vcpuNum": 16
   },
   {
    "_defaultOrder": 25,
    "_isFastLaunch": false,
    "category": "Compute optimized",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 72,
    "name": "ml.c5.9xlarge",
    "vcpuNum": 36
   },
   {
    "_defaultOrder": 26,
    "_isFastLaunch": false,
    "category": "Compute optimized",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 96,
    "name": "ml.c5.12xlarge",
    "vcpuNum": 48
   },
   {
    "_defaultOrder": 27,
    "_isFastLaunch": false,
    "category": "Compute optimized",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 144,
    "name": "ml.c5.18xlarge",
    "vcpuNum": 72
   },
   {
    "_defaultOrder": 28,
    "_isFastLaunch": false,
    "category": "Compute optimized",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 192,
    "name": "ml.c5.24xlarge",
    "vcpuNum": 96
   },
   {
    "_defaultOrder": 29,
    "_isFastLaunch": true,
    "category": "Accelerated computing",
    "gpuNum": 1,
    "hideHardwareSpecs": false,
    "memoryGiB": 16,
    "name": "ml.g4dn.xlarge",
    "vcpuNum": 4
   },
   {
    "_defaultOrder": 30,
    "_isFastLaunch": false,
    "category": "Accelerated computing",
    "gpuNum": 1,
    "hideHardwareSpecs": false,
    "memoryGiB": 32,
    "name": "ml.g4dn.2xlarge",
    "vcpuNum": 8
   },
   {
    "_defaultOrder": 31,
    "_isFastLaunch": false,
    "category": "Accelerated computing",
    "gpuNum": 1,
    "hideHardwareSpecs": false,
    "memoryGiB": 64,
    "name": "ml.g4dn.4xlarge",
    "vcpuNum": 16
   },
   {
    "_defaultOrder": 32,
    "_isFastLaunch": false,
    "category": "Accelerated computing",
    "gpuNum": 1,
    "hideHardwareSpecs": false,
    "memoryGiB": 128,
    "name": "ml.g4dn.8xlarge",
    "vcpuNum": 32
   },
   {
    "_defaultOrder": 33,
    "_isFastLaunch": false,
    "category": "Accelerated computing",
    "gpuNum": 4,
    "hideHardwareSpecs": false,
    "memoryGiB": 192,
    "name": "ml.g4dn.12xlarge",
    "vcpuNum": 48
   },
   {
    "_defaultOrder": 34,
    "_isFastLaunch": false,
    "category": "Accelerated computing",
    "gpuNum": 1,
    "hideHardwareSpecs": false,
    "memoryGiB": 256,
    "name": "ml.g4dn.16xlarge",
    "vcpuNum": 64
   },
   {
    "_defaultOrder": 35,
    "_isFastLaunch": false,
    "category": "Accelerated computing",
    "gpuNum": 1,
    "hideHardwareSpecs": false,
    "memoryGiB": 61,
    "name": "ml.p3.2xlarge",
    "vcpuNum": 8
   },
   {
    "_defaultOrder": 36,
    "_isFastLaunch": false,
    "category": "Accelerated computing",
    "gpuNum": 4,
    "hideHardwareSpecs": false,
    "memoryGiB": 244,
    "name": "ml.p3.8xlarge",
    "vcpuNum": 32
   },
   {
    "_defaultOrder": 37,
    "_isFastLaunch": false,
    "category": "Accelerated computing",
    "gpuNum": 8,
    "hideHardwareSpecs": false,
    "memoryGiB": 488,
    "name": "ml.p3.16xlarge",
    "vcpuNum": 64
   },
   {
    "_defaultOrder": 38,
    "_isFastLaunch": false,
    "category": "Accelerated computing",
    "gpuNum": 8,
    "hideHardwareSpecs": false,
    "memoryGiB": 768,
    "name": "ml.p3dn.24xlarge",
    "vcpuNum": 96
   },
   {
    "_defaultOrder": 39,
    "_isFastLaunch": false,
    "category": "Memory Optimized",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 16,
    "name": "ml.r5.large",
    "vcpuNum": 2
   },
   {
    "_defaultOrder": 40,
    "_isFastLaunch": false,
    "category": "Memory Optimized",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 32,
    "name": "ml.r5.xlarge",
    "vcpuNum": 4
   },
   {
    "_defaultOrder": 41,
    "_isFastLaunch": false,
    "category": "Memory Optimized",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 64,
    "name": "ml.r5.2xlarge",
    "vcpuNum": 8
   },
   {
    "_defaultOrder": 42,
    "_isFastLaunch": false,
    "category": "Memory Optimized",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 128,
    "name": "ml.r5.4xlarge",
    "vcpuNum": 16
   },
   {
    "_defaultOrder": 43,
    "_isFastLaunch": false,
    "category": "Memory Optimized",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 256,
    "name": "ml.r5.8xlarge",
    "vcpuNum": 32
   },
   {
    "_defaultOrder": 44,
    "_isFastLaunch": false,
    "category": "Memory Optimized",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 384,
    "name": "ml.r5.12xlarge",
    "vcpuNum": 48
   },
   {
    "_defaultOrder": 45,
    "_isFastLaunch": false,
    "category": "Memory Optimized",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 512,
    "name": "ml.r5.16xlarge",
    "vcpuNum": 64
   },
   {
    "_defaultOrder": 46,
    "_isFastLaunch": false,
    "category": "Memory Optimized",
    "gpuNum": 0,
    "hideHardwareSpecs": false,
    "memoryGiB": 768,
    "name": "ml.r5.24xlarge",
    "vcpuNum": 96
   },
   {
    "_defaultOrder": 47,
    "_isFastLaunch": false,
    "category": "Accelerated computing",
    "gpuNum": 1,
    "hideHardwareSpecs": false,
    "memoryGiB": 16,
    "name": "ml.g5.xlarge",
    "vcpuNum": 4
   },
   {
    "_defaultOrder": 48,
    "_isFastLaunch": false,
    "category": "Accelerated computing",
    "gpuNum": 1,
    "hideHardwareSpecs": false,
    "memoryGiB": 32,
    "name": "ml.g5.2xlarge",
    "vcpuNum": 8
   },
   {
    "_defaultOrder": 49,
    "_isFastLaunch": false,
    "category": "Accelerated computing",
    "gpuNum": 1,
    "hideHardwareSpecs": false,
    "memoryGiB": 64,
    "name": "ml.g5.4xlarge",
    "vcpuNum": 16
   },
   {
    "_defaultOrder": 50,
    "_isFastLaunch": false,
    "category": "Accelerated computing",
    "gpuNum": 1,
    "hideHardwareSpecs": false,
    "memoryGiB": 128,
    "name": "ml.g5.8xlarge",
    "vcpuNum": 32
   },
   {
    "_defaultOrder": 51,
    "_isFastLaunch": false,
    "category": "Accelerated computing",
    "gpuNum": 1,
    "hideHardwareSpecs": false,
    "memoryGiB": 256,
    "name": "ml.g5.16xlarge",
    "vcpuNum": 64
   },
   {
    "_defaultOrder": 52,
    "_isFastLaunch": false,
    "category": "Accelerated computing",
    "gpuNum": 4,
    "hideHardwareSpecs": false,
    "memoryGiB": 192,
    "name": "ml.g5.12xlarge",
    "vcpuNum": 48
   },
   {
    "_defaultOrder": 53,
    "_isFastLaunch": false,
    "category": "Accelerated computing",
    "gpuNum": 4,
    "hideHardwareSpecs": false,
    "memoryGiB": 384,
    "name": "ml.g5.24xlarge",
    "vcpuNum": 96
   },
   {
    "_defaultOrder": 54,
    "_isFastLaunch": false,
    "category": "Accelerated computing",
    "gpuNum": 8,
    "hideHardwareSpecs": false,
    "memoryGiB": 768,
    "name": "ml.g5.48xlarge",
    "vcpuNum": 192
   },
   {
    "_defaultOrder": 55,
    "_isFastLaunch": false,
    "category": "Accelerated computing",
    "gpuNum": 8,
    "hideHardwareSpecs": false,
    "memoryGiB": 1152,
    "name": "ml.p4d.24xlarge",
    "vcpuNum": 96
   },
   {
    "_defaultOrder": 56,
    "_isFastLaunch": false,
    "category": "Accelerated computing",
    "gpuNum": 8,
    "hideHardwareSpecs": false,
    "memoryGiB": 1152,
    "name": "ml.p4de.24xlarge",
    "vcpuNum": 96
   }
  ],
  "instance_type": "ml.t3.medium",
  "kernelspec": {
   "display_name": "Python 3 (Data Science 3.0)",
   "language": "python",
   "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/sagemaker-data-science-310-v1"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}