{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "71a329f0",
   "metadata": {},
   "source": [
    "# Deploy scalable streaming tokens solution on SageMaker\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n",
    "\n",
    "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/inference|generativeai|llm-workshop|lab6-stream-with-pagination|stream_pagination_lmi.ipynb)\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "71a329f0",
   "metadata": {},
   "source": [
    "In this notebook, we explore how to host a large language model on SageMaker using the latest container that packages some of the most popular open source libraries for model parallel inference like DeepSpeed and HuggingFace Accelerate. We use DJLServing as the model serving solution in this example. DJLServing is a high-performance universal model serving solution powered by the Deep Java Library (DJL) that is programming language agnostic. To learn more about DJL and DJLServing, you can refer to our [recent blog post](https://aws.amazon.com/blogs/machine-learning/deploy-bloom-176b-and-opt-30b-on-amazon-sagemaker-with-large-model-inference-deep-learning-containers-and-deepspeed/).\n",
    "\n",
    "In this notebook, we will deploy the open source [Cerebras-GPT-2.7B](https://huggingface.co/cerebras/Cerebras-GPT-1.3B) model on a ml.g5.2xlarge machine. We will also demostrate a streaming experience to have model run end2end in a stream way.\n",
    "\n",
    "![](images/design_chart.png)\n",
    "\n",
    "\n",
    "Customer could directly send the request to Lambda service through API-Gateway. Lambda service could send inference job request to SageMaker. SageMaker would run inference and update the result to dynamoDB. Finally, customer could read directly from DynamoDB.\n",
    "\n",
    "## Licence agreement\n",
    "- View model license information: Apache 2.0 before using the model.\n",
    "- This notebook is a sample notebook and not intended for production use. Please refer to the licence at https://github.com/aws/mit-0.\n",
    "\n",
    "\n",
    "## Permission\n",
    "\n",
    "In order to conduct this lab, we will need the following permissions:\n",
    "\n",
    "- ECR Push/Pull access\n",
    "- S3 bucket push access\n",
    "- SageMaker access\n",
    "- DynamoDB access (create DB and query)\n",
    "\n",
    "If you plan to do build restful services, we also need to have lambda, iam and API-gateway permission.\n",
    "\n",
    "- AWSLambda access (Create lambda function)\n",
    "- IAM access (Create role, delete role)\n",
    "- APIGateway (Creation, deletion)\n",
    "\n",
    "## Let's bump up SageMaker and import stuff"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "67fa3208",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%pip install sagemaker boto3 awscli --upgrade  --quiet"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ec9ac353",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import boto3\n",
    "import sagemaker\n",
    "from sagemaker import Model, serializers, deserializers\n",
    "\n",
    "role = sagemaker.get_execution_role()  # execution role for the endpoint\n",
    "sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs\n",
    "region = sess._region_name  # region name of the current SageMaker Studio environment\n",
    "account_id = sess.account_id()  # account_id of the current SageMaker Studio environment"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "71542f98",
   "metadata": {},
   "source": [
    "## Bring your own container to ECR repository\n",
    "\n",
    "*Note: Please make sure you have the permission in AWS credential to push to ECR repository*\n",
    "\n",
    "In this step, we will pull the LMI nightly container from dockerhub and then push it to the ECR repository.\n",
    "\n",
    "This process may take a while, depends on the container size and your network bandwidth."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b1efb852",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "# The name of our container\n",
    "repo_name=djlserving-byoc\n",
    "# Target container\n",
    "target_container=\"deepjavalibrary/djl-serving:deepspeed-nightly\"\n",
    "\n",
    "account=$(aws sts get-caller-identity --query Account --output text)\n",
    "\n",
    "# Get the region defined in the current configuration (default to us-west-2 if none defined)\n",
    "region=$(aws configure get region)\n",
    "region=${region:-us-west-2}\n",
    "\n",
    "fullname=\"${account}.dkr.ecr.${region}.amazonaws.com/${repo_name}:latest\"\n",
    "echo \"Creating ECR repository ${fullname}\"\n",
    "\n",
    "# If the repository doesn't exist in ECR, create it.\n",
    "\n",
    "aws ecr describe-repositories --repository-names \"${repo_name}\" > /dev/null 2>&1\n",
    "\n",
    "if [ $? -ne 0 ]\n",
    "then\n",
    "    aws ecr create-repository --repository-name \"${repo_name}\" > /dev/null\n",
    "fi\n",
    "\n",
    "# Get the login command from ECR and execute it directly\n",
    "aws ecr get-login-password --region ${region} | docker login --username AWS --password-stdin \"${account}.dkr.ecr.${region}.amazonaws.com\"\n",
    "\n",
    "# Build the docker image locally with the image name and then push it to ECR\n",
    "# with the full name.\n",
    "echo \"Start pulling container: ${target_container}\"\n",
    "\n",
    "docker pull ${target_container}\n",
    "docker tag ${target_container} ${fullname}\n",
    "docker push ${fullname}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "81deac79",
   "metadata": {},
   "source": [
    "## Create SageMaker compatible Model artifact, upload Model to S3 and bring your own inference script.\n",
    "\n",
    "SageMaker Large Model Inference containers can be used to host models without providing your own inference code. This is extremely useful when there is no custom pre-processing of the input data or postprocessing of the model's predictions.\n",
    "\n",
    "However in this notebook, we demonstrate how to deploy a model with custom inference code.\n",
    "\n",
    "In LMI contianer, we expect some artifacts to help setting up the model\n",
    "- `serving.properties` is the configuration file that can be used to configure the model server.\n",
    "- `model.py` A python file to define the core inference logic\n",
    "- `requirements.txt` contains the pip wheel need to install in runtime\n",
    "\n",
    "For more details on the configuration options and an exhaustive list, you can refer the documentation - https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-configuration.html."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b011bf5f",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%%writefile serving.properties\n",
    "engine=Python\n",
    "option.model_id=cerebras/Cerebras-GPT-2.7B"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "19d6798b",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%%writefile model.py\n",
    "from djl_python import Input, Output\n",
    "import torch\n",
    "import logging\n",
    "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
    "from djl_python.streaming_utils import StreamingUtils\n",
    "from paginator import DDBPaginator\n",
    "import uuid\n",
    "\n",
    "\n",
    "def load_model(properties):\n",
    "    model_location = properties[\"model_dir\"]\n",
    "    if \"model_id\" in properties:\n",
    "        model_location = properties[\"model_id\"]\n",
    "    logging.info(f\"Loading model in {model_location}\")\n",
    "    device = \"cuda:0\"\n",
    "    model = AutoModelForCausalLM.from_pretrained(\n",
    "        model_location, low_cpu_mem_usage=True, torch_dtype=torch.float16\n",
    "    ).to(device)\n",
    "    tokenizer = AutoTokenizer.from_pretrained(model_location)\n",
    "    stream_generator = StreamingUtils.get_stream_generator(\"Accelerate\")\n",
    "    return model, tokenizer, stream_generator\n",
    "\n",
    "\n",
    "model = None\n",
    "tokenizer = None\n",
    "stream_generator = None\n",
    "paginator = None\n",
    "\n",
    "\n",
    "def separate_inference(session_id, inputs):\n",
    "    prompt = inputs[\"prompt\"]\n",
    "    length = inputs[\"max_new_tokens\"]\n",
    "    generate_kwargs = dict(max_new_tokens=length, do_sample=True)\n",
    "    generator = stream_generator(model, tokenizer, prompt, **generate_kwargs)\n",
    "    generated = \"\"\n",
    "    iterator = 0\n",
    "    for text in generator:\n",
    "        generated += text[0]\n",
    "        if iterator == 5:\n",
    "            paginator.add_cache(session_id, generated)\n",
    "            iterator = 0\n",
    "        iterator += 1\n",
    "    paginator.add_cache(session_id, generated + \"<eos>\")\n",
    "\n",
    "\n",
    "def handle(inputs: Input):\n",
    "    global model, tokenizer, stream_generator, paginator\n",
    "    if not model:\n",
    "        model, tokenizer, stream_generator = load_model(inputs.get_properties())\n",
    "        paginator = DDBPaginator(\"lmi_test_db\")\n",
    "\n",
    "    if inputs.is_empty():\n",
    "        # Model server makes an empty call to warmup the model on startup\n",
    "        return None\n",
    "    session_id = str(uuid.uuid4())\n",
    "    return (\n",
    "        Output()\n",
    "        .add({\"session_id\": session_id})\n",
    "        .finalize(separate_inference, session_id, inputs.get_as_json())\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "02e53e8c-9b3b-47a1-9afc-d1c8621269e9",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%%writefile paginator.py\n",
    "import boto3\n",
    "import logging\n",
    "\n",
    "\n",
    "class DDBPaginator:\n",
    "    DEFAULT_KEY_NAME = \"cache_id\"\n",
    "\n",
    "    def __init__(self, db_name):\n",
    "        self.db_name = db_name\n",
    "        self.ddb_client = boto3.client(\"dynamodb\")\n",
    "        try:\n",
    "            self.ddb_client.describe_table(TableName=db_name)\n",
    "        except self.ddb_client.exceptions.ResourceNotFoundException:\n",
    "            logging.info(f\"Table {db_name} not found\")\n",
    "            self.ddb_client.create_table(\n",
    "                TableName=db_name,\n",
    "                AttributeDefinitions=[\n",
    "                    {\"AttributeName\": self.DEFAULT_KEY_NAME, \"AttributeType\": \"S\"},\n",
    "                ],\n",
    "                KeySchema=[{\"AttributeName\": self.DEFAULT_KEY_NAME, \"KeyType\": \"HASH\"}],\n",
    "                BillingMode=\"PAY_PER_REQUEST\",\n",
    "            )\n",
    "            waiter = self.ddb_client.get_waiter(\"table_exists\")\n",
    "            waiter.wait(TableName=db_name, WaiterConfig={\"Delay\": 1})\n",
    "\n",
    "    def add_cache(self, session_id, content):\n",
    "        return self.ddb_client.put_item(\n",
    "            TableName=self.db_name,\n",
    "            Item={self.DEFAULT_KEY_NAME: {\"S\": session_id}, \"content\": {\"S\": content}},\n",
    "        )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e8b50a6c",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%%writefile requirements.txt\n",
    "boto3\n",
    "transformers==4.27.2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b0142973",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%%sh\n",
    "mkdir mymodel\n",
    "mv serving.properties mymodel/\n",
    "mv model.py mymodel/\n",
    "mv paginator.py mymodel/\n",
    "mv requirements.txt mymodel/\n",
    "tar czvf mymodel.tar.gz mymodel/\n",
    "rm -rf mymodel"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2e58cf33",
   "metadata": {},
   "source": [
    "## Start building SageMaker endpoint\n",
    "\n",
    "### Upload artifact on S3 and create SageMaker model\n",
    "\n",
    "The tarball that we created will be sent to an s3bucket that SageMaker created."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "38b1e5ca",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "s3_code_prefix = \"large-model-lmi/code\"\n",
    "bucket = sess.default_bucket()  # bucket to house artifacts\n",
    "code_artifact = sess.upload_data(\"mymodel.tar.gz\", bucket, s3_code_prefix)\n",
    "print(f\"S3 Code or Model tar ball uploaded to --- > {code_artifact}\")\n",
    "\n",
    "repo_name = \"djlserving-byoc\"\n",
    "image_uri = f\"{account_id}.dkr.ecr.{region}.amazonaws.com/{repo_name}:latest\"\n",
    "env = {\"HUGGINGFACE_HUB_CACHE\": \"/tmp\", \"TRANSFORMERS_CACHE\": \"/tmp\"}\n",
    "\n",
    "model = Model(image_uri=image_uri, model_data=code_artifact, env=env, role=role)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "004f39f6",
   "metadata": {},
   "source": [
    "### Create SageMaker endpoint\n",
    "\n",
    "Here, we use g5.2xlarge instance. The endpoint name is `lmi-model-deploy`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8e0e61cd",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "instance_type = \"ml.g5.2xlarge\"\n",
    "endpoint_name = sagemaker.utils.name_from_base(\"lmi-model\")\n",
    "print(f\"endpoint_name is {endpoint_name}\")\n",
    "\n",
    "model.deploy(\n",
    "    initial_instance_count=1,\n",
    "    instance_type=instance_type,\n",
    "    endpoint_name=endpoint_name,\n",
    ")\n",
    "\n",
    "# our requests and responses will be in json format so we specify the serializer and the deserializer\n",
    "predictor = sagemaker.Predictor(\n",
    "    endpoint_name=endpoint_name,\n",
    "    sagemaker_session=sess,\n",
    "    serializer=serializers.JSONSerializer(),\n",
    "    deserializer=deserializers.JSONDeserializer(),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d2fea064-cbaa-40e7-b36c-e65abd8b35c6",
   "metadata": {},
   "source": [
    "### This step can take ~ 10 min or longer so please be patient"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bb63ee65",
   "metadata": {},
   "source": [
    "## Test and benchmark the inference\n",
    "\n",
    "In here, we use a SageMaker endpoint + DynamoDB simple fetcher to get the response result.\n",
    "\n",
    "- send prompt request and receive a session id\n",
    "- use session id to retrieve the streamed tokens"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2bcef095",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import time\n",
    "\n",
    "session_id = predictor.predict({\"prompt\": [\"Large language model is\"], \"max_new_tokens\": 256})\n",
    "\n",
    "\n",
    "def get_stream(session_id):\n",
    "    ddb_client = boto3.client(\"dynamodb\")\n",
    "    prev = 0\n",
    "    while True:\n",
    "        result = ddb_client.get_item(TableName=\"lmi_test_db\", Key={\"cache_id\": {\"S\": session_id}})\n",
    "        if \"Item\" in result:\n",
    "            text = result[\"Item\"][\"content\"][\"S\"]\n",
    "            print(text[prev:], end=\"\")\n",
    "            prev = len(text)\n",
    "            if text.endswith(\"<eos>\"):\n",
    "                break\n",
    "        time.sleep(0.1)\n",
    "\n",
    "\n",
    "get_stream(session_id[\"session_id\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5da05cbc-af43-4b36-bcff-0c6132f50b72",
   "metadata": {},
   "source": [
    "## Make this as a single RESTful endpoint service\n",
    "\n",
    "in the previous example, we just demoed how to create an endpoint and use CLI to complete inference. Now, let's build a real-world application using Lambda and API-Gateway. This API can be used to call by any client server or any web applications. Here we used an open-sourced toolkit by AWS called [Chalice](https://github.com/aws/chalice). It combines most commonly used Lambda/DynamoDB/APIGateway functions to deploy the stack easily.\n",
    "\n",
    "Chalice requires 4 major components:\n",
    "\n",
    "- `app.py`: Place to define your lambda function and related services\n",
    "- `requirements.txt`: pip wheel needed to drive the applicaiton\n",
    "- `.chalice/config.json`: a json file defines the generation logic and deployment stage\n",
    "- `.chalice/policy-<stage>.json`: a json file defines the policy that needs to attach to an IAM role of Lambda\n",
    "\n",
    "#### Note: This step requires permission on IAM role creation, Lambda creation and APIGateWay creation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bb0d9078-d909-46fc-a452-ed54f35f62ad",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%pip install chalice requests --upgrade  --quiet"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3cb795a3-bbec-42db-bf27-eae1607aa2ce",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "with open(\"app.py\", \"w\") as f:\n",
    "    f.write(\n",
    "        \"\"\"import boto3\n",
    "import sagemaker\n",
    "from sagemaker import serializers, deserializers\n",
    "from chalice import Chalice\n",
    "\n",
    "app = Chalice(app_name=\"stream_endpoint\")\n",
    "TABLE_NAME = \"lmi_test_db\"\n",
    "SM_ENDPOINT_NAME = \"{endpoint_name}\"\n",
    "sm_predictor = None\n",
    "\"\"\".format(\n",
    "            endpoint_name=endpoint_name\n",
    "        )\n",
    "        + \"\"\"\n",
    "@app.route(\"/query\", methods=[\"POST\"])\n",
    "def run_inference():\n",
    "    body = app.current_request.json_body\n",
    "    if \"session_id\" in body:\n",
    "        return ddb_fetcher(body[\"session_id\"])\n",
    "    elif \"prompt\" in body:\n",
    "        return get_sm_predictor().predict(body)\n",
    "    else:\n",
    "        return {\"result\": \"Error!\", \"_debug\": body}\n",
    "\n",
    "\n",
    "def ddb_fetcher(session_id):\n",
    "    ddb_client = boto3.client(\"dynamodb\")\n",
    "    result = ddb_client.get_item(TableName=TABLE_NAME, Key={\"cache_id\": {\"S\": session_id}})\n",
    "    if \"Item\" in result:\n",
    "        return {\"result\": result[\"Item\"][\"content\"][\"S\"]}\n",
    "    return {\"result\": \"\", \"_debug\": result}\n",
    "\n",
    "\n",
    "def get_sm_predictor():\n",
    "    global sm_predictor\n",
    "    if sm_predictor is None:\n",
    "        sess = sagemaker.session.Session()\n",
    "        sm_predictor = sagemaker.Predictor(\n",
    "            endpoint_name=SM_ENDPOINT_NAME,\n",
    "            sagemaker_session=sess,\n",
    "            serializer=serializers.JSONSerializer(),\n",
    "            deserializer=deserializers.JSONDeserializer(),\n",
    "        )\n",
    "    return sm_predictor\n",
    "\"\"\"\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9ddbc81a-10c1-4435-9f64-1d5619ec2fd0",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%%writefile requirements.txt\n",
    "boto3\n",
    "sagemaker"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "deb6fe20-62a4-44b8-8234-c26744146022",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%%writefile config.json\n",
    "{\n",
    "    \"version\": \"2.0\",\n",
    "    \"app_name\": \"stream_endpoint\",\n",
    "    \"stages\": {\"dev\": {\"autogen_policy\": false, \"api_gateway_stage\": \"api\"}}\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eeb7f8f4-6773-42ec-8805-f613f0b9a1ce",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%%writefile policy-dev.json\n",
    "{\n",
    "    \"Version\": \"2012-10-17\",\n",
    "    \"Statement\": [\n",
    "        {\n",
    "            \"Action\": [\"logs:CreateLogGroup\", \"logs:CreateLogStream\", \"logs:PutLogEvents\"],\n",
    "            \"Resource\": \"arn:aws:logs:*:*:*\",\n",
    "            \"Effect\": \"Allow\"\n",
    "        },\n",
    "        {\n",
    "            \"Action\": [\"dynamodb:GetItem\", \"dynamodb:Scan\", \"dynamodb:Query\"],\n",
    "            \"Resource\": [\"arn:aws:dynamodb:*:*:table/lmi_test_db*\"],\n",
    "            \"Effect\": \"Allow\"\n",
    "        },\n",
    "        {\n",
    "            \"Action\": [\"sagemaker:ListEndpoints\", \"sagemaker:InvokeEndpoint\"],\n",
    "            \"Resource\": [\"arn:aws:sagemaker:*:*:endpoint/lmi*\"],\n",
    "            \"Effect\": \"Allow\"\n",
    "        }\n",
    "    ]\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a1fd4a5f-da1a-4e53-97b8-0dbfd40d7ba4",
   "metadata": {},
   "source": [
    "Now, let's do deployment!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eb685733-646b-47bf-bff2-c72cd22eb1ea",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "rm -rf deployment/\n",
    "mkdir -p deployment/.chalice\n",
    "mv app.py deployment/\n",
    "mv requirements.txt deployment/\n",
    "mv policy-dev.json deployment/.chalice/\n",
    "mv config.json deployment/.chalice/\n",
    "cd deployment/\n",
    "chalice deploy"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bd6a475c-316e-41b5-91d6-f8c26396cfdc",
   "metadata": {},
   "source": [
    "### Inference with Lambda endpoint\n",
    "Now, let's keep the above url and use it for a simple inference request directly using requests library. Remember to replace the endpoint url."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ba1b359a-d61d-432e-9a74-5c60ed51a8a1",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import requests\n",
    "\n",
    "lambda_endpoint = \"https://wcxj19xw4f.execute-api.us-east-1.amazonaws.com/api/\" + \"query\"\n",
    "\n",
    "headers = {\"content-type\": \"application/json\"}\n",
    "data = {\"prompt\": [\"Large language model is\"], \"max_new_tokens\": 256}\n",
    "res = requests.post(lambda_endpoint, headers=headers, json=data)\n",
    "print(f\"First inference response: {res.json()}\")\n",
    "prev = 0\n",
    "while True:\n",
    "    ddb_result = requests.post(lambda_endpoint, headers=headers, json=res.json())\n",
    "    text = ddb_result.json()[\"result\"]\n",
    "    print(text[prev:], end=\"\")\n",
    "    prev = len(text)\n",
    "    if text.endswith(\"<eos>\"):\n",
    "        break\n",
    "    time.sleep(0.1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c1cd9042",
   "metadata": {},
   "source": [
    "## Clean up the environment\n",
    "\n",
    "If you have lambda and API gateway environment, do the following to clean up:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "74d059e2-7f70-4908-9eed-ae00f12938de",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "cd deployment/\n",
    "chalice delete"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6d297acb-ef3c-4610-985e-a638a7694a06",
   "metadata": {},
   "source": [
    "Clean up the SageMaker endpoint:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3d674b41",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "sess.delete_endpoint(endpoint_name)\n",
    "sess.delete_endpoint_config(endpoint_name)\n",
    "model.delete_model()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a241d5c0-3c0d-41ca-9fbf-f58f5865ce5f",
   "metadata": {},
   "source": [
    "Delete DynamoDB table"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "23c07e7a-feb1-40a7-b4ac-51e50e583150",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "ddb_client = boto3.client(\"dynamodb\")\n",
    "ddb_client.delete_table(TableName=\"lmi_test_db\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Notebook CI Test Results\n",
    "\n",
    "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n",
    "\n",
    "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/inference|generativeai|llm-workshop|lab6-stream-with-pagination|stream_pagination_lmi.ipynb)\n",
    "\n",
    "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/inference|generativeai|llm-workshop|lab6-stream-with-pagination|stream_pagination_lmi.ipynb)\n",
    "\n",
    "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/inference|generativeai|llm-workshop|lab6-stream-with-pagination|stream_pagination_lmi.ipynb)\n",
    "\n",
    "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/inference|generativeai|llm-workshop|lab6-stream-with-pagination|stream_pagination_lmi.ipynb)\n",
    "\n",
    "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/inference|generativeai|llm-workshop|lab6-stream-with-pagination|stream_pagination_lmi.ipynb)\n",
    "\n",
    "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/inference|generativeai|llm-workshop|lab6-stream-with-pagination|stream_pagination_lmi.ipynb)\n",
    "\n",
    "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/inference|generativeai|llm-workshop|lab6-stream-with-pagination|stream_pagination_lmi.ipynb)\n",
    "\n",
    "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/inference|generativeai|llm-workshop|lab6-stream-with-pagination|stream_pagination_lmi.ipynb)\n",
    "\n",
    "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/inference|generativeai|llm-workshop|lab6-stream-with-pagination|stream_pagination_lmi.ipynb)\n",
    "\n",
    "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/inference|generativeai|llm-workshop|lab6-stream-with-pagination|stream_pagination_lmi.ipynb)\n",
    "\n",
    "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/inference|generativeai|llm-workshop|lab6-stream-with-pagination|stream_pagination_lmi.ipynb)\n",
    "\n",
    "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/inference|generativeai|llm-workshop|lab6-stream-with-pagination|stream_pagination_lmi.ipynb)\n",
    "\n",
    "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/inference|generativeai|llm-workshop|lab6-stream-with-pagination|stream_pagination_lmi.ipynb)\n",
    "\n",
    "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/inference|generativeai|llm-workshop|lab6-stream-with-pagination|stream_pagination_lmi.ipynb)\n",
    "\n",
    "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/inference|generativeai|llm-workshop|lab6-stream-with-pagination|stream_pagination_lmi.ipynb)\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_pytorch_p39",
   "language": "python",
   "name": "conda_pytorch_p39"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}