{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "e87dc259",
   "metadata": {},
   "source": [
    "# Semantic Search with OpenSearch Neural Search "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b0cfd51d",
   "metadata": {},
   "source": [
    "We will use Neural Search plugin in OpenSearch to implement semantic search"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "31703e3d",
   "metadata": {},
   "source": [
    "### 1. Check PyTorch Version\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9ac12126",
   "metadata": {},
   "source": [
    "As in the previous modules, let's import PyTorch and confirm that have have the latest version of PyTorch. The version should already be 1.10.2 or higher. If not, please run the lab in order to get everything set up."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0b532987",
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "print(torch.__version__)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f2f1cc51",
   "metadata": {},
   "source": [
    "### 2. Retrieve notebook variables\n",
    "\n",
    "The line below will retrieve your shared variables from the previous notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "18a0e06e",
   "metadata": {},
   "outputs": [],
   "source": [
    "%store -r"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4a3fa4b0",
   "metadata": {},
   "source": [
    "### 3. Install OpenSearch ML Python library"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "58a1c491",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install opensearch-py-ml\n",
    "!pip install accelerate"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "05c00375",
   "metadata": {},
   "source": [
    "Now we need to restart the kernel by running below cell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f94df946",
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython.display import display_html\n",
    "def restartkernel() :\n",
    "    display_html(\"<script>Jupyter.notebook.kernel.restart()</script>\",raw=True)\n",
    "restartkernel()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0aa614bc",
   "metadata": {},
   "source": [
    "### 4. Import library\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1688f4e9",
   "metadata": {},
   "outputs": [],
   "source": [
    "import boto3\n",
    "import re\n",
    "import time"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c6607721",
   "metadata": {},
   "source": [
    "### 5. Prepare Headset PQA data\n",
    "We have already downloaded the dataset in Module 2, so let's start by ingesting 1000 rows of the data into a Pandas data frame. \n",
    "\n",
    "Before we can run any queries, we need to download the Amazon Product Question and Answer data from : https://registry.opendata.aws/amazon-pqa/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "58fca957",
   "metadata": {},
   "outputs": [],
   "source": [
    "!aws s3 cp --no-sign-request s3://amazon-pqa/amazon_pqa_headsets.json ./amazon-pqa/amazon_pqa_headsets.json"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9b1cf47b",
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import pandas as pd\n",
    "\n",
    "def load_pqa(file_name,number_rows=1000):\n",
    "    qa_list = []\n",
    "    df = pd.DataFrame(columns=('question', 'answer'))\n",
    "    with open(file_name) as f:\n",
    "        i=0\n",
    "        for line in f:\n",
    "            data = json.loads(line)\n",
    "            df.loc[i] = [data['question_text'],data['answers'][0]['answer_text']]\n",
    "            i+=1\n",
    "            if(i == number_rows):\n",
    "                break\n",
    "    return df\n",
    "\n",
    "\n",
    "qa_list = load_pqa('amazon-pqa/amazon_pqa_headsets.json',number_rows=1000)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2f54a349",
   "metadata": {},
   "source": [
    "### 6. Create an OpenSearch cluster connection.\n",
    "Next, we'll use Python API to set up connection with OpenSearch Cluster.\n",
    "\n",
    "Note: if you're using a region other than us-east-1, please update the region in the code below.\n",
    "\n",
    "#### Get Cloud Formation stack output variables\n",
    "\n",
    "We also need to grab some key values from the infrastructure we provisioned using CloudFormation. To do this, we will list the outputs from the stack and store this in \"outputs\" to be used later.\n",
    "\n",
    "You can ignore any \"PythonDeprecationWarning\" warnings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "81dc45a2",
   "metadata": {},
   "outputs": [],
   "source": [
    "import boto3\n",
    "\n",
    "cfn = boto3.client('cloudformation')\n",
    "\n",
    "def get_cfn_outputs(stackname):\n",
    "    outputs = {}\n",
    "    for output in cfn.describe_stacks(StackName=stackname)['Stacks'][0]['Outputs']:\n",
    "        outputs[output['OutputKey']] = output['OutputValue']\n",
    "    return outputs\n",
    "\n",
    "## Setup variables to use for the rest of the demo\n",
    "cloudformation_stack_name = \"semantic-search\"\n",
    "\n",
    "outputs = get_cfn_outputs(cloudformation_stack_name)\n",
    "\n",
    "bucket = outputs['s3BucketTraining']\n",
    "aos_host = outputs['OpenSearchDomainEndpoint']\n",
    "\n",
    "outputs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "405e0e52",
   "metadata": {},
   "outputs": [],
   "source": [
    "from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth\n",
    "import boto3\n",
    "\n",
    "region = 'us-east-1' \n",
    "\n",
    "#credentials = boto3.Session().get_credentials()\n",
    "#auth = AWSV4SignerAuth(credentials, region)\n",
    "auth = (\"master\",\"Semantic123!\")\n",
    "index_name = 'nlp_pqa'\n",
    "\n",
    "aos_client = OpenSearch(\n",
    "    hosts = [{'host': aos_host, 'port': 443}],\n",
    "    http_auth = auth,\n",
    "    use_ssl = True,\n",
    "    verify_certs = True,\n",
    "    connection_class = RequestsHttpConnection\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0173ff23",
   "metadata": {},
   "source": [
    "### 7. Configure OpenSearch domain to enable run Machine Learning code in data node"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9e4080d5",
   "metadata": {},
   "outputs": [],
   "source": [
    "s = b'{\"transient\":{\"plugins.ml_commons.only_run_on_ml_node\": false}}'\n",
    "aos_client.cluster.put_settings(body=s)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e53a9cd9",
   "metadata": {},
   "source": [
    "Verify `plugins.ml_commons.only_run_on_ml_node` is set to false"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cce6a646",
   "metadata": {},
   "outputs": [],
   "source": [
    "aos_client.cluster.get_settings(flat_settings=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba0742cd",
   "metadata": {},
   "source": [
    "### 8. Download pre-trained BERT model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8bab973e",
   "metadata": {},
   "outputs": [],
   "source": [
    "import urllib.request\n",
    "urllib.request.urlretrieve('https://github.com/opensearch-project/ml-commons/raw/2.x/ml-algorithms/src/test/resources/org/opensearch/ml/engine/algorithms/text_embedding/all-MiniLM-L6-v2_torchscript_sentence-transformer.zip?raw=true', 'model/all-MiniLM-L6-v2_torchscript_sentence-transformer.zip')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19eacd3c",
   "metadata": {},
   "source": [
    "Verify model is downloaded successfully in the `model` folder"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "24144556",
   "metadata": {},
   "outputs": [],
   "source": [
    "!ls -al model"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "62421acd",
   "metadata": {},
   "source": [
    "### 9. Upload BERT model to OpenSearch domain"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7cfc28ff",
   "metadata": {},
   "outputs": [],
   "source": [
    "from opensearch_py_ml.ml_models import SentenceTransformerModel\n",
    "from opensearch_py_ml.ml_commons import MLCommonClient\n",
    "\n",
    "ml_client = MLCommonClient(aos_client)\n",
    "model_path = './model/all-MiniLM-L6-v2_torchscript_sentence-transformer.zip'\n",
    "model_config_path = './model/all-MiniLM-L6-v2_torchscript.json'\n",
    "\n",
    "\n",
    "model_id=ml_client.upload_model(model_path, model_config_path, isVerbose=True)\n",
    "\n",
    "print(\"model id:\" + model_id)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5dc9d104",
   "metadata": {},
   "source": [
    "### 10. Load the model for inference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "52cba357",
   "metadata": {},
   "outputs": [],
   "source": [
    "load_model_output = ml_client.load_model(model_id)\n",
    "\n",
    "print(load_model_output)\n",
    "task_id = load_model_output['task_id']"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "98f69aea",
   "metadata": {},
   "source": [
    "Get the task detailed information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "510eea27",
   "metadata": {},
   "outputs": [],
   "source": [
    "task_info = ml_client.get_task_info(task_id)\n",
    "\n",
    "print(task_info)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1c54be9c",
   "metadata": {},
   "source": [
    "Get the model detailed information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1211a76d",
   "metadata": {},
   "outputs": [],
   "source": [
    "model_info = ml_client.get_model_info(model_id)\n",
    "\n",
    "print(model_info)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3625b5cf",
   "metadata": {},
   "source": [
    "### 11. Create pipeline to convert text into vector with BERT model\n",
    "We will use the just uploaded model to convert `qestion` field into vector(embedding) and stored into `question_vector` field."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dc810643",
   "metadata": {},
   "outputs": [],
   "source": [
    "pipeline={\n",
    "  \"description\": \"An example neural search pipeline\",\n",
    "  \"processors\" : [\n",
    "    {\n",
    "      \"text_embedding\": {\n",
    "        \"model_id\": model_id,\n",
    "        \"field_map\": {\n",
    "           \"question\": \"question_vector\"\n",
    "        }\n",
    "      }\n",
    "    }\n",
    "  ]\n",
    "}\n",
    "pipeline_id = 'nlp_pipeline'\n",
    "aos_client.ingest.put_pipeline(id=pipeline_id,body=pipeline)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c431a804",
   "metadata": {},
   "source": [
    "Verify pipeline is created succefuflly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "13ff2f91",
   "metadata": {},
   "outputs": [],
   "source": [
    "aos_client.ingest.get_pipeline(id=pipeline_id)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "beaabc1e",
   "metadata": {},
   "source": [
    "### 12. Create a index in Amazon Opensearch Service \n",
    "Whereas we previously created an index with 2 fields, this time we'll define the index with 3 fields: the first field ' question_vector' holds the vector representation of the question, the second is the \"question\" for raw sentence and the third field is \"answer\" for the raw answer data.\n",
    "\n",
    "To create the index, we first define the index in JSON, then use the aos_client connection we initiated ealier to create the index in OpenSearch."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5eba5754",
   "metadata": {},
   "outputs": [],
   "source": [
    "knn_index = {\n",
    "    \"settings\": {\n",
    "        \"index.knn\": True,\n",
    "        \"index.knn.space_type\": \"cosinesimil\",\n",
    "        \"default_pipeline\": pipeline_id,\n",
    "        \"analysis\": {\n",
    "          \"analyzer\": {\n",
    "            \"default\": {\n",
    "              \"type\": \"standard\",\n",
    "              \"stopwords\": \"_english_\"\n",
    "            }\n",
    "          }\n",
    "        }\n",
    "    },\n",
    "    \"mappings\": {\n",
    "        \"properties\": {\n",
    "            \"question_vector\": {\n",
    "                \"type\": \"knn_vector\",\n",
    "                \"dimension\": 384,\n",
    "                \"method\": {\n",
    "                    \"name\": \"hnsw\",\n",
    "                    \"space_type\": \"l2\",\n",
    "                    \"engine\": \"faiss\"\n",
    "                },\n",
    "                \"store\": True\n",
    "            },\n",
    "            \"question\": {\n",
    "                \"type\": \"text\",\n",
    "                \"store\": True\n",
    "            },\n",
    "            \"answer\": {\n",
    "                \"type\": \"text\",\n",
    "                \"store\": True\n",
    "            }\n",
    "        }\n",
    "    }\n",
    "}\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1330502a",
   "metadata": {},
   "source": [
    "If for any reason you need to recreate your dataset, you can uncomment and execute the following to delete any previously created indexes. If this is the first time you're running this, you can skip this step."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a835b9fb",
   "metadata": {},
   "outputs": [],
   "source": [
    "#aos_client.indices.delete(index=\"nlp_pqa\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c6de634d",
   "metadata": {},
   "source": [
    "Using the above index definition, we now need to create the index in Amazon OpenSearch"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "715b751d",
   "metadata": {},
   "outputs": [],
   "source": [
    "aos_client.indices.create(index=\"nlp_pqa\",body=knn_index,ignore=400)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a7007735",
   "metadata": {},
   "source": [
    "Let's verify the created index information"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1f71659d",
   "metadata": {},
   "outputs": [],
   "source": [
    "aos_client.indices.get(index=\"nlp_pqa\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0040992c",
   "metadata": {},
   "source": [
    "### 13. Load the raw data into the Index\n",
    "Next, let's load the headset enhanced PQA data into the index we've just created. During ingest data, `question` field will also be converted to vector(embedding) by the `nlp_pipeline` we defined."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7e55e6a6",
   "metadata": {},
   "outputs": [],
   "source": [
    "i = 0\n",
    "for c in qa_list[\"question\"].tolist():\n",
    "    content=c\n",
    "    answer=qa_list[\"answer\"][i]\n",
    "    i+=1\n",
    "    aos_client.index(index='nlp_pqa',body={\"question\": content,\"answer\":answer})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "67fad674",
   "metadata": {},
   "source": [
    "To validate the load, we'll query the number of documents number in the index. We should have 1000 hits in the index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "05ed0b71",
   "metadata": {},
   "outputs": [],
   "source": [
    "res = aos_client.search(index=\"nlp_pqa\", body={\"query\": {\"match_all\": {}}})\n",
    "print(\"Records found: %d.\" % res['hits']['total']['value'])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "de9b827c",
   "metadata": {},
   "source": [
    "### 14. Search vector with \"Semantic Search\" \n",
    "\n",
    "We can search the data with neural search.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5c5f4e81",
   "metadata": {},
   "outputs": [],
   "source": [
    "query={\n",
    "  \"_source\": {\n",
    "        \"exclude\": [ \"question_vector\" ]\n",
    "    },\n",
    "  \"size\": 30,\n",
    "  \"query\": {\n",
    "    \"neural\": {\n",
    "      \"question_vector\": {\n",
    "        \"query_text\": \"does this work with xbox?\",\n",
    "        \"model_id\": model_id,\n",
    "        \"k\": 30\n",
    "      }\n",
    "    }\n",
    "  }\n",
    "}\n",
    "\n",
    "res = aos_client.search(index=\"nlp_pqa\", \n",
    "                       body=query,\n",
    "                       stored_fields=[\"question\",\"answer\"])\n",
    "print(\"Got %d Hits:\" % res['hits']['total']['value'])\n",
    "query_result=[]\n",
    "for hit in res['hits']['hits']:\n",
    "    row=[hit['_id'],hit['_score'],hit['_source']['question'],hit['_source']['answer']]\n",
    "    query_result.append(row)\n",
    "\n",
    "query_result_df = pd.DataFrame(data=query_result,columns=[\"_id\",\"_score\",\"question\",\"answer\"])\n",
    "display(query_result_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0abddaa4",
   "metadata": {},
   "source": [
    "### 15. Search the same query with \"Text Search\"\n",
    "\n",
    "Let's repeat the same query with a keyword search and compare the differences."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8c652c52",
   "metadata": {},
   "outputs": [],
   "source": [
    "query={\n",
    "    \"size\": 30,\n",
    "    \"query\": {\n",
    "        \"match\": {\n",
    "            \"question\":\"does this work with xbox?\"\n",
    "        }\n",
    "    }\n",
    "}\n",
    "\n",
    "res = aos_client.search(index=\"nlp_pqa\", \n",
    "                       body=query,\n",
    "                       stored_fields=[\"question\",\"answer\"])\n",
    "#print(\"Got %d Hits:\" % res['hits']['total']['value'])\n",
    "query_result=[]\n",
    "for hit in res['hits']['hits']:\n",
    "    row=[hit['_id'],hit['_score'],hit['fields']['question'][0],hit['fields']['answer'][0]]\n",
    "    query_result.append(row)\n",
    "\n",
    "query_result_df = pd.DataFrame(data=query_result,columns=[\"_id\",\"_score\",\"question\",\"answer\"])\n",
    "display(query_result_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fb777d3d",
   "metadata": {},
   "source": [
    "### 16. Observe The Results\n",
    "\n",
    "Compare the first few records in the two searches above. For the Semantic search, the first 10 or so results are very similar to our input questions, as we expect. Compare this to keyword search, where the results quickly start to deviate from our search query (e.g. \"it shows xbox 360. Does it work for ps3 as well?\" - this matches on keywords but has a different meaning).\n",
    "\n",
    "You can also use \"Compare search results\" in Search relevance plugin to compare search relevance side by side. Please refer the lab \"Option 2: OpenSearch Dashboard Dev Tools\" to compare search results."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9c8d80ff",
   "metadata": {},
   "source": [
    "### 17. Summary\n",
    "With OpenSearch Neural Search plugin, embedding is automatically generated with model we uploaded. We don't need care about inference pipeline anymore. It makes the semantic search solution simple to develop and maintain. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_pytorch_p310",
   "language": "python",
   "name": "conda_pytorch_p310"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}