# Semantic Search with OpenSearch Neural Search 

We will use Neural Search plugin in OpenSearch to implement semantic search

### 1. Check PyTorch Version


As in the previous modules, let's import PyTorch and confirm that have have the latest version of PyTorch. The version should already be 1.10.2 or higher. If not, please run the lab in order to get everything set up.

In [None]:
import torch
print(torch.__version__)

### 2. Retrieve notebook variables

The line below will retrieve your shared variables from the previous notebook.

In [None]:
%store -r

### 3. Install OpenSearch ML Python library

In [None]:
!pip install opensearch-py-ml
!pip install accelerate

Now we need to restart the kernel by running below cell.

In [None]:
from IPython.display import display_html
def restartkernel() :
 display_html("",raw=True)
restartkernel()

### 4. Import library



In [None]:
import boto3
import re
import time

### 5. Prepare Headset PQA data
We have already downloaded the dataset in Module 2, so let's start by ingesting 1000 rows of the data into a Pandas data frame. 

Before we can run any queries, we need to download the Amazon Product Question and Answer data from : https://registry.opendata.aws/amazon-pqa/

In [None]:
!aws s3 cp --no-sign-request s3://amazon-pqa/amazon_pqa_headsets.json ./amazon-pqa/amazon_pqa_headsets.json

In [None]:
import json
import pandas as pd

def load_pqa(file_name,number_rows=1000):
 qa_list = []
 df = pd.DataFrame(columns=('question', 'answer'))
 with open(file_name) as f:
 i=0
 for line in f:
 data = json.loads(line)
 df.loc[i] = [data['question_text'],data['answers'][0]['answer_text']]
 i+=1
 if(i == number_rows):
 break
 return df


qa_list = load_pqa('amazon-pqa/amazon_pqa_headsets.json',number_rows=1000)



### 6. Create an OpenSearch cluster connection.
Next, we'll use Python API to set up connection with OpenSearch Cluster.

Note: if you're using a region other than us-east-1, please update the region in the code below.

#### Get Cloud Formation stack output variables

We also need to grab some key values from the infrastructure we provisioned using CloudFormation. To do this, we will list the outputs from the stack and store this in "outputs" to be used later.

You can ignore any "PythonDeprecationWarning" warnings.

In [None]:
import boto3

cfn = boto3.client('cloudformation')

def get_cfn_outputs(stackname):
 outputs = {}
 for output in cfn.describe_stacks(StackName=stackname)['Stacks'][0]['Outputs']:
 outputs[output['OutputKey']] = output['OutputValue']
 return outputs

## Setup variables to use for the rest of the demo
cloudformation_stack_name = "semantic-search"

outputs = get_cfn_outputs(cloudformation_stack_name)

bucket = outputs['s3BucketTraining']
aos_host = outputs['OpenSearchDomainEndpoint']

outputs

In [None]:
from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth
import boto3

region = 'us-east-1' 

#credentials = boto3.Session().get_credentials()
#auth = AWSV4SignerAuth(credentials, region)
auth = ("master","Semantic123!")
index_name = 'nlp_pqa'

aos_client = OpenSearch(
 hosts = [{'host': aos_host, 'port': 443}],
 http_auth = auth,
 use_ssl = True,
 verify_certs = True,
 connection_class = RequestsHttpConnection
)

### 7. Configure OpenSearch domain to enable run Machine Learning code in data node

In [None]:
s = b'{"transient":{"plugins.ml_commons.only_run_on_ml_node": false}}'
aos_client.cluster.put_settings(body=s)

Verify `plugins.ml_commons.only_run_on_ml_node` is set to false

In [None]:
aos_client.cluster.get_settings(flat_settings=True)

### 8. Download pre-trained BERT model

In [None]:
import urllib.request
urllib.request.urlretrieve('https://github.com/opensearch-project/ml-commons/raw/2.x/ml-algorithms/src/test/resources/org/opensearch/ml/engine/algorithms/text_embedding/all-MiniLM-L6-v2_torchscript_sentence-transformer.zip?raw=true', 'model/all-MiniLM-L6-v2_torchscript_sentence-transformer.zip')


Verify model is downloaded successfully in the `model` folder

In [None]:
!ls -al model

### 9. Upload BERT model to OpenSearch domain

In [None]:
from opensearch_py_ml.ml_models import SentenceTransformerModel
from opensearch_py_ml.ml_commons import MLCommonClient

ml_client = MLCommonClient(aos_client)
model_path = './model/all-MiniLM-L6-v2_torchscript_sentence-transformer.zip'
model_config_path = './model/all-MiniLM-L6-v2_torchscript.json'


model_id=ml_client.upload_model(model_path, model_config_path, isVerbose=True)

print("model id:" + model_id)

### 10. Load the model for inference.

In [None]:
load_model_output = ml_client.load_model(model_id)

print(load_model_output)
task_id = load_model_output['task_id']

Get the task detailed information.

In [None]:
task_info = ml_client.get_task_info(task_id)

print(task_info)

Get the model detailed information.

In [None]:
model_info = ml_client.get_model_info(model_id)

print(model_info)

### 11. Create pipeline to convert text into vector with BERT model
We will use the just uploaded model to convert `qestion` field into vector(embedding) and stored into `question_vector` field.

In [None]:
pipeline={
 "description": "An example neural search pipeline",
 "processors" : [
 {
 "text_embedding": {
 "model_id": model_id,
 "field_map": {
 "question": "question_vector"
 }
 }
 }
 ]
}
pipeline_id = 'nlp_pipeline'
aos_client.ingest.put_pipeline(id=pipeline_id,body=pipeline)

Verify pipeline is created succefuflly.

In [None]:
aos_client.ingest.get_pipeline(id=pipeline_id)

### 12. Create a index in Amazon Opensearch Service 
Whereas we previously created an index with 2 fields, this time we'll define the index with 3 fields: the first field ' question_vector' holds the vector representation of the question, the second is the "question" for raw sentence and the third field is "answer" for the raw answer data.

To create the index, we first define the index in JSON, then use the aos_client connection we initiated ealier to create the index in OpenSearch.

In [None]:
knn_index = {
 "settings": {
 "index.knn": True,
 "index.knn.space_type": "cosinesimil",
 "default_pipeline": pipeline_id,
 "analysis": {
 "analyzer": {
 "default": {
 "type": "standard",
 "stopwords": "_english_"
 }
 }
 }
 },
 "mappings": {
 "properties": {
 "question_vector": {
 "type": "knn_vector",
 "dimension": 384,
 "method": {
 "name": "hnsw",
 "space_type": "l2",
 "engine": "faiss"
 },
 "store": True
 },
 "question": {
 "type": "text",
 "store": True
 },
 "answer": {
 "type": "text",
 "store": True
 }
 }
 }
}


If for any reason you need to recreate your dataset, you can uncomment and execute the following to delete any previously created indexes. If this is the first time you're running this, you can skip this step.

In [None]:
#aos_client.indices.delete(index="nlp_pqa")


Using the above index definition, we now need to create the index in Amazon OpenSearch

In [None]:
aos_client.indices.create(index="nlp_pqa",body=knn_index,ignore=400)


Let's verify the created index information

In [None]:
aos_client.indices.get(index="nlp_pqa")

### 13. Load the raw data into the Index
Next, let's load the headset enhanced PQA data into the index we've just created. During ingest data, `question` field will also be converted to vector(embedding) by the `nlp_pipeline` we defined.

In [None]:
i = 0
for c in qa_list["question"].tolist():
 content=c
 answer=qa_list["answer"][i]
 i+=1
 aos_client.index(index='nlp_pqa',body={"question": content,"answer":answer})

To validate the load, we'll query the number of documents number in the index. We should have 1000 hits in the index.

In [None]:
res = aos_client.search(index="nlp_pqa", body={"query": {"match_all": {}}})
print("Records found: %d." % res['hits']['total']['value'])


### 14. Search vector with "Semantic Search" 

We can search the data with neural search.


In [None]:
query={
 "_source": {
 "exclude": [ "question_vector" ]
 },
 "size": 30,
 "query": {
 "neural": {
 "question_vector": {
 "query_text": "does this work with xbox?",
 "model_id": model_id,
 "k": 30
 }
 }
 }
}

res = aos_client.search(index="nlp_pqa", 
 body=query,
 stored_fields=["question","answer"])
print("Got %d Hits:" % res['hits']['total']['value'])
query_result=[]
for hit in res['hits']['hits']:
 row=[hit['_id'],hit['_score'],hit['_source']['question'],hit['_source']['answer']]
 query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","question","answer"])
display(query_result_df)

### 15. Search the same query with "Text Search"

Let's repeat the same query with a keyword search and compare the differences.

In [None]:
query={
 "size": 30,
 "query": {
 "match": {
 "question":"does this work with xbox?"
 }
 }
}

res = aos_client.search(index="nlp_pqa", 
 body=query,
 stored_fields=["question","answer"])
#print("Got %d Hits:" % res['hits']['total']['value'])
query_result=[]
for hit in res['hits']['hits']:
 row=[hit['_id'],hit['_score'],hit['fields']['question'][0],hit['fields']['answer'][0]]
 query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","question","answer"])
display(query_result_df)

### 16. Observe The Results

Compare the first few records in the two searches above. For the Semantic search, the first 10 or so results are very similar to our input questions, as we expect. Compare this to keyword search, where the results quickly start to deviate from our search query (e.g. "it shows xbox 360. Does it work for ps3 as well?" - this matches on keywords but has a different meaning).

You can also use "Compare search results" in Search relevance plugin to compare search relevance side by side. Please refer the lab "Option 2: OpenSearch Dashboard Dev Tools" to compare search results.

### 17. Summary
With OpenSearch Neural Search plugin, embedding is automatically generated with model we uploaded. We don't need care about inference pipeline anymore. It makes the semantic search solution simple to develop and maintain. 