# Semantic Search with Amazon OpenSearch Service 

Now that we've been able to search the data set with a keyword search, let's see how we can use Semantic Search to improve the matches. To do this, we will add a vector representation of the questions to our data set in OpenSearch, then do the same with our sample query "Does this work with xbox?". In OpenSearch, we'll use a KNN search to find matches based on a cosine similarity rating on the vector.

![word vector](word2vec.png)


We will:
1. Use a HuggingFace BERT model to generate vector for the PQA dataset
2. Upload the dataset to OpenSearch, with the original question and answer text combined with the vector representation of the questions.
3. Translate the query question to a vector.
4. Perform a KNN search in OpenSearch to perform semantic search

### 1. Check PyTorch Version


As in the previous modules, let's import PyTorch and confirm that have have the latest version of PyTorch. The version should already be 1.10.2 or higher. If not, please run the lab in order to get everything set up.

In [None]:
import torch
print(torch.__version__)

### 2. Retrieve notebook variables

The line below will retrieve your shared variables from the previous notebook.

In [None]:
%store -r

### 3. Import library



In [None]:
import boto3
import re
import time
import sagemaker

### 4. Prepare BERT Model 

For this module, we will be using the HuggingFace BERT model to generate vectorization data, where every sentence is 768 dimension data. Let's create some helper functions we'll use later on.
![BERT](nlp_bert.png)

We are creating 2 functions:
1. mean_pooling
2. sentence_to_vector - this is the key function we'll use to generate our vector for the headset PQA dataset.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModel
from transformers import DistilBertTokenizer, DistilBertModel

#model_name = "distilbert-base-uncased"
#model_name = "sentence-transformers/msmarco-distilbert-base-dot-prod-v3"
model_name = "sentence-transformers/distilbert-base-nli-stsb-mean-tokens"


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
    sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    return sum_embeddings / sum_mask


def sentence_to_vector(raw_inputs):
    tokenizer = DistilBertTokenizer.from_pretrained(model_name)
    model = DistilBertModel.from_pretrained(model_name)
    inputs_tokens = tokenizer(raw_inputs, padding=True, return_tensors="pt")
    
    with torch.no_grad():
        outputs = model(**inputs_tokens)

    sentence_embeddings = mean_pooling(outputs, inputs_tokens['attention_mask'])
    return sentence_embeddings


### 5. Prepare Headset PQA data
We have already downloaded the dataset in Module 2, so let's start by ingesting 1000 rows of the data into a Pandas data frame. 

In [None]:
import json
import pandas as pd

def load_pqa(file_name,number_rows=1000):
    qa_list = []
    df = pd.DataFrame(columns=('question', 'answer'))
    with open(file_name) as f:
        i=0
        for line in f:
            data = json.loads(line)
            df.loc[i] = [data['question_text'],data['answers'][0]['answer_text']]
            i+=1
            if(i == number_rows):
                break
    return df


qa_list = load_pqa('amazon-pqa/amazon_pqa_headsets.json',number_rows=1000)



### 6. Convert the text data into vector
Using the helper function we created earlier, let's convert the questions from the Headset PQA dataset into vectors.

In [None]:
vector_sentences = sentence_to_vector(qa_list["question"].tolist())

### 7. Create an OpenSearch cluster connection.
Next, we'll use Python API to set up connection with OpenSearch Cluster.

Note: if you're using a region other than us-east-1, please update the region in the code below. 

In [None]:
from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth
import boto3

region = 'us-east-1' 

#credentials = boto3.Session().get_credentials()
#auth = AWSV4SignerAuth(credentials, region)
auth = ("master","Semantic123!")
index_name = 'nlp_pqa'

aos_client = OpenSearch(
    hosts = [{'host': aos_host, 'port': 443}],
    http_auth = auth,
    use_ssl = True,
    verify_certs = True,
    connection_class = RequestsHttpConnection
)

### 8. Create a index in Amazon Opensearch Service 
Whereas we previously created an index with 2 fields, this time we'll define the index with 3 fields: the first field ' question_vector' holds the vector representation of the question, the second is the "question" for raw sentence and the third field is "answer" for the raw answer data.

To create the index, we first define the index in JSON, then use the aos_client connection we initiated ealier to create the index in OpenSearch.

In [None]:
knn_index = {
    "settings": {
        "index.knn": True,
        "index.knn.space_type": "cosinesimil",
        "analysis": {
          "analyzer": {
            "default": {
              "type": "standard",
              "stopwords": "_english_"
            }
          }
        }
    },
    "mappings": {
        "properties": {
            "question_vector": {
                "type": "knn_vector",
                "dimension": 768,
                "store": True
            },
            "question": {
                "type": "text",
                "store": True
            },
            "answer": {
                "type": "text",
                "store": True
            }
        }
    }
}


If for any reason you need to recreate your dataset, you can uncomment and execute the following to delete any previously created indexes. If this is the first time you're running this, you can skip this step.

In [None]:
#aos_client.indices.delete(index="nlp_pqa")


Using the above index definition, we now need to create the index in Amazon OpenSearch

In [None]:
aos_client.indices.create(index="nlp_pqa",body=knn_index,ignore=400)


Let's verify the created index information

In [None]:
aos_client.indices.get(index="nlp_pqa")

### 9. Load the raw data into the Index
Next, let's load the headset enhanced PQA data into the index we've just created.

In [None]:
i = 0
for c in qa_list["question"].tolist():
    content=c
    vector=vector_sentences[i].tolist()
    answer=qa_list["answer"][i]
    i+=1
    aos_client.index(index='nlp_pqa',body={"question_vector": vector, "question": content,"answer":answer})

To validate the load, we'll query the number of documents number in the index. We should have 1000 hits in the index.

In [None]:
res = aos_client.search(index="nlp_pqa", body={"query": {"match_all": {}}})
print("Records found: %d." % res['hits']['total']['value'])

### 10. Generate vector for user input query 

Next, we'll use the same helper function to translate our input question "does this work with xbox?" into a vector. 

In [None]:
query_raw_sentences = ['does this work with xbox?']
search_vector = sentence_to_vector(query_raw_sentences)[0].tolist()
search_vector

### 11. Search vector with "Semantic Search" 

Now that we have vector in OpenSearch and a vector for our query question, let's perform a KNN search in OpenSearch.


In [None]:

query={
    "size": 30,
    "query": {
        "knn": {
            "question_vector":{
                "vector":search_vector,
                "k":30
            }
        }
    }
}

res = aos_client.search(index="nlp_pqa", 
                       body=query,
                       stored_fields=["question","answer"])
#print("Got %d Hits:" % res['hits']['total']['value'])
query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['fields']['question'][0],hit['fields']['answer'][0]]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","question","answer"])
display(query_result_df)

### 12. Search the same query with "Text Search"

Let's repeat the same query with a keyword search and compare the differences.

In [None]:
query={
    "size": 30,
    "query": {
        "match": {
            "question":"does this work with xbox?"
        }
    }
}

res = aos_client.search(index="nlp_pqa", 
                       body=query,
                       stored_fields=["question","answer"])
#print("Got %d Hits:" % res['hits']['total']['value'])
query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['fields']['question'][0],hit['fields']['answer'][0]]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","question","answer"])
display(query_result_df)

### 13. Observe The Results

Compare the first few records in the two searches above. For the Semantic search, the first 10 or so results are very similar to our input questions, as we expect. Compare this to keyword search, where the results quickly start to deviate from our search query (e.g. "it shows xbox 360. Does it work for ps3 as well?" - this matches on keywords but has a different meaning).

### 14. Store Variables Used for the Next Notebook

There are a few values you will need for the next notebook, execute the cells below to store them so they can be copied and pasted into the next part of the exercise.

In [None]:
%store qa_list