# Module 1 : Exploring BM25 similarity and Semantic similarity

Before we get started with Amazon OpenSearch and our search web app, let's explore some of the core concepts in search. Below, we'll demonstrate the different between algorithms for matching data using BM25 similarity (keyword matching) and Cosine similarity (sematnic vector matching).

### 1. Upgrade PyTorch and restart Kernel

Before we begin, we need to upgrade PyTorch and restart the notebook kernel. The following should take 2-3 minutes to complete, and you should see the following message::"Successfully intalled torch-1.nn.n".

You may see a message with stating "ERROR: pip's dependency resolver does not..." - you can ignore this error.

In [None]:
!pip install --upgrade torch==2.0.0

Now we need to restart the kernel by running below cell.

In [None]:
from IPython.display import display_html
def restartkernel() :
    display_html("<script>Jupyter.notebook.kernel.restart()</script>",raw=True)
restartkernel()

Next, let's verify the version of Torch to ensure everything is up to date. The version should be 1.13.1.

In [None]:
import torch
print(torch.__version__)

### 2. Install Pre-Requisites

Before we can experiment with different searches, we need to install some required libraries.

In [None]:
!pip install -q transformers
!pip install -U sentence-transformers rank_bm25

from rank_bm25 import BM25Okapi
from sklearn.feature_extraction import _stop_words
import string
from tqdm.autonotebook import tqdm
import numpy as np

### 3. Create a sample dataset

Let's now create a very simple dataset as an array of 4 questions.

In [None]:
passages=["does this work with xbox?",
          "Does the M70 work with Android phones?", 
          "does this work with iphone?",
          "Can this work with an xbox "
         ]

### 4. Explore BM25 similarity 

Execute the following to explore BM25 similarity. First, we'll tokenize the data set, then use BM25 similarity to compare the phrase "does this work with xbox?" with our sample questions. 

In [None]:
# split the sentence into words and remove stop words. For example setence "does this work with xbox" 
# will be converted into words "does", "work", "xbox"
def bm25_tokenizer(text):
    tokenized_doc = []
    for token in text.lower().split():
        token = token.strip(string.punctuation)

        if len(token) > 0 and token not in _stop_words.ENGLISH_STOP_WORDS:
            tokenized_doc.append(token)
    return tokenized_doc


tokenized_corpus = []
for passage in tqdm(passages):
    tokenized_corpus.append(bm25_tokenizer(passage))

#get the BM25 score between the query "does this work with xbox?" and exiting 4 questions above. 
#If the score is high, it means BM25 think the two sentences are similiar.
bm25 = BM25Okapi(tokenized_corpus)
bm25_scores = bm25.get_scores(bm25_tokenizer("does this work with xbox?"))

all_sentence_combinations = []
for i in range(len(bm25_scores)):
    all_sentence_combinations.append([bm25_scores[i], i])

#sort the score descending
all_sentence_combinations = sorted(all_sentence_combinations, key=lambda x: x[0], reverse=True)

# print the 4 sentences, tokens and the BM25 score with query "does this work with xbox?"
# You can "does this work with iphone?" has high BM25 score with "does this work with xbox?" 
# even though the semantics meaning is different. While "Can this work with an xbox" has low BM25 score 
# with "does this work with xbox?" even though the two sentence has same semantic meaning. 
# This is the drawback of BM25.
print("Top most similar pairs:")
for score, i in all_sentence_combinations[0:4]:
    print("{} \t {} \t {:.4f}".format(passages[i],bm25_tokenizer(passages[i]),bm25_scores[i]))
    
    

### 5. Semantic Similarities


Execute the following to explore semantic similarity with cosine similarity. In this code, we'll use the same dataset as above, but using cosine similarity. Compare the differences in how similarity is measured.

The 'all-MiniLM-L6-v2' is a sentence transformer from HuggingFace that maps sentences & paragraphs to a 384 dimensional dense vector space. We'll use this library to translate our sample data set to a vector set. We'll then use util.cos_sim to provide a cosine similarity between each combination of records in the dataset. Finally, we'll print the cosine similarity score of the first record ("does this work with xbox?") with records in the dataset.

In [None]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')

#Encode all sentences
embeddings = model.encode(passages)

#Compute cosine similarity between all pairs
cos_sim = util.cos_sim(embeddings, embeddings)

#cosine similarity score with query
all_sentence_combinations = []
for i in range(len(cos_sim)):
    all_sentence_combinations.append([cos_sim[0][i], i])

#Sort list by the highest cosine similarity score
all_sentence_combinations = sorted(all_sentence_combinations, key=lambda x: x[0], reverse=True)

# You see "does this work with xbox?" has the same meaning with query "does this work with xbox?", 
# so the semantic score is highest of course. "Can this work with an xbox" has similiar meaning 
# with "does this work with xbox?", semantic score ranked second.
print("Top most similar pairs:")
for score, i in all_sentence_combinations[0:4]:
    print("{} \t {:.4f}".format(passages[i],cos_sim[0][i]))

### 6. Compare the differences.

As you can see, the similarity is significantly different, even with with a trivial data set. In particular, observe the differences between the text for "Can this work with an xbox" - with BM25 keyword search, the word "can" doesn't match the original question, and so it's given a low rating. Cosine similarity semantic search provides a much better match in this case.

In this module, we've used fairly simple steps with a very small dataset to demonstrate the difference between BM25 and cosine similarity. In the following modules, we'll demonstrate these same concepts with using OpenSearch and a larger and more complex dataset.