# Module 2: Text Search with Amazon OpenSearch Service 

In this module, we are going to perform a simple search in OpenSearch by matching the individual words in our search query. We will:
1. Load data into OpenSearch from the Amazon Product Question and Answer (PQA) dataset. This dataset contains a list of common questions and answers related to products.
2. Query the data using a simple query search for find potentially matching questions. We will search the PQA dataset for questions similar to our sample question "does this work with xbox?". We expect to find matches in the dataset based on the individual words such as "xbox" and "work".

In subsequent modules, we will then demonstrate how to use semantic search to improve the relvance of the query results.

### 1. Install required libraries

Before we begin, we need to install some required libraries.

In [None]:
!pip install -q boto3
!pip install -q requests
!pip install -q requests-aws4auth
!pip install -q opensearch-py
!pip install -q tqdm
!pip install -q boto3
!pip install -q install transformers[torch]
!pip install -q transformers
!pip install -q sentence-transformers rank_bm25

### 2. Get Cloud Formation stack output variables

We also need to grab some key values from the infrastructure we provisioned using CloudFormation. To do this, we will list the outputs from the stack and store this in "outputs" to be used later.

You can ignore any "PythonDeprecationWarning" warnings.

In [None]:
import boto3

cfn = boto3.client('cloudformation')

def get_cfn_outputs(stackname):
    outputs = {}
    for output in cfn.describe_stacks(StackName=stackname)['Stacks'][0]['Outputs']:
        outputs[output['OutputKey']] = output['OutputValue']
    return outputs

## Setup variables to use for the rest of the demo
cloudformation_stack_name = "semantic-search"

outputs = get_cfn_outputs(cloudformation_stack_name)

bucket = outputs['s3BucketTraining']
aos_host = outputs['OpenSearchDomainEndpoint']

outputs

### 3. Copy the data set locally
Before we can run any queries, we need to download the Amazon Product Question and Answer data from : https://registry.opendata.aws/amazon-pqa/

Let's start by having a look at all the files in the dataset.

In [None]:
!aws s3 ls --no-sign-request s3://amazon-pqa/

There are a lot of files here, so for the purposes of this demo, we focus on just the headset data. Let's download the amazon_pqa_headsets.json data locally. 

In [None]:
!aws s3 cp --no-sign-request s3://amazon-pqa/amazon_pqa_headsets.json ./amazon-pqa/amazon_pqa_headsets.json

### 4. Create an OpenSearch cluster connection.
Next, we'll use Python API to set up connection with Amazon Opensearch Service domain.

Note: if you're using a region other than us-east-1, please update the region in the code below. 

In [None]:
from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth
import boto3

#update the region if you're working other than us-east-1
region = 'us-east-1' 

print (aos_host)

#credentials = boto3.Session().get_credentials()
#auth = AWSV4SignerAuth(credentials, region)
auth = ("master","Semantic123!")

aos_client = OpenSearch(
    hosts = [{'host': aos_host, 'port': 443}],
    http_auth = auth,
    use_ssl = True,
    verify_certs = True,
    connection_class = RequestsHttpConnection
)

### 5. Create a index in Amazon Opensearch Service 
We are defining an index with english analyzer which will strip the common stopwords like `the`, `is`, `a`, `an`, etc..

We will use the aos_client connection we initiated ealier to create an index in Amazon OpenSearch Service

In [None]:
headset_default_index = {
    "settings": {
        "number_of_replicas": 1,
        "number_of_shards": 1,
        "analysis": {
          "analyzer": {
            "default": {
              "type": "standard",
              "stopwords": "_english_"
            }
          }
        }
    }
   
}

If for any reason you need to recreate your dataset, you can uncomment and execute the following to delete any previously created indexes. If this is the first time you're running this, you can skip this step.

In [None]:
#aos_client.indices.delete(index="headset_pqa")

Using the above index definition, we now need to create the index in Amazon OpenSearch

In [None]:
aos_client.indices.create(index="headset_pqa",body=headset_default_index,ignore=400)


Let's verify the created index information

In [None]:
aos_client.indices.get(index="headset_pqa")

### 6. Load the raw data into the Index
Next, let's load the headset PQA data we copied locally into the index we've just created.

In [None]:
import json
from tqdm.contrib.concurrent import process_map
from multiprocessing import cpu_count

def load_pqa_as_json(file_name,number_rows=1000):
    result=[]
    with open(file_name) as f:
        i=0
        for line in f:
            data = json.loads(line)
            result.append(data)
            i+=1
            if(i == number_rows):
                break
    return result


qa_list_json = load_pqa_as_json('amazon-pqa/amazon_pqa_headsets.json',number_rows=1000)


def es_import(question):
    aos_client.index(index='headset_pqa', body=question)
        
workers = 4 * cpu_count()
    
process_map(es_import, qa_list_json,chunksize=1000)

To validate the load, we'll query the number of documents number in the index. We should have 1000 hits in the index.

In [None]:
res = aos_client.search(index="headset_pqa", body={"query": {"match_all": {}}})
print("Records found: %d " % res['hits']['total']['value'])

### 7. Run a " Simple Text Search"

Now that we've loaded our data, let's run a keyword search for the question "does this work with xbox?", using the default OpenSearch query, and display the results.

In [None]:
import pandas as pd
query={
  "size": 10,
  "query": {
    "match": {
      "question_text": "does this work with xbox?"
    }
  }
}
res = aos_client.search(index="headset_pqa", body=query)
query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['question_text'],hit['_source']['answers'][0]['answer_text']]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","question","answer"])
display(query_result_df)

### 8. Search across multiple fields

Search across multiple fields could bring more results and scored based on BM25 relevancy 

In [None]:
import pandas as pd
query={
  "size": 10,
  "query": {
    "multi_match": {
      "query": "does this work with xbox?",
      "fields": ["question_text","bullet_point*", "answers.answer_text", "item_name"]
    }
  }
}
res = aos_client.search(index="headset_pqa", body=query)

query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['question_text'],hit['_source']['answers'][0]['answer_text']]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","question","answer"])
display(query_result_df)

### 9. Search with Field preference or boosting

When searching across fields, all fields given the same priority by default. But you can control the preference by giving static boost score to each field

In [None]:
import pandas as pd
query={
  "size": 10,
  "query": {
    "multi_match": {
      "query": "does this work with xbox?",
      "fields": ["question_text^2", "bullet_point*", "answers.answer_text^2", "item_name^1.5"]
    }
  }
}
res = aos_client.search(index="headset_pqa", body=query)

query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['question_text'],hit['_source']['answers'][0]['answer_text']]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","question","answer"])
display(query_result_df)

### 10. Compound queries with `bool`

With `bool` queries, you can give more preference based on other field values/existance. In the below query, it will get higher score if `answer_aggregated` is `netural`

In [None]:
import pandas as pd
query={
  "query": {
    "bool": {
      "must": [
        {
          "multi_match": {
            "query": "does this work with xbox?",
            "fields": [ "question_text^2", "bullet_point*", "answers.answer_text^2","item_name^2"]
          }
        }
      ],
      "should": [
        {
          "term": {
            "answer_aggregated.keyword": {
              "value": "neutral"
            }
          }
        }
      ]
    }
  }
}
res = aos_client.search(index="headset_pqa", body=query)
query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['question_text'],hit['_source']['answers'][0]['answer_text']]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","question","answer"])
display(query_result_df)

### 11. Use custom scoring with function score queries

Function score are handy queries to overwrite the default BM-25 scoring. In the below query, it recalculates the score based on how many times the question was answered before.

In [None]:
import pandas as pd
query={
  "query": {
    "function_score": {
      "query": {
        "bool": {
          "must": [
            {
              "multi_match": {
                "query": "does this work with xbox?",
                "fields": ["question_text^5","bullet_point*","answers.answer_text^2", "item_name^2" ]
              }
            }
          ],
          "should": [
            {
              "term": {
                "answer_aggregated.keyword": {
                  "value": "neutral"
                }
              }
            }
          ]
        }
      },
      "functions": [
        {
          "script_score": {
            "script": "_score * 0.25 * doc['answers.answer_text.keyword'].length"
          }
        }
      ]
    }
  }
}
res = aos_client.search(index="headset_pqa", body=query)
print("Got %d Hits:" % res['hits']['total']['value'])
query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['question_text'],hit['_source']['answers'][0]['answer_text']]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","question","answer"])
display(query_result_df)

### 12. Observe The Results and Refine

Congratulations, you've now explored the possiblities of text search on the data in OpenSearch.

If you take a look at the results above, you'll notice that the results match one or more of the key words from our question, most commonly the words "work" and "xbox".  You'll also notices that a lot of these results aren't relevant to our original question, such as "Does it work on PS3?" and "Does it work for computers". In Module 3, we'll instead use semantic search to make the result more relevant.

### Store Variables Used for the Next Notebook

There are a few values you will need for the next notebook, execute the cells below to store them so they can be copied and pasted into the next part of the exercise.

In [None]:
%store outputs
%store bucket
%store aos_host