# Measure document similarities by extraction features from text inputs

Create a mechanism to extract features (embeddings) from text inputs. With the embeddings you can then compute the distance between two or more sentences. This is useful if you're building a search mechanism or trying to see how **"semantically"** two sentences are close.

For that purpose you'll use a **[Bert base](https://huggingface.co/bert-base-cased-finetuned-mrpc)** model, accelerated by an inf1 instance ([AWS Inferentia](https://aws.amazon.com/machine-learning/inferentia/)), running on SageMaker.

For maximum performance and flexibility, you'll prepare the model with "Neuron Core Pipeline" and "Dynamic Batch Size" enabled. The first technique will shard the model across multiple cores to improve throughput. The second technique will allow you to send requests with different batch sizes. [Read more about these feature here](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/pytorch/pipeline_tutorial/neuroncore_pipeline_pytorch.html).

The text samples used in this notebook were extracted from: https://www.gutenberg.org/cache/epub/84/pg84-images.html#chap01

## 1) Compile a pre-trained model
When you deploy a model to a SageMaker Endpoint/inf1 instance (AWS Inferentia), you first need to compile the model with NeuronSDK. We'll use a sample provided by the official AWS Neuron SDK + Inferentia Samples.

- Clone the repo: https://github.com/aws-neuron/aws-neuron-samples
- Load the jupyter notebook for BertBaseCased: https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuron/inference/bertbasecased/
- Start running the notebook, but enable Dynamic Batch and also Neuron Core Pipelines for 4 Neuron Cores, in model compilation section, as following:

```python
import os
import torch
import torch.neuron

save_dir='model'
neuron_model = torch.neuron.trace(
 model, example_inputs=example_inputs_paraphrase,
 dynamic_batch_size=True,
 compiler_args['--neuron-core-pipeline', '4']
)
model.config.update({"traced_sequence_length": max_length})

## Export 1/compiled model; 2/ tokenizer and 3/ model configs
model_neuron.save(os.path.join(save_dir,"model_neuron.pt"))
tokenizer.save_pretrained(save_dir)
model.config.save_pretrained(save_dir)

```

## 2) Pack and upload the model to S3
After compiling the model with the instructions above, **COPY** the entire **save_dir** to the same directory of this Notebook.

In [None]:
import io
import tarfile
import sagemaker

save_dir='model'
sess = sagemaker.Session()
sagemaker_session_bucket = sess.default_bucket()
with io.BytesIO() as file:
 with tarfile.open(fileobj=file, mode="w:gz") as tar:
 tar.add(save_dir, ".")
 tar.list()
 file.seek(0)
 s3_uri = sess.upload_string_as_file_body(
 file.read(), sagemaker_session_bucket, "model/bert/model.tar.gz"
 )
print(s3_uri)

## 3) Inference script used by SageMaker endpoint to load and execute the model
This script is responsible for loading the model and expose a webservice for us to invoke and get predictions (embeddings)

In [None]:
!pygmentize code/inference.py

## 4) Deploy our model to a SageMaker endpoint

In [None]:
import sagemaker
sess = sagemaker.Session()

# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket = sess.default_bucket()

role = sagemaker.get_execution_role()

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

In [None]:
from sagemaker.huggingface.model import HuggingFaceModel

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
 model_data=s3_uri, # path to your model and script
 role=role, # iam role with permissions to create an Endpoint
 transformers_version="4.12", # transformers version used
 pytorch_version="1.9", # pytorch version used
 py_version='py37', # python version used
 sagemaker_session=sess,
 model_server_workers=4, # keep 4 workers
 entry_point="code/inference.py",
 # for production it is important to define vpc_config and use a vpc_endpoint
 #vpc_config={
 # 'Subnets': ['subnet-a320a8ca', 'subnet-56d5072d'],
 # 'SecurityGroupIds': ['sg-0d8c231d83c1caaa6', 'sg-5504723c']
 #} 
)

# Let SageMaker know that we've already compiled the model via neuron-cc
huggingface_model._is_compiled_model = True

# deploy the endpoint endpoint
predictor = huggingface_model.deploy(
 initial_instance_count=1, # number of instances
 instance_type="ml.inf1.6xlarge" # AWS Inferentia Instance
)

## 5) Run a simple test

In [None]:
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import NumpyDeserializer
predictor.serializer = JSONSerializer()
predictor.deserializer = NumpyDeserializer()

In [None]:
with open('frank_chap01.txt') as f:
 data = {'inputs': [l.strip() for l in f.readlines()]}
num_sentences = len(data['inputs'])
print(f"Number of sentences: {num_sentences}")
embeddings = predictor.predict(data)
print(embeddings.shape)

### 5.1) Simple benchmark to identify the best batch_size with 1 client only

In [51]:
import time
import copy
iterations=10
for batch_size in range(1,num_sentences+1):
 d = copy.deepcopy(data)
 d['inputs'] = d['inputs'][:batch_size]
 t=time.time()
 for i in range(iterations):
 predictor.predict(d)
 elapsed = (time.time()-t)/iterations*1000
 print(f"Batch size: {batch_size} Elapsed time: {elapsed}ms Latency p/s {elapsed/batch_size}ms")

Batch size: 1 Elapsed time: 14.544463157653809ms Latency p/s 14.544463157653809ms
Batch size: 2 Elapsed time: 23.25267791748047ms Latency p/s 11.626338958740234ms
Batch size: 3 Elapsed time: 31.86509609222412ms Latency p/s 10.621698697408041ms
Batch size: 4 Elapsed time: 39.96927738189697ms Latency p/s 9.992319345474243ms
Batch size: 5 Elapsed time: 48.52888584136963ms Latency p/s 9.705777168273926ms
Batch size: 6 Elapsed time: 57.08444118499756ms Latency p/s 9.514073530832926ms
Batch size: 7 Elapsed time: 65.29092788696289ms Latency p/s 9.32727541242327ms
Batch size: 8 Elapsed time: 74.49376583099365ms Latency p/s 9.311720728874207ms
Batch size: 9 Elapsed time: 82.37555027008057ms Latency p/s 9.15283891889784ms
Batch size: 10 Elapsed time: 90.54069519042969ms Latency p/s 9.054069519042969ms
Batch size: 11 Elapsed time: 99.27759170532227ms Latency p/s 9.025235609574752ms


### 5.2) Now Invoke the endpoint in parallel to evaluate throughput

In [38]:
import time
from concurrent.futures import ThreadPoolExecutor

# custom task that will sleep for a variable amount of time
def task(data):
 predictor.predict(data)

num_workers = 5
d = copy.deepcopy(data)
documents_1k = [d for i in range(1024)]
total_docs = len(documents_1k) * len(data['inputs'])

# start the thread pool
t=time.time()
with ThreadPoolExecutor(num_workers) as executor:
 # execute tasks concurrently and process results in order 
 executor.map(task, documents_1k)
elapsed = (time.time()-t)*1000
print(f"Elapsed time: {elapsed}ms to process {total_docs} sentences with {num_workers} workers. Latency p/s: {elapsed/total_docs}ms")

Elapsed time: 24082.525491714478ms to process 11264 sentences with 5 workers. Latency p/s: 2.1380083000456747ms


### 5.3) Finally a similarity test

In [50]:
from sklearn.metrics.pairwise import cosine_similarity
sentence_1="I've seen things you people wouldn't believe. Attack ships on fire off the shoulder of Orion."
sentence_2="I watched C-beams glitter in the dark near the Tannhäuser Gate. All those moments will be lost in time, like tears in rain. Time to die."
embeddings_1,embeddings_2 = predictor.predict({'inputs':[sentence_1, sentence_2]})
print(f'Cosine Similarity: {cosine_similarity([embeddings_1],[embeddings_2])}')

Cosine Similarity: [[0.9238203]]
