{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "16d74ffb",
   "metadata": {},
   "source": [
    "# Measure document similarities by extraction features from text inputs\n",
    "\n",
    "Create a mechanism to extract features (embeddings) from text inputs. With the embeddings you can then compute the distance between two or more sentences. This is useful if you're building a search mechanism or trying to see how **\"semantically\"** two sentences are close.\n",
    "\n",
    "For that purpose you'll use a **[Bert base](https://huggingface.co/bert-base-cased-finetuned-mrpc)** model, accelerated by an inf1 instance ([AWS Inferentia](https://aws.amazon.com/machine-learning/inferentia/)), running on SageMaker.\n",
    "\n",
    "For maximum performance and flexibility, you'll prepare the model with \"Neuron Core Pipeline\" and \"Dynamic Batch Size\" enabled. The first technique will shard the model across multiple cores to improve throughput. The second technique will allow you to send requests with different batch sizes. [Read more about these feature here](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/pytorch/pipeline_tutorial/neuroncore_pipeline_pytorch.html).\n",
    "\n",
    "The text samples used in this notebook were extracted from: https://www.gutenberg.org/cache/epub/84/pg84-images.html#chap01"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fc1cb598",
   "metadata": {},
   "source": [
    "## 1) Compile a pre-trained model\n",
    "When you deploy a model to a SageMaker Endpoint/inf1 instance (AWS Inferentia), you first need to compile the model with NeuronSDK. We'll use a sample provided by the official AWS Neuron SDK + Inferentia Samples.\n",
    "\n",
    "- Clone the repo: https://github.com/aws-neuron/aws-neuron-samples\n",
    "- Load the jupyter notebook for BertBaseCased: https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuron/inference/bertbasecased/\n",
    "- Start running the notebook, but enable Dynamic Batch and also Neuron Core Pipelines for 4 Neuron Cores, in model compilation section, as following:\n",
    "\n",
    "```python\n",
    "import os\n",
    "import torch\n",
    "import torch.neuron\n",
    "\n",
    "save_dir='model'\n",
    "neuron_model = torch.neuron.trace(\n",
    "    model, example_inputs=example_inputs_paraphrase,\n",
    "    dynamic_batch_size=True,\n",
    "    compiler_args['--neuron-core-pipeline', '4']\n",
    ")\n",
    "model.config.update({\"traced_sequence_length\": max_length})\n",
    "\n",
    "## Export 1/compiled model; 2/ tokenizer and 3/ model configs\n",
    "model_neuron.save(os.path.join(save_dir,\"model_neuron.pt\"))\n",
    "tokenizer.save_pretrained(save_dir)\n",
    "model.config.save_pretrained(save_dir)\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "61189f98",
   "metadata": {},
   "source": [
    "## 2) Pack and upload the model to S3\n",
    "After compiling the model with the instructions above, **COPY** the entire **save_dir** to the same directory of this Notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d60a3908",
   "metadata": {},
   "outputs": [],
   "source": [
    "import io\n",
    "import tarfile\n",
    "import sagemaker\n",
    "\n",
    "save_dir='model'\n",
    "sess = sagemaker.Session()\n",
    "sagemaker_session_bucket = sess.default_bucket()\n",
    "with io.BytesIO() as file:\n",
    "    with tarfile.open(fileobj=file, mode=\"w:gz\") as tar:\n",
    "        tar.add(save_dir, \".\")\n",
    "        tar.list()\n",
    "    file.seek(0)\n",
    "    s3_uri = sess.upload_string_as_file_body(\n",
    "        file.read(), sagemaker_session_bucket, \"model/bert/model.tar.gz\"\n",
    "    )\n",
    "print(s3_uri)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7ef38d42",
   "metadata": {},
   "source": [
    "## 3) Inference script used by SageMaker endpoint to load and execute the model\n",
    "This script is responsible for loading the model and expose a webservice for us to invoke and get predictions (embeddings)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "81760e79",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pygmentize code/inference.py"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7799a6eb",
   "metadata": {},
   "source": [
    "## 4) Deploy our model to a SageMaker endpoint"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cbc462b8",
   "metadata": {},
   "outputs": [],
   "source": [
    "import sagemaker\n",
    "sess = sagemaker.Session()\n",
    "\n",
    "# sagemaker session bucket -> used for uploading data, models and logs\n",
    "# sagemaker will automatically create this bucket if it not exists\n",
    "sagemaker_session_bucket = sess.default_bucket()\n",
    "\n",
    "role = sagemaker.get_execution_role()\n",
    "\n",
    "print(f\"sagemaker role arn: {role}\")\n",
    "print(f\"sagemaker bucket: {sess.default_bucket()}\")\n",
    "print(f\"sagemaker session region: {sess.boto_region_name}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8bd55d2b",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.huggingface.model import HuggingFaceModel\n",
    "\n",
    "# create Hugging Face Model Class\n",
    "huggingface_model = HuggingFaceModel(\n",
    "   model_data=s3_uri,       # path to your model and script\n",
    "   role=role,                    # iam role with permissions to create an Endpoint\n",
    "   transformers_version=\"4.12\",  # transformers version used\n",
    "   pytorch_version=\"1.9\",        # pytorch version used\n",
    "   py_version='py37',            # python version used\n",
    "   sagemaker_session=sess,\n",
    "   model_server_workers=4, # keep 4 workers\n",
    "   entry_point=\"code/inference.py\",\n",
    "   # for production it is important to define vpc_config and use a vpc_endpoint\n",
    "   #vpc_config={\n",
    "   #    'Subnets': ['subnet-a320a8ca', 'subnet-56d5072d'],\n",
    "   #    'SecurityGroupIds': ['sg-0d8c231d83c1caaa6', 'sg-5504723c']\n",
    "   #}    \n",
    ")\n",
    "\n",
    "# Let SageMaker know that we've already compiled the model via neuron-cc\n",
    "huggingface_model._is_compiled_model = True\n",
    "\n",
    "# deploy the endpoint endpoint\n",
    "predictor = huggingface_model.deploy(\n",
    "    initial_instance_count=1,      # number of instances\n",
    "    instance_type=\"ml.inf1.6xlarge\" # AWS Inferentia Instance\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "380e48a5",
   "metadata": {},
   "source": [
    "## 5) Run a simple test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "63e60205",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.serializers import JSONSerializer\n",
    "from sagemaker.deserializers import NumpyDeserializer\n",
    "predictor.serializer = JSONSerializer()\n",
    "predictor.deserializer = NumpyDeserializer()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "72afc7de",
   "metadata": {},
   "outputs": [],
   "source": [
    "with open('frank_chap01.txt') as f:\n",
    "    data = {'inputs': [l.strip() for l in f.readlines()]}\n",
    "num_sentences = len(data['inputs'])\n",
    "print(f\"Number of sentences: {num_sentences}\")\n",
    "embeddings = predictor.predict(data)\n",
    "print(embeddings.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "79e975c7",
   "metadata": {},
   "source": [
    "### 5.1) Simple benchmark to identify the best batch_size with 1 client only"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "id": "ff730002",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch size: 1 Elapsed time: 14.544463157653809ms Latency p/s 14.544463157653809ms\n",
      "Batch size: 2 Elapsed time: 23.25267791748047ms Latency p/s 11.626338958740234ms\n",
      "Batch size: 3 Elapsed time: 31.86509609222412ms Latency p/s 10.621698697408041ms\n",
      "Batch size: 4 Elapsed time: 39.96927738189697ms Latency p/s 9.992319345474243ms\n",
      "Batch size: 5 Elapsed time: 48.52888584136963ms Latency p/s 9.705777168273926ms\n",
      "Batch size: 6 Elapsed time: 57.08444118499756ms Latency p/s 9.514073530832926ms\n",
      "Batch size: 7 Elapsed time: 65.29092788696289ms Latency p/s 9.32727541242327ms\n",
      "Batch size: 8 Elapsed time: 74.49376583099365ms Latency p/s 9.311720728874207ms\n",
      "Batch size: 9 Elapsed time: 82.37555027008057ms Latency p/s 9.15283891889784ms\n",
      "Batch size: 10 Elapsed time: 90.54069519042969ms Latency p/s 9.054069519042969ms\n",
      "Batch size: 11 Elapsed time: 99.27759170532227ms Latency p/s 9.025235609574752ms\n"
     ]
    }
   ],
   "source": [
    "import time\n",
    "import copy\n",
    "iterations=10\n",
    "for batch_size in range(1,num_sentences+1):\n",
    "    d = copy.deepcopy(data)\n",
    "    d['inputs'] = d['inputs'][:batch_size]\n",
    "    t=time.time()\n",
    "    for i in range(iterations):\n",
    "        predictor.predict(d)\n",
    "    elapsed = (time.time()-t)/iterations*1000\n",
    "    print(f\"Batch size: {batch_size} Elapsed time: {elapsed}ms Latency p/s {elapsed/batch_size}ms\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "658ad164",
   "metadata": {},
   "source": [
    "### 5.2) Now Invoke the endpoint in parallel to evaluate throughput"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "id": "518a7206",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Elapsed time: 24082.525491714478ms to process 11264 sentences with 5 workers. Latency p/s: 2.1380083000456747ms\n"
     ]
    }
   ],
   "source": [
    "import time\n",
    "from concurrent.futures import ThreadPoolExecutor\n",
    "\n",
    "# custom task that will sleep for a variable amount of time\n",
    "def task(data):\n",
    "    predictor.predict(data)\n",
    "\n",
    "num_workers = 5\n",
    "d = copy.deepcopy(data)\n",
    "documents_1k = [d for i in range(1024)]\n",
    "total_docs = len(documents_1k) * len(data['inputs'])\n",
    "\n",
    "# start the thread pool\n",
    "t=time.time()\n",
    "with ThreadPoolExecutor(num_workers) as executor:\n",
    "    # execute tasks concurrently and process results in order    \n",
    "    executor.map(task, documents_1k)\n",
    "elapsed = (time.time()-t)*1000\n",
    "print(f\"Elapsed time: {elapsed}ms to process {total_docs} sentences with {num_workers} workers. Latency p/s: {elapsed/total_docs}ms\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6ea67331",
   "metadata": {},
   "source": [
    "### 5.3) Finally a similarity test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "id": "8bbedec9",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Cosine Similarity: [[0.9238203]]\n"
     ]
    }
   ],
   "source": [
    "from sklearn.metrics.pairwise import cosine_similarity\n",
    "sentence_1=\"I've seen things you people wouldn't believe. Attack ships on fire off the shoulder of Orion.\"\n",
    "sentence_2=\"I watched C-beams glitter in the dark near the Tannhäuser Gate. All those moments will be lost in time, like tears in rain. Time to die.\"\n",
    "embeddings_1,embeddings_2 = predictor.predict({'inputs':[sentence_1, sentence_2]})\n",
    "print(f'Cosine Similarity: {cosine_similarity([embeddings_1],[embeddings_2])}')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_pytorch_p36",
   "language": "python",
   "name": "conda_pytorch_p36"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}