{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Compiling and Deploying Hugging Face Pre-trained BERT" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Introduction\n", "\n", "In this tutorial we will compile and deploy the BERT-base version of Hugging Face 🤗 Transformers BERT for Inferentia. The full list of HuggingFace's pretrained BERT models can be found in the BERT section on this page https://huggingface.co/transformers/pretrained_models.html. \n", "\n", "Please verify that this Jupyter notebook is running the Python Neuron kernel environment that was set up according to the [PyTorch Neuron Installation Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-intro/pytorch-setup/pytorch-install.html). You can select the kernel from the \"Kernel -> Change Kernel\" option on the top of this Jupyter notebook page." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Install Dependencies:\n", "This tutorial requires the following pip packages:\n", "\n", "- `torch-neuron`\n", "- `neuron-cc[tensorflow]`\n", "- `transformers`\n", "\n", "Most of these packages will be installed when configuring your environment using the Neuron PyTorch setup guide. The additional dependencies must be installed here." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "!pip install --upgrade \"transformers==4.6.0\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Compile the model into an AWS Neuron optimized TorchScript\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import tensorflow # to work around a protobuf version conflict issue\n", "import torch\n", "import torch.neuron\n", "from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig\n", "import transformers\n", "import os\n", "import warnings" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now setup the number of cores. For an inf1.xlarge or inf1.2xlarge instance, we will set num_cores = 4. For an inf1.6xlarge, set num_cores = 16." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Setting up NeuronCore groups for inf1.6xlarge with 16 cores\n", "num_cores = 4\n", "os.environ[\"TOKENIZERS_PARALLELISM\"]=\"false\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, build the tokenizer from the pre-trained model and also create the model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Build tokenizer and model\n", "tokenizer = AutoTokenizer.from_pretrained(\"bert-base-cased-finetuned-mrpc\")\n", "model = AutoModelForSequenceClassification.from_pretrained(\"bert-base-cased-finetuned-mrpc\", return_dict=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It's time to setup some example inputs and also max lengths. This will help in defining the paraphrase and non-paraphrase inputs." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Setup some example inputs\n", "sequence_0 = \"Hugging Face is a technology company based in New York City\"\n", "sequence_1 = \"Apples are especially good for your health\"\n", "sequence_2 = \"Hugging Face's headquarters are situated in Manhattan\"\n", "\n", "max_length=128\n", "paraphrase = tokenizer.encode_plus(sequence_0, sequence_2, max_length=max_length, padding='max_length', truncation=True, return_tensors=\"pt\")\n", "not_paraphrase = tokenizer.encode_plus(sequence_0, sequence_1, max_length=max_length, padding='max_length', truncation=True, return_tensors=\"pt\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's run the encoded example sequences through the original PyTorch model. We will also convert the example sequences to a format that is compatible with TorchScript tracing, which takes place in a subsequent step." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Run the original PyTorch model on compilation exaple\n", "paraphrase_classification_logits = model(**paraphrase)[0]\n", "print(f\"Model output: {paraphrase_classification_logits}\")\n", "\n", "# Convert example inputs to a format that is compatible with TorchScript tracing\n", "example_inputs_paraphrase = paraphrase['input_ids'], paraphrase['attention_mask'], paraphrase['token_type_ids']\n", "example_inputs_not_paraphrase = not_paraphrase['input_ids'], not_paraphrase['attention_mask'], not_paraphrase['token_type_ids']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, the final step is to trace and save the Neuron-compiled Torchscript model" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Run torch.neuron.trace to generate a TorchScript that is optimized by AWS Neuron\n", "model_neuron = torch.neuron.trace(model, example_inputs_paraphrase)\n", "\n", "# Verify the TorchScript works on both example inputs\n", "paraphrase_classification_logits_neuron = model_neuron(*example_inputs_paraphrase)\n", "not_paraphrase_classification_logits_neuron = model_neuron(*example_inputs_not_paraphrase)\n", "\n", "# Save the TorchScript for later use\n", "model_neuron.save('bert_neuron.pt')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You may inspect `model_neuron.graph` to see which part is running on CPU versus running on the accelerator. All native `aten` operators in the graph will be running on CPU." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(model_neuron.graph)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "### Deploy the AWS Neuron optimized TorchScript\n", "\n", "To deploy the AWS Neuron optimized TorchScript, you may choose to load any already-compiled model from disk to avoid the compilation step." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load TorchScript back\n", "model_neuron = torch.jit.load('bert_neuron.pt')\n", "\n", "# Verify the TorchScript works on both example inputs\n", "paraphrase_classification_logits_neuron = model_neuron(*example_inputs_paraphrase)\n", "not_paraphrase_classification_logits_neuron = model_neuron(*example_inputs_not_paraphrase)\n", "\n", "classes = ['not paraphrase', 'paraphrase']\n", "paraphrase_prediction = paraphrase_classification_logits_neuron[0][0].argmax().item()\n", "not_paraphrase_prediction = not_paraphrase_classification_logits_neuron[0][0].argmax().item()\n", "\n", "print('BERT says that \"{}\" and \"{}\" are {}'.format(sequence_0, sequence_2, classes[paraphrase_prediction]))\n", "print()\n", "print('BERT says that \"{}\" and \"{}\" are {}'.format(sequence_0, sequence_1, classes[not_paraphrase_prediction]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's run the model in parallel on four cores. For this, define couple of utility functions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_input_with_padding(batch, batch_size, max_length):\n", " ## Reformulate the batch into three batch tensors - default batch size batches the outer dimension\n", " encoded = batch['encoded']\n", " inputs = torch.squeeze(encoded['input_ids'], 1)\n", " attention = torch.squeeze(encoded['attention_mask'], 1)\n", " token_type = torch.squeeze(encoded['token_type_ids'], 1)\n", " quality = list(map(int, batch['quality']))\n", "\n", " if inputs.size()[0] != batch_size:\n", " print(\"Input size = {} - padding\".format(inputs.size()))\n", " remainder = batch_size - inputs.size()[0]\n", " zeros = torch.zeros( [remainder, max_length], dtype=torch.long )\n", " inputs = torch.cat( [inputs, zeros] )\n", " attention = torch.cat( [attention, zeros] )\n", " token_type = torch.cat( [token_type, zeros] )\n", "\n", " assert(inputs.size()[0] == batch_size and inputs.size()[1] == max_length)\n", " assert(attention.size()[0] == batch_size and attention.size()[1] == max_length)\n", " assert(token_type.size()[0] == batch_size and token_type.size()[1] == max_length)\n", "\n", " return (inputs, attention, token_type), quality\n", "\n", "def count(output, quality):\n", " assert output.size(0) >= len(quality)\n", " correct_count = 0\n", " count = len(quality)\n", " \n", " batch_predictions = [ row.argmax().item() for row in output ]\n", "\n", " for a, b in zip(batch_predictions, quality):\n", " if int(a)==int(b):\n", " correct_count += 1\n", "\n", " return correct_count, count" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data parallel inference\n", "In the following cells, we use the data parallel approach for inference. In this approach, we load multiple models, all of them running in parallel. Each model is loaded onto a single NeuronCore. In the below implementation, we launch 4 models, thereby utilizing all the 4 cores on an inf1.xlarge or inf1.2xlarge instance.\n", "\n", "By running more than 1 model concurrently, the model throughput of the system increases. To accurate benchmark the system, we need to efficiently feed the models so as to keep them busy at all times. In the below setup, this is done by using a producer-consumer model. We maintain a common Python queue shared across all the models. The common queue enables feeding data continuously to the models." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we download two helper libraries called `parallel` and `bert_benchmark_utils` and a test dataset called `glue_mrpc_dev.tsv`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!wget https://raw.githubusercontent.com/aws-neuron/aws-neuron-sdk/master/src/examples/pytorch/bert_tutorial/parallel.py -O parallel.py\n", "!wget https://raw.githubusercontent.com/aws-neuron/aws-neuron-sdk/master/src/examples/pytorch/bert_tutorial/bert_benchmark_utils.py -O bert_benchmark_utils.py\n", "!wget https://raw.githubusercontent.com/aws-neuron/aws-neuron-sdk/master/src/examples/pytorch/bert_tutorial/glue_mrpc_dev.tsv -O glue_mrpc_dev.tsv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we proceed with importing the required libraries and setting up the input file and its configurations:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from parallel import NeuronSimpleDataParallel\n", "from bert_benchmark_utils import BertTestDataset, BertResults\n", "import time\n", "import functools\n", "\n", "max_length = 128\n", "num_cores = 4\n", "batch_size = 1\n", "\n", "tsv_file=\"glue_mrpc_dev.tsv\" ## where to locate this file on local??\n", "\n", "data_set = BertTestDataset( tsv_file=tsv_file, tokenizer=tokenizer, max_length=max_length )\n", "data_loader = torch.utils.data.DataLoader(data_set, batch_size=batch_size, shuffle=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let us create a handler function which will aggregate the result. Also, this is where you create the Data Parallel model object using the `NeuronSimpleDataParallel()` wrapper." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Result aggregation class (code in bert_benchmark_utils.py)\n", "results = BertResults(batch_size, num_cores)\n", "def result_handler(output, result_id, start, end, input_dict):\n", " correct_count, inference_count = count(output[0], input_dict.pop(result_id))\n", " elapsed = end - start\n", " results.add_result(correct_count, inference_count, [elapsed], [end], [start])\n", "\n", "parallel_neuron_model = NeuronSimpleDataParallel('bert_neuron.pt', num_cores)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "Now, the final step is perform parallel inference and calculate the benchmark:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Starting the inference threads\n", "parallel_neuron_model.start_continuous_inference()\n", "\n", "# Warm up the cores - the first inference request to a Neuron-compiled model causes the model\n", "# to be loaded onto a NeuronCore. We do not want model loading time to be included in the benchmark.\n", "z = torch.zeros( [batch_size, max_length], dtype=torch.long )\n", "batch = (z, z, z)\n", "for _ in range(num_cores*4):\n", " parallel_neuron_model.infer(batch, -1, None)\n", " \n", "input_dict = {}\n", "input_id = 0\n", "for _ in range(50):\n", " for batch in data_loader:\n", " batch, quality = get_input_with_padding(batch, batch_size, max_length)\n", " input_dict[input_id] = quality\n", " callback_fn = functools.partial(result_handler, input_dict=input_dict)\n", " parallel_neuron_model.infer(batch, input_id, callback_fn)\n", " input_id+=1\n", "\n", "# Stop inference \n", "parallel_neuron_model.stop()\n", "\n", "\n", "with open(\"benchmark.txt\", \"w\") as f:\n", " results.report(f, window_size=1)\n", "\n", "with open(\"benchmark.txt\", \"r\") as f:\n", " for line in f:\n", " print(line)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "Note that in the above benchmarking test we are using a batch size of `1` which will provide good latency but will not optimize for throughput\n", "\n", "If you feel inclined, you can retry the benchmarking with a larger batch size to understand its impact on latency and throughput. You can do this by:\n", "\n", "1. modifying the original notebook code to create an example input with the new batch size\n", "2. re-tracing the original model with the new example input\n", "3. re-running the above benchmark code" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python (Neuron PyTorch)", "language": "python", "name": "pytorch_venv" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.15" }, "vscode": { "interpreter": { "hash": "aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49" } } }, "nbformat": 4, "nbformat_minor": 4 }