{ "cells": [ { "cell_type": "markdown", "id": "55288b83", "metadata": {}, "source": [ "# Compiling a Hugging Face model for AWS Inferentia with SageMaker Neo" ] }, { "cell_type": "markdown", "id": "3d9cf4f2", "metadata": {}, "source": [ "The notebook describes the process of downloading and preparing a pre-trained PyTorch language model from the Hugging Face repository, to be deployed on AWS Inferentia, a purpose-built hardware accelerator. \n", "\n", "Things to take into account when preparing the PyTorch model:\n", "\n", "#### PyTorch versions :\n", "\n", "- the model is pre-trained but will be traced and saved with 'torch.jit.trace' as a torch script\n", "- the version of PyTorch used for saving the model is the version that needs to be passed to the SageMaker PyTorch Estimator and Neo Compilation Job\n", "\n", "#### Neo Compilation : \n", "\n", "- the compilation job needs to find the model under 'model_data' with filename model.pth\n", "- the PyTorch version used for SageMaker Neo must be the same as the PyTorch version used to save the model in this notebook\n", "\n", "#### Inference script:\n", "\n", "- the inference script added to the SageMaker PyTorch Estimator will have a model loading function using 'torch.jit.load'\n", "- there has to be a requirements.txt added to the code directory with the transformers package so that the inference script can use the tokenizer from the Hugging Face transformers library\n", "\n", "\n" ] }, { "cell_type": "markdown", "id": "44397103", "metadata": {}, "source": [ "## Setting up our environment" ] }, { "cell_type": "markdown", "id": "5bcc25ab", "metadata": {}, "source": [ "For supported PyTorch versions with Neo please refer to the [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/neo-supported-cloud.html). For the purpose of this workshop we will compile a PyTorch version of the model available as standard SageMaker kernel. Please make sure you are using the available **SageMaker Python 3 (PyTorch 1.10 Python 3.8 CPU Optimized)** kernel, indicated at the top-right of the JupyterLab interface.\n", "\n", "Let's begin by installing the Hugging Face Transformers package to be able to download the pre-trained model and tokenizer and save it locally as a torch script." ] }, { "cell_type": "code", "execution_count": null, "id": "f8d03ed5", "metadata": { "tags": [] }, "outputs": [], "source": [ "%pip install -U transformers==4.15.0" ] }, { "cell_type": "code", "execution_count": null, "id": "51cf35a6", "metadata": {}, "outputs": [], "source": [ "import transformers\n", "print(transformers.__version__)" ] }, { "cell_type": "markdown", "id": "f268a2a0", "metadata": {}, "source": [ "If you run this notebook in SageMaker Studio, you need to make sure ipywidgets is installed and restart the kernel, so please uncomment the code in the next cell, and run it." ] }, { "cell_type": "code", "execution_count": null, "id": "06757f75", "metadata": {}, "outputs": [], "source": [ "%%capture\n", "import IPython\n", "import sys\n", "\n", "!{sys.executable} -m pip install ipywidgets\n", "IPython.Application.instance().kernel.do_shutdown(True) # has to restart kernel so changes are used" ] }, { "cell_type": "code", "execution_count": null, "id": "eb1c157a", "metadata": {}, "outputs": [], "source": [ "import transformers\n", "import sagemaker\n", "import torch\n", "\n", "sagemaker_session = sagemaker.Session()\n", "role = sagemaker.get_execution_role()\n", "sess_bucket = sagemaker_session.default_bucket()" ] }, { "cell_type": "markdown", "id": "6210df45", "metadata": {}, "source": [ "## Retrieving the model from Hugging Face Model Hub" ] }, { "cell_type": "markdown", "id": "f58656f2", "metadata": {}, "source": [ "The model [bert-base-cased-finetuned-mrpc](https://huggingface.co/bert-base-cased-finetuned-mrpc) is one of the most downloaded models from the Hugging Face Model Hub. This model is a fine-tuned version of bert-base-cased on the GLUE MRPC dataset. It achieves the following results on the evaluation set:\n", "\n", " Loss: 0.7132\n", " Accuracy: 0.8603\n", " F1: 0.9026\n", " Combined Score: 0.8814\n", "\n", "\n", "**Note:** It is important to set the `return_dict` parameter to `False` when instantiating the model. In `transformers` v4.x, this parameter is `True` by default and it enables the return of dict-like python objects containing the model outputs, instead of the standard tuples. Neuron compilation does not support dictionary-based model ouputs, and compilation would fail if we didn't explictly set it to `False`.\n", "\n", "We also get the tokenizer corresponding to this same model, in order to create a sample input to trace our model. " ] }, { "cell_type": "code", "execution_count": null, "id": "dfaa9aba", "metadata": {}, "outputs": [], "source": [ "tokenizer = transformers.AutoTokenizer.from_pretrained(\"bert-base-cased-finetuned-mrpc\")\n", "\n", "model = transformers.AutoModelForSequenceClassification.from_pretrained(\n", " \"bert-base-cased-finetuned-mrpc\", return_dict=False\n", ")" ] }, { "cell_type": "markdown", "id": "f16e5447", "metadata": {}, "source": [ "## Tracing model with `torch.jit` and uploading to S3 " ] }, { "cell_type": "markdown", "id": "c1dc3bcf", "metadata": {}, "source": [ "Using the `jit.trace` to create a torch script; this is a required step to have SageMaker Neo compile the model artifact, which will take a `tar.gz` file containing the traced model.\n", "\n", "The `.pth` extension when saving our model is required." ] }, { "cell_type": "code", "execution_count": null, "id": "309e579b", "metadata": {}, "outputs": [], "source": [ "# Prepare sample input for jit model tracing\n", "seq_0 = \"This is just sample text for model tracing, the length of the sequence does not matter because we will pad to the max length that Bert accepts.\"\n", "seq_1 = seq_0\n", "max_length = 512\n", "\n", "tokenized_sequence_pair = tokenizer.encode_plus(\n", " seq_0, seq_1, max_length=max_length, padding=\"max_length\", truncation=True, return_tensors=\"pt\"\n", ")\n", "\n", "example = tokenized_sequence_pair[\"input_ids\"], tokenized_sequence_pair[\"attention_mask\"]\n", "\n", "traced_model = torch.jit.trace(model.eval(), example)\n", "traced_model.save(\"model.pth\")" ] }, { "cell_type": "code", "execution_count": null, "id": "c3a01665", "metadata": {}, "outputs": [], "source": [ "import tarfile\n", "\n", "with tarfile.open(\"model.tar.gz\", \"w:gz\") as f:\n", " f.add(\"model.pth\")" ] }, { "cell_type": "markdown", "id": "45cc0cb3", "metadata": {}, "source": [ "Next, upload the traced model `tar.gz` file to Amazon S3, where our compilation job will download it from" ] }, { "cell_type": "code", "execution_count": null, "id": "15e70110", "metadata": {}, "outputs": [], "source": [ "traced_model_url = sagemaker_session.upload_data(\n", " path=\"model.tar.gz\",\n", " key_prefix=\"neuron-experiments/bert-seq-classification/traced-model\",\n", ")" ] }, { "cell_type": "markdown", "id": "b8d4f477", "metadata": {}, "source": [ "## Understanding the inference code" ] }, { "cell_type": "markdown", "id": "acff089d", "metadata": {}, "source": [ "The inference code is being placed in this instance under the /code directory as well as the requirements.txt. Both files will be uploaded by the SageMaker PyTorch Estimator when specified as \n", "\n", "- entry_point=\"inference_inf1.py\",\n", "- source_dir=\"code\",\n", "\n", "The SageMaker PyTorch Estimator will automatically pull a specifically built PyTorch container for the PyTorch version specified and use the script specified on the entry_point to override the functions model_fn, input_fn, and output_fn and predict_fn. \n", "\n", "- model_fn - receives the model directory, is responsible for loading and returning the model -, an i\n", "- nput_fn and output_fn - in charge of pre-processing/checking content types of input and output to the endpoint \n", "- predict_fn, which receives the outputs of model_fn and input_fn (meaning, the loaded model and the deserialized/pre-processed input data) and defines how the model will run inference.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "03e3f0e3", "metadata": {}, "outputs": [], "source": [ "!mkdir -p code" ] }, { "cell_type": "code", "execution_count": null, "id": "92b2661b", "metadata": {}, "outputs": [], "source": [ "%%writefile code/inference_inf1.py\n", "import os\n", "import json\n", "import torch\n", "import torch_neuron\n", "from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig\n", "\n", "JSON_CONTENT_TYPE = 'application/json'\n", "\n", "def model_fn(model_dir):\n", " \n", " model_dir = '/opt/ml/model/'\n", " dir_contents = os.listdir(model_dir)\n", " model_path = next(filter(lambda item: 'model' in item, dir_contents), None)\n", " \n", " tokenizer_init = AutoTokenizer.from_pretrained('bert-base-cased-finetuned-mrpc')\n", " model = torch.jit.load(os.path.join(model_dir, model_path))\n", "\n", " \n", " return (model, tokenizer_init)\n", "\n", "\n", "def input_fn(serialized_input_data, content_type=JSON_CONTENT_TYPE):\n", " if content_type == JSON_CONTENT_TYPE:\n", " input_data = json.loads(serialized_input_data)\n", " return input_data\n", " else:\n", " raise Exception('Requested unsupported ContentType in Accept: ' + content_type)\n", " return\n", " \n", "\n", "def predict_fn(input_data, models):\n", "\n", " model_bert, tokenizer = models\n", " sequence_0 = input_data[0] \n", " sequence_1 = input_data[1]\n", " \n", " max_length = 512\n", " tokenized_sequence_pair = tokenizer.encode_plus(sequence_0,\n", " sequence_1,\n", " max_length=max_length,\n", " padding='max_length',\n", " truncation=True,\n", " return_tensors='pt')\n", " \n", " # Convert example inputs to a format that is compatible with TorchScript tracing\n", " example_inputs = tokenized_sequence_pair['input_ids'], tokenized_sequence_pair['attention_mask']\n", " \n", " with torch.no_grad():\n", " paraphrase_classification_logits_neuron = model_bert(*example_inputs)\n", " \n", " classes = ['not paraphrase', 'paraphrase']\n", " paraphrase_prediction = paraphrase_classification_logits_neuron[0][0].argmax().item()\n", " out_str = 'BERT predicts that \"{}\" and \"{}\" are {}'.format(sequence_0, sequence_1, classes[paraphrase_prediction])\n", " \n", " return out_str\n", "\n", "\n", "def output_fn(prediction_output, accept=JSON_CONTENT_TYPE):\n", " if accept == JSON_CONTENT_TYPE:\n", " return json.dumps(prediction_output), accept\n", " \n", " raise Exception('Requested unsupported ContentType in Accept: ' + accept)\n" ] }, { "cell_type": "markdown", "id": "22192c10", "metadata": {}, "source": [ "In this case, within the `model_fn` the model artifact located in `model_dir` is loaded (the compilation step will name the artifact `model_neuron.pt`). Then, the Neuron compiled model is loaded with `torch.jit.load`. \n", "\n", "Together with the `model_fn`, the torch_neuron package needs to be imported as well as the transformers package." ] }, { "cell_type": "code", "execution_count": null, "id": "2ae4ada4", "metadata": {}, "outputs": [], "source": [ "%%writefile code/requirements.txt\n", "transformers==4.15.0\n" ] }, { "cell_type": "markdown", "id": "4c703c40", "metadata": {}, "source": [ "## Compiling and deploying model on an Inferentia inf1 instance" ] }, { "cell_type": "markdown", "id": "cd304bd3", "metadata": {}, "source": [ "The newly downloaded `PyTorchModel` will use `inference_inf1.py` as its entry point script. PyTorch version 1.10.1 is specified, as it is the latest version supported by Neo." ] }, { "cell_type": "code", "execution_count": null, "id": "a6ae92a8", "metadata": {}, "outputs": [], "source": [ "from sagemaker.pytorch.model import PyTorchModel\n", "from sagemaker.predictor import Predictor\n", "from datetime import datetime\n", "\n", "prefix = \"neuron-experiments/bert-paraphrase\"\n", "flavour = \"INF\"\n", "date_string = datetime.now().strftime(\"%Y%m-%d%H-%M%S\")\n", "\n", "compiled_sm_model = PyTorchModel(\n", " model_data=traced_model_url,\n", " predictor_cls=Predictor,\n", " framework_version=\"1.10.1\",\n", " role=role,\n", " sagemaker_session=sagemaker_session,\n", " entry_point=\"inference_inf1.py\",\n", " source_dir=\"code\",\n", " py_version=\"py3\",\n", " name=f\"{flavour}-bert-mrpc-pt101-{date_string}\",\n", " env={\"SAGEMAKER_CONTAINER_LOG_LEVEL\": \"10\"},\n", ")" ] }, { "cell_type": "markdown", "id": "755897c1", "metadata": {}, "source": [ "Finally, we are ready to compile the model. Two notes here:\n", "* HuggingFace models should be compiled to `dtype` `int64`\n", "* the format for `compiler_options` differs from the standard Python `dict` that you can use when compiling for \"normal\" instance types; for inferentia, you must provide a JSON string with CLI arguments, which correspond to the ones supported by the [Neuron Compiler](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/neuron-cc/command-line-reference.html) (read more about `compiler_options` [here](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_OutputConfig.html#API_OutputConfig_Contents))\n", "\n", "Compilation of the model will take ~13 minutes." ] }, { "cell_type": "code", "execution_count": null, "id": "bbf92fb8", "metadata": {}, "outputs": [], "source": [ "%%time\n", "import json\n", "\n", "hardware = \"inf1\"\n", "flavour = \"compiled-inf\"\n", "compilation_job_name = f\"bert-{flavour}-{hardware}-\" + date_string\n", "\n", "compiled_inf1_model = compiled_sm_model.compile(\n", " target_instance_family=f\"ml_{hardware}\",\n", " input_shape={\"input_ids\": [1, 512], \"attention_mask\": [1, 512]},\n", " job_name=compilation_job_name,\n", " role=role,\n", " framework=\"pytorch\",\n", " framework_version=\"1.10.1\",\n", " output_path=f\"s3://{sess_bucket}/{prefix}/neo-compilations/{flavour}-model\",\n", " compiler_options=json.dumps(\"--dtype int64\"),\n", " # compiler_options={'dtype': 'int64'}, # For compiling to \"normal\" instance types, cpu or gpu-based\n", " compile_max_run=900,\n", ")" ] }, { "cell_type": "markdown", "id": "4645fcba", "metadata": {}, "source": [ "After successful compilation, we deploy our model to an inf1.xlarge Inferentia-powered instance. Endpoint deployment will take ~10 minutes." ] }, { "cell_type": "code", "execution_count": null, "id": "94b46b7e", "metadata": {}, "outputs": [], "source": [ "%%time\n", "from sagemaker.serializers import JSONSerializer\n", "from sagemaker.deserializers import JSONDeserializer\n", "\n", "date_string = datetime.now().strftime(\"%Y%m-%d%H-%M%S\")\n", "\n", "compiled_inf1_predictor = compiled_inf1_model.deploy(\n", " instance_type=\"ml.inf1.xlarge\",\n", " initial_instance_count=1,\n", " endpoint_name=f\"test-neo-{hardware}-{date_string}\",\n", " serializer=JSONSerializer(),\n", " deserializer=JSONDeserializer(),\n", ")" ] }, { "cell_type": "markdown", "id": "21a06075", "metadata": {}, "source": [ "Next, we submit an inference request to the endpoint" ] }, { "cell_type": "code", "execution_count": null, "id": "d77e0822", "metadata": {}, "outputs": [], "source": [ "# Predict with model endpoint\n", "payload = seq_0, seq_1\n", "compiled_inf1_predictor.predict(payload)" ] }, { "cell_type": "markdown", "id": "3fcdd387", "metadata": {}, "source": [ "### Clean up\n", "\n", "When you are finished with your Inferentia-based SageMaker endpoint, run the following code to remove the associated resources:" ] }, { "cell_type": "code", "execution_count": null, "id": "18eeff36", "metadata": {}, "outputs": [], "source": [ "compiled_inf1_predictor.delete_model()\n", "compiled_inf1_predictor.delete_endpoint()" ] } ], "metadata": { "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "Python 3 (PyTorch 1.10 Python 3.8 CPU Optimized)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-west-2:236514542706:image/pytorch-1.10-cpu-py38" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "vscode": { "interpreter": { "hash": "aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49" } } }, "nbformat": 4, "nbformat_minor": 5 }