{ "cells": [ { "cell_type": "code", "execution_count": null, "source": [ "!pip install -q -U sagemaker ipywidgets" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "# ResNet-50 comparison: compiled vs uncompiled, featuring Inferentia\n", "\n", "In this notebook we will see how to deploy a pretrained model from the PyTorch Vision library, in particular a ResNet50, to Amazon SageMaker. We will also test how it performs on different hardware configurations, and the effects of model compilation with Amazon SageMaker Neo. In a nutshell, we will test:\n", "\n", "- ResNet50 on a ml.c5.xlarge, uncompiled\n", "- ResNet50 on a ml.g4dn.xlarge, uncompiled\n", "- ResNet50 on a ml.c5.xlarge, compiled\n", "- ResNet50 on a ml.g4dn.xlarge, compiled\n", "- ResNet50 on a ml.inf1.xlarge, compiled\n", "\n", "**NOTE**: this notebook has been tested with the PyTorch 1.8 CPU kernel. Please use that one if you're using SageMaker Studio." ], "metadata": {} }, { "cell_type": "markdown", "source": [ "### Set-up model and SageMaker helper functions" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "import sagemaker\n", "from sagemaker import Session, get_execution_role\n", "from sagemaker.pytorch.model import PyTorchModel\n", "from sagemaker.utils import name_from_base\n", "\n", "print(sagemaker.__version__)\n", "\n", "sess = Session()\n", "bucket = sess.default_bucket()\n", "role = get_execution_role()\n", "endpoints = {}" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "Let's download the model for the PyTorch Hub, and create an archive that can be used by SageMaker to deploy this model. For using PyTorch in Script Mode, Amazon SageMaker expects a single archive file in `.tar.gz` format, containing a model file and the code for inference in a `code` folder. The structure of the archive will be as follows: \n", "\n", "```\n", "/model.tar.gz\n", "/--- model.pth\n", "/--- code/\n", "/--- /--- inference.py\n", "/--- /--- requirements.txt (optional)\n", "```\n", "\n", "By setting the variable `download_the_model=False`, you can skip the download phase and provide your own path to S3 in the `model_data` variable." ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "download_the_model = True\n", "\n", "if download_the_model:\n", " import torch, tarfile\n", " # Load the model\n", " model = torch.hub.load('pytorch/vision:v0.9.0', 'resnet50', pretrained=True)\n", " inp = torch.rand(1, 3, 224, 224)\n", " model_trace = torch.jit.trace(model, inp)\n", " # Save your model. The following code saves it with the .pth file extension\n", " model_trace.save('model.pth')\n", " with tarfile.open('model.tar.gz', 'w:gz') as f:\n", " f.add('model.pth')\n", " f.add('code/uncompiled-inference.py', 'code/inference.py')\n", " f.close()\n", " pytorch_resnet50_prefix = 'pytorch/resnet50'\n", " model_data = sess.upload_data('model.tar.gz', bucket, pytorch_resnet50_prefix)\n", "else:\n", " pytorch_resnet50_prefix = 'pytorch/resnet50'\n", " model_data = f's3://{bucket}/{pytorch_resnet50_prefix}/model.tar.gz'\n", " \n", "print(f'Model stored in {model_data}')" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "## Deploy and test on CPU\n", "\n", "In our first test, we will deploy the model on a `ml.c5.xlarge` instance, without compiling the model. Although this is a CNN, it is still possible to run it on CPU, although the performances won't be that good. This can give us a nice baseline of the performances of our model." ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "pth_model = PyTorchModel(model_data=model_data,\n", " entry_point='uncompiled-inference.py',\n", " source_dir='code',\n", " role=role,\n", " framework_version='1.7',\n", " py_version='py3'\n", ")" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "predictor = pth_model.deploy(1, 'ml.c5.xlarge')" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "endpoints['cpu_uncompiled'] = predictor.endpoint_name\n", "predictor.endpoint_name" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "## Deploy and test on GPU\n", "\n", "The instance chosen this time is a `ml.g4dn.xlarge`. It has great throughput and the cheapest way of running GPU inferences on the AWS cloud." ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "pth_model = PyTorchModel(model_data=model_data,\n", " entry_point='uncompiled-inference.py',\n", " source_dir='code',\n", " role=role,\n", " framework_version='1.6',\n", " py_version='py3'\n", ")" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "predictor = pth_model.deploy(1, 'ml.g4dn.xlarge')" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "endpoints['gpu_uncompiled'] = predictor.endpoint_name\n", "predictor.endpoint_name" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "# Compiled Models\n", "\n", "A common tactic in more advanced use cases is to improve model performances, in terms of latency and throughput, by compiling the model. \n", "\n", "Amazon SageMaker features its own compiler, Amazon SageMaker Neo, that enables data scientists to optimize ML models for inference on SageMaker in the cloud and supported devices at the edge.\n", "\n", "You start with a machine learning model already built with DarkNet, Keras, MXNet, PyTorch, TensorFlow, TensorFlow-Lite, ONNX, or XGBoost and trained in Amazon SageMaker or anywhere else. Then you choose your target hardware platform, which can be a SageMaker hosting instance or an edge device based on processors from Ambarella, Apple, ARM, Intel, MediaTek, Nvidia, NXP, Qualcomm, RockChip, Texas Instruments, or Xilinx. With a single click, SageMaker Neo optimizes the trained model and compiles it into an executable. \n", "\n", "For inference in the cloud, SageMaker Neo speeds up inference and saves cost by creating an inference optimized container in SageMaker hosting. For inference at the edge, SageMaker Neo saves developers months of manual tuning by automatically tuning the model for the selected operating system and processor hardware." ], "metadata": {} }, { "cell_type": "markdown", "source": [ "### Create the `model_data` for compilation\n", "\n", "To compile the model, we need to provide a `tar.gz` archive just like before, with very few changes. In particular, since SageMaker will use a different DL runtime for running compiled models, we will let it use the default function for serving the model, and only provide a script containing how to preprocess data. Let's create this archive and uplaod it to S3." ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "import tarfile \n", "\n", "with tarfile.open('model-to-compile.tar.gz', 'w:gz') as f:\n", " f.add('model.pth')\n", " f.add('code/compiled-inference.py', 'code/inference.py')\n", "f.close()\n", "model_data = sess.upload_data('model-to-compile.tar.gz', bucket, pytorch_resnet50_prefix)" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "output_path = f's3://{bucket}/{pytorch_resnet50_prefix}/compiled'" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "## Compile for CPU\n", "\n", "Let's run the same baseline test from before, and compile and deploy for CPU instances." ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "pth_model = PyTorchModel(model_data=model_data,\n", " entry_point='compiled-inference.py',\n", " source_dir='code',\n", " role=role,\n", " framework_version='1.6',\n", " py_version='py3'\n", ")" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "output_path = f's3://{bucket}/{pytorch_resnet50_prefix}/compiled'\n", "\n", "compiled_model = pth_model.compile(\n", " target_instance_family='ml_c5',\n", " input_shape={\"input0\": [1, 3, 224, 224]},\n", " output_path=output_path,\n", " role=role,\n", " job_name=name_from_base(f'pytorch-resnet50-c5')\n", ")" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "predictor = compiled_model.deploy(1, 'ml.c5.xlarge')" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "endpoints['cpu_compiled'] = predictor.endpoint_name\n", "predictor.endpoint_name" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "## Compile for GPU" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "pth_model = PyTorchModel(model_data=model_data,\n", " entry_point='compiled-inference.py',\n", " source_dir='code',\n", " role=role,\n", " framework_version='1.6',\n", " py_version='py3'\n", " )" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "output_path = f's3://{bucket}/{pytorch_resnet50_prefix}/compiled'\n", "\n", "compiled_model = pth_model.compile(\n", " target_instance_family='ml_g4dn',\n", " input_shape={\"input0\": [1, 3, 224, 224]},\n", " output_path=output_path,\n", " role=role,\n", " job_name=name_from_base(f'pytorch-resnet50-g4dn')\n", ")" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "predictor = compiled_model.deploy(1, 'ml.g4dn.xlarge')" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "endpoints['gpu_compiled'] = predictor.endpoint_name\n", "predictor.endpoint_name" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "print(1)" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "## Compile for Inferentia instances\n", "\n", "There is one more thing we can try to improve our model performances: using Inferentia instances.\n", "\n", "Amazon EC2 Inf1 instances deliver high-performance ML inference at the lowest cost in the cloud. They deliver up to 2.3x higher throughput and up to 70% lower cost per inference than comparable current generation GPU-based Amazon EC2 instances. Inf1 instances are built from the ground up to support machine learning inference applications. They feature up to 16 AWS Inferentia chips, high-performance machine learning inference chips designed and built by AWS." ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "pth_model = PyTorchModel(\n", " model_data=model_data,\n", " entry_point='compiled-inference.py',\n", " source_dir='code',\n", " role=role,\n", " framework_version='1.7',\n", " py_version='py3'\n", ")" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "compiled_model = pth_model.compile(\n", " target_instance_family='ml_inf1',\n", " input_shape={\"input0\": [1, 3, 224, 224]},\n", " output_path=output_path,\n", " role=role,\n", " job_name=name_from_base(f'pytorch-resnet50-inf1'),\n", " compile_max_run=1000\n", ")" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "predictor = compiled_model.deploy(1, 'ml.inf1.xlarge')" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "endpoints['inferentia'] = predictor.endpoint_name\n", "predictor.endpoint_name" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "# Test" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "For testing our models and endpoints, we will use the following picture of a beagle pup, freely available on [Wikimedia](https://commons.wikimedia.org/wiki/File:Beagle_puppy_sitting_on_grass.jpg). We will pass it to our endpoints as `application/x-image`, and no particular pre-processing is needed on client-side. " ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "from IPython.display import Image\n", "\n", "Image('doggo.jpg')" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "Finally, we will use the following function to benchmark our SageMaker endpoints, measuring the latency of our predictions. This specific version uses both the `Predictor` from the SageMaker Python SDK and boto3's `invoke_endpoint()` function - just change the last parameter from `boto3` to `sm` if you want to use the Python SDK." ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "from load_test import load_tester\n", "\n", "num_thread = 16\n", "\n", "from IPython.display import JSON\n", "JSON(endpoints)" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "# CPU - Uncompiled\n", "load_tester(num_thread, endpoints['cpu_uncompiled'], 'doggo.jpg', 'boto3')" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "Let's discuss the above results:\n", "\n", "as we can from the latency tests, the model way too long to generate an inference, averaging about 6 transactions per second (TPS). This of course may be sufficient for some low throughput use cases, rarely used instances, but it's very likely that it won't be sufficient. Let's see how those numbers change when using a GPU." ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "# GPU - Uncompiled\n", "load_tester(num_thread, endpoints['gpu_uncompiled'], 'doggo.jpg', 'boto3')" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "Now we're talking! The GPU helps us achieve 77 TPS on average, with a much lower latency percentile over the board. Nice!" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "# CPU - Compiled\n", "load_tester(num_thread, endpoints['cpu_compiled'], 'doggo.jpg', 'boto3')" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "With a simple compilation job, we more than DOUBLED the performances of our model on our c5 instance, achieving 13 TPS and half the latency percentiles. Let's see if the same results can be seen on GPU." ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "# GPU - Compiled\n", "load_tester(num_thread, endpoints['gpu_compiled'], 'doggo.jpg', 'boto3')" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "Results are also consistent in the compiled version of the model on GPU. Almost double the TPS, with 7 ms latency. Let's take it one step further and test Inferentia." ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "# Inferentia\n", "load_tester(num_thread, endpoints['inferentia'], 'doggo.jpg', 'boto3')" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "The best results so far! Up to 290 TPS at the same latency percentiles than GPU, or lower. All of this for a fraction of the cost." ], "metadata": {} }, { "cell_type": "markdown", "source": [ "# Reviewing the results\n", "\n", "Let's plot the results obtained from the previous tests, by taking into account the last printed line of each load testing. We will ignore the error rate, and take into account the TPS obtained with our tests dividing it by the cost of the machine in the region of testing (DUB) - a metric we will call \"throughput per dollar\"." ], "metadata": {} }, { "cell_type": "code", "execution_count": 1, "source": [ "import matplotlib.pyplot as plt\n", "from io import StringIO\n", "import pandas as pd\n", "\n", "data = StringIO('''\n", "experiment|TPS|p50|cost/hour\n", "c5|5.900|215.23876|0.23\n", "c5 + Neo|12.900|121.60897|0.23\n", "g4dn|74.700|21.19393|0.821\n", "g4dn + Neo|140.400|11.28725|0.821\n", "inf1|304.300|4.90315|0.33\n", "''')\n", "\n", "df = pd.read_csv(data, sep='|')\n", "df['throughput per dollar (divided by 100, higher is better)'] = df['TPS']/df['cost/hour']/100 # Divide by 100 to normalize\n", "df['cost per 1M inferences in $ (lower is better)'] = (1000000/df['TPS'])/3600*df['cost/hour']\n", "df['number of inferences per dollar (higher is better)'] = df['TPS']*3600/df['cost/hour']\n", "df['transactions per hour'] = df['TPS']*3600\n", "\n", "df.head()" ], "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", " | experiment | \n", "TPS | \n", "p50 | \n", "cost/hour | \n", "throughput per dollar (divided by 100, higher is better) | \n", "cost per 1M inferences in $ (lower is better) | \n", "number of inferences per dollar (higher is better) | \n", "transactions per hour | \n", "
---|---|---|---|---|---|---|---|---|
0 | \n", "c5 | \n", "5.9 | \n", "215.23876 | \n", "0.230 | \n", "0.256522 | \n", "10.828625 | \n", "9.234783e+04 | \n", "21240.0 | \n", "
1 | \n", "c5 + Neo | \n", "12.9 | \n", "121.60897 | \n", "0.230 | \n", "0.560870 | \n", "4.952627 | \n", "2.019130e+05 | \n", "46440.0 | \n", "
2 | \n", "g4dn | \n", "74.7 | \n", "21.19393 | \n", "0.821 | \n", "0.909866 | \n", "3.052953 | \n", "3.275518e+05 | \n", "268920.0 | \n", "
3 | \n", "g4dn + Neo | \n", "140.4 | \n", "11.28725 | \n", "0.821 | \n", "1.710110 | \n", "1.624327 | \n", "6.156395e+05 | \n", "505440.0 | \n", "
4 | \n", "inf1 | \n", "304.3 | \n", "4.90315 | \n", "0.330 | \n", "9.221212 | \n", "0.301238 | \n", "3.319636e+06 | \n", "1095480.0 | \n", "