{
"cells": [
{
"cell_type": "markdown",
"id": "b0fa4989",
"metadata": {},
"source": [
"# Compiling HuggingFace models for AWS Inferentia"
]
},
{
"cell_type": "markdown",
"id": "9c8fc22c",
"metadata": {},
"source": [
"AWS Inferentia는 저렴한 비용으로 높은 처리량(throughput)과 짧은 레이턴시(low latency)의 추론 성능을 제공하기 위해 AWS에서 개발한 머신 러닝 추론 칩입니다. Inferentia 칩은 최신형 커스텀 2세대 Intel® Xeon® 프로세서 및 100Gbps 네트워킹과 결합되어 머신 러닝 추론 애플리케이션을 위한 고성능 및 업계에서 가장 낮은 비용을 제공합니다. AWS Inferentia 기반 Amazon EC2 Inf1 인스턴스는 Inferentia 칩에서 머신 러닝 모델을 컴파일&최적화할 수 있는 AWS Neuron 컴파일러, 런타임 및 프로파일링 도구가 포함되어 있습니다.\n",
"\n",
"AWS Neuron은 AWS Inferentia 칩을 사용하여 머신 러닝 추론을 실행하기 위한 SDK입니다. Neuron을 사용하면 딥러닝 프레임워크(PyTorch, TensorFlow, MXNet)에서 훈련된 컴퓨터 비전 및 자연어 처리 모델을 보다 빠르게 추론할 수 있습니다. 또한, [Dynamic Batching](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/appnotes/perf/torch-neuron-dataparallel-app-note.html#dynamic-batching-description)과 [Data Parallel](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/neuron-frameworks/pytorch-neuron/api-torch-neuron-dataparallel-api.html) 같은 기능을 활용하여 대용량 모델에 대한 추론 성능 개선이 가능합니다.\n",
"\n",
"Inf1 인스턴스는 SageMaker 호스팅 인스턴스로도 배포가 가능하며, 여러분은 아래 두 가지 옵션 중 하나를 선택하여 머신 러닝 모델을 쉽고 빠르게 배포할 수 있습니다.\n",
"\n",
"- **Option 1.** SageMaker Neo로 컴파일 후 Inf1 호스팅 인스턴스로 배포. 이 경우 SageMaker Neo에서 내부적으로 Neuron SDK를 사용하여 모델을 컴파일합니다. Hugging Face 모델은 컴파일 시에 dtype int64로 컴파일해야 합니다. \n",
"- **Option 2.** 개발 환경에서 Neuron SDK로 직접 컴파일 후 Inf1 호스팅 인스턴스로 배포 \n",
"\n",
"본 예제 노트북에서는 Option 2의 방법으로 허깅페이스 BERT 모델을 직접 컴파일 후, g4dn 인스턴스와 Inf1 인스턴스로 배포하여 처리량과 지연 시간에 대한 간단한 벤치마크를 수행합니다. \n",
"\n",
"### References\n",
"- AWS Neuron GitHub: https://github.com/aws/aws-neuron-sdk/\n",
"- AWS Neuron Developer Guide: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/"
]
},
{
"cell_type": "markdown",
"id": "3e4be91e",
"metadata": {},
"source": [
"
\n",
"\n",
"## Install Dependencies\n",
"---\n",
"\n",
"Neuron 컴파일을 위해 `torch-neuron`, `neuron-cc`를 설치해야 합니다. 컴파일을 Inf1 인스턴스에서 수행하실 필요는 없습니다."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b100ddd3",
"metadata": {},
"outputs": [],
"source": [
"%load_ext autoreload\n",
"%autoreload 2"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2657ac5a",
"metadata": {},
"outputs": [],
"source": [
"!pip install --upgrade --no-cache-dir torch-neuron neuron-cc[tensorflow] torchvision torch --extra-index-url=https://pip.repos.neuron.amazonaws.com\n",
"!pip install --upgrade --no-cache-dir 'transformers==4.15.0'"
]
},
{
"cell_type": "markdown",
"id": "b750e7c0",
"metadata": {},
"source": [
"
\n",
"\n",
"## 1. Get Model from HuggingFace Hub\n",
"---\n",
"\n",
"HuggingFace Model Hub의 BERT 파인 튜닝 모델을 가져옵니다. \n",
"\n",
"**[주의] 모델을 인스턴스화할 때 `return_dict=False`로 설정하지 않으면 neuron 컴파일이 정상적으로 수행되지 않습니다.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "242d03c9",
"metadata": {},
"outputs": [],
"source": [
"import torch\n",
"import torch_neuron\n",
"import json\n",
"from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig\n",
"from src.inference import model_fn, input_fn, predict_fn, output_fn\n",
"device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n",
"\n",
"model_id = 'bert-base-cased-finetuned-mrpc'\n",
"tokenizer = AutoTokenizer.from_pretrained(model_id)\n",
"model = AutoModelForSequenceClassification.from_pretrained(model_id, return_dict=False).eval().to(device)\n",
"models = model, tokenizer"
]
},
{
"cell_type": "markdown",
"id": "736400f5",
"metadata": {},
"source": [
"### Understanding our inference code\n",
"\n",
"\n",
"#### For Normal instnace\n",
"먼저, 일반적인 인스턴스에서 추론을 수행하는 코드를 확인해 보겠습니다. SageMaker로 모델을 배포해 보신 분들은 익숙한 코드입니다. 단순화를 위해 HuggingFace Model Hub에서 직접 모델을 로드하며, `model_fn()`은 모델과 해당 토크나이저를 모두 포함하는 튜플을 반환합니다. 모델과 입력 데이터는 모두 `.to(device)`로 전송되며 디바이스는 CPU 또는 GPU가 될 수 있습니다."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "49a7ab51",
"metadata": {},
"outputs": [],
"source": [
"!pygmentize src/inference.py"
]
},
{
"cell_type": "markdown",
"id": "87c3d08f",
"metadata": {},
"source": [
"#### For Inf1 instnace\n",
"이제 Inferentia용으로 컴파일된 모델로 추론을 수행하고자 할 때 추론 코드가 어떻게 변경되는지 살펴볼까요?\n",
"\n",
"`model_fn()`만 변경되었으며, 나머지 코드는 모두 동일합니다. 단, `.to(device)`가 제외된 것을 주목해 주세요. Neuron 런타임이 모델을 NeuronCores에 로드하기 때문입니다."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "216786d1",
"metadata": {},
"outputs": [],
"source": [
"!pygmentize src/inference_inf1.py"
]
},
{
"cell_type": "markdown",
"id": "451b99cc",
"metadata": {},
"source": [
"문장 유사도를 판별하기 위한 3개의 샘플 문장들입니다."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7a719e39",
"metadata": {},
"outputs": [],
"source": [
"sequence_0 = \"Machine learning is super easy and easy to follow\"\n",
"sequence_1 = \"Yesterday I went to the supermarket and bought meat.\"\n",
"sequence_2 = \"The best part of Amazon SageMaker is that it makes machine learning easy.\""
]
},
{
"cell_type": "markdown",
"id": "ecedb10c",
"metadata": {},
"source": [
"모델 추론 결과를 확인합니다. SageMaker 호스팅 인스턴스에 배포하기 위한 인터페이스를 구현 후 디버깅하는 것이 좋은 전략입니다."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8132a2f7",
"metadata": {},
"outputs": [],
"source": [
"inputs = json.dumps([sequence_0, sequence_1])\n",
"request_body = input_fn(inputs)\n",
"out_str = predict_fn(request_body, models)\n",
"response = output_fn(out_str)\n",
"print(request_body)\n",
"print(response)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "09ed7f06",
"metadata": {},
"outputs": [],
"source": [
"inputs = json.dumps([sequence_0, sequence_2])\n",
"request_body = input_fn(inputs)\n",
"out_str = predict_fn(request_body, models)\n",
"response = output_fn(out_str)\n",
"print(request_body)\n",
"print(response)"
]
},
{
"cell_type": "markdown",
"id": "b0085d4e",
"metadata": {},
"source": [
"
\n",
"\n",
"## 2. Compile the model into an AWS Neuron optimized TorchScript\n",
"---\n",
"\n",
"PyTorch-Neuron의 trace Python API는 TorchScript로 직렬화할 수 있는 Inferentia에서 실행할 PyTorch 모델을 컴파일합니다. PyTorch의 `torch.jit.trace()` 함수와 유사합니다. 컴파일 시간은 약 3분-5분 정도 소요됩니다."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6dcc1044",
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"\n",
"max_length = 128\n",
"\n",
"paraphrase = tokenizer.encode_plus(sequence_0, sequence_2, max_length=max_length, padding='max_length', truncation=True, return_tensors=\"pt\")\n",
"not_paraphrase = tokenizer.encode_plus(sequence_0, sequence_1, max_length=max_length, padding='max_length', truncation=True, return_tensors=\"pt\")\n",
"\n",
"# Convert example inputs to a format that is compatible with TorchScript tracing\n",
"example_inputs_paraphrase = paraphrase['input_ids'], paraphrase['attention_mask'], paraphrase['token_type_ids']\n",
"example_inputs_not_paraphrase = not_paraphrase['input_ids'], not_paraphrase['attention_mask'], not_paraphrase['token_type_ids']\n",
"\n",
"# Run torch.neuron.trace to generate a TorchScript that is optimized by AWS Neuron\n",
"# This step may need 3-5 min\n",
"model_neuron = torch.neuron.trace(\n",
" model, example_inputs_paraphrase, verbose=1, compiler_workdir='./compilation_artifacts'\n",
")"
]
},
{
"cell_type": "markdown",
"id": "7c9d926a",
"metadata": {},
"source": [
"`model_neuron.graph`로 CPU에서 실행 중인 부분과 가속기에서 실행 중인 부분을 확인할 수 있습니다."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6f5a03e4",
"metadata": {},
"outputs": [],
"source": [
"# See which part is running on CPU versus running on the accelerator.\n",
"print(model_neuron.graph)"
]
},
{
"cell_type": "markdown",
"id": "87985a5f",
"metadata": {},
"source": [
"Neuron으로 컴파일된 모델과 일반 인스턴스에서 사용할 0바이트의 더미 모델을 각각 `model.tar.gz`로 아카이빙하여 S3로 복사합니다."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8f436416",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import boto3\n",
"import sys\n",
"import time\n",
"from sagemaker.utils import name_from_base\n",
"import sagemaker\n",
"\n",
"role = sagemaker.get_execution_role()\n",
"sess = sagemaker.Session()\n",
"region = sess.boto_region_name\n",
"bucket = sess.default_bucket()\n",
"sm_client = boto3.client('sagemaker')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e4d40b9e",
"metadata": {},
"outputs": [],
"source": [
"os.makedirs('neuron_model', exist_ok=True)\n",
"os.makedirs('model', exist_ok=True)\n",
"\n",
"model_dir = 'model'\n",
"model_filename = 'model.pth'\n",
"model_neuron_dir = 'neuron_model'\n",
"model_neuron_filename = 'neuron_compiled_model.pt'\n",
"\n",
"os.makedirs(model_neuron_dir, exist_ok=True)\n",
"os.makedirs(model_dir, exist_ok=True)\n",
"\n",
"f = open(model_filename, 'w')\n",
"f.close()\n",
"!tar -czvf model.tar.gz {model_filename} && mv model.tar.gz {model_dir} && rm {model_filename}\n",
"\n",
"model_neuron.save(model_neuron_filename)\n",
"!tar -czvf model.tar.gz {model_neuron_filename} && mv model.tar.gz {model_neuron_dir} && rm {model_neuron_filename}"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c41828aa",
"metadata": {},
"outputs": [],
"source": [
"def upload_s3(prefix, local_model_dir):\n",
" model_key = f'{prefix}/model.tar.gz'\n",
" s3_model_path = 's3://{}/{}'.format(bucket, model_key)\n",
" boto3.resource('s3').Bucket(bucket).upload_file(f'{local_model_dir}/model.tar.gz', model_key)\n",
" print(\"Uploaded model to S3:\")\n",
" print(s3_model_path)\n",
" return s3_model_path\n",
" \n",
"normal_prefix = 'normal/model'\n",
"neuron_prefix = 'inf1_compiled_model/model'\n",
"s3_model_path = upload_s3(normal_prefix, model_dir)\n",
"s3_model_neuron_path = upload_s3(neuron_prefix, model_neuron_dir) "
]
},
{
"cell_type": "markdown",
"id": "7697dfb6",
"metadata": {},
"source": [
"
\n",
"\n",
"## 3. Deploy Endpoint and run inference based on the pretrained model\n",
"---"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "244ef854",
"metadata": {},
"outputs": [],
"source": [
"!{sys.executable} -m pip install Transformers"
]
},
{
"cell_type": "markdown",
"id": "d6f6bc2f",
"metadata": {},
"source": [
"### [Optional] Deploying Model on Local\n",
"\n",
"디버깅을 위해 로컬 모드로 먼저 배포하는 것이 좋은 전략입니다."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3afd03bc",
"metadata": {},
"outputs": [],
"source": [
"DEBUG_LOCAL_MODE = False\n",
"#DEBUG_LOCAL_MODE = True"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "47502a05",
"metadata": {},
"outputs": [],
"source": [
"if DEBUG_LOCAL_MODE:\n",
" from sagemaker.pytorch.model import PyTorchModel\n",
" from sagemaker.predictor import Predictor\n",
" from sagemaker.serializers import JSONSerializer\n",
" from sagemaker.deserializers import JSONDeserializer\n",
" from datetime import datetime\n",
" local_model_path = f'file://{os.getcwd()}/model/model.tar.gz'\n",
"\n",
" sm_local_model = PyTorchModel(\n",
" model_data=local_model_path,\n",
" predictor_cls=Predictor,\n",
" framework_version='1.8.1',\n",
" role=role,\n",
" entry_point=\"inference.py\",\n",
" source_dir=\"src\", \n",
" py_version='py3'\n",
" )\n",
" \n",
" local_predictor = sm_local_model.deploy(\n",
" initial_instance_count=1,\n",
" instance_type='local',\n",
" serializer=JSONSerializer(),\n",
" deserializer=JSONDeserializer(), \n",
" ) "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e0744231",
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"if DEBUG_LOCAL_MODE:\n",
" result = local_predictor.predict([sequence_0, sequence_1])\n",
" print(result)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9a0a4dae",
"metadata": {},
"outputs": [],
"source": [
"if DEBUG_LOCAL_MODE:\n",
" local_predictor.delete_endpoint()\n",
" sm_local_model.delete_model()"
]
},
{
"cell_type": "markdown",
"id": "6b52de01",
"metadata": {},
"source": [
"### Deploying Model on g4dn Instance"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6230d91d",
"metadata": {},
"outputs": [],
"source": [
"from sagemaker.pytorch.model import PyTorchModel\n",
"from sagemaker.predictor import Predictor\n",
"from sagemaker.serializers import JSONSerializer\n",
"from sagemaker.deserializers import JSONDeserializer\n",
"from datetime import datetime\n",
"date_string = datetime.now().strftime(\"%Y%m-%d%H-%M%S\")\n",
"\n",
"sm_model = PyTorchModel(\n",
" model_data=s3_model_path,\n",
" role=role,\n",
" predictor_cls=Predictor,\n",
" framework_version='1.8.1',\n",
" entry_point=\"inference.py\",\n",
" source_dir=\"src\", \n",
" py_version='py3',\n",
" name=f\"bert-classification-pt181-{date_string}\",\n",
" env={\"SAGEMAKER_CONTAINER_LOG_LEVEL\": \"10\"}, \n",
")\n",
"\n",
"predictor = sm_model.deploy(\n",
" initial_instance_count=1,\n",
" instance_type=\"ml.g4dn.xlarge\",\n",
" endpoint_name=f\"bert-classification-g4dn-{date_string}\", \n",
" serializer=JSONSerializer(),\n",
" deserializer=JSONDeserializer(), \n",
" wait=False \n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9f24c38f",
"metadata": {},
"outputs": [],
"source": [
"from IPython.core.display import display, HTML\n",
"\n",
"def make_endpoint_link(region, endpoint_name, endpoint_task):\n",
" endpoint_link = f'{endpoint_task} Review Endpoint' \n",
" return endpoint_link \n",
" \n",
"endpoint_link = make_endpoint_link(region, predictor.endpoint_name, '[Deploy normal model]')\n",
"display(HTML(endpoint_link))"
]
},
{
"cell_type": "markdown",
"id": "048245d2",
"metadata": {},
"source": [
"### Deploying Model on Inf1 Instance"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "13112309",
"metadata": {},
"outputs": [],
"source": [
"ecr_image = f'763104351884.dkr.ecr.{region}.amazonaws.com/pytorch-inference-neuron:1.7.1-neuron-py36-ubuntu18.04'\n",
"\n",
"sm_neuron_model = PyTorchModel(\n",
" model_data=s3_model_neuron_path,\n",
" role=role,\n",
" framework_version=\"1.7.1\",\n",
" entry_point=\"inference_inf1.py\",\n",
" source_dir=\"src\", \n",
" image_uri=ecr_image,\n",
" name=f\"bert-classification-pt171-neuron-{date_string}\",\n",
" env={\"SAGEMAKER_CONTAINER_LOG_LEVEL\": \"10\"}, \n",
")\n",
"\n",
"# Let SageMaker know that we've already compiled the model via neuron-cc\n",
"sm_neuron_model._is_compiled_model = True"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a25ae3b4",
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"neuron_predictor = sm_neuron_model.deploy(\n",
" initial_instance_count=1, \n",
" instance_type=\"ml.inf1.2xlarge\",\n",
" endpoint_name=f\"bert-classification-inf1-2x-{date_string}\", \n",
" serializer=JSONSerializer(),\n",
" deserializer=JSONDeserializer(), \n",
" wait=False\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "def8ce2d",
"metadata": {},
"outputs": [],
"source": [
"endpoint_link = make_endpoint_link(region, neuron_predictor.endpoint_name, '[Deploy neuron model]')\n",
"display(HTML(endpoint_link))"
]
},
{
"cell_type": "markdown",
"id": "f349f0d6",
"metadata": {},
"source": [
"### Wait for the endpoint jobs to complete\n",
"엔드포인트가 생성될 때까지 기다립니다. 약 5-10분의 시간이 소요됩니다."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3fefb480",
"metadata": {},
"outputs": [],
"source": [
"sess.wait_for_endpoint(predictor.endpoint_name, poll=5)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "00596e13",
"metadata": {},
"outputs": [],
"source": [
"sess.wait_for_endpoint(neuron_predictor.endpoint_name, poll=5)"
]
},
{
"cell_type": "markdown",
"id": "0a9a96cd",
"metadata": {},
"source": [
"### Inference Test\n",
"\n",
"모델 배포가 완료되었으면, 각 엔드포인트에 대해 추론을 수행합니다."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a10e5631",
"metadata": {},
"outputs": [],
"source": [
"sequence_0 = \"Machine learning is super easy and easy to follow\"\n",
"sequence_1 = \"Yesterday I went to the supermarket and bought meat.\"\n",
"sequence_2 = \"The best part of Amazon SageMaker is that it makes machine learning easy.\""
]
},
{
"cell_type": "markdown",
"id": "973b55b3",
"metadata": {},
"source": [
"#### For g4dn instance"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b8fbca3d",
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"result = predictor.predict([sequence_0, sequence_1])\n",
"print(result)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c766d4e7",
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"result = predictor.predict([sequence_0, sequence_2])\n",
"print(result)"
]
},
{
"cell_type": "markdown",
"id": "5f93d52a",
"metadata": {},
"source": [
"#### For Inf1 instance"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d4f4ea9c",
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"result = neuron_predictor.predict([sequence_0, sequence_1])\n",
"print(result)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d286138f",
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"result = neuron_predictor.predict([sequence_0, sequence_2])\n",
"print(result)"
]
},
{
"cell_type": "markdown",
"id": "1872534b",
"metadata": {},
"source": [
"
\n",
"\n",
"## 4. Benchmark and comparison\n",
"---\n",
"\n",
"두 엔드포인트에 대한 간단한 벤치마크를 수행합니다. 각 벤치마크에서 우리는 각각 모델 엔드포인트에 1,000개의 요청을 수행하는 멀티프로세싱을 수행합니다. 각 요청에 대한 추론 지연 시간을 측정하고 작업을 완료하는 데 걸린 총 시간도 측정하여 요청 처리량/초(request throughput/second)를 추정할 수 있습니다."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "16166a67",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np \n",
"import datetime\n",
"import math\n",
"import time\n",
"import boto3 \n",
"import matplotlib.pyplot as plt\n",
"from joblib import Parallel, delayed\n",
"import numpy as np\n",
"from tqdm import tqdm\n",
"import random"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d28166b4",
"metadata": {},
"outputs": [],
"source": [
"def inference_latency(model,*inputs):\n",
" \"\"\"\n",
" infetence_time is a simple method to return the latency of a model inference.\n",
"\n",
" Parameters:\n",
" model: torch model onbject loaded using torch.jit.load\n",
" inputs: model() args\n",
"\n",
" Returns:\n",
" latency in seconds\n",
" \"\"\"\n",
" error = False\n",
" start = time.time()\n",
" try:\n",
" results = model(*inputs)\n",
" except:\n",
" error = True\n",
" results = []\n",
" return {'latency':time.time() - start, 'error': error, 'result': results}"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d0cfd9ad",
"metadata": {},
"outputs": [],
"source": [
"def random_sentence():\n",
" \n",
" s_nouns = [\"A dude\", \"My mom\", \"The king\", \"Some guy\", \"A cat with rabies\", \"A sloth\", \"Your homie\", \"This cool guy my gardener met yesterday\", \"Superman\"]\n",
" p_nouns = [\"These dudes\", \"Both of my moms\", \"All the kings of the world\", \"Some guys\", \"All of a cattery's cats\", \"The multitude of sloths living under your bed\", \"Your homies\", \"Like, these, like, all these people\", \"Supermen\"]\n",
" s_verbs = [\"eats\", \"kicks\", \"gives\", \"treats\", \"meets with\", \"creates\", \"hacks\", \"configures\", \"spies on\", \"retards\", \"meows on\", \"flees from\", \"tries to automate\", \"explodes\"]\n",
" p_verbs = [\"eat\", \"kick\", \"give\", \"treat\", \"meet with\", \"create\", \"hack\", \"configure\", \"spy on\", \"retard\", \"meow on\", \"flee from\", \"try to automate\", \"explode\"]\n",
" infinitives = [\"to make a pie.\", \"for no apparent reason.\", \"because the sky is green.\", \"for a disease.\", \"to be able to make toast explode.\", \"to know more about archeology.\"]\n",
" \n",
" return (random.choice(s_nouns) + ' ' + random.choice(s_verbs) + ' ' + random.choice(s_nouns).lower() or random.choice(p_nouns).lower() + ' ' + random.choice(infinitives))\n",
"\n",
"print([random_sentence(), random_sentence()])"
]
},
{
"cell_type": "markdown",
"id": "75ce4c37",
"metadata": {},
"source": [
"### For g4dn instance"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "41755493",
"metadata": {},
"outputs": [],
"source": [
"# Defining Auxiliary variables\n",
"number_of_clients = 2\n",
"number_of_runs = 1000\n",
"t = tqdm(range(number_of_runs),position=0, leave=True)\n",
"\n",
"# Starting parallel clients\n",
"cw_start = datetime.datetime.utcnow()\n",
"\n",
"results = Parallel(n_jobs=number_of_clients,prefer=\"threads\")(delayed(inference_latency)(predictor.predict,[random_sentence(), random_sentence()]) for mod in t)\n",
"avg_throughput = t.total/t.format_dict['elapsed']\n",
"\n",
"cw_end = datetime.datetime.utcnow() \n",
"\n",
"# Computing metrics and print\n",
"latencies = [res['latency'] for res in results]\n",
"errors = [res['error'] for res in results]\n",
"error_p = sum(errors)/len(errors) *100\n",
"p50 = np.quantile(latencies[-1000:],0.50) * 1000\n",
"p90 = np.quantile(latencies[-1000:],0.95) * 1000\n",
"p95 = np.quantile(latencies[-1000:],0.99) * 1000\n",
"\n",
"print(f'Avg Throughput: :{avg_throughput:.1f}\\n')\n",
"print(f'50th Percentile Latency:{p50:.1f} ms')\n",
"print(f'90th Percentile Latency:{p90:.1f} ms')\n",
"print(f'95th Percentile Latency:{p95:.1f} ms\\n')\n",
"print(f'Errors percentage: {error_p:.1f} %\\n')\n",
"\n",
"# Querying CloudWatch\n",
"print('Getting Cloudwatch:')\n",
"cloudwatch = boto3.client('cloudwatch')\n",
"statistics=['SampleCount', 'Average', 'Minimum', 'Maximum']\n",
"extended=['p50', 'p90', 'p95', 'p100']\n",
"\n",
"# Give 5 minute buffer to end\n",
"cw_end += datetime.timedelta(minutes=5)\n",
"\n",
"# Period must be 1, 5, 10, 30, or multiple of 60\n",
"# Calculate closest multiple of 60 to the total elapsed time\n",
"factor = math.ceil((cw_end - cw_start).total_seconds() / 60)\n",
"period = factor * 60\n",
"print('Time elapsed: {} seconds'.format((cw_end - cw_start).total_seconds()))\n",
"print('Using period of {} seconds\\n'.format(period))\n",
"\n",
"cloudwatch_ready = False\n",
"# Keep polling CloudWatch metrics until datapoints are available\n",
"while not cloudwatch_ready:\n",
" time.sleep(30)\n",
" print('Waiting 30 seconds ...')\n",
" # Must use default units of microseconds\n",
" model_latency_metrics = cloudwatch.get_metric_statistics(MetricName='ModelLatency',\n",
" Dimensions=[{'Name': 'EndpointName',\n",
" 'Value': predictor.endpoint_name},\n",
" {'Name': 'VariantName',\n",
" 'Value': \"AllTraffic\"}],\n",
" Namespace=\"AWS/SageMaker\",\n",
" StartTime=cw_start,\n",
" EndTime=cw_end,\n",
" Period=period,\n",
" Statistics=statistics,\n",
" ExtendedStatistics=extended\n",
" )\n",
" # Should be 1000\n",
" if len(model_latency_metrics['Datapoints']) > 0:\n",
" print('{} latency datapoints ready'.format(model_latency_metrics['Datapoints'][0]['SampleCount']))\n",
" side_avg = model_latency_metrics['Datapoints'][0]['Average'] / number_of_runs\n",
" side_p50 = model_latency_metrics['Datapoints'][0]['ExtendedStatistics']['p50'] / number_of_runs\n",
" side_p90 = model_latency_metrics['Datapoints'][0]['ExtendedStatistics']['p90'] / number_of_runs\n",
" side_p95 = model_latency_metrics['Datapoints'][0]['ExtendedStatistics']['p95'] / number_of_runs\n",
" side_p100 = model_latency_metrics['Datapoints'][0]['ExtendedStatistics']['p100'] / number_of_runs\n",
"\n",
" print(f'50th Percentile Latency:{side_p50:.1f} ms')\n",
" print(f'90th Percentile Latency:{side_p90:.1f} ms')\n",
" print(f'95th Percentile Latency:{side_p95:.1f} ms\\n')\n",
"\n",
" cloudwatch_ready = True"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "533cf7c7",
"metadata": {},
"outputs": [],
"source": [
"from matplotlib.pyplot import hist, title, show, savefig, xlim\n",
"import numpy as np\n",
"\n",
"latency_percentiles = np.percentile(latencies, q=[50, 90, 95, 99])\n",
"\n",
"hist(latencies, bins=100)\n",
"title(\"Request latency histogram on GPU\")\n",
"xlim(0, 0.2)\n",
"show()\n",
"\n",
"print(\"==== Default HuggingFace model on GPU benchmark ====\\n\")\n",
"print(f\"95 % of requests take less than {latency_percentiles[2]*1000} ms\")\n",
"print(f\"Rough request throughput/second is {avg_throughput:.2f}\")"
]
},
{
"cell_type": "markdown",
"id": "bfd18cb4",
"metadata": {},
"source": [
"### For Inf1 instnace"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1a20c241",
"metadata": {},
"outputs": [],
"source": [
"# Defining Auxiliary variables\n",
"number_of_clients = 2\n",
"number_of_runs = 1000\n",
"t = tqdm(range(number_of_runs),position=0, leave=True)\n",
"\n",
"# Starting parallel clients\n",
"cw_start = datetime.datetime.utcnow()\n",
"\n",
"results = Parallel(n_jobs=number_of_clients,prefer=\"threads\")(delayed(inference_latency)(neuron_predictor.predict,[random_sentence(), random_sentence()]) for mod in t)\n",
"avg_throughput = t.total/t.format_dict['elapsed']\n",
"\n",
"cw_end = datetime.datetime.utcnow() \n",
"\n",
"# Computing metrics and print\n",
"latencies = [res['latency'] for res in results]\n",
"errors = [res['error'] for res in results]\n",
"error_p = sum(errors)/len(errors) *100\n",
"p50 = np.quantile(latencies[-1000:],0.50) * 1000\n",
"p90 = np.quantile(latencies[-1000:],0.95) * 1000\n",
"p95 = np.quantile(latencies[-1000:],0.99) * 1000\n",
"\n",
"print(f'Avg Throughput: :{avg_throughput:.1f}\\n')\n",
"print(f'50th Percentile Latency:{p50:.1f} ms')\n",
"print(f'90th Percentile Latency:{p90:.1f} ms')\n",
"print(f'95th Percentile Latency:{p95:.1f} ms\\n')\n",
"print(f'Errors percentage: {error_p:.1f} %\\n')\n",
"\n",
"# Querying CloudWatch\n",
"print('Getting Cloudwatch:')\n",
"cloudwatch = boto3.client('cloudwatch')\n",
"statistics=['SampleCount', 'Average', 'Minimum', 'Maximum']\n",
"extended=['p50', 'p90', 'p95', 'p100']\n",
"\n",
"# Give 5 minute buffer to end\n",
"cw_end += datetime.timedelta(minutes=5)\n",
"\n",
"# Period must be 1, 5, 10, 30, or multiple of 60\n",
"# Calculate closest multiple of 60 to the total elapsed time\n",
"factor = math.ceil((cw_end - cw_start).total_seconds() / 60)\n",
"period = factor * 60\n",
"print('Time elapsed: {} seconds'.format((cw_end - cw_start).total_seconds()))\n",
"print('Using period of {} seconds\\n'.format(period))\n",
"\n",
"cloudwatch_ready = False\n",
"# Keep polling CloudWatch metrics until datapoints are available\n",
"while not cloudwatch_ready:\n",
" time.sleep(30)\n",
" print('Waiting 30 seconds ...')\n",
" # Must use default units of microseconds\n",
" model_latency_metrics = cloudwatch.get_metric_statistics(MetricName='ModelLatency',\n",
" Dimensions=[{'Name': 'EndpointName',\n",
" 'Value': neuron_predictor.endpoint_name},\n",
" {'Name': 'VariantName',\n",
" 'Value': \"AllTraffic\"}],\n",
" Namespace=\"AWS/SageMaker\",\n",
" StartTime=cw_start,\n",
" EndTime=cw_end,\n",
" Period=period,\n",
" Statistics=statistics,\n",
" ExtendedStatistics=extended\n",
" )\n",
" # Should be 1000\n",
" if len(model_latency_metrics['Datapoints']) > 0:\n",
" print('{} latency datapoints ready'.format(model_latency_metrics['Datapoints'][0]['SampleCount']))\n",
" side_avg = model_latency_metrics['Datapoints'][0]['Average'] / number_of_runs\n",
" side_p50 = model_latency_metrics['Datapoints'][0]['ExtendedStatistics']['p50'] / number_of_runs\n",
" side_p90 = model_latency_metrics['Datapoints'][0]['ExtendedStatistics']['p90'] / number_of_runs\n",
" side_p95 = model_latency_metrics['Datapoints'][0]['ExtendedStatistics']['p95'] / number_of_runs\n",
" side_p100 = model_latency_metrics['Datapoints'][0]['ExtendedStatistics']['p100'] / number_of_runs\n",
"\n",
" print(f'50th Percentile Latency:{side_p50:.1f} ms')\n",
" print(f'90th Percentile Latency:{side_p90:.1f} ms')\n",
" print(f'95th Percentile Latency:{side_p95:.1f} ms\\n')\n",
"\n",
" cloudwatch_ready = True"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e2f75bad",
"metadata": {},
"outputs": [],
"source": [
"from matplotlib.pyplot import hist, title, show, savefig, xlim\n",
"import numpy as np\n",
"\n",
"latency_percentiles = np.percentile(latencies, q=[50, 90, 95, 99])\n",
"\n",
"hist(latencies, bins=100)\n",
"title(\"Request latency histogram for Inferentia\")\n",
"xlim(0, 0.2)\n",
"show()\n",
"\n",
"print(\"==== HuggingFace model compiled for Inferentia benchmark ====\\n\")\n",
"print(f\"95 % of requests take less than {latency_percentiles[2]*1000} ms\")\n",
"print(f\"Rough request throughput/second is {avg_throughput:.2f}\")"
]
},
{
"cell_type": "markdown",
"id": "e7a2bd7d",
"metadata": {},
"source": [
"### Wrap-up\n",
"\n",
"\n",
"Inferentia 기반 인스턴스로 모델 배포 시, 비용 절감과 성능 향상을 동시에 누릴 수 있다는 것이 매우 매력적입니다. 예제 코드를 통해 확인해 보았듯이, 러닝 커브 없이 친숙한 인터페이스와 API를 사용하여 Inferentia용 모델을 컴파일할 수 있습니다. 여러분께서도 본 코드를 활용하여 여러분의 모델을 자유롭게 컴파일해 보세요."
]
},
{
"cell_type": "markdown",
"id": "adfbe930",
"metadata": {},
"source": [
"
\n",
"\n",
"## Endpoint Clean-up\n",
"SageMaker Endpoint로 인한 과금을 막기 위해, 본 핸즈온이 끝나면 반드시 Endpoint를 삭제해 주시기 바랍니다."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "36b664c6",
"metadata": {},
"outputs": [],
"source": [
"predictor.delete_endpoint()\n",
"sm_model.delete_model()\n",
"neuron_predictor.delete_endpoint()\n",
"sm_neuron_model.delete_model()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "40fa7b05",
"metadata": {},
"outputs": [],
"source": [
"!rm -rf compilation_artifacts {model_dir} {model_neuron_dir}"
]
}
],
"metadata": {
"instance_type": "ml.t3.medium",
"kernelspec": {
"display_name": "conda_pytorch_p38",
"language": "python",
"name": "conda_pytorch_p38"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}