# Compiling HuggingFace models for AWS Inferentia

AWS Inferentia는 저렴한 비용으로 높은 처리량(throughput)과 짧은 레이턴시(low latency)의 추론 성능을 제공하기 위해 AWS에서 개발한 머신 러닝 추론 칩입니다. Inferentia 칩은 최신형 커스텀 2세대 Intel® Xeon® 프로세서 및 100Gbps 네트워킹과 결합되어 머신 러닝 추론 애플리케이션을 위한 고성능 및 업계에서 가장 낮은 비용을 제공합니다. AWS Inferentia 기반 Amazon EC2 Inf1 인스턴스는 Inferentia 칩에서 머신 러닝 모델을 컴파일&최적화할 수 있는 AWS Neuron 컴파일러, 런타임 및 프로파일링 도구가 포함되어 있습니다.

AWS Neuron은 AWS Inferentia 칩을 사용하여 머신 러닝 추론을 실행하기 위한 SDK입니다. Neuron을 사용하면 딥러닝 프레임워크(PyTorch, TensorFlow, MXNet)에서 훈련된 컴퓨터 비전 및 자연어 처리 모델을 보다 빠르게 추론할 수 있습니다. 또한, [Dynamic Batching](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/appnotes/perf/torch-neuron-dataparallel-app-note.html#dynamic-batching-description)과 [Data Parallel](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/neuron-frameworks/pytorch-neuron/api-torch-neuron-dataparallel-api.html) 같은 기능을 활용하여 대용량 모델에 대한 추론 성능 개선이 가능합니다.

Inf1 인스턴스는 SageMaker 호스팅 인스턴스로도 배포가 가능하며, 여러분은 아래 두 가지 옵션 중 하나를 선택하여 머신 러닝 모델을 쉽고 빠르게 배포할 수 있습니다.

- **Option 1.** SageMaker Neo로 컴파일 후 Inf1 호스팅 인스턴스로 배포. 이 경우 SageMaker Neo에서 내부적으로 Neuron SDK를 사용하여 모델을 컴파일합니다. Hugging Face 모델은 컴파일 시에 dtype int64로 컴파일해야 합니다. 
- **Option 2.** 개발 환경에서 Neuron SDK로 직접 컴파일 후 Inf1 호스팅 인스턴스로 배포 

본 예제 노트북에서는 Option 2의 방법으로 허깅페이스 BERT 모델을 직접 컴파일 후, g4dn 인스턴스와 Inf1 인스턴스로 배포하여 처리량과 지연 시간에 대한 간단한 벤치마크를 수행합니다. 

### References
- AWS Neuron GitHub: https://github.com/aws/aws-neuron-sdk/
- AWS Neuron Developer Guide: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/

<br>

## Install Dependencies
---

Neuron 컴파일을 위해 `torch-neuron`, `neuron-cc`를 설치해야 합니다. 컴파일을 Inf1 인스턴스에서 수행하실 필요는 없습니다.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
!pip install --upgrade --no-cache-dir torch-neuron neuron-cc[tensorflow] torchvision torch --extra-index-url=https://pip.repos.neuron.amazonaws.com
!pip install --upgrade --no-cache-dir 'transformers==4.15.0'

<br>

## 1. Get Model from HuggingFace Hub
---

HuggingFace Model Hub의 BERT 파인 튜닝 모델을 가져옵니다. 

**[주의] 모델을 인스턴스화할 때 `return_dict=False`로 설정하지 않으면 neuron 컴파일이 정상적으로 수행되지 않습니다.**

In [None]:
import torch
import torch_neuron
import json
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig
from src.inference import model_fn, input_fn, predict_fn, output_fn
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model_id = 'bert-base-cased-finetuned-mrpc'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id, return_dict=False).eval().to(device)
models = model, tokenizer

### Understanding our inference code


#### For Normal instnace
먼저, 일반적인 인스턴스에서 추론을 수행하는 코드를 확인해 보겠습니다. SageMaker로 모델을 배포해 보신 분들은 익숙한 코드입니다. 단순화를 위해 HuggingFace Model Hub에서 직접 모델을 로드하며, `model_fn()`은 모델과 해당 토크나이저를 모두 포함하는 튜플을 반환합니다. 모델과 입력 데이터는 모두 `.to(device)`로 전송되며 디바이스는 CPU 또는 GPU가 될 수 있습니다.

In [None]:
!pygmentize src/inference.py

#### For Inf1 instnace
이제 Inferentia용으로 컴파일된 모델로 추론을 수행하고자 할 때 추론 코드가 어떻게 변경되는지 살펴볼까요?

`model_fn()`만 변경되었으며, 나머지 코드는 모두 동일합니다. 단, `.to(device)`가 제외된 것을 주목해 주세요. Neuron 런타임이 모델을 NeuronCores에 로드하기 때문입니다.

In [None]:
!pygmentize src/inference_inf1.py

문장 유사도를 판별하기 위한 3개의 샘플 문장들입니다.

In [None]:
sequence_0 = "Machine learning is super easy and easy to follow"
sequence_1 = "Yesterday I went to the supermarket and bought meat."
sequence_2 = "The best part of Amazon SageMaker is that it makes machine learning easy."

모델 추론 결과를 확인합니다. SageMaker 호스팅 인스턴스에 배포하기 위한 인터페이스를 구현 후 디버깅하는 것이 좋은 전략입니다.

In [None]:
inputs = json.dumps([sequence_0, sequence_1])
request_body = input_fn(inputs)
out_str = predict_fn(request_body, models)
response = output_fn(out_str)
print(request_body)
print(response)

In [None]:
inputs = json.dumps([sequence_0, sequence_2])
request_body = input_fn(inputs)
out_str = predict_fn(request_body, models)
response = output_fn(out_str)
print(request_body)
print(response)

<br>

## 2. Compile the model into an AWS Neuron optimized TorchScript
---

PyTorch-Neuron의 trace Python API는 TorchScript로 직렬화할 수 있는 Inferentia에서 실행할 PyTorch 모델을 컴파일합니다. PyTorch의 `torch.jit.trace()` 함수와 유사합니다. 컴파일 시간은 약 3분-5분 정도 소요됩니다.

In [None]:
%%time

max_length = 128

paraphrase = tokenizer.encode_plus(sequence_0, sequence_2, max_length=max_length, padding='max_length', truncation=True, return_tensors="pt")
not_paraphrase = tokenizer.encode_plus(sequence_0, sequence_1, max_length=max_length, padding='max_length', truncation=True, return_tensors="pt")

# Convert example inputs to a format that is compatible with TorchScript tracing
example_inputs_paraphrase = paraphrase['input_ids'], paraphrase['attention_mask'], paraphrase['token_type_ids']
example_inputs_not_paraphrase = not_paraphrase['input_ids'], not_paraphrase['attention_mask'], not_paraphrase['token_type_ids']

# Run torch.neuron.trace to generate a TorchScript that is optimized by AWS Neuron
# This step may need 3-5 min
model_neuron = torch.neuron.trace(
    model, example_inputs_paraphrase, verbose=1, compiler_workdir='./compilation_artifacts'
)

`model_neuron.graph`로 CPU에서 실행 중인 부분과 가속기에서 실행 중인 부분을 확인할 수 있습니다.

In [None]:
# See which part is running on CPU versus running on the accelerator.
print(model_neuron.graph)

Neuron으로 컴파일된 모델과 일반 인스턴스에서 사용할 0바이트의 더미 모델을 각각 `model.tar.gz`로 아카이빙하여 S3로 복사합니다.

In [None]:
import os
import boto3
import sys
import time
from sagemaker.utils import name_from_base
import sagemaker

role = sagemaker.get_execution_role()
sess = sagemaker.Session()
region = sess.boto_region_name
bucket = sess.default_bucket()
sm_client = boto3.client('sagemaker')

In [None]:
os.makedirs('neuron_model', exist_ok=True)
os.makedirs('model', exist_ok=True)

model_dir = 'model'
model_filename = 'model.pth'
model_neuron_dir = 'neuron_model'
model_neuron_filename = 'neuron_compiled_model.pt'

os.makedirs(model_neuron_dir, exist_ok=True)
os.makedirs(model_dir, exist_ok=True)

f = open(model_filename, 'w')
f.close()
!tar -czvf model.tar.gz {model_filename} && mv model.tar.gz {model_dir} && rm {model_filename}

model_neuron.save(model_neuron_filename)
!tar -czvf model.tar.gz {model_neuron_filename} && mv model.tar.gz {model_neuron_dir} && rm {model_neuron_filename}

In [None]:
def upload_s3(prefix, local_model_dir):
    model_key = f'{prefix}/model.tar.gz'
    s3_model_path = 's3://{}/{}'.format(bucket, model_key)
    boto3.resource('s3').Bucket(bucket).upload_file(f'{local_model_dir}/model.tar.gz', model_key)
    print("Uploaded model to S3:")
    print(s3_model_path)
    return s3_model_path
    
normal_prefix = 'normal/model'
neuron_prefix = 'inf1_compiled_model/model'
s3_model_path = upload_s3(normal_prefix, model_dir)
s3_model_neuron_path = upload_s3(neuron_prefix, model_neuron_dir)    

<br>

## 3. Deploy Endpoint and run inference based on the pretrained model
---

In [None]:
!{sys.executable} -m pip install Transformers

### [Optional] Deploying Model on Local

디버깅을 위해 로컬 모드로 먼저 배포하는 것이 좋은 전략입니다.

In [None]:
DEBUG_LOCAL_MODE = False
#DEBUG_LOCAL_MODE = True

In [None]:
if DEBUG_LOCAL_MODE:
    from sagemaker.pytorch.model import PyTorchModel
    from sagemaker.predictor import Predictor
    from sagemaker.serializers import JSONSerializer
    from sagemaker.deserializers import JSONDeserializer
    from datetime import datetime
    local_model_path = f'file://{os.getcwd()}/model/model.tar.gz'

    sm_local_model = PyTorchModel(
        model_data=local_model_path,
        predictor_cls=Predictor,
        framework_version='1.8.1',
        role=role,
        entry_point="inference.py",
        source_dir="src", 
        py_version='py3'
    )
    
    local_predictor = sm_local_model.deploy(
        initial_instance_count=1,
        instance_type='local',
        serializer=JSONSerializer(),
        deserializer=JSONDeserializer(),    
    )    

In [None]:
%%time
if DEBUG_LOCAL_MODE:
    result = local_predictor.predict([sequence_0, sequence_1])
    print(result)

In [None]:
if DEBUG_LOCAL_MODE:
    local_predictor.delete_endpoint()
    sm_local_model.delete_model()

### Deploying Model on g4dn Instance

In [None]:
from sagemaker.pytorch.model import PyTorchModel
from sagemaker.predictor import Predictor
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer
from datetime import datetime
date_string = datetime.now().strftime("%Y%m-%d%H-%M%S")

sm_model = PyTorchModel(
    model_data=s3_model_path,
    role=role,
    predictor_cls=Predictor,
    framework_version='1.8.1',
    entry_point="inference.py",
    source_dir="src", 
    py_version='py3',
    name=f"bert-classification-pt181-{date_string}",
    env={"SAGEMAKER_CONTAINER_LOG_LEVEL": "10"},    
)

predictor = sm_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g4dn.xlarge",
    endpoint_name=f"bert-classification-g4dn-{date_string}",    
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer(),    
    wait=False    
)

In [None]:
from IPython.core.display import display, HTML

def make_endpoint_link(region, endpoint_name, endpoint_task):
    endpoint_link = f'<b><a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={region}#/endpoints/{endpoint_name}">{endpoint_task} Review Endpoint</a></b>'   
    return endpoint_link 
        
endpoint_link = make_endpoint_link(region, predictor.endpoint_name, '[Deploy normal model]')
display(HTML(endpoint_link))

### Deploying Model on Inf1 Instance

In [None]:
ecr_image = f'763104351884.dkr.ecr.{region}.amazonaws.com/pytorch-inference-neuron:1.7.1-neuron-py36-ubuntu18.04'

sm_neuron_model = PyTorchModel(
    model_data=s3_model_neuron_path,
    role=role,
    framework_version="1.7.1",
    entry_point="inference_inf1.py",
    source_dir="src",    
    image_uri=ecr_image,
    name=f"bert-classification-pt171-neuron-{date_string}",
    env={"SAGEMAKER_CONTAINER_LOG_LEVEL": "10"},       
)

# Let SageMaker know that we've already compiled the model via neuron-cc
sm_neuron_model._is_compiled_model = True

In [None]:
%%time
neuron_predictor = sm_neuron_model.deploy(
    initial_instance_count=1, 
    instance_type="ml.inf1.2xlarge",
    endpoint_name=f"bert-classification-inf1-2x-{date_string}",        
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer(),       
    wait=False
)

In [None]:
endpoint_link = make_endpoint_link(region, neuron_predictor.endpoint_name, '[Deploy neuron model]')
display(HTML(endpoint_link))

### Wait for the endpoint jobs to complete
엔드포인트가 생성될 때까지 기다립니다. 약 5-10분의 시간이 소요됩니다.

In [None]:
sess.wait_for_endpoint(predictor.endpoint_name, poll=5)

In [None]:
sess.wait_for_endpoint(neuron_predictor.endpoint_name, poll=5)

### Inference Test

모델 배포가 완료되었으면, 각 엔드포인트에 대해 추론을 수행합니다.

In [None]:
sequence_0 = "Machine learning is super easy and easy to follow"
sequence_1 = "Yesterday I went to the supermarket and bought meat."
sequence_2 = "The best part of Amazon SageMaker is that it makes machine learning easy."

#### For g4dn instance

In [None]:
%%time
result = predictor.predict([sequence_0, sequence_1])
print(result)

In [None]:
%%time
result = predictor.predict([sequence_0, sequence_2])
print(result)

#### For Inf1 instance

In [None]:
%%time
result = neuron_predictor.predict([sequence_0, sequence_1])
print(result)

In [None]:
%%time
result = neuron_predictor.predict([sequence_0, sequence_2])
print(result)

<br>

## 4. Benchmark and comparison
---

두 엔드포인트에 대한 간단한 벤치마크를 수행합니다. 각 벤치마크에서 우리는 각각 모델 엔드포인트에 1,000개의 요청을 수행하는 멀티프로세싱을 수행합니다. 각 요청에 대한 추론 지연 시간을 측정하고 작업을 완료하는 데 걸린 총 시간도 측정하여 요청 처리량/초(request throughput/second)를 추정할 수 있습니다.

In [None]:
import numpy as np 
import datetime
import math
import time
import boto3   
import matplotlib.pyplot as plt
from joblib import Parallel, delayed
import numpy as np
from tqdm import tqdm
import random

In [None]:
def inference_latency(model,*inputs):
    """
    infetence_time is a simple method to return the latency of a model inference.

        Parameters:
            model: torch model onbject loaded using torch.jit.load
            inputs: model() args

        Returns:
            latency in seconds
    """
    error = False
    start = time.time()
    try:
        results = model(*inputs)
    except:
        error = True
        results = []
    return {'latency':time.time() - start, 'error': error, 'result': results}

In [None]:
def random_sentence():
    
    s_nouns = ["A dude", "My mom", "The king", "Some guy", "A cat with rabies", "A sloth", "Your homie", "This cool guy my gardener met yesterday", "Superman"]
    p_nouns = ["These dudes", "Both of my moms", "All the kings of the world", "Some guys", "All of a cattery's cats", "The multitude of sloths living under your bed", "Your homies", "Like, these, like, all these people", "Supermen"]
    s_verbs = ["eats", "kicks", "gives", "treats", "meets with", "creates", "hacks", "configures", "spies on", "retards", "meows on", "flees from", "tries to automate", "explodes"]
    p_verbs = ["eat", "kick", "give", "treat", "meet with", "create", "hack", "configure", "spy on", "retard", "meow on", "flee from", "try to automate", "explode"]
    infinitives = ["to make a pie.", "for no apparent reason.", "because the sky is green.", "for a disease.", "to be able to make toast explode.", "to know more about archeology."]
    
    return (random.choice(s_nouns) + ' ' + random.choice(s_verbs) + ' ' + random.choice(s_nouns).lower() or random.choice(p_nouns).lower() + ' ' + random.choice(infinitives))

print([random_sentence(), random_sentence()])

### For g4dn instance

In [None]:
# Defining Auxiliary variables
number_of_clients = 2
number_of_runs = 1000
t = tqdm(range(number_of_runs),position=0, leave=True)

# Starting parallel clients
cw_start = datetime.datetime.utcnow()

results = Parallel(n_jobs=number_of_clients,prefer="threads")(delayed(inference_latency)(predictor.predict,[random_sentence(), random_sentence()]) for mod in t)
avg_throughput = t.total/t.format_dict['elapsed']

cw_end = datetime.datetime.utcnow() 

# Computing metrics and print
latencies = [res['latency'] for res in results]
errors = [res['error'] for res in results]
error_p = sum(errors)/len(errors) *100
p50 = np.quantile(latencies[-1000:],0.50) * 1000
p90 = np.quantile(latencies[-1000:],0.95) * 1000
p95 = np.quantile(latencies[-1000:],0.99) * 1000

print(f'Avg Throughput: :{avg_throughput:.1f}\n')
print(f'50th Percentile Latency:{p50:.1f} ms')
print(f'90th Percentile Latency:{p90:.1f} ms')
print(f'95th Percentile Latency:{p95:.1f} ms\n')
print(f'Errors percentage: {error_p:.1f} %\n')

# Querying CloudWatch
print('Getting Cloudwatch:')
cloudwatch = boto3.client('cloudwatch')
statistics=['SampleCount', 'Average', 'Minimum', 'Maximum']
extended=['p50', 'p90', 'p95', 'p100']

# Give 5 minute buffer to end
cw_end += datetime.timedelta(minutes=5)

# Period must be 1, 5, 10, 30, or multiple of 60
# Calculate closest multiple of 60 to the total elapsed time
factor = math.ceil((cw_end - cw_start).total_seconds() / 60)
period = factor * 60
print('Time elapsed: {} seconds'.format((cw_end - cw_start).total_seconds()))
print('Using period of {} seconds\n'.format(period))

cloudwatch_ready = False
# Keep polling CloudWatch metrics until datapoints are available
while not cloudwatch_ready:
    time.sleep(30)
    print('Waiting 30 seconds ...')
    # Must use default units of microseconds
    model_latency_metrics = cloudwatch.get_metric_statistics(MetricName='ModelLatency',
                                             Dimensions=[{'Name': 'EndpointName',
                                                          'Value': predictor.endpoint_name},
                                                         {'Name': 'VariantName',
                                                          'Value': "AllTraffic"}],
                                             Namespace="AWS/SageMaker",
                                             StartTime=cw_start,
                                             EndTime=cw_end,
                                             Period=period,
                                             Statistics=statistics,
                                             ExtendedStatistics=extended
                                             )
    # Should be 1000
    if len(model_latency_metrics['Datapoints']) > 0:
        print('{} latency datapoints ready'.format(model_latency_metrics['Datapoints'][0]['SampleCount']))
        side_avg = model_latency_metrics['Datapoints'][0]['Average'] / number_of_runs
        side_p50 = model_latency_metrics['Datapoints'][0]['ExtendedStatistics']['p50'] / number_of_runs
        side_p90 = model_latency_metrics['Datapoints'][0]['ExtendedStatistics']['p90'] / number_of_runs
        side_p95 = model_latency_metrics['Datapoints'][0]['ExtendedStatistics']['p95'] / number_of_runs
        side_p100 = model_latency_metrics['Datapoints'][0]['ExtendedStatistics']['p100'] / number_of_runs

        print(f'50th Percentile Latency:{side_p50:.1f} ms')
        print(f'90th Percentile Latency:{side_p90:.1f} ms')
        print(f'95th Percentile Latency:{side_p95:.1f} ms\n')

        cloudwatch_ready = True

In [None]:
from matplotlib.pyplot import hist, title, show, savefig, xlim
import numpy as np

latency_percentiles = np.percentile(latencies, q=[50, 90, 95, 99])

hist(latencies, bins=100)
title("Request latency histogram on GPU")
xlim(0, 0.2)
show()

print("==== Default HuggingFace model on GPU benchmark ====\n")
print(f"95 % of requests take less than {latency_percentiles[2]*1000} ms")
print(f"Rough request throughput/second is {avg_throughput:.2f}")

### For Inf1 instnace

In [None]:
# Defining Auxiliary variables
number_of_clients = 2
number_of_runs = 1000
t = tqdm(range(number_of_runs),position=0, leave=True)

# Starting parallel clients
cw_start = datetime.datetime.utcnow()

results = Parallel(n_jobs=number_of_clients,prefer="threads")(delayed(inference_latency)(neuron_predictor.predict,[random_sentence(), random_sentence()]) for mod in t)
avg_throughput = t.total/t.format_dict['elapsed']

cw_end = datetime.datetime.utcnow() 

# Computing metrics and print
latencies = [res['latency'] for res in results]
errors = [res['error'] for res in results]
error_p = sum(errors)/len(errors) *100
p50 = np.quantile(latencies[-1000:],0.50) * 1000
p90 = np.quantile(latencies[-1000:],0.95) * 1000
p95 = np.quantile(latencies[-1000:],0.99) * 1000

print(f'Avg Throughput: :{avg_throughput:.1f}\n')
print(f'50th Percentile Latency:{p50:.1f} ms')
print(f'90th Percentile Latency:{p90:.1f} ms')
print(f'95th Percentile Latency:{p95:.1f} ms\n')
print(f'Errors percentage: {error_p:.1f} %\n')

# Querying CloudWatch
print('Getting Cloudwatch:')
cloudwatch = boto3.client('cloudwatch')
statistics=['SampleCount', 'Average', 'Minimum', 'Maximum']
extended=['p50', 'p90', 'p95', 'p100']

# Give 5 minute buffer to end
cw_end += datetime.timedelta(minutes=5)

# Period must be 1, 5, 10, 30, or multiple of 60
# Calculate closest multiple of 60 to the total elapsed time
factor = math.ceil((cw_end - cw_start).total_seconds() / 60)
period = factor * 60
print('Time elapsed: {} seconds'.format((cw_end - cw_start).total_seconds()))
print('Using period of {} seconds\n'.format(period))

cloudwatch_ready = False
# Keep polling CloudWatch metrics until datapoints are available
while not cloudwatch_ready:
    time.sleep(30)
    print('Waiting 30 seconds ...')
    # Must use default units of microseconds
    model_latency_metrics = cloudwatch.get_metric_statistics(MetricName='ModelLatency',
                                             Dimensions=[{'Name': 'EndpointName',
                                                          'Value': neuron_predictor.endpoint_name},
                                                         {'Name': 'VariantName',
                                                          'Value': "AllTraffic"}],
                                             Namespace="AWS/SageMaker",
                                             StartTime=cw_start,
                                             EndTime=cw_end,
                                             Period=period,
                                             Statistics=statistics,
                                             ExtendedStatistics=extended
                                             )
    # Should be 1000
    if len(model_latency_metrics['Datapoints']) > 0:
        print('{} latency datapoints ready'.format(model_latency_metrics['Datapoints'][0]['SampleCount']))
        side_avg = model_latency_metrics['Datapoints'][0]['Average'] / number_of_runs
        side_p50 = model_latency_metrics['Datapoints'][0]['ExtendedStatistics']['p50'] / number_of_runs
        side_p90 = model_latency_metrics['Datapoints'][0]['ExtendedStatistics']['p90'] / number_of_runs
        side_p95 = model_latency_metrics['Datapoints'][0]['ExtendedStatistics']['p95'] / number_of_runs
        side_p100 = model_latency_metrics['Datapoints'][0]['ExtendedStatistics']['p100'] / number_of_runs

        print(f'50th Percentile Latency:{side_p50:.1f} ms')
        print(f'90th Percentile Latency:{side_p90:.1f} ms')
        print(f'95th Percentile Latency:{side_p95:.1f} ms\n')

        cloudwatch_ready = True

In [None]:
from matplotlib.pyplot import hist, title, show, savefig, xlim
import numpy as np

latency_percentiles = np.percentile(latencies, q=[50, 90, 95, 99])

hist(latencies, bins=100)
title("Request latency histogram for Inferentia")
xlim(0, 0.2)
show()

print("==== HuggingFace model compiled for Inferentia benchmark ====\n")
print(f"95 % of requests take less than {latency_percentiles[2]*1000} ms")
print(f"Rough request throughput/second is {avg_throughput:.2f}")

### Wrap-up


Inferentia 기반 인스턴스로 모델 배포 시, 비용 절감과 성능 향상을 동시에 누릴 수 있다는 것이 매우 매력적입니다. 예제 코드를 통해 확인해 보았듯이, 러닝 커브 없이 친숙한 인터페이스와 API를 사용하여 Inferentia용 모델을 컴파일할 수 있습니다. 여러분께서도 본 코드를 활용하여 여러분의 모델을 자유롭게 컴파일해 보세요.

<br>

## Endpoint Clean-up
SageMaker Endpoint로 인한 과금을 막기 위해, 본 핸즈온이 끝나면 반드시 Endpoint를 삭제해 주시기 바랍니다.

In [None]:
predictor.delete_endpoint()
sm_model.delete_model()
neuron_predictor.delete_endpoint()
sm_neuron_model.delete_model()

In [None]:
!rm -rf compilation_artifacts {model_dir} {model_neuron_dir}