# Module 4.1. Compile the pretrained PyTorch model with SageMaker Neo
---

In this notebook, we compile and deploy the model trained in the previous session with SageMaker Neo. SageMaker Neo is an API that optimizes machine learning models for hardware, and models compiled with Neo can run anywhere in the cloud and on edge devices.

As of Nov. 2021, SageMaker Neo supports up to PyTorch 1.8.0 on cloud instances and edge devices, and PyTorch 1.5.1 on AWS Inferentia.

Please refer to the link below for instance types, hardware, and deep learning frameworks supported by SageMaker Neo.

- Cloud Instance: https://docs.aws.amazon.com/sagemaker/latest/dg/neo-supported-cloud.html
- Edge Device: https://docs.aws.amazon.com/sagemaker/latest/dg/neo-supported-devices-edge.html


This hands-on can be completed in about **10 minutes**. 

In [None]:
%load_ext autoreload
%autoreload 2
%store -r
%store

In [None]:
try:
    base_model_name
    local_model_path
    model_name
    num_classes
    print("[OK] You can proceed.")
except NameError:
    print("+"*60)
    print("[ERROR] Please run previous notebooks and before you continue.")
    print("+"*60)

In [None]:
import logging, sys
def _get_logger():
    '''
    # https://stackoverflow.com/questions/17745914/python-logging-module-is-printing-lines-multiple-times
    '''
    loglevel = logging.DEBUG
    l = logging.getLogger(__name__)
    if not l.hasHandlers():
        l.setLevel(loglevel)
        logging.getLogger().addHandler(logging.StreamHandler(sys.stdout))        
        l.handler_set = True
    return l  

logger = _get_logger()

In [None]:
import os, sys, sagemaker
sys.path.insert(0, "./src")
#!{sys.executable} -m pip install -qU "sagemaker>=2.45"
print(sagemaker.__version__)
model_trace_name = 'model.pth'
sample_img_path = f"./{dataset_dir}/valid/{classes[0]}"

<br>

# 1. Inference script
---

The code cell below stores the SageMaker inference script in the `src` directory.

In [None]:
%%writefile src/infer_pytorch_neo.py

import io
import json
import logging
import os
import pickle
import numpy as np
import torch
import torchvision.transforms as transforms
from PIL import Image  # Training container doesn't have this package

logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)


def model_fn(model_dir):
    import neopytorch

    logger.info("model_fn")
    neopytorch.config(model_dir=model_dir, neo_runtime=True)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    # The compiled model is saved as "compiled.pt"
    model = torch.jit.load(os.path.join(model_dir, "compiled.pt"), map_location=device)

    # It is recommended to run warm-up inference during model load
    sample_input_path = os.path.join(model_dir, "sample_input.pkl")
    with open(sample_input_path, "rb") as input_file:
        model_input = pickle.load(input_file)
    if torch.is_tensor(model_input):
        model_input = model_input.to(device)
        model(model_input)
    elif isinstance(model_input, tuple):
        model_input = (inp.to(device) for inp in model_input if torch.is_tensor(inp))
        model(*model_input)
    else:
        print("Only supports a torch tensor or a tuple of torch tensors")

    return model
    
    
def transform_fn(model, payload, request_content_type='application/octet-stream', 
                 response_content_type='application/json'):

    logger.info('Invoking user-defined transform function')

    if request_content_type != 'application/octet-stream':
        raise RuntimeError(
            'Content type must be application/octet-stream. Provided: {0}'.format(request_content_type))

    # preprocess
    decoded = Image.open(io.BytesIO(payload))
    preprocess = transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize(
            mean=[0.485, 0.456, 0.406], 
            std=[0.229, 0.224, 0.225]
        ),
    ])
    normalized = preprocess(decoded)
    batchified = normalized.unsqueeze(0)

    # predict
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    batchified = batchified.to(device)
    result = model.forward(batchified)

    # Softmax (assumes batch size 1)
    result = np.squeeze(result.detach().cpu().numpy())
    result_exp = np.exp(result - np.max(result))
    result = result_exp / np.sum(result_exp)

    response_body = json.dumps(result.tolist())

    return response_body, response_content_type

<br>

# 2. Load trained model
---

Load the trained model. In order to reduce compatibility issues with different framework versions and issues during serialization, it is recommended to initialize the model structure first and load the model weights rather than loading the entire model as much as possible.

In [None]:
import torch
import torchvision.models as models
import tarfile
import src.train_utils as train_utils

classes, classes_dict = train_utils.get_classes(os.path.join(dataset_dir, 'valid'))
num_classes = len(classes)
model = train_utils.initialize_ft_model(base_model_name, num_classes=num_classes, feature_extract=False)

In [None]:
# Detect if we have a GPU available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
chkpt = torch.load(os.path.join(local_model_path, model_name), map_location=device)
model.load_state_dict(chkpt['state_dict'])
model = model.to(device)

In [None]:
import torch
import torchvision.models as models
import tarfile

input_shape = [1,3,224,224]
dummy_input = torch.zeros(input_shape).float()
dummy_input = dummy_input.to(device)
trace = torch.jit.trace(model.float().eval(), dummy_input)
trace.save(model_trace_name)

## Local Inference without Endpoint

Debugging while performing inference in a local environment is recommended because there are many risks to directly deploying a trained model to a production environment without sufficient testing. Please refer to the code in the code cell below as an example.

### Check Inference Results on Validation data

Once the model deployment is complete, let's do some inference for validation data.

In [None]:
import os
import json
import random
import numpy as np
from io import BytesIO
from PIL import Image
import src.infer_utils as infer_utils
from src.infer_pytorch_neo import transform_fn

model = torch.jit.load(model_trace_name)
model = model.to(device)

img_list = os.listdir(sample_img_path)
img_path_list = [os.path.join(sample_img_path, img) for img in img_list]

In [None]:
import random
test_idx = random.randint(0, len(img_list))
img_path = img_path_list[test_idx]

with open(img_path, mode='rb') as file:
    payload = bytearray(file.read())
    
response_body, _ = transform_fn(model, payload)
result = json.loads(response_body)
infer_utils.parse_result(result, classes_dict, img_path, show_img=True)

### Check Inference Results for Test data

You can check inference results on test data. If the training dataset is insufficient and the distribution of the test dataset is different from the training/validation dataset, inference results may be poor.

In [None]:
import glob
test_image_dir = 'sample_images'

extensions = (f"{test_image_dir}/*.jpg", f"{test_image_dir}/*.jpeg")
img_filelist = [f for f_ in [glob.glob(e) for e in extensions] for f in f_]

for img_filepath in img_filelist:
    ground_truth = img_filepath.split('/')[-1]
    
    with open(img_filepath, mode='rb') as file:
        payload = bytearray(file.read())

    response_body, _ = transform_fn(model, payload)
    result = json.loads(response_body)
    pred_cls_idx, pred_cls_str, prob = infer_utils.parse_result(result, classes_dict, img_path, show_img=False)
    
    print("+"*80)
    print(f'predicted = {pred_cls_str}, {prob:.2f}%')
    print(f'ground_truth = {ground_truth}')  

<br>

# 3. Compile Model with SageMaker Neo
---

## Overview

Neo-AI supports various frameworks and automatically optimizes models with minimal loss of accuracy. The Neo-AI compiler automatically optimizes the model for the target device's OS and hardware platform, and transforms the model into an executable form in a deep learning runtime. Deep Learning Runtime can perform inference with just two lines of code regardless of deep learning frameworks and edge devices, and the version is constantly updated.

And, if you have an AWS account, you can use Amazon SageMaker Neo powered by Neo-AI. SageMaker Neo compiles models suitable for multiple target devices at the same time with one click on a few lines of code or UI, without any additional package or infrastructure setup, and without charging.

## (Deep Dive) SageMaker Neo Stack

Let's take a closer look at the SageMaker Neo Stack. Neo first optimizes the model with the compiler, and then runs the compiled model through the runtime in the cloud or on-device.

For compiler part, the computational graph generator loads deep learning models trained on various deep learning platforms and reconstructs them into graph structures. It then converts the operators defined in the model into primiative operators to create a computational graph.
After the graph is created, an optimized graph is created by applying various graph optimization techniques.

However, hardware-dependent optimization considering the target hardware architecture on which the model will operate is not possible with only an optimized graph.
Therefore, it is necessary to transform the computational graph into IR(Intermediate Representation), which is a form of hardware-dependent optimization.

Relay IR generates code after performing hardware-dependent optimizations such as memory allocation, parallelization, and execution order  based on the generated IR. For more details, please check the paper.

It should be noted here that Apache TVM is not always used. Depending on the deep learning framework or hardware specifications, TensorRT or TreeLite is used. For NVIDIA GPUs, Neo uses TensorRT.

Finally, the backend code generator generates backend code optimized for the target hardware architecture (CPU, GPU, TPU, etc.) where the deep learning model workload will be deployed based on IR.

The runtime part is done through the DLR. It's only a few megabytes, not hundreds of megabytes like TensorFlow or PyTorch.

![sm-neo-stack](../imgs/sm-neo-stack.png)

- Relay IR: https://arxiv.org/pdf/1810.00952.pdf 


## (Deep Dive) Graph Optimization

Most model compilers perform graph optimizations as shown in the diagram.
Graph optimization reduces the number of layers and computational complexity with techniques such as operator fusion, tensor fusion, and layer fusion.

Let's take an example of calculating 1/sqrt(w) as an example of operator fusion. The graph before optimization requires a total of 2 operations with w > $\sqrt{.}$ > div, but if $1/\sqrt{w}$ is stored as a pre-computation value, no operation is required.

In the figure, three operations conv, bias, and Relu are bundled into one CBR (Conv, Bias, Relu) block to process consecutive operations in one function. This is called vertical layer fusion.
Also, there are 3 overlapping 1x1 CBR blocks, which can be reduced to one block. This is called horizontal layer fusion.

According to NVIDIA's experiments, the total number of layers used in the ResNet-152 network was 670, but reduced to 159 through graph optimization.

![graph-optimization](../imgs/graph-optimization.png)

- TensorRT Overview: https://blogs.nvidia.co.kr/2020/02/19/nvidia-tensor-rt/ 


## Model Compression

In [None]:
with tarfile.open('model.tar.gz', 'w:gz') as f:
    f.add(model_trace_name)

## Model Compilation

The code below compiles the model for 4 use cases.
- Cloud (CPU, `ml_m5` instance)
- Cloud (CPU, `ml_c5` instance)
- Cloud (GPU, `ml_g4dn` instance)
- NVIDIA Jetson nano (CPU)
- NVIDIA Jetson nano (GPU)

Depending on NVIDIA Jetpack, the CUDA version or TensorRT version of the device may not be compatible, and it takes tens of seconds to load the GPU model, so it is a good strategy to compile and test the CPU model together.

Compilation takes usually 4-6 minutes.

**[Caution] If you get an error while compiling, please be sure to check the PyTorch version of the notebook running this code. The PyTorch version must match. This hands-on uses PyTorch 1.6.**

In [None]:
import time, boto3, sagemaker
role = sagemaker.get_execution_role()
bucket = sagemaker.Session().default_bucket()

# For cloud ML inference
compilation_job_cloud_cpu_m4 = infer_utils.compile_model_for_cloud(
    role, bucket, target_device='ml_m4', dataset_dir=dataset_dir
)
compilation_job_cloud_cpu_c5 = infer_utils.compile_model_for_cloud(
    role, bucket, target_device='ml_c5', dataset_dir=dataset_dir
)
compilation_job_cloud_gpu = infer_utils.compile_model_for_cloud(
    role, bucket, target_device='ml_g4dn', dataset_dir=dataset_dir
)

# For on-device ML inference
compilation_job_rasp4b = infer_utils.compile_model_for_rasp(
    role, bucket, target_device='rasp4b', dataset_dir=dataset_dir
)
compilation_job_jetson_cpu = infer_utils.compile_model_for_jetson(
    role, bucket, dataset_dir=dataset_dir, use_gpu=False
)
compilation_job_jetson_gpu = infer_utils.compile_model_for_jetson(
    role, bucket, dataset_dir=dataset_dir, use_gpu=True
)

In [None]:
compilation_jobs = [compilation_job_cloud_cpu_m4, compilation_job_cloud_cpu_c5, compilation_job_cloud_gpu, 
                    compilation_job_rasp4b, compilation_job_jetson_cpu, compilation_job_jetson_gpu]

In [None]:
compilation_job_cloud_cpu_m4['response']['CompilationJobArn']

In [None]:
sm_client = boto3.client('sagemaker')

max_time = time.time() + 15*60 # 15 mins
for job in compilation_jobs:
    while time.time() < max_time:
        resp = sm_client.describe_compilation_job(CompilationJobName=job['job_name'])    
        if resp['CompilationJobStatus'] in ['STARTING', 'INPROGRESS']:
            print('Running...')
        else:
            print(resp['CompilationJobStatus'], job)
            break
        time.sleep(30)

## Review Compilation Jobs on AWS Console

In [None]:
from IPython.core.display import display, HTML
region = boto3.Session().region_name

for job in compilation_jobs:
    job_name = job['job_name']
    display(
        HTML(
            '<b>Review <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/compilation-jobs/{}">Compilation Job</a> for {}</b>'.format(
                region, job_name, job_name
            )
        )
    )

## Copy Compiled model to local
This code cell copies the compiled model from S3 to local. In the case of the cloud, real-time deployment can be performed by creating the endpoint of the instance, and in the case of on-device like NVIDIA Jetson nano, copy the model to the device and install DLR. With DLR, you can easily infer models with simple API calls without the need to install a separate framework such as PyTorch and TensorFlow.

- Installing DLR: https://neo-ai-dlr.readthedocs.io/en/latest/install.html

In [None]:
model_root_path = 'neo-custom-model'
!rm -rf {model_root_path}
for job in compilation_jobs:
    model_path = f"{model_root_path}/{job['job_name']}"
    os.makedirs(model_path, exist_ok=True)
    !aws s3 cp {job['s3_compiled_model_path']} {model_path} --recursive 

## Wrap-up

If you continue to do hands-on deploying Greengrass ML component in the Cloud9 environment, take note of the output of the code cell below. You must run the shell command below on Cloud9.

```shell
rm -rf model_cpu
mkdir model_cpu && cd model_cpu
aws s3 cp [MODEL-CLOUD-CPU-S3-PATH] . --recursive
tar -xzvf model-ml_m4.tar.gz && rm model-ml_m4.tar.gz
```

In [None]:
model_cloud_cpu_s3_path = compilation_jobs[0]['s3_compiled_model_path']
print(model_cloud_cpu_s3_path)

In [None]:
%store model_cloud_cpu_s3_path

### (Optional) Clean-up

In [None]:
# !rm -rf model neo-custom-model output model.pth model.tar.gz {dataset_dir}