# SageMaker로 Yolov5 학습 및 평가 하기

본 모듈에서는 Amzaon SageMaker API을 효과적으로 이용하기 위해 multigpu-distributed 학습을 위한 PyTorch 프레임워크 자체 구현만으로 모델 훈련을 수행해 봅니다.

In [None]:
install_needed = True  # should only be True once
install_needed = False

In [None]:
%%bash

DAEMON_PATH="/etc/docker"
MEMORY_SIZE=10G

FLAG=$(cat $DAEMON_PATH/daemon.json | jq 'has("data-root")')
# echo $FLAG

if [ "$FLAG" == true ]; then
    echo "Already revised"
else
    echo "Add data-root and default-shm-size=$MEMORY_SIZE"
    sudo cp $DAEMON_PATH/daemon.json $DAEMON_PATH/daemon.json.bak
    sudo cat $DAEMON_PATH/daemon.json.bak | jq '. += {"data-root":"/home/ec2-user/SageMaker/.container/docker","default-shm-size":"'$MEMORY_SIZE'"}' | sudo tee $DAEMON_PATH/daemon.json > /dev/null
    sudo service docker restart
    echo "Docker Restart"
fi

In [None]:
import sys
import IPython

if install_needed:
    print("installing deps and restarting kernel")
#     !{sys.executable} -m pip install -U split-folders tqdm albumentations crc32c wget
    !{sys.executable} -m pip install 'sagemaker[local]' --upgrade
    !{sys.executable} -m pip install -U smdebug sagemaker-experiments
    !{sys.executable} -m pip install -U sagemaker ipyplot jsonlines
    IPython.Application.instance().kernel.do_shutdown(True)

## 2. 모델 실험 환경 설정

<p>Sagemaker 학습에 필요한 기본적인 package를 import 합니다. </p>
<p>boto3는 HTTP API 호출을 숨기는 편한 추상화 모델을 가지고 있고, Amazon EC2 인스턴스 및 S3 버켓과 같은 AWS 리소스와 동작하는 파이선 클래스를 제공합니다. </p>
<p>sagemaker python sdk는 Amazon SageMaker에서 기계 학습 모델을 교육 및 배포하기 위한 오픈 소스 라이브러리입니다.</p>

In [None]:
import joblib
import sagemaker

import datetime
import glob
import os
import time
import warnings

import sys
import matplotlib.image as mpimg
import ipyplot

from smexperiments.experiment import Experiment
from smexperiments.trial import Trial

import shutil

import boto3
import numpy as np
import pandas as pd

# from tqdm import tqdm
from time import strftime
from PIL import Image

from sagemaker import get_execution_role
from sagemaker.pytorch import PyTorch

from sagemaker.debugger import (Rule,
                                rule_configs,
                                ProfilerConfig, 
                                FrameworkProfile, 
                                DetailedProfilingConfig, 
                                DataloaderProfilingConfig, 
                                PythonProfilingConfig)

%matplotlib inline

In [None]:
role = get_execution_role()

sess = boto3.Session()
region = sess.region_name
account = boto3.client("sts").get_caller_identity().get("Account")

In [None]:
sagemaker.__version__

In [None]:
def create_experiment(experiment_name):
    try:
        sm_experiment = Experiment.load(experiment_name)
    except:
        sm_experiment = Experiment.create(experiment_name=experiment_name)

In [None]:
def create_trial(experiment_name):
    create_date = strftime("%m%d-%H%M%s")
    sm_trial = Trial.create(trial_name=f'{experiment_name}-{create_date}',
                            experiment_name=experiment_name)

    job_name = f'{sm_trial.trial_name}'
    return job_name

## 3. custom training container 생성
현재 사용하는 노트북의 iam role에 IAMFullAccess을 추가한 이후에 아래 작업을 수행합니다.

In [None]:
# iam_client = boto3.client('iam')
# role=get_execution_role()
# base_role_name=role.split('/')[-1]

In [None]:
# iam_client.attach_role_policy(
#     RoleName=base_role_name,
#     PolicyArn='arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryFullAccess'
# )

In [None]:
# %%bash
# cd ./docker
# echo $(pwd)
# container_name=yolov5-training-sagemaker
# account=$(aws sts get-caller-identity --query Account --output text)

# # Get the region defined in the current configuration (default to us-west-2 if none defined)
# region=$(aws configure get region)
# region=${region:-us-west-2}

# fullname="${account}.dkr.ecr.${region}.amazonaws.com/${container_name}:1.0"

# # If the repository doesn't exist in ECR, create it.
# aws ecr describe-repositories --repository-names "${container_name}" > /dev/null 2>&1
# if [ $? -ne 0 ]
# then
#     aws ecr create-repository --repository-name "${container_name}" > /dev/null
# fi

# # Get the login command from ECR and execute it directly
# $(aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin "763104351884.dkr.ecr.us-west-2.amazonaws.com")

# # Build the docker image locally with the image name and then push it to ECR
# # with the full name.
# docker build -f Dockerfile -t ${fullname} .
# # docker tag ${container_name} ${fullname}

# # Get the login command from ECR and execute it directly
# $(aws ecr get-login --region ${region} --no-include-email)
# docker push ${fullname}

## 4. yolov5 format 데이터 설정

실습에 사용한 데이터셋은 [roboflow 에서 공개한 BCCD Dataset](https://public.roboflow.com/object-detection/bccd)으로, 혈액의 WBC(백혈구), RBC(적혈구), Platelets(혈소판)를 촬영한 이미지들입니다.

yolov5 object detection모델을 학습하기 위한 train/val/test 데이터셋은 다음과 같은 폴더 구조를 따라야 합니다. `images` 폴더에는 이미지를 저장하고, `labels` 폴더에는 이미지 별 annotation 결과 파일을 저장합니다.
```
├── test
│   ├── images
│   └── labels
├── train
│   ├── images
│   └── labels
└── valid
    ├── images
    └── labels
```

YOLOv5에서는 데이터셋이 저장된 경로와 Class수, Class이름을 별도 YAML파일에 선언합니다. 

- `data_sm.yaml`: 학습을 SageMaker Local mode, SageMaker managed training에서 수행할 경우 사용하는 설정 파일입니다. SageMaker는 S3에 저장된 데이터셋을(managed training의 경우, Local mode에서는 Local에 저장된 데이터셋) SageMaker container 내 `/opt/ml/input/data/[channel_name]/` 에 저장하므로 `train`, `val` 경로는 Jupyter notebook local 경로가 아닌 SageMaker container의 경로를 지정합니다.

In [None]:
%%writefile yolov5/data/data_sm.yaml
train: /opt/ml/input/data/inputdata/train/images
val: /opt/ml/input/data/inputdata/valid/images

nc: 3
names: ['Platelets', 'RBC', 'WBC']

## 5. Estimator 설정

### Metric definitions 정의

학습코드 수행 시 발생하는 Standard output 로그에서 특정 패턴을 만족하는 값을 찾아 CloudWatch 사용자 metric으로 저장할 수 있습니다. `metric_definitions`는 SageMaker `Estimator`를 선언할 때 `metric_definitions` 파라미터의 값으로 전달 합니다.

In [None]:
# TODO
metric_definitions = [
    {'Name': 'Precision', 'Regex': r'all\s+[0-9.]+\s+[0-9.]+\s+([0-9.]+)'},
    {'Name': 'Recall', 'Regex': r'all\s+[0-9.]+\s+[0-9.]+\s+[0-9.]+\s+([0-9.]+)'},
    {'Name': 'mAP@.5', 'Regex': r'all\s+[0-9.]+\s+[0-9.]+\s+[0-9.]+\s+[0-9.]+\s+([0-9.]+)'},
    {'Name': 'mAP@.5:.95', 'Regex': r'all\s+[0-9.]+\s+[0-9.]+\s+[0-9.]+\s+[0-9.]+\s+[0-9.]+\s+([0-9.]+)'}
]

### WandB 설정 (Optional)
https://wandb.ai/cayush/yoloV5/reports/Track-and-debug-your-YOLOv5-models--VmlldzozMDQ1OTg

`wandb.init()`을 수행하면 `/root/.netrc` 파일에 WEB API Key가 저장됩니다. `.netrc`파일을 SageMaker Local/Managed 학습 수행 시 실행되는 Container내부의 `/root/.netrc`로 저장하기 위한 코드를 `utils/loggers/__init__.py`에 추가합니다. `.netrc` 파일은 `source_dir/.netrc`에 미리 저장해야 합니다.

```
# __init__.py
...
################## For SageMaker ##################
from pathlib import Path
import subprocess

def wandb_setting():
    set_path = '/opt/ml/code/.netrc' #WANDB API Key
    file = Path(set_path)
    if file.exists():
        subprocess.run(['cp', '-r', set_path, '/root/.netrc'])
    else:
        print('=' * 100)
        print('Not found!!!')
        print('=' * 100)    

wandb_setting()
################## For SageMaker ##################
...
```

### Storage 설정

In [None]:
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()

In [None]:
# bucket = 'XXXXXXXXXXXXX'  ## <== Create a bucket in your account and modify the bucket name.
code_location = f's3://{bucket}/poc_yolov5/sm_codes'
output_path = f's3://{bucket}/poc_yolov5/output'

### Hyperparameters 설정

In [None]:
hyperparameters = {
    'data': 'data_sm.yaml',
    'cfg': 'yolov5s.yaml',
    'weights': 'weights/yolov5s.pt', # Transfer learning
    'batch-size': 64,
    'epochs': 1,
    'project': '/opt/ml/model',
    'workers': 0, # To avoid shm OOM issue
    'freeze': 10, # For transfer learning, freeze all Layers except for the final output convolution layers.
}

### Instance 및 최대 학습 시간 설정

In [None]:
experiment_name = 'yolov5-poc-exp1'
instance_type = 'ml.m5.4xlarge' 
# instance_type = 'local'
instance_count = 1
do_spot_training = False
max_wait = None
max_run = 1*60*60

### Local mode

In [None]:
from pathlib import Path
source_dir = f'{Path.cwd()}/yolov5'


if instance_type in ['local_gpu', 'local']:
    from sagemaker.local import LocalSession
    
    sagemaker_session = LocalSession()
    sagemaker_session.config = {'local': {'local_code': True}}
    s3_data_path = f'file://{Path.cwd()}/BCCD'
    
    checkpoint_s3_uri = None
else:
    sagemaker_session = sagemaker.Session()
    s3_data_path = f's3://{bucket}/dataset/BCCD'

    checkpoint_s3_uri = f's3://{bucket}/poc_yolov5/checkpoints'
    !aws s3 sync ./BCCD {s3_data_path} --quiet

In [None]:
image_uri = None
distribution = None
train_job_name = 'sm'

if do_spot_training:
    max_wait = max_run

print("train_job_name : {} \ntrain_instance_type : {} \ntrain_instance_count : {} \nimage_uri : {} \ndistribution : {}".format(train_job_name, instance_type, instance_count, image_uri, distribution))    

In [None]:
# all input configurations, parameters, and metrics specified in estimator 
# definition are automatically tracked
estimator = PyTorch(
    entry_point='train.py',
    source_dir=source_dir,
    role=role,
    sagemaker_session=sagemaker_session,
    framework_version='1.10',
    py_version='py38',
    # image_uri=image_uri,
    instance_count=instance_count,
    instance_type=instance_type,
    # volume_size=1024,
    code_location = code_location,
    output_path=output_path,
    hyperparameters=hyperparameters,
    # distribution=distribution,
    # disable_profiler=True,
    # debugger_hook_config=False,
    metric_definitions=metric_definitions,
    max_run=max_run,
    use_spot_instances=do_spot_training,
    max_wait=max_wait,
    checkpoint_s3_uri=checkpoint_s3_uri,
)

In [None]:
!sudo rm -rf ./yolov5/weights/

In [None]:
%%time
create_experiment(experiment_name)
job_name = create_trial(experiment_name)

# Now associate the estimator with the Experiment and Trial
estimator.fit(
    inputs={'inputdata': s3_data_path},
    job_name=job_name,
    experiment_config={
      'TrialName': job_name,
      'TrialComponentDisplayName': job_name,
    },
    wait=False,
)

In [None]:
job_name=estimator.latest_training_job.name

In [None]:
sagemaker_session.logs_for_job(job_name=job_name, wait=True)

## 6. 학습 결과 확인

In [None]:
artifacts_dir = estimator.model_data.replace('model.tar.gz', '')
print(artifacts_dir)
!aws s3 ls --human-readable {artifacts_dir}

In [None]:
model_dir = './model'

!rm -rf $model_dir

import json , os

if not os.path.exists(model_dir):
    os.makedirs(model_dir)

!aws s3 cp {artifacts_dir}model.tar.gz {model_dir}/model.tar.gz
!tar -xvzf {model_dir}/model.tar.gz -C {model_dir}

In [None]:
from yolov5 import utils

sys.path.append(f'{os.getcwd()}/yolov5')
display = utils.notebook_init()  # checks

In [None]:
images = []
for img_path in glob.glob('model/*/*'):
    if img_path.split('.')[-1] in ['jpg','png']:
        images.append(mpimg.imread(img_path))

In [None]:
ipyplot.plot_images(images, max_images=2, img_width=450)

## 7. Custom Inference Docker 생성

In [None]:
# %%bash
# cd ./docker
# echo $(pwd)
# container_name=yolov5-inference-sagemaker
# account=$(aws sts get-caller-identity --query Account --output text)

# # Get the region defined in the current configuration (default to us-west-2 if none defined)
# region=$(aws configure get region)
# region=${region:-us-west-2}

# fullname="${account}.dkr.ecr.${region}.amazonaws.com/${container_name}:1.0"

# # If the repository doesn't exist in ECR, create it.
# aws ecr describe-repositories --repository-names "${container_name}" > /dev/null 2>&1
# if [ $? -ne 0 ]
# then
#     aws ecr create-repository --repository-name "${container_name}" > /dev/null
# fi

# # Get the login command from ECR and execute it directly
# $(aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin "763104351884.dkr.ecr.us-west-2.amazonaws.com")

# # Build the docker image locally with the image name and then push it to ECR
# # with the full name.
# docker build -f Dockerfile.inf -t ${fullname} .
# # docker tag ${container_name} ${fullname}

# # Get the login command from ECR and execute it directly
# $(aws ecr get-login --region ${region} --no-include-email)
# docker push ${fullname}

## 8. Processing job으로 Inference하기

In [None]:
from sagemaker.pytorch import PyTorch
from sagemaker.processing import Processor, ScriptProcessor, FrameworkProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput

In [None]:
# inf_image_uri = f'{account}.dkr.ecr.{region}.amazonaws.com/yolov5-inference-sagemaker:1.0'

In [None]:
instance_count = 1
instance_type = "ml.c5.2xlarge"
# instance_type = 'local'

In [None]:
detect_processor = FrameworkProcessor(
    PyTorch,
    framework_version="1.10",
    py_version="py38",
    role=role, 
    # image_uri=inf_image_uri,
    instance_count=instance_count,
    instance_type=instance_type,
    code_location=code_location
    )

In [None]:
s3_test_path = f"s3://{bucket}/dataset/BCCD/test/images/"
detect_output = f"s3://{bucket}/poc_yolov5/detect_output"

In [None]:
if instance_type == 'local':
    from pathlib import Path
    from sagemaker.local import LocalSession
    
    sagemaker_session = LocalSession()
    
    sagemaker_session.config = {'local': {'local_code': True}}
    # s3_data_path = f'file://{Path.cwd()}/BCCD/test/images/'
else:
    sagemaker_session = sagemaker.Session()


### detect.py 수정하기
함수 하위에 model.tar.gz 압축을 푸는 코드를 추가합니다.
```python
def run(...) :
    
    try:
        ############ MLOps for SageMaker ###########
        w = str(weights[0] if isinstance(weights, list) else weights)
        if w.endswith('.tar.gz'):
            import tarfile
            ap = tarfile.open(w)
            ap.extractall("/opt/ml/processing/weights/")
            ap.close()


        import glob
#         print(glob.glob("/opt/ml/processing/weights/*/*/*"))
        training_job_name = glob.glob("/opt/ml/processing/weights/*/*/*")[0].split("/")[-3]
        # LOGGER.info(f" **************** training_job_name : {training_job_name}")
        weights = f"/opt/ml/processing/weights/{training_job_name}/weights/best.pt"
    except:
        print("Not use SageMaker Processinng job")
        pass
```
local mode에서 권한 이슈가 발생할 경우 아래 값을 코드 맨 하단에 추가합니다.
```python
 os.system('chmod 777 -R /opt/ml/processing/output/exp')
```

In [None]:
create_experiment(experiment_name)
job_name = create_trial(experiment_name)

detect_processor.run(
    code="detect.py",
    source_dir=source_dir,
    inputs=[ProcessingInput(source=s3_test_path, input_name="test_data", destination="/opt/ml/processing/input"),
            ProcessingInput(source=artifacts_dir, input_name="model_weight", destination="/opt/ml/processing/weights")
    ],
    outputs=[
        ProcessingOutput(source="/opt/ml/processing/output", destination=detect_output),
    ],
    arguments=["--img", "640", "--conf", "0.25", "--source", "/opt/ml/processing/input", "--weights", "/opt/ml/processing/weights/model.tar.gz", "--project", "/opt/ml/processing/output"],
    job_name=job_name,
    experiment_config={
      'TrialName': job_name,
      'TrialComponentDisplayName': job_name,
    },
    wait=False
)

In [None]:
detect_processor.latest_job.wait()

In [None]:
detect_output

In [None]:
!rm -rf detect_output
!mkdir detect_output
!aws s3 sync {detect_output}/exp detect_output

In [None]:
detect_result = []
for img_path in glob.glob('./detect_output/*'):
    if img_path.split('.')[-1] in ['jpg','png']:
        detect_result.append(mpimg.imread(img_path))

In [None]:
ipyplot.plot_images(detect_result, max_images=2, img_width=250)

### Code repository 생성 및 push

현재 사용하는 노트북의 iam role에 IAMFullAccess을 추가한 이후에 아래 작업을 수행합니다.

In [None]:
iam_client = boto3.client('iam')

role=get_execution_role()
base_role_name=role.split('/')[-1]

iam_client.attach_role_policy(
    RoleName=base_role_name,
    PolicyArn='arn:aws:iam::aws:policy/AWSCodeCommitFullAccess'
)

In [None]:
codecommit = boto3.client('codecommit')
repository_name = 'model_code'

try:
    response = codecommit.create_repository(
        repositoryName=repository_name,
        repositoryDescription='Data Scientists share their training code using this Repository'
    )
except:
    
    print("Repository already exists")
    response = codecommit.get_repository(
        repositoryName=repository_name
    )

In [None]:
codecommit_repo = response['repositoryMetadata']['cloneUrlHttp']
codecommit_repo

In [None]:
!sudo rm -rf ./yolov5/.git/

In [None]:
!git init
!git remote add repo_codecommit $codecommit_repo
!git checkout -b main
!git add yolov5
!git commit -m "yolov5-update"
!git push --set-upstream repo_codecommit main

### Pipeline에 필요한 값 저장하기

In [None]:
%store bucket artifacts_dir codecommit_repo 