# Amazon SageMaker Distributed Training (Image Classification for Oxford-IIIT Pet Dataset) ### Training/Deploying Model for Image dataset ### 1. 실습 구성 ì´ë²ˆ 실습ì—서는 아래 단계를 걸ì³ì„œ ì§„í–‰ì„ í• ì˜ˆì •ìž…ë‹ˆë‹¤.  ### 2. ë°ì´í„°ì…‹ 설명 Oxford-IIIT Pet Datasetì€ 37ê°œ 다른 ì¢…ì˜ ê°œì™€ ê³ ì–‘ì´ ì´ë¯¸ì§€ë¥¼ ê°ê° 200장 씩 ì œê³µí•˜ê³ ìžˆìœ¼ë©°, Ground Truth ë˜í•œ Classification, Object Detection, Segmentation와 ê´€ë ¨ëœ ëª¨ë“ ì •ë³´ê°€ 있으나, ì´ë²ˆ 학습ì—서는 37ê°œ classì— ëŒ€í•´ ì¼ë¶€ ì´ë¯¸ì§€ë¡œ Classification ë¬¸ì œë¥¼ 해결하기 위해 í•™ìŠµì„ ì§„í–‰í• ì˜ˆì •ìž…ë‹ˆë‹¤.  ### 3. 실습 수행 ê³¼ì •  ì´ë²ˆ ì‹¤ìŠµì€ SageMakerì˜ training jobì„ ì—¬ëŸ¬ ê°œ ë„워서 분산 í•™ìŠµì´ ê°€ëŠ¥í•˜ë„ë¡ êµ¬ì„±í•˜ì˜€ìŠµë‹ˆë‹¤. ë˜í•œ, GPU를 여러 ê°œ ê°€ì§€ê³ ìžˆëŠ” ml.p3.8xlarge, ml.p3.16xlarge, ml.p3dn.24xlarge, ml.p4dn.24xlarge를 함께 ì‚¬ìš©í• ë•Œì—는 ëª¨ë“ GPUê°€ Trainingì—ì„œ 활용ë 수 있ë„ë¡ êµ¬ì„±í•˜ì˜€ìŠµë‹ˆë‹¤. [SageMaker Distributed training](https://docs.aws.amazon.com/ko_kr/sagemaker/latest/dg/distributed-training.html)ì€ [Data Parallel](https://docs.aws.amazon.com/ko_kr/sagemaker/latest/dg/data-parallel-intro.html)ê³¼ [Model Parallel](https://docs.aws.amazon.com/ko_kr/sagemaker/latest/dg/model-parallel.html) 2가지 ë°©ë²•ì„ ì§€ì›í•˜ë©°, 기존 Distributed Training 보다 AWSì˜ ì¸í”„ë¼ì— ì 합하게 êµ¬ì„±í•˜ì˜€ê¸°ì— ì„±ëŠ¥ ë˜í•œ 우수합니다. [Horovod](https://distributed-training-workshop.go-aws.com/)와 [APEX](https://github.com/NVIDIA/apex) (A Pytorch EXtension) 패키지와 ê°™ì€ ê¸°ì¡´ì˜ Distributed trainingë„ ìˆ˜í–‰ì´ ê°€ëŠ¥í•©ë‹ˆë‹¤. ì´ë²ˆ 실습ì—서는 SageMaker Data Parallelê³¼ APEX 패키지를 ëª¨ë‘ ì‹¤í–‰í• ìˆ˜ 있ë„ë¡ distributed training í™˜ê²½ì„ êµ¬ì„±í•˜ì˜€ìœ¼ë©°, ì‹¤ìŠµì„ í†µí•´ 2ê°œì˜ ì„±ëŠ¥ê³¼ ì†ë„ ë“±ì„ ë¹„êµí•´ ë³´ë„ë¡ í•˜ê² ìŠµë‹ˆë‹¤. Trainingì´ ì™„ë£Œëœ ì´í›„ì—는 í•™ìŠµëœ modelì„ SageMaker Endpoint를 ì´ìš©í•˜ì—¬ deploy를 í• ì˜ˆì •ìž…ë‹ˆë‹¤. ì´ ë•Œ GPU ëŒ€ì‹ ê°€ê²©ì´ ì €ë ´í•œ CPUë¡œ deploy를 하게 ë˜ë©´ Amazon Elastic Inference를 ì´ìš©í•˜ì—¬ inference ì†ë„를 CPU보다는 ë”ìš± ë¹ ë¥´ê²Œ ìˆ˜í–‰í• ìˆ˜ 있ë„ë¡ í•©ë‹ˆë‹¤. ## 실습 종료 후 리소스 ì •ë¦¬ ì‹¤ìŠµì´ ì¢…ë£Œë˜ë©´, ì‹¤ìŠµì— ì‚¬ìš©ëœ ë¦¬ì†ŒìŠ¤ë“¤ì„ ëª¨ë‘ ì‚ì œí•´ 주셔야 불필요한 ê³¼ê¸ˆì„ í”¼í•˜ì‹¤ 수 있습니다. 아래 ì‚ì œì— ì•žì„œ SageMaker Notebookì„ í†µí•´ ìƒì„±í•œ ***SageMaker Endpoint***를 ê° Notebook ìƒì„± 페ì´ì§€ì—ì„œ SDK ëª…ë ¹ì–´ë¥¼ 통해 ì‚ì œí•´ 주시기 ë°”ëžë‹ˆë‹¤. ### IAM Role ì‚ì œ [IAMì˜ Role 콘솔](https://console.aws.amazon.com/iam/#/roles)ë¡œ ì´ë™í•˜ê³ ì‹¤ìŠµì— ì‚¬ìš©í–ˆë˜ IAM Roleì„ ê²€ìƒ‰í•˜ì—¬ ì°¾ì€ í›„, ***delete***를 í´ë¦í•˜ì—¬ ì‚ì œí•©ë‹ˆë‹¤. 예를 들어 ***SageMakerIamRole***ê³¼ ê°™ì€ ì´ë¦„으로 실습 ê³¼ì •ì—ì„œ IAM Roleì„ ìƒì„±í•˜ì…¨ë‹¤ë©´ ì´ê²ƒì„ 찾아서 ì‚ì œí•©ë‹ˆë‹¤. ### SageMaker Notebook ì‚ì œ [SageMaker 콘솔](https://ap-northeast-2.console.aws.amazon.com/sagemaker/home?region=ap-northeast-2#/dashboard)ë¡œ ì´ë™í•˜ê³ ì‹¤ìŠµì— ì‚¬ìš©í–ˆë˜ Notebook instance를 검색하여 ì°¾ì€ í›„, ***delete***를 í´ë¦í•˜ì—¬ ì‚ì œí•©ë‹ˆë‹¤. 예를 들어 ***sagemaker-hol-lab***ê³¼ ê°™ì€ ì´ë¦„으로 실습 ê³¼ì •ì—ì„œ Notebookì„ ìƒì„±í•˜ì…¨ë‹¤ë©´ ì´ê²ƒì„ 찾아서 ì‚ì œí•©ë‹ˆë‹¤. ### S3 Bucket ì‚ì œ [S3 콘솔](https://s3.console.aws.amazon.com/s3/home?region=ap-northeast-2)ë¡œ ì´ë™í•˜ê³ ì‹¤ìŠµì— ì‚¬ìš©í–ˆë˜ 2ê°œì˜ bucketì„ ê²€ìƒ‰í•˜ì—¬ ì°¾ì€ í›„, ***delete***를 í´ë¦í•˜ì—¬ ì‚ì œí•©ë‹ˆë‹¤. 예를 들어 ***sagemaker-experiments-ap-northeast-2***와 ***sagemaker-ap-northeast-2*** ê°™ì€ ì´ë¦„으로 실습 ê³¼ì •ì—ì„œ S3 Bucketì„ ìƒì„±í•˜ì…¨ë‹¤ë©´ ì´ê²ƒì„ 찾아서 ì‚ì œí•©ë‹ˆë‹¤. ìˆ˜ê³ í•˜ì…¨ìŠµë‹ˆë‹¤.\ ì´ì œ ëª¨ë“ ë¦¬ì†ŒìŠ¤ ì‚ì œë¥¼ 완료하셨습니다. ## Contributors - Youngjoon Choi (choijoon@amazon.com) - Daekeun Kim (daekeun@amazon.com)