# sagemaker-cv-preprocessing-training-performance This repository contains [Amazon SageMaker](https://aws.amazon.com/sagemaker/) training implementation with data pre-processing (decoding + augmentations) on both GPUs and CPUs for computer vision — allowing you to compare and reduce training time by addressing CPU bottlenecks caused by increasing data pre-processing load. This is achieved by GPU-accelerated JPEG image decoding and offloading of augmentation to GPUs using [NVIDIA DALI](https://docs.nvidia.com/deeplearning/dali/user-guide/docs/). Performance bottlenecks and ystem utilizations metrics are compared using [Amazon Sagemaker Debugger](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html). ## Module Description: - `util_train.py`: Launch [Amazon Sagemaker PyTorch traininng](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html) jobs with your custom training script. - `src/sm_augmentation_train-script.py`: Custom training script to train models of different complexities (`RESNET-18`, `RESNET-50`, `RESNET-152`) with data pre-processing implementation for: - JPEG decoding and augmentation on CPUs using PyTorch Dataloader - JPEG decoding and augmentation on CPUs & GPUs using NVIDIA DALI - `util_debugger.py`: Extract system utilization metrics with [SageMaker Debugger](https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_debugger.html). ## Run SageMaker training job with decoding and augmentation on GPU: - Parameters such as training data path, S3 bucket, epochs and other training hyperparameters can be adapted at `util_train.py`. - The custom custom training script used is `src/sm_augmentation_train-script.py`. ``` from util_debugger import get_sys_metric from util_train import aug_exp_train aug_exp_train(model_arch = 'RESNET50', batch_size = '32', aug_operator = 'dali-gpu', instance_type='ml.p3.2xlarge', curr_sm_role = 'to-be-added') ``` - Note that this implementation at the moment is optimized for single-GPU training to address multi-core CPU bottlenecks. The [DALI Decoder operation](https://docs.nvidia.com/deeplearning/dali/user-guide/docs/supported_ops.html#nvidia.dali.fn.decoders.image) can be updated with improved usage of `device_memory_padding` and` host_memory_padding` for multi-GPU larger instances. ## Experiment to compare bottlenecks: - Create an Amazon S3 bucket called `sm-aug-test` and upload the [Imagenette dataset](https://github.com/fastai/imagenette) ([download link](https://s3.amazonaws.com/fast-ai-imageclas/imagenette2.tgz)). - Update your SageMaker execution role in the notebook `sm_augmentation_train-script.py` and run the notebook to compare seconds/epoch and system utilization for training jobs by toggling the following parameters: - `instance_type` (default: `ml.p3.2xlarge`) - `model_arch` (default: `RESNET18`) - `batch_size` (default: `32`) - `aug_load_factor` (default: `12`) - `AUGMENTATION_APPROACHES` (default: `['pytorch-cpu', 'dali-gpu']`) - Comparison results using the above default parameter setup: - Seconds/ Epoch improvement of `72.59%` in Amazon SageMaker training job by offloading JPEG decoding and heavy augmentation to GPU — addressing data pre-processing bottleneck to improve performance-cost ratio. - Using the above strategy, training time improvement is higher for lighter models like `RESNET-18` (which causes more CPU bottlenecks) over heavier model such as `RESNET-152` as the `aug_load_factor` is increased while keeping lower batch size of `32`. - System utilization Histograms and CPU bottleneck Heatmaps are generated with SageMaker Debugger in the notebook. Profiler Report and other interactive visuals available on SageMaker Studio. - Further detailed results (based on different augmentation loads, batch sizes, and model complexities for training on 8-CPUs and 1-GPU) are available on request. ## Security See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information. ## License This library is licensed under the MIT-0 License. See the LICENSE file.