{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# GPU performance IO Optimise code and data format\n", "\n", "\n", "GPUs can significantly speed up deep learning training, and have the potential to reduce training time from weeks to just hours. However in order to fully benefit from the use of GPUS there are many aspects to consider such as a) code optimizations to ensure that underlying hardware is fully utilized b) using the latest high performant libraries and GPU drivers c) optimizing input/output and network operations to ensure that the data is fed to the GPU at the rate that matches its computations d) optimizing communication between GPUS during multi-GPU or distributed training.\n", " \n", "Here, we will be specifically focusing on optimizations for improving I/O for GPU performance tuning, regardless of the underlying infrastructure or deep learning framework, as shown in Figure1. This is one area where customers stand to benefit the most from, obtaining typically 10X improvements in overall GPU training performance by just optimizing IO processing routines. \n", "\n", "\n", "We will be using the Caltech 256 dataset to demonstrate the results" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sagemaker\n", "import boto3\n", "import os\n", "import sys\n", "import pandas as pd\n", "import tempfile\n", "import shutil\n", "import numpy as np\n", "sys.path.append('./src')\n", "\n", "sagemaker_session = sagemaker.Session()\n", "account_id = boto3.client('sts').get_caller_identity().get('Account')\n", "region = boto3.session.Session().region_name\n", "\n", "role = sagemaker.get_execution_role()\n", "# Optional you can use a role that you choose\n", "#role=\"arn:aws:iam::{}:role/service-role/AmazonSageMaker-ExecutionRole-20190118T115449\".format(account_id)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Set up data locations in S3" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bucket = sagemaker_session.default_bucket()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "prefix_s3_train = \"caltech_256/train/\"\n", "s3_train = \"s3://{}/{}\".format(bucket, prefix_s3_train)\n", "tmp_data_dir = \"temp/caltech\"\n", "\n", "prefix_s3_train_processed = \"caltech_256/train_processed/\"\n", "s3_train_processed = \"s3://{}/{}\".format(bucket, prefix_s3_train_processed)\n", "tmp_processed_data_dir = \"temp/caltech_processed\"\n", "\n", "s3_model_path = \"s3://{}/models\".format(bucket)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Download Caltech 256 dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Set prepare dataset to False , to avoid recreating data on repeated runs" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "prepare_dataset = True" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Download dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "\n", "%%bash -s \"$prepare_dataset\" \"$s3_train\" \"$tmp_data_dir\"\n", "set -e\n", "set -x\n", "\n", "prepare_dataset=$1\n", "s3_data_url=$2\n", "local_data_dir=$3\n", "if [ \"$prepare_dataset\" == \"True\" ]\n", "then\n", " rm -rf $local_data_dir\n", " mkdir -p $local_data_dir\n", " \n", " # Download CALTECH 256 dataset\n", " wget http://www.vision.caltech.edu/Image_Datasets/Caltech256/256_ObjectCategories.tar -O $local_data_dir/data.tar\n", " \n", " # Extract file\n", " data_tmp_dir=$local_data_dir/downloaded_data \n", " mkdir -p $data_tmp_dir\n", " tar -xf $local_data_dir/data.tar -C $data_tmp_dir \n", " mv $data_tmp_dir/* $local_data_dir\n", " \n", " # delete gz & temp\n", " rm $local_data_dir/data.tar\n", " rm -d $data_tmp_dir\n", "\n", "fi\n", "echo \"$prepare_dataset\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Upload to raw files s3" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "\n", "from s3_util import S3Util\n", "\n", "if prepare_dataset:\n", " S3Util().upload_files(tmp_data_dir,s3_train)\n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!aws s3 ls $s3_train\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!aws s3 ls --recursive $s3_train | wc -l\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create training job" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Source directory" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "source_dir = './src'\n", "entry_point_file = 'main.py'\n", "dependencies = ['./src/datasets']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_instance_type = \"ml.p3.2xlarge\" " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "metric_def = [\n", " {\"Name\": \"loss\",\n", " \"Regex\": \"## loss ##: (\\d*[.]?\\d*)\"}\n", ",{\"Name\": \"secs_time_per_epoch\",\n", " \"Regex\": \"## secs_time_per_epoch ##: (\\d*[.]?\\d*)\"}\n", "]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "inputs = {\n", " \"train\" : s3_train\n", "}\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "epochs = 20 \n", "batch_size=32\n", "learning_rate=0.00001\n", "log_level=\"INFO\" #DEBUG" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pytorch_version= \"1.2.0\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### This is the most naive implementation \n", "\n", "This uses a single worker for the dataloader and an unoptimised dataset. This will have the slowest performance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets have a look at the custom dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pygmentize ./src/datasets/custom_caltech_dataset.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hp_naivedataset_single_worker = {'epochs': 1, \n", " 'batch-size': batch_size,\n", " 'numworkers': 1,\n", " \"dataset_type\" : \"CustomCaltechDataset\",\n", " \"epochs\":epochs,\n", " \"lr\":learning_rate,\n", " \"log-level\" : log_level \n", " \n", " }" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.pytorch import PyTorch\n", "from time import gmtime, strftime\n", "\n", "\n", "\n", "estimator = PyTorch(entry_point=entry_point_file,\n", " source_dir=source_dir,\n", " dependencies = dependencies,\n", " role=role,\n", " py_version=\"py3\",\n", " framework_version = pytorch_version,\n", " hyperparameters=hp_naivedataset_single_worker,\n", " output_path = s3_model_path,\n", " metric_definitions = metric_def,\n", " train_instance_count=1, \n", " train_instance_type=train_instance_type)\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "job_name = \"gpu-performance-naive-one-worker{}\".format(strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime()))\n", "\n", "\n", "\n", "estimator.fit( inputs, job_name=job_name, wait = False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Improve the performance by increasing the number of workers\n", "This uses a multiple workers for the dataloader " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This example uses a p3 2x large and this has has 8 cpus [https://aws.amazon.com/ec2/instance-types/p3/]. So we will use 8 - 1, 7 workers" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "num_workers = 7" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hp_naivedataset_multiple_worker = {'epochs': 1, \n", " 'batch-size': batch_size,\n", " 'numworkers': num_workers,\n", " \"dataset_type\" : \"CustomCaltechDataset\",\n", " \"epochs\":epochs , \n", " \"lr\":learning_rate,\n", " \"log-level\" : log_level \n", " }" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.pytorch import PyTorch\n", "from time import gmtime, strftime\n", "\n", "\n", "\n", "estimator = PyTorch(entry_point=entry_point_file,\n", " source_dir=source_dir,\n", " role=role,\n", " py_version=\"py3\",\n", " framework_version = pytorch_version,\n", " hyperparameters=hp_naivedataset_multiple_worker,\n", " output_path = s3_model_path,\n", " metric_definitions = metric_def,\n", " train_instance_count=1, \n", " train_instance_type=train_instance_type)\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "job_name = \"gpu-performance-naive-multi-worker{}\".format(strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime()))\n", "\n", "\n", "estimator.fit( inputs, job_name=job_name, wait = False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Improve the performance by optimising the dataset code\n", "This uses a single worker for the dataloader , but uses preprocessed data. So the __getitem__ function is quite lean. This allows you to obtain similar performance to having multiple workers.\n", "\n", "Sometimes in order to obtain the best performance, you should use multiple workers and optimise the code " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Preprocessing logic" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pygmentize ./src/datasets/caltech_image_preprocessor.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Preprocess dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "\n", "import os, shutil\n", "sys.path.append('./src')\n", "from datasets.caltech_image_preprocessor import CaltechImagePreprocessor\n", "\n", "if prepare_dataset:\n", " if os.path.exists(tmp_processed_data_dir):\n", " shutil.rmtree( tmp_processed_data_dir ) \n", " os.makedirs(tmp_processed_data_dir, exist_ok=False)\n", " CaltechImagePreprocessor().dump(os.path.join(os.path.dirname(\".\"), tmp_data_dir ), tmp_processed_data_dir , parts=4)\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Upload to s3" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "\n", "\n", "from s3_util import S3Util\n", "\n", "if prepare_dataset:\n", " S3Util().upload_files(tmp_processed_data_dir,s3_train_processed )\n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!aws s3 ls $s3_train_processed/" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pygmentize ./src/datasets/custom_caltech_optimised_dataset.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Here we use a single worker to achieve the same performance as multiple workers.\n", "In this example, if you increase the number of workers you would barely see any performance gain as the getitem operation is optmised as much as possible. In other cases tuning your dataset code and thhe increasing the number of workers will provide the optimal performance." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "num_workers = 1" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "inputs_procesesd = {\n", " \"train\" : s3_train_processed\n", "}\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hp_optimiseddataset_multiple_worker = {'epochs': 1, \n", " 'batch-size': batch_size,\n", " 'numworkers': num_workers,\n", " \"dataset_type\" : \"CustomCaltechOptimisedDataset\",\n", " \"epochs\":epochs \n", " ,\"lr\":learning_rate\n", " }" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.pytorch import PyTorch\n", "from time import gmtime, strftime\n", "\n", "\n", "\n", "estimator = PyTorch(entry_point=entry_point_file,\n", " source_dir=source_dir,\n", " role=role,\n", " py_version=\"py3\",\n", " framework_version = pytorch_version,\n", " hyperparameters=hp_optimiseddataset_multiple_worker,\n", " output_path = s3_model_path,\n", " metric_definitions = metric_def,\n", " train_instance_count=1, \n", " train_instance_type=train_instance_type)\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "job_name = \"gpu-performance-tuned-single-worker{}\".format(strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime()))\n", "\n", "\n", "estimator.fit( inputs_procesesd, job_name=job_name, wait = False)" ] } ], "metadata": { "kernelspec": { "display_name": "conda_pytorch_p36", "language": "python", "name": "conda_pytorch_p36" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.10" } }, "nbformat": 4, "nbformat_minor": 1 }