{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Analyze and plot experiment results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook illustrates how to analyze and plot results obtained with SageMaker Bencher.\n", "\n", "As an example, we will replicate benchmark analysis and re-create plots presented in the blog post *[\"Choose the best data source for your Amazon SageMaker training job\"](https://aws.amazon.com/blogs/machine-learning/choose-the-best-data-source-for-your-amazon-sagemaker-training-job/) (G. Nachum, A. Arzhanov)*. To this end, it is assumed that we have first ourselves reproduced the benchmarks presented in the mentioned blog post with the SageMaker Bencher by running:\n", "\n", "`python start_experiment.py -f experiments/blog-benchmarks-all.yml`\n", "\n", "Once all the trials of the `blog-benchmark-all` experiment have finished, we can proceed with this notebook to analyse and plot the results.\n", "\n", "----" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import all required packages\n", "We will use `sagemaker.analytics` package to fetch all trial component data stored in SageMaker Experiments and convert it into a Pandas DataFrame for detailed analisys. For more information see *[\"Amazon SageMaker Experiments - Organize, Track, and Compare Your Machine Learning Trainings\"](https://aws.amazon.com/blogs/aws/amazon-sagemaker-experiments-organize-track-and-compare-your-machine-learning-trainings/)*." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from sagemaker.analytics import ExperimentAnalytics\n", "\n", "import numpy as np\n", "import pandas as pd\n", "\n", "import matplotlib as mpl\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "sns.set_theme(style=\"whitegrid\")" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SageMaker.InstanceCountSageMaker.VolumeSizeInGBbatch_sizeepochsimg_sec_ave_totimg_totinput_diminstance_countmax_runnum_parallel_calls...t_downloadingt_epoch_1t_import_frameworkt_startingt_trainingt_training_exactt_training_smt_uploadingvolume_sizeSageMaker.ModelArtifact - MediaType
count37.037.037.037.036.0000003.600000e+0137.036.036.037.0...36.00000036.00000036.00000036.00000036.00000036.00000036.00000036.00000036.00.0
mean1.0500.064.01.0277.1146627.804785e+05224.01.036000.0-1.0...988.2315002779.9543100.000002116.3001672972.2681942780.1337613966.0000005.447306500.0NaN
std0.00.00.00.056.4527877.605085e+050.00.00.00.0...3050.0650272959.7988000.00000115.1393783001.7772742959.8235174660.2158870.0912830.0NaN
min1.0500.064.01.0156.1424693.060700e+04224.01.036000.0-1.0...6.774000111.2975080.00000099.926000151.251000111.408272309.0000005.307000500.0NaN
25%1.0500.064.01.0268.6778733.060700e+04224.01.036000.0-1.0...14.151250113.1225690.000002107.121250305.722000113.304401333.0000005.394750500.0NaN
50%1.0500.064.01.0273.4144497.804785e+05224.01.036000.0-1.0...20.4200002410.8637700.000002111.3885002546.3365002411.0987732647.0000005.422000500.0NaN
75%1.0500.064.01.0326.9259341.530350e+06224.01.036000.0-1.0...192.4132504680.0896490.000002118.7180004886.3865004680.2549415069.5000005.490000500.0NaN
max1.0500.064.01.0330.1929431.530350e+06224.01.036000.0-1.0...11166.5070009800.7369010.000007167.43100010150.5620009800.98502715878.0000005.716000500.0NaN
\n", "

8 rows × 27 columns

\n", "
" ], "text/plain": [ " SageMaker.InstanceCount SageMaker.VolumeSizeInGB batch_size epochs \\\n", "count 37.0 37.0 37.0 37.0 \n", "mean 1.0 500.0 64.0 1.0 \n", "std 0.0 0.0 0.0 0.0 \n", "min 1.0 500.0 64.0 1.0 \n", "25% 1.0 500.0 64.0 1.0 \n", "50% 1.0 500.0 64.0 1.0 \n", "75% 1.0 500.0 64.0 1.0 \n", "max 1.0 500.0 64.0 1.0 \n", "\n", " img_sec_ave_tot img_tot input_dim instance_count max_run \\\n", "count 36.000000 3.600000e+01 37.0 36.0 36.0 \n", "mean 277.114662 7.804785e+05 224.0 1.0 36000.0 \n", "std 56.452787 7.605085e+05 0.0 0.0 0.0 \n", "min 156.142469 3.060700e+04 224.0 1.0 36000.0 \n", "25% 268.677873 3.060700e+04 224.0 1.0 36000.0 \n", "50% 273.414449 7.804785e+05 224.0 1.0 36000.0 \n", "75% 326.925934 1.530350e+06 224.0 1.0 36000.0 \n", "max 330.192943 1.530350e+06 224.0 1.0 36000.0 \n", "\n", " num_parallel_calls ... t_downloading t_epoch_1 \\\n", "count 37.0 ... 36.000000 36.000000 \n", "mean -1.0 ... 988.231500 2779.954310 \n", "std 0.0 ... 3050.065027 2959.798800 \n", "min -1.0 ... 6.774000 111.297508 \n", "25% -1.0 ... 14.151250 113.122569 \n", "50% -1.0 ... 20.420000 2410.863770 \n", "75% -1.0 ... 192.413250 4680.089649 \n", "max -1.0 ... 11166.507000 9800.736901 \n", "\n", " t_import_framework t_starting t_training t_training_exact \\\n", "count 36.000000 36.000000 36.000000 36.000000 \n", "mean 0.000002 116.300167 2972.268194 2780.133761 \n", "std 0.000001 15.139378 3001.777274 2959.823517 \n", "min 0.000000 99.926000 151.251000 111.408272 \n", "25% 0.000002 107.121250 305.722000 113.304401 \n", "50% 0.000002 111.388500 2546.336500 2411.098773 \n", "75% 0.000002 118.718000 4886.386500 4680.254941 \n", "max 0.000007 167.431000 10150.562000 9800.985027 \n", "\n", " t_training_sm t_uploading volume_size \\\n", "count 36.000000 36.000000 36.0 \n", "mean 3966.000000 5.447306 500.0 \n", "std 4660.215887 0.091283 0.0 \n", "min 309.000000 5.307000 500.0 \n", "25% 333.000000 5.394750 500.0 \n", "50% 2647.000000 5.422000 500.0 \n", "75% 5069.500000 5.490000 500.0 \n", "max 15878.000000 5.716000 500.0 \n", "\n", " SageMaker.ModelArtifact - MediaType \n", "count 0.0 \n", "mean NaN \n", "std NaN \n", "min NaN \n", "25% NaN \n", "50% NaN \n", "75% NaN \n", "max NaN \n", "\n", "[8 rows x 27 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "experiment_name = 'blog-benchmarks-all'\n", "analytics = ExperimentAnalytics(experiment_name=experiment_name) \n", "df = analytics.dataframe()\n", "df.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Some pre-processing\n", "While we track a lot of various parameters and metrics in each trial (see above), for this analysis we would need only a couple of them. For example, we will only focus on the following trial parameters:\n", "- `img_sec_ave_tot` - an average throughput of the training job in terms of *images/sec*,\n", "- `dataset/train` - name of the dataset used in the *train* input channel,\n", "- `input_format/train` - file format (e.g., JPG-files, or TFRecord-files) of the dataset used in *train* input channel,\n", "- `input_mode/train` - data input mode used for the *train* input channel (e.g., File, FastFile, or FSx),\n", "- `t_*`-metrics - the exact timing for different phases of the training job (e.g. download, train, or total billable time),\n", "- `img_tot` - the total number of processed images by the training job." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(36, 12)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "col_focus = [\n", " 'TrialComponentName',\n", " 'img_sec_ave_tot',\n", " 'dataset/train',\n", " 'input_format/train',\n", " 'input_mode/train',\n", " 't_downloading',\n", " 't_training',\n", " 't_training_exact',\n", " 't_training_sm',\n", " 'instance_type',\n", " 'img_tot'\n", "]\n", "\n", "dataset_map = {\n", " 'Caltech-jpg-1x': 'Smaller Dataset / Smaller Files',\n", " 'Caltech-tfr-jpg-1x': 'Smaller Dataset / Larger Files',\n", " 'Caltech-jpg-50x': 'Larger Dataset / Smaller Files',\n", " 'Caltech-tfr-jpg-50x': 'Larger Dataset / Larger Files',\n", "}\n", "\n", "df = df[col_focus]\n", "df = df.dropna()\n", "df['dataset_spec/train'] = df['dataset/train'].map(dataset_map)\n", "df.sort_values(by='input_mode/train', inplace=True, ascending=False)\n", "df.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tabular statistics\n", "We can slice and dice the extracted Pandas DataFrame as we see fit. For example, we can make sure that we have indeed repeated each trial 3 times to account for any variance in the timings." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TrialComponentNameimg_sec_ave_tott_downloadingt_trainingt_training_exactt_training_sminstance_typeimg_totdataset_spec/train
input_format/traindataset/traininput_mode/train
jpgCaltech-jpg-1xffm333333333
file333333333
fsx333333333
Caltech-jpg-50xffm333333333
file333333333
fsx333333333
tfrecord/jpgCaltech-tfr-jpg-1xffm333333333
file333333333
fsx333333333
Caltech-tfr-jpg-50xffm333333333
file333333333
fsx333333333
\n", "
" ], "text/plain": [ " TrialComponentName \\\n", "input_format/train dataset/train input_mode/train \n", "jpg Caltech-jpg-1x ffm 3 \n", " file 3 \n", " fsx 3 \n", " Caltech-jpg-50x ffm 3 \n", " file 3 \n", " fsx 3 \n", "tfrecord/jpg Caltech-tfr-jpg-1x ffm 3 \n", " file 3 \n", " fsx 3 \n", " Caltech-tfr-jpg-50x ffm 3 \n", " file 3 \n", " fsx 3 \n", "\n", " img_sec_ave_tot \\\n", "input_format/train dataset/train input_mode/train \n", "jpg Caltech-jpg-1x ffm 3 \n", " file 3 \n", " fsx 3 \n", " Caltech-jpg-50x ffm 3 \n", " file 3 \n", " fsx 3 \n", "tfrecord/jpg Caltech-tfr-jpg-1x ffm 3 \n", " file 3 \n", " fsx 3 \n", " Caltech-tfr-jpg-50x ffm 3 \n", " file 3 \n", " fsx 3 \n", "\n", " t_downloading \\\n", "input_format/train dataset/train input_mode/train \n", "jpg Caltech-jpg-1x ffm 3 \n", " file 3 \n", " fsx 3 \n", " Caltech-jpg-50x ffm 3 \n", " file 3 \n", " fsx 3 \n", "tfrecord/jpg Caltech-tfr-jpg-1x ffm 3 \n", " file 3 \n", " fsx 3 \n", " Caltech-tfr-jpg-50x ffm 3 \n", " file 3 \n", " fsx 3 \n", "\n", " t_training \\\n", "input_format/train dataset/train input_mode/train \n", "jpg Caltech-jpg-1x ffm 3 \n", " file 3 \n", " fsx 3 \n", " Caltech-jpg-50x ffm 3 \n", " file 3 \n", " fsx 3 \n", "tfrecord/jpg Caltech-tfr-jpg-1x ffm 3 \n", " file 3 \n", " fsx 3 \n", " Caltech-tfr-jpg-50x ffm 3 \n", " file 3 \n", " fsx 3 \n", "\n", " t_training_exact \\\n", "input_format/train dataset/train input_mode/train \n", "jpg Caltech-jpg-1x ffm 3 \n", " file 3 \n", " fsx 3 \n", " Caltech-jpg-50x ffm 3 \n", " file 3 \n", " fsx 3 \n", "tfrecord/jpg Caltech-tfr-jpg-1x ffm 3 \n", " file 3 \n", " fsx 3 \n", " Caltech-tfr-jpg-50x ffm 3 \n", " file 3 \n", " fsx 3 \n", "\n", " t_training_sm \\\n", "input_format/train dataset/train input_mode/train \n", "jpg Caltech-jpg-1x ffm 3 \n", " file 3 \n", " fsx 3 \n", " Caltech-jpg-50x ffm 3 \n", " file 3 \n", " fsx 3 \n", "tfrecord/jpg Caltech-tfr-jpg-1x ffm 3 \n", " file 3 \n", " fsx 3 \n", " Caltech-tfr-jpg-50x ffm 3 \n", " file 3 \n", " fsx 3 \n", "\n", " instance_type \\\n", "input_format/train dataset/train input_mode/train \n", "jpg Caltech-jpg-1x ffm 3 \n", " file 3 \n", " fsx 3 \n", " Caltech-jpg-50x ffm 3 \n", " file 3 \n", " fsx 3 \n", "tfrecord/jpg Caltech-tfr-jpg-1x ffm 3 \n", " file 3 \n", " fsx 3 \n", " Caltech-tfr-jpg-50x ffm 3 \n", " file 3 \n", " fsx 3 \n", "\n", " img_tot \\\n", "input_format/train dataset/train input_mode/train \n", "jpg Caltech-jpg-1x ffm 3 \n", " file 3 \n", " fsx 3 \n", " Caltech-jpg-50x ffm 3 \n", " file 3 \n", " fsx 3 \n", "tfrecord/jpg Caltech-tfr-jpg-1x ffm 3 \n", " file 3 \n", " fsx 3 \n", " Caltech-tfr-jpg-50x ffm 3 \n", " file 3 \n", " fsx 3 \n", "\n", " dataset_spec/train \n", "input_format/train dataset/train input_mode/train \n", "jpg Caltech-jpg-1x ffm 3 \n", " file 3 \n", " fsx 3 \n", " Caltech-jpg-50x ffm 3 \n", " file 3 \n", " fsx 3 \n", "tfrecord/jpg Caltech-tfr-jpg-1x ffm 3 \n", " file 3 \n", " fsx 3 \n", " Caltech-tfr-jpg-50x ffm 3 \n", " file 3 \n", " fsx 3 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_grouped = df.groupby(['input_format/train', 'dataset/train', 'input_mode/train']).count()\n", "df_grouped" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After this, we can also print out the average values for every unique trial scenario." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
img_sec_ave_tott_downloadingt_trainingt_training_exactt_training_smimg_tot
input_format/traindataset/traininput_mode/train
jpgCaltech-jpg-1xffm168.41902318.845333388.769333181.852903413.00000030607.0
file268.861786255.186333154.636667113.843675415.33333330607.0
fsx273.75256910.397000322.110333111.805900338.00000030607.0
Caltech-jpg-50xffm165.830324169.4150009598.5543339245.0965189773.3333331530350.0
file329.45335210954.6840004703.0020004645.13286715663.3333331530350.0
fsx328.37945912.0943335099.7116674660.3902005117.6666671530350.0
tfrecord/jpgCaltech-tfr-jpg-1xffm271.14610315.862000293.759333112.882378315.00000030607.0
file270.67630622.123333300.330000113.099438327.66666730607.0
fsx269.52275313.735000303.629000113.571342323.00000030607.0
Caltech-tfr-jpg-50xffm325.78870817.0823334884.7516674697.4032204907.3333331530350.0
file327.262672357.2623334723.3956674676.2362945086.0000001530350.0
fsx326.28289312.0910004894.5683334690.2903944912.3333331530350.0
\n", "
" ], "text/plain": [ " img_sec_ave_tot \\\n", "input_format/train dataset/train input_mode/train \n", "jpg Caltech-jpg-1x ffm 168.419023 \n", " file 268.861786 \n", " fsx 273.752569 \n", " Caltech-jpg-50x ffm 165.830324 \n", " file 329.453352 \n", " fsx 328.379459 \n", "tfrecord/jpg Caltech-tfr-jpg-1x ffm 271.146103 \n", " file 270.676306 \n", " fsx 269.522753 \n", " Caltech-tfr-jpg-50x ffm 325.788708 \n", " file 327.262672 \n", " fsx 326.282893 \n", "\n", " t_downloading \\\n", "input_format/train dataset/train input_mode/train \n", "jpg Caltech-jpg-1x ffm 18.845333 \n", " file 255.186333 \n", " fsx 10.397000 \n", " Caltech-jpg-50x ffm 169.415000 \n", " file 10954.684000 \n", " fsx 12.094333 \n", "tfrecord/jpg Caltech-tfr-jpg-1x ffm 15.862000 \n", " file 22.123333 \n", " fsx 13.735000 \n", " Caltech-tfr-jpg-50x ffm 17.082333 \n", " file 357.262333 \n", " fsx 12.091000 \n", "\n", " t_training \\\n", "input_format/train dataset/train input_mode/train \n", "jpg Caltech-jpg-1x ffm 388.769333 \n", " file 154.636667 \n", " fsx 322.110333 \n", " Caltech-jpg-50x ffm 9598.554333 \n", " file 4703.002000 \n", " fsx 5099.711667 \n", "tfrecord/jpg Caltech-tfr-jpg-1x ffm 293.759333 \n", " file 300.330000 \n", " fsx 303.629000 \n", " Caltech-tfr-jpg-50x ffm 4884.751667 \n", " file 4723.395667 \n", " fsx 4894.568333 \n", "\n", " t_training_exact \\\n", "input_format/train dataset/train input_mode/train \n", "jpg Caltech-jpg-1x ffm 181.852903 \n", " file 113.843675 \n", " fsx 111.805900 \n", " Caltech-jpg-50x ffm 9245.096518 \n", " file 4645.132867 \n", " fsx 4660.390200 \n", "tfrecord/jpg Caltech-tfr-jpg-1x ffm 112.882378 \n", " file 113.099438 \n", " fsx 113.571342 \n", " Caltech-tfr-jpg-50x ffm 4697.403220 \n", " file 4676.236294 \n", " fsx 4690.290394 \n", "\n", " t_training_sm \\\n", "input_format/train dataset/train input_mode/train \n", "jpg Caltech-jpg-1x ffm 413.000000 \n", " file 415.333333 \n", " fsx 338.000000 \n", " Caltech-jpg-50x ffm 9773.333333 \n", " file 15663.333333 \n", " fsx 5117.666667 \n", "tfrecord/jpg Caltech-tfr-jpg-1x ffm 315.000000 \n", " file 327.666667 \n", " fsx 323.000000 \n", " Caltech-tfr-jpg-50x ffm 4907.333333 \n", " file 5086.000000 \n", " fsx 4912.333333 \n", "\n", " img_tot \n", "input_format/train dataset/train input_mode/train \n", "jpg Caltech-jpg-1x ffm 30607.0 \n", " file 30607.0 \n", " fsx 30607.0 \n", " Caltech-jpg-50x ffm 1530350.0 \n", " file 1530350.0 \n", " fsx 1530350.0 \n", "tfrecord/jpg Caltech-tfr-jpg-1x ffm 30607.0 \n", " file 30607.0 \n", " fsx 30607.0 \n", " Caltech-tfr-jpg-50x ffm 1530350.0 \n", " file 1530350.0 \n", " fsx 1530350.0 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_grouped = df.groupby(['input_format/train', 'dataset/train', 'input_mode/train']).mean()\n", "df_grouped" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualize the results\n", "We can now produce all kinds of plots to visualize and analise the benchmark results. For example let us first plot the average **throughput** (images/sec) for different datasets, and then plot the **total billable time** (sec) for each of these SageMaker training jobs." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "g = sns.catplot(data=df, x='dataset_spec/train', y='img_sec_ave_tot',\n", " hue='input_mode/train', col='img_tot', ci='sd',\n", " kind='bar', sharey=False, sharex=False, height=6)\n", "g.set_axis_labels('', 'Throughput [imgs/sec]').set_titles(\"\")" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "g = sns.catplot(data=df, x='dataset_spec/train', y='t_training_sm',\n", " hue='input_mode/train', col='img_tot', ci='sd',\n", " kind='bar', sharey=False, sharex=False, height=6)\n", "g.set_axis_labels('', 'Billable time [sec]').set_titles(\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Replicate plots from the blog post\n", "Now let us try to replicate the benchmark plots from the blog post, which show the total job time with the breakdown for the downloading, training, and other stages. See *[\"Choose the best data source for your Amazon SageMaker training job\"](https://aws.amazon.com/blogs/machine-learning/choose-the-best-data-source-for-your-amazon-sagemaker-training-job/)* for details." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Some helper function for fancy plotting" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "def change_width(ax, new_value) :\n", " for patch in ax.patches :\n", " current_width = patch.get_width()\n", " diff = current_width - new_value\n", "\n", " # we change the bar width\n", " patch.set_width(new_value)\n", "\n", " # we recenter the bar\n", " patch.set_x(patch.get_x() + diff * .5)\n", " \n", "def show_values_on_bars(axs):\n", " def _show_on_single_plot(ax): \n", " for p in ax.patches:\n", " _x = p.get_x() + p.get_width() / 2\n", " _y = p.get_y() + p.get_height()\n", " value = '{:.2f}'.format(p.get_height())\n", " ax.text(_x, _y, value, ha=\"center\") \n", "\n", " if isinstance(axs, np.ndarray):\n", " for idx, ax in np.ndenumerate(axs):\n", " _show_on_single_plot(ax)\n", " else:\n", " _show_on_single_plot(axs)\n", "\n", "def plot_stacked_times(df, title=None, lim=None, savefile=None, bar_width=0.5, **kw):\n", " \n", " import matplotlib.patches as mpatches\n", " import matplotlib.pyplot as plt\n", " sns.set_palette(\"deep\")\n", " \n", " x_col = 'input_mode/train'\n", " \n", " clrs = {\n", " 'train': 'coral',\n", " 'download': 'cornflowerblue',\n", " 'rest': 'grey'\n", " }\n", " \n", " _df = df.copy()\n", " \n", " _df['t_dnt'] = _df['t_training_exact'] + _df['t_downloading']\n", " \n", " fig, ax = plt.subplots()\n", "\n", " bar_tot = sns.barplot(x=x_col, y=\"t_training_sm\", data=_df, color=clrs['rest'], ax=ax, **kw)\n", " bar_dnt = sns.barplot(x=x_col, y=\"t_dnt\", data=_df, color=clrs['download'], ax=ax, **kw)\n", " bar_exact = sns.barplot(x=x_col, y=\"t_training_exact\", data=_df, color=clrs['train'], ax=ax, **kw)\n", " \n", " if lim:\n", " bar_tot.set_ylim(lim)\n", " bar_dnt.set_ylim(lim)\n", " bar_exact.set_ylim(lim)\n", "\n", " top_bar = mpatches.Patch(color=clrs['rest'], label='setup time')\n", " mid_bar = mpatches.Patch(color=clrs['train'], label='training time')\n", " bottom_bar = mpatches.Patch(color=clrs['download'], label='downloading time')\n", " \n", " #change width\n", " change_width(ax, bar_width)\n", " \n", " #add text\n", " #show_values_on_bars(ax)\n", "\n", " #plt.legend(handles=[mid_bar, bottom_bar], loc='upper left')\n", " \n", " ax.tick_params(axis='y', which='major', labelsize=14)\n", " ax.tick_params(axis='x', which='major', labelsize=16)\n", " \n", " ticklabels = ['FSx', 'File', 'FastFile']\n", " ax.set_xticklabels(ticklabels)\n", " \n", " ax.set_xlabel('Input Mode', fontsize=18, fontweight='bold')\n", " ax.set_ylabel('Job Time [s]', fontsize=18, fontweight='bold')\n", " \n", " fig.set_size_inches(8, 6)\n", " \n", " if title:\n", " plt.title(title, fontdict = {'fontsize' : 20})\n", "\n", " # show the graph\n", " plt.show()\n", " \n", " if savefile:\n", " fig.savefig(savefile, bbox_inches='tight')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Blog post plots" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "scrolled": false }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plot_params = [\n", " ('Smaller Dataset / Smaller Files', 500),\n", " ('Larger Dataset / Smaller Files', 16000),\n", " ('Smaller Dataset / Larger Files', 500),\n", " ('Larger Dataset / Larger Files', 16000),\n", "]\n", "\n", "for ds, ylim in plot_params:\n", " df_to_plot = df.loc[df['dataset_spec/train'] == ds]\n", " plot_stacked_times(df_to_plot, title=ds, lim=[0, ylim], ci=None, dodge=False, bar_width=0.75)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Do your results look similar to the ones from the blog post?**\n", "![Blog Results](https://d2908q01vomqb2.cloudfront.net/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59/2022/02/15/ML-2979-image005.jpg)\n", "\n", "Source: *[\"Choose the best data source for your Amazon SageMaker training job\"](https://aws.amazon.com/blogs/machine-learning/choose-the-best-data-source-for-your-amazon-sagemaker-training-job/) (G. Nachum, A. Arzhanov)*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.4" } }, "nbformat": 4, "nbformat_minor": 4 }