{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Parallel Training\n", "\n", "This notebook trains the ENet model on a number of GPUs distributed across multiple `ml.p3.16xlarge` instances\n", "using [SageMaker's Distributed Data Parallel](https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel.html) library.\n", "\n", "A prerequisite for model training is a preprocessed dataset which is done in a [separate notebook](preprocess-camvid.ipynb)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Imports and Paths\n", "\n", "The next cell imports modules from the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/)\n", "that we need for training the model, sets up a SageMaker session,\n", "and then defines the S3 URIs for the preprocessed data." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%reload_ext autoreload\n", "%autoreload 2\n", "%reload_ext dotenv\n", "%dotenv\n", "\n", "import sagemaker\n", "from sagemaker.tensorflow import TensorFlow\n", "from sagemaker.inputs import TrainingInput\n", "\n", "session = sagemaker.Session()\n", "bucket = session.default_bucket()\n", "role = sagemaker.get_execution_role()\n", "training_role = role\n", "\n", "prefix = 'enet-tensorflow-distributed'\n", "train_path = f's3://{bucket}/{prefix}/preprocessed-data/camvid/train/'\n", "train_labels_path = f's3://{bucket}/{prefix}/preprocessed-data/camvid/train_labels/'\n", "val_path = f's3://{bucket}/{prefix}/preprocessed-data/camvid/val/'\n", "val_labels_path = f's3://{bucket}/{prefix}/preprocessed-data/camvid/val_labels/'\n", "test_path = f's3://{bucket}/{prefix}/preprocessed-data/camvid/test/'\n", "test_labels_path = f's3://{bucket}/{prefix}/preprocessed-data/camvid/test_labels/'\n", "report_path = f's3://{bucket}/{prefix}/preprocessed-data/camvid/report/'\n", "preprocessing_report_path = f'{report_path}preprocessing_report.json'\n", "class_dict_path = f'{report_path}class_dict.json'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define Training Job\n", "\n", "Since the ENet model is implemented in TensorFlow, we're using the [`TensorFlow estimator`](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/sagemaker.tensorflow.html) to train it via Amazon SageMaker\n", "using a custom [training script](../scripts/train_data_parallel.py) (set via the `source_dir` and `entry_point` arguments).\n", "\n", "We also set the model's hyperparameters,\n", "as well as metric definitions that allow us to extract training metrics from log output.\n", "\n", "For cost efficiency we're using [managed spot training](https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html) (by setting `use_spot_instances=True` and providing `max_run` and `max_wait`).\n", "\n", "For data parallel training we provide a [`distribution`](https://sagemaker.readthedocs.io/en/stable/api/training/smd_data_parallel_use_sm_pysdk.html) argument which configures the distributed training.\n", "\n", "Note that the training job runs on two `ml.p3.16xlarge` instances (`instance_count=2`)." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "hyperparameters = {\n", " 'dropout-rate1': 0.01,\n", " 'dropout-rate2': 0.1,\n", " 'batch-size': 4,\n", " 'learning-rate': 0.001,\n", " 'epochs': 25,\n", "}\n", "metric_definitions = [\n", " {'Name': 'Epoch', 'Regex': r'# epoch = (\\d+)'},\n", " {'Name': 'Loss', 'Regex': r'# loss = ([\\d.\\-\\+e]+)'},\n", " {'Name': 'Val Loss', 'Regex': r'# val_loss = ([\\d.\\-\\+e]+)'},\n", " {'Name': 'Mean IoU', 'Regex': r'# mean_iou = ([\\d.\\-\\+e]+)'},\n", " {'Name': 'Val Mean IoU', 'Regex': r'# val_mean_iou = ([\\d.\\-\\+e]+)'},\n", "]\n", "estimator = TensorFlow(\n", " base_job_name='enet-tf-dp-train',\n", " py_version='py39',\n", " framework_version='2.8.0',\n", " model_dir='/opt/ml/model',\n", " checkpoint_local_path='/opt/ml/checkpoints',\n", " entry_point='scripts/train_data_parallel.py',\n", " source_dir='../',\n", " hyperparameters=hyperparameters,\n", " metric_definitions=metric_definitions,\n", " role=training_role,\n", " sagemaker_session=session,\n", " instance_count=2,\n", " instance_type='ml.p3.16xlarge',\n", " distribution={\n", " 'smdistributed': {\n", " 'dataparallel': {\n", " 'enabled': True,\n", " 'custom_mpi_options': '-verbose -x NCCL_DEBUG=VERSION'\n", " }\n", " }\n", " },\n", " use_spot_instances=True,\n", " max_run=10*3600,\n", " max_wait=16*3600,\n", ")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Run Training Job\n", "\n", "We then run the training job by invoking the TensorFlow estimator's `fit` method.\n", "As argument we provide the data [inputs](https://sagemaker.readthedocs.io/en/stable/api/utility/inputs.html) with the locations of the preprocessed dataset in S3." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "estimator.fit({\n", " 'train': TrainingInput(train_path),\n", " 'train_labels': TrainingInput(train_labels_path),\n", " 'val': TrainingInput(val_path),\n", " 'val_labels': TrainingInput(val_labels_path),\n", " 'test': TrainingInput(test_path),\n", " 'test_labels': TrainingInput(test_labels_path),\n", " 'report': TrainingInput(report_path),\n", "}, wait=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "During training we can stream the logs of the training job to the notebook to follow its progress." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "estimator.logs()" ] } ], "metadata": { "interpreter": { "hash": "23e9edbf101ec1229ee15d5e8950818b02dd17cfc3730b3f1ee235cf2fb9b8d3" }, "kernelspec": { "display_name": "Python 3.9.11 64-bit ('3.9.11')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.11" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }