{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Module 3. Training on SageMaker Environment with Profiling Enabled\n", "---\n", "\n", "This notebook can train on a single GPU or perform distributed training using PyTorch DDP.\n", "\n", "This hands-on can be completed in about **20 minutes**. \n", "\n", "

Note

\n", "If you are considering training large models & hundreds of gigabytes to terabytes of data, consider the SageMaker Distributed Training option.

" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%store -r \n", "%store" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2\n", "%matplotlib inline\n", "import copy\n", "import time\n", "import numpy as np\n", "import torch, os\n", "import matplotlib.pyplot as plt\n", "import src.train_utils\n", "\n", "import boto3\n", "import sagemaker\n", "\n", "boto_session = boto3.Session()\n", "sagemaker_session = sagemaker.Session(boto_session=boto_session)\n", "region = boto3.Session().region_name" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "try:\n", " bucket \n", " dataset_dir \n", " print(\"[OK] You can proceed.\")\n", "except NameError:\n", " print(\"+\"*60)\n", " print(\"[ERROR] Please run '01_make_augmented_imgs.ipynb' before you continue.\")\n", " print(\"+\"*60)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.pytorch import PyTorch\n", "role = sagemaker.get_execution_role()\n", "s3_path = f's3://{bucket}/{dataset_dir}'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# import wandb\n", "# os.environ[\"WANDB_NOTEBOOK_NAME\"] = dataset_dir\n", "# wandb.sagemaker_auth(path=\"./src\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "# 1. SageMaker Debugger Profiling Enabled\n", "---\n", "\n", "To enable profiling, create a `ProfilerConfig` object and pass it to the SageMaker estimator `profiler_config` parameter. This example sets the profiling interval to 500 milliseconds (0.5 seconds)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.debugger import (ProfilerConfig,\n", " FrameworkProfile,\n", " CollectionConfig,\n", " DebuggerHookConfig,\n", " DetailedProfilingConfig, \n", " DataloaderProfilingConfig, \n", " PythonProfilingConfig,\n", " Rule,\n", " PythonProfiler,\n", " cProfileTimer,\n", " ProfilerRule,\n", " rule_configs)\n", "\n", "# Location in S3 where the debugger output will be stored is mentioned in the previous step\n", "\n", "# Set the profile config for both system and framework metrics\n", "profiler_config = ProfilerConfig(\n", " system_monitor_interval_millis = 500,\n", " framework_profile_params = FrameworkProfile(\n", " detailed_profiling_config = DetailedProfilingConfig(\n", " start_step = 5, \n", " num_steps = 10\n", " ),\n", " dataloader_profiling_config = DataloaderProfilingConfig(\n", " start_step = 7, \n", " num_steps = 10\n", " ),\n", " python_profiling_config = PythonProfilingConfig(\n", " start_step = 9, \n", " num_steps = 10,\n", " python_profiler = PythonProfiler.CPROFILE, \n", " cprofile_timer = cProfileTimer.TOTAL_TIME\n", " )\n", " )\n", ")\n", "\n", "# Set the debugger hook config to save tensors\n", "debugger_hook_config = DebuggerHookConfig(\n", " collection_configs = [\n", " CollectionConfig(name = 'weights'),\n", " CollectionConfig(name = 'gradients')\n", " ]\n", ")\n", "\n", "# Set the rules to analyze tensors emitted during training. \n", "# These specific set of rules will inspect the overall training performance and progress of the model\n", "rules = [\n", " ProfilerRule.sagemaker(rule_configs.ProfilerReport()),\n", " Rule.sagemaker(rule_configs.loss_not_decreasing()),\n", " Rule.sagemaker(rule_configs.overfit()),\n", " Rule.sagemaker(rule_configs.overtraining()),\n", " Rule.sagemaker(rule_configs.stalled_training_rule())\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "# 2. Run a Training Job\n", "---\n", "\n", "Create a training job using the standard SageMaker Estimator API for PyTorch. \n", "\n", "The python training script file utilized by SageMaker will also use the same python file as the on-premise. Therefore, if you add only SageMaker environment variables, you can train without modifying the source code. The code snippet below is an example of setting SageMaker environment variables.\n", "\n", "```python\n", "parser = argparse.ArgumentParser()\n", "...\n", "\n", "# SageMaker environment variables\n", "parser.add_argument('--hosts', type=list,\n", " default=json.loads(os.environ['SM_HOSTS']))\n", "parser.add_argument('--current_host', type=str,\n", " default=os.environ['SM_CURRENT_HOST'])\n", "parser.add_argument('--model_dir', type=str,\n", " default=os.environ['SM_MODEL_DIR'])\n", "parser.add_argument('--model_chkpt_dir', type=str,\n", " default='/opt/ml/checkpoint') \n", "parser.add_argument('--train_dir', type=str,\n", " default=os.environ['SM_CHANNEL_TRAIN'])\n", "parser.add_argument('--valid_dir', type=str,\n", " default=os.environ['SM_CHANNEL_VALID']) \n", "parser.add_argument('--num_gpus', type=int,\n", " default=os.environ['SM_NUM_GPUS'])\n", "parser.add_argument('--output_data_dir', type=str,\n", " default=os.environ.get('SM_OUTPUT_DATA_DIR'))\n", "```\n", "\n", "For example, among the environment variables, `SM_MODEL_DIR` means `/opt/ml/model` in the SageMaker training container environment. For various environment variables, please refer to [SageMaker Inference Toolkit](https://github.com/aws/sagemaker-inference-toolkit)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "IS_DISTRIBUTED_TRAINING = False\n", "USE_WANDB = False\n", "FRAMEWORK_VERSION = '1.6.0'\n", "entry_point = 'train_single_gpu_wandb.py' if USE_WANDB else 'train_single_gpu.py'\n", "\n", "if IS_DISTRIBUTED_TRAINING:\n", " estimator = PyTorch(entry_point='train_multi_gpu.py',\n", " source_dir='./src',\n", " role=role,\n", " instance_type='ml.g4dn.12xlarge',\n", " instance_count=1,\n", " framework_version=FRAMEWORK_VERSION,\n", " py_version='py3',\n", " disable_profiler=False,\n", " profiler_config=profiler_config, \n", " debugger_hook_config=debugger_hook_config,\n", " hyperparameters = {'num_epochs': 10, \n", " 'batch_size': 128, # This parameter is divided by the number of GPUs\n", " 'lr': 0.0005,\n", " }\n", " )\n", "else:\n", " estimator = PyTorch(entry_point=entry_point,\n", " source_dir='./src',\n", " role=role,\n", " instance_type='ml.g4dn.xlarge',\n", " instance_count=1,\n", " framework_version=FRAMEWORK_VERSION,\n", " py_version='py3',\n", " disable_profiler=False,\n", " profiler_config=profiler_config, \n", " debugger_hook_config=debugger_hook_config, \n", " hyperparameters = {'num_epochs': 6, \n", " 'batch_size': 64,\n", " 'lr': 0.0005,\n", " } \n", " )\n", " \n", "train_input = sagemaker.TrainingInput(s3_data='s3://{}/{}/train'.format(bucket, dataset_dir)) \n", "valid_input = sagemaker.TrainingInput(s3_data='s3://{}/{}/valid'.format(bucket, dataset_dir)) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "estimator.fit({'train': train_input, 'valid': valid_input}, wait=False)\n", "train_job_name = estimator.latest_training_job.job_name" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from IPython.core.display import display, HTML\n", "\n", "display(\n", " HTML(\n", " 'Review Training Job After About 5 Minutes'.format(\n", " region, train_job_name\n", " )\n", " )\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "display(\n", " HTML(\n", " 'Review CloudWatch Logs After About 5 Minutes'.format(\n", " region, train_job_name\n", " )\n", " )\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "display(HTML(\n", " 'Review S3 Output Data After The Training Job Has Completed'.format(\n", " bucket, train_job_name, region\n", " )\n", " )\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you set the wait argument of the `fit(.)` function to False, you can execute the code cell below to change it to a synchronous way to wait until training is complete." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sagemaker.Session().logs_for_job(job_name=train_job_name, wait=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " \n", "# 3. Analysis profiled data\n", "---\n", "\n", "When the training job starts, SageMaker Debugger starts collecting system and framework metrics. Once metrics collection begins, the profiling data can be analyzed in various ways, including plot and query." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rule_output_path = estimator.output_path + train_job_name + \"/rule-output\"\n", "print(f\"You will find the profiler report in {rule_output_path}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get Profiler Report\n", "\n", "The `ProfilerReport()` rule generates an html report `profiler-report.html` with a summary of the basic rules and recommendations for next steps. You can find this report in your S3 bucket.\n", "\n", "For more information on how to download and open the Debugger Profiling Report, see '[SageMaker Debugger Profiling Report](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-profiling-report.html) in the SageMaker Developer Guide.\n", "\n", "**[Caution] If running in JupyterLab, click \"Trust HTML\" at the top left of the screen to display the html report normally!**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import src.profiling_utils as profiling_utils\n", "import json, os\n", "from IPython.core.display import display, HTML\n", "\n", "rule_output_path = estimator.output_path + train_job_name + \"/rule-output\"\n", "output_dir = './output'\n", "profile_output = output_dir+'/ProfilerReport'\n", "profile_report_folder = profiling_utils.get_profile_report_folder(bucket, train_job_name + \"/rule-output\")\n", "\n", "!rm -rf $output_dir\n", "\n", "if not os.path.exists(output_dir):\n", " os.makedirs(output_dir)\n", " \n", "if not os.path.exists(profile_output):\n", " os.makedirs(profile_output) \n", " \n", "!aws s3 ls {rule_output_path}/{profile_report_folder}/profiler-output/\n", "!aws s3 cp {rule_output_path}/{profile_report_folder}/profiler-output/ {output_dir}/ProfilerReport/ --recursive \n", "\n", "display(HTML('ProfilerReport : Profiler Report'.format(output_dir+\"/ProfilerReport/\")))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from smdebug.profiler.analysis.notebook_utils.training_job import TrainingJob\n", "tj = TrainingJob(train_job_name, region)\n", "\n", "# Retrieve a description of the training job description and the S3 bucket URI where the metric data are saved\n", "tj.describe_training_job()\n", "tj.get_config_and_profiler_s3_output_path()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Wait for the data to be available\n", "tj.wait_for_sys_profiling_data_to_be_available()\n", "tj.wait_for_framework_profiling_data_to_be_available()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get Metrics Dataframe" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "system_metrics_df, framework_metrics_df = profiling_utils.get_profiling_df(tj)\n", "\n", "display(system_metrics_df.head())\n", "display(framework_metrics_df.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plot profiling metrics" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "profiling_utils.plot_profiling_metrics(tj)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "# 4. Import model artifacts from S3 into your Local Environment\n", "---" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "local_model_path = 'model'\n", "model_name = 'model_best.pth'\n", "base_model_name = 'mobilenetv2'\n", "os.makedirs(local_model_path, exist_ok=True)\n", "s3_model_path = estimator.model_data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash -s \"$local_model_path\" \"$s3_model_path\"\n", "aws s3 cp $2 $1\n", "cd $1\n", "tar -xzvf model.tar.gz\n", "rm model.tar.gz" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%store train_job_name s3_model_path base_model_name local_model_path model_name" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "# Next Step\n", "\n", "In this session, the model was trained by invoking the SageMaker Training job. Please proceed to `4.1_neo_compile.ipynb`." ] } ], "metadata": { "kernelspec": { "display_name": "conda_pytorch_p37", "language": "python", "name": "conda_pytorch_p37" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 4 }