{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Profiling TensorFlow Multi GPU Multi Node Training Job with Amazon SageMaker Debugger (SageMaker SDK)\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", "\n", "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/sagemaker-debugger|tensorflow_profiling|tf-resnet-profiling-multi-gpu-multi-node.ipynb)\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "This notebook will walk you through creating a TensorFlow training job with the SageMaker Debugger profiling feature enabled. It will create a multi GPU multi node training using Horovod. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### (Optional) Install SageMaker and SMDebug Python SDKs\n", "To use the new Debugger profiling features released in December 2020, ensure that you have the latest versions of SageMaker and SMDebug SDKs installed. Use the following cell to update the libraries and restarts the Jupyter kernel to apply the updates." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sys\n", "import IPython\n", "install_needed = False # should only be True once\n", "if install_needed:\n", " print(\"installing deps and restarting kernel\")\n", " !{sys.executable} -m pip install -U sagemaker smdebug\n", " IPython.Application.instance().kernel.do_shutdown(True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Create a Training Job with Profiling Enabled\n", "\n", "You will use the standard [SageMaker Estimator API for Tensorflow](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/sagemaker.tensorflow.html#tensorflow-estimator) to create training jobs. To enable profiling, create a `ProfilerConfig` object and pass it to the `profiler_config` parameter of the `TensorFlow` estimator." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Define parameters for distributed training" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This parameter tells SageMaker how to configure and run horovod. If you want to use more than 4 GPUs per node then change the process_per_host paramter accordingly." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "distributions = {\n", " \"mpi\": {\n", " \"enabled\": True,\n", " \"processes_per_host\": 4,\n", " \"custom_mpi_options\": \"-verbose -x HOROVOD_TIMELINE=./hvd_timeline.json -x NCCL_DEBUG=INFO -x OMPI_MCA_btl_vader_single_copy_mechanism=none\",\n", " }\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Configure rules\n", "We specify the following rules:\n", "- loss_not_decreasing: checks if loss is decreasing and triggers if the loss has not decreased by a certain persentage in the last few iterations\n", "- LowGPUUtilization: checks if GPU is under-utilizated \n", "- ProfilerReport: runs the entire set of performance rules and create a final output report with further insights and recommendations." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.debugger import Rule, ProfilerRule, rule_configs\n", "\n", "rules = [\n", " Rule.sagemaker(rule_configs.loss_not_decreasing()),\n", " ProfilerRule.sagemaker(rule_configs.LowGPUUtilization()),\n", " ProfilerRule.sagemaker(rule_configs.ProfilerReport()),\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Specify a profiler configuration\n", "The following configuration will capture system metrics at 500 milliseconds. The system metrics include utilization per CPU, GPU, memory utilization per CPU, GPU as well I/O and network.\n", "\n", "Debugger will capture detailed profiling information from step 5 to step 15. This information includes Horovod metrics, dataloading, preprocessing, operators running on CPU and GPU." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.debugger import ProfilerConfig, FrameworkProfile\n", "\n", "profiler_config = ProfilerConfig(\n", " system_monitor_interval_millis=500,\n", " framework_profile_params=FrameworkProfile(\n", " local_path=\"/opt/ml/output/profiler/\", start_step=5, num_steps=10\n", " ),\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Get the image URI\n", "The image that we will is dependent on the region that you are running this notebook in." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import boto3\n", "\n", "session = boto3.session.Session()\n", "region = session.region_name\n", "\n", "image_uri = f\"763104351884.dkr.ecr.{region}.amazonaws.com/tensorflow-training:2.3.1-gpu-py37-cu110-ubuntu18.04\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Define estimator\n", "\n", "To enable profiling, you need to pass the Debugger profiling configuration (`profiler_config`), a list of Debugger rules (`rules`), and the image URI (`image_uri`) to the estimator. Debugger enables monitoring and profiling while the SageMaker estimator requests a training job." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sagemaker\n", "from sagemaker.tensorflow import TensorFlow\n", "\n", "estimator = TensorFlow(\n", " role=sagemaker.get_execution_role(),\n", " image_uri=image_uri,\n", " instance_count=2,\n", " instance_type=\"ml.p3.8xlarge\",\n", " entry_point=\"tf-hvd-train.py\",\n", " source_dir=\"entry_point\",\n", " profiler_config=profiler_config,\n", " distribution=distributions,\n", " rules=rules,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Start training job\n", "\n", "The following `estimator.fit()` with `wait=False` argument initiates the training job in the background. You can proceed to run the dashboard or analysis notebooks." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "estimator.fit(wait=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Analyze Profiling Data\n", "\n", "Copy outputs of the following cell (`training_job_name` and `region`) to run the analysis notebooks `profiling_generic_dashboard.ipynb`, `analyze_performance_bottlenecks.ipynb`, and `profiling_interactive_analysis.ipynb`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "training_job_name = estimator.latest_training_job.name\n", "print(f\"Training jobname: {training_job_name}\")\n", "print(f\"Region: {region}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While the training is still in progress you can visualize the performance data in SageMaker Studio or in the notebook.\n", "Debugger provides utilities to plot system metrics in form of timeline charts or heatmaps. Checkout out the notebook \n", "[profiling_interactive_analysis.ipynb](analysis_tools/profiling_interactive_analysis.ipynb) for more details. In the following code cell we plot the total CPU and GPU utilization as timeseries charts. To visualize other metrics such as I/O, memory, network you simply need to extend the list passed to `select_dimension` and `select_events`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Install the SMDebug client library to use Debugger analysis tools" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pip\n", "\n", "\n", "def import_or_install(package):\n", " try:\n", " __import__(package)\n", " except ImportError:\n", " pip.main([\"install\", package])\n", "\n", "\n", "import_or_install(\"smdebug\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Access the profiling data using the SMDebug `TrainingJob` utility class" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from smdebug.profiler.analysis.notebook_utils.training_job import TrainingJob\n", "\n", "tj = TrainingJob(training_job_name, region)\n", "tj.wait_for_sys_profiling_data_to_be_available()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plot time line charts\n", "\n", "The following code shows how to use the SMDebug `TrainingJob` object, refresh the object if new event files are available, and plot time line charts of CPU and GPU usage." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from smdebug.profiler.analysis.notebook_utils.timeline_charts import TimelineCharts\n", "\n", "system_metrics_reader = tj.get_systems_metrics_reader()\n", "system_metrics_reader.refresh_event_file_list()\n", "\n", "view_timeline_charts = TimelineCharts(\n", " system_metrics_reader,\n", " framework_metrics_reader=None,\n", " select_dimensions=[\"CPU\", \"GPU\"],\n", " select_events=[\"total\"],\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Download Debugger Profiling Report" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `ProfilerReport()` rule creates an html report `profiler-report.html` with a summary of builtin rules and recommenades of next steps. You can find this report in your S3 bucket. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rule_output_path = estimator.output_path + estimator.latest_training_job.job_name + \"/rule-output\"\n", "print(f\"You will find the profiler report in {rule_output_path}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For more information about how to download and open the Debugger profiling report, see [SageMaker Debugger Profiling Report](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-profiling-report.html) in the SageMaker developer guide." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Notebook CI Test Results\n", "\n", "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", "\n", "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/sagemaker-debugger|tensorflow_profiling|tf-resnet-profiling-multi-gpu-multi-node.ipynb)\n", "\n", "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/sagemaker-debugger|tensorflow_profiling|tf-resnet-profiling-multi-gpu-multi-node.ipynb)\n", "\n", "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/sagemaker-debugger|tensorflow_profiling|tf-resnet-profiling-multi-gpu-multi-node.ipynb)\n", "\n", "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/sagemaker-debugger|tensorflow_profiling|tf-resnet-profiling-multi-gpu-multi-node.ipynb)\n", "\n", "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/sagemaker-debugger|tensorflow_profiling|tf-resnet-profiling-multi-gpu-multi-node.ipynb)\n", "\n", "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/sagemaker-debugger|tensorflow_profiling|tf-resnet-profiling-multi-gpu-multi-node.ipynb)\n", "\n", "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/sagemaker-debugger|tensorflow_profiling|tf-resnet-profiling-multi-gpu-multi-node.ipynb)\n", "\n", "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/sagemaker-debugger|tensorflow_profiling|tf-resnet-profiling-multi-gpu-multi-node.ipynb)\n", "\n", "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/sagemaker-debugger|tensorflow_profiling|tf-resnet-profiling-multi-gpu-multi-node.ipynb)\n", "\n", "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/sagemaker-debugger|tensorflow_profiling|tf-resnet-profiling-multi-gpu-multi-node.ipynb)\n", "\n", "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/sagemaker-debugger|tensorflow_profiling|tf-resnet-profiling-multi-gpu-multi-node.ipynb)\n", "\n", "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/sagemaker-debugger|tensorflow_profiling|tf-resnet-profiling-multi-gpu-multi-node.ipynb)\n", "\n", "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/sagemaker-debugger|tensorflow_profiling|tf-resnet-profiling-multi-gpu-multi-node.ipynb)\n", "\n", "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/sagemaker-debugger|tensorflow_profiling|tf-resnet-profiling-multi-gpu-multi-node.ipynb)\n", "\n", "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/sagemaker-debugger|tensorflow_profiling|tf-resnet-profiling-multi-gpu-multi-node.ipynb)\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (Data Science 2.0)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-west-2:236514542706:image/sagemaker-data-science-38" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" } }, "nbformat": 4, "nbformat_minor": 4 }