{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Compile and Train a Binary Classification Trainer Model with the SST2 Dataset for Single-Node Single-GPU Training" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", "\n", "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/sagemaker-training-compiler|huggingface|pytorch_single_gpu_single_node|roberta-base|roberta-base.ipynb)\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. [Introduction](#Introduction) \n", "2. [Development Environment and Permissions](#Development-Environment-and-Permissions)\n", " 1. [Installation](#Installation) \n", " 2. [SageMaker environment](#SageMaker-environment)\n", "3. [Processing](#Preprocessing) \n", " 1. [Tokenization](#Tokenization) \n", " 2. [Uploading data to sagemaker_session_bucket](#Uploading-data-to-sagemaker_session_bucket) \n", "4. [SageMaker Training Job](#SageMaker-Training-Job) \n", " 1. [Training with Native PyTorch](#Training-with-Native-PyTorch) \n", " 2. [Training with Optimized PyTorch](#Training-with-Optimized-PyTorch) \n", " 3. [Analysis](#Analysis) \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## SageMaker Training Compiler Overview\n", "\n", "SageMaker Training Compiler is a capability of SageMaker that makes these hard-to-implement optimizations to reduce training time on GPU instances. The compiler optimizes DL models to accelerate training by more efficiently using SageMaker machine learning (ML) GPU instances. SageMaker Training Compiler is available at no additional charge within SageMaker and can help reduce total billable time as it accelerates training. \n", "\n", "SageMaker Training Compiler is integrated into the AWS Deep Learning Containers (DLCs). Using the SageMaker Training Compiler enabled AWS DLCs, you can compile and optimize training jobs on GPU instances with minimal changes to your code. Bring your deep learning models to SageMaker and enable SageMaker Training Compiler to accelerate the speed of your training job on SageMaker ML instances for accelerated computing. \n", "\n", "For more information, see [SageMaker Training Compiler](https://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler.html) in the *Amazon SageMaker Developer Guide*.\n", "\n", "## Introduction\n", "\n", "In this demo, you'll use Hugging Face's transformers and datasets libraries with Amazon SageMaker Training Compiler to train the RoBERTa model on the Stanford Sentiment Treebank v2 (SST2) dataset. To get started, we need to set up the environment with a few prerequisite steps, for permissions, configurations, and so on. \n", "\n", "**NOTE:** You can run this demo in SageMaker Studio, SageMaker notebook instances, or your local machine with AWS CLI set up. If using SageMaker Studio or SageMaker notebook instances, make sure you choose one of the PyTorch-based kernels, Python 3 (PyTorch x.y Python 3.x CPU Optimized) or conda_pytorch_p36 respectively.\n", "\n", "**NOTE:** This notebook uses two ml.p3.2xlarge instances that have single GPU. If you don't have enough quota, see [Request a service quota increase for SageMaker resources](https://docs.aws.amazon.com/sagemaker/latest/dg/regions-quotas.html#service-limit-increase-request-procedure). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Development Environment " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Installation\n", "\n", "This example notebook requires the **SageMaker Python SDK v2.133.0** and **transformers**." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install botocore boto3 awscli s3fs typing-extensions \"torch==1.12.0\" \"fsspec<=2022.7.1\" \"sagemaker>=2.133.0\" --upgrade" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install -U transformers \"datasets[s3]==2.5.2\" --upgrade" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import botocore\n", "import boto3\n", "import sagemaker\n", "import transformers\n", "import pandas as pd\n", "\n", "print(f\"sagemaker: {sagemaker.__version__}\")\n", "print(f\"transformers: {transformers.__version__}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Copy and run the following code if you need to upgrade ipywidgets for datasets library and restart kernel. This is only needed when preprocessing is done in the notebook.\n", "\n", "```python\n", "%%capture\n", "import IPython\n", "!conda install -c conda-forge ipywidgets -y\n", "# has to restart kernel for the updates to be applied\n", "IPython.Application.instance().kernel.do_shutdown(True) \n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### SageMaker environment " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sagemaker\n", "\n", "sess = sagemaker.Session()\n", "\n", "# SageMaker session bucket -> used for uploading data, models and logs\n", "# SageMaker will automatically create this bucket if it does not exist\n", "sagemaker_session_bucket = None\n", "if sagemaker_session_bucket is None and sess is not None:\n", " # set to default bucket if a bucket name is not given\n", " sagemaker_session_bucket = sess.default_bucket()\n", "\n", "role = sagemaker.get_execution_role()\n", "sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)\n", "\n", "print(f\"sagemaker role arn: {role}\")\n", "print(f\"sagemaker bucket: {sess.default_bucket()}\")\n", "print(f\"sagemaker session region: {sess.boto_region_name}\")" ] }, { "cell_type": "markdown", "metadata": { "id": "whPRbBNbIrIl" }, "source": [ "## Loading the SST dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When using the [\ud83e\udd17 Datasets library](https://github.com/huggingface/datasets), datasets can be downloaded directly with the following `datasets.load_dataset()` method:\n", "\n", "```python\n", "from datasets import load_dataset\n", "load_dataset('dataset_name')\n", "```\n", "\n", "If you'd like to try other training datasets later, you can simply use this method.\n", "\n", "For this example notebook, we prepared the SST2 dataset in the public SageMaker sample file S3 bucket. The following code cells show how you can directly load the dataset and convert to a HuggingFace DatasetDict." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preprocessing\n", "\n", "We download and preprocess the SST2 dataset from the s3://sagemaker-sample-files/datasets bucket. After preprocessing, we'll upload the dataset to the sagemaker_session_bucket, which will be used as a data channel for the training job." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tokenization " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from datasets import Dataset\n", "from transformers import AutoTokenizer\n", "import pandas as pd\n", "\n", "# tokenizer used in preprocessing\n", "tokenizer_name = \"roberta-base\"\n", "\n", "# s3 key prefix for the data\n", "s3_prefix = \"samples/datasets/sst2\"\n", "\n", "# Download the SST2 data from s3\n", "!curl https://sagemaker-sample-files.s3.amazonaws.com/datasets/text/SST2/sst2.test > ./sst2.test\n", "!curl https://sagemaker-sample-files.s3.amazonaws.com/datasets/text/SST2/sst2.train > ./sst2.train\n", "!curl https://sagemaker-sample-files.s3.amazonaws.com/datasets/text/SST2/sst2.val > ./sst2.val" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# download tokenizer\n", "tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)\n", "\n", "\n", "# tokenizer helper function\n", "def tokenize(batch):\n", " return tokenizer(batch[\"text\"], padding=\"max_length\", truncation=True)\n", "\n", "\n", "# load dataset\n", "test_df = pd.read_csv(\"sst2.test\", sep=\"delimiter\", header=None, engine=\"python\", names=[\"line\"])\n", "train_df = pd.read_csv(\"sst2.train\", sep=\"delimiter\", header=None, engine=\"python\", names=[\"line\"])\n", "\n", "test_df[[\"label\", \"text\"]] = test_df[\"line\"].str.split(\" \", 1, expand=True)\n", "train_df[[\"label\", \"text\"]] = train_df[\"line\"].str.split(\" \", 1, expand=True)\n", "\n", "test_df.drop(\"line\", axis=1, inplace=True)\n", "train_df.drop(\"line\", axis=1, inplace=True)\n", "\n", "test_df[\"label\"] = pd.to_numeric(test_df[\"label\"], downcast=\"integer\")\n", "train_df[\"label\"] = pd.to_numeric(train_df[\"label\"], downcast=\"integer\")\n", "\n", "train_dataset = Dataset.from_pandas(train_df)\n", "test_dataset = Dataset.from_pandas(test_df)\n", "\n", "# tokenize dataset\n", "train_dataset = train_dataset.map(tokenize, batched=True)\n", "test_dataset = test_dataset.map(tokenize, batched=True)\n", "\n", "# set format for pytorch\n", "train_dataset = train_dataset.rename_column(\"label\", \"labels\")\n", "train_dataset.set_format(\"torch\", columns=[\"input_ids\", \"attention_mask\", \"labels\"])\n", "\n", "test_dataset = test_dataset.rename_column(\"label\", \"labels\")\n", "test_dataset.set_format(\"torch\", columns=[\"input_ids\", \"attention_mask\", \"labels\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Uploading data to sagemaker_session_bucket\n", "\n", "We are going to use the new FileSystem [integration](https://huggingface.co/docs/datasets/filesystems.html) to upload our preprocessed dataset to S3." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import botocore\n", "from datasets.filesystems import S3FileSystem\n", "\n", "s3 = S3FileSystem()\n", "\n", "# save train_dataset to s3\n", "training_input_path = f\"s3://{sess.default_bucket()}/{s3_prefix}/train\"\n", "train_dataset.save_to_disk(training_input_path, fs=s3)\n", "\n", "# save test_dataset to s3\n", "test_input_path = f\"s3://{sess.default_bucket()}/{s3_prefix}/test\"\n", "test_dataset.save_to_disk(test_input_path, fs=s3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## SageMaker Training Job\n", "\n", "To create a SageMaker training job, we use a PyTorch estimator. Using the estimator, you can define which fine-tuning script should SageMaker use through entry_point, which instance_type to use for training, which hyperparameters to pass, and so on.\n", "\n", "When a SageMaker training job starts, SageMaker takes care of starting and managing all the required machine learning instances, picks up the PyTorch Deep Learning Container, uploads your training script, and downloads the data from sagemaker_session_bucket into the container at /opt/ml/input/data.\n", "\n", "In the following section, you learn how to set up two versions of the SageMaker PyTorch estimator, a native one without the compiler and an optimized one with the compiler." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Training Setup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Set up an option for fine-tuning or full training. `FINE_TUNING = 1` is for fine-tuning, and it will use fine_tune_with_huggingface.py. `FINE_TUNING = 0` is for full training, and it will use full_train_roberta_with_huggingface.py." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Here we configure the training job. Please configure the appropriate options below:\n", "\n", "# Fine tuning trains a pre-trained model on a different dataset whereas full training trains the model from scratch.\n", "FINE_TUNING = 1\n", "FULL_TRAINING = not FINE_TUNING\n", "\n", "# Fine tuning is typically faster and is done for fewer epochs\n", "EPOCHS = 4 if FINE_TUNING else 100\n", "\n", "TRAINING_SCRIPT = (\n", " \"fine_tune_with_huggingface.py\" if FINE_TUNING else \"full_train_roberta_with_huggingface.py\"\n", ")\n", "\n", "# SageMaker Training Compiler currently only supports training on GPU\n", "# Select Instance type for training\n", "INSTANCE_TYPE = \"ml.p3.2xlarge\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Training with Native PyTorch" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `train_batch_size` in the following code cell is the maximum batch that can fit into the memory of the ml.p3.2xlarge instance. If you change the model, instance type, sequence length, and other parameters, you need to do some experiments to find the largest batch size that will fit into GPU memory." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.pytorch import PyTorch\n", "\n", "# hyperparameters, which are passed into the training job\n", "hyperparameters = {\"epochs\": EPOCHS, \"train_batch_size\": 24, \"model_name\": \"roberta-base\"}\n", "# The original LR was set for a batch of 32. Here we are scaling learning rate with batch size.\n", "hyperparameters[\"learning_rate\"] = float(\"5e-5\") / 32 * hyperparameters[\"train_batch_size\"]\n", "\n", "# If checkpointing is enabled with higher epoch numbers, your disk requirements will increase as well\n", "volume_size = 60 + 2 * hyperparameters[\"epochs\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# configure the training job\n", "native_estimator = PyTorch(\n", " entry_point=TRAINING_SCRIPT,\n", " source_dir=\"./scripts\",\n", " instance_type=INSTANCE_TYPE,\n", " instance_count=1,\n", " role=role,\n", " py_version=\"py39\",\n", " framework_version=\"1.13.1\",\n", " volume_size=volume_size,\n", " hyperparameters=hyperparameters,\n", " disable_profiler=True,\n", " debugger_hook_config=False,\n", ")\n", "\n", "# start training with our uploaded datasets as input\n", "native_estimator.fit({\"train\": training_input_path, \"test\": test_input_path}, wait=False)\n", "\n", "# The name of the training job.\n", "native_estimator.latest_training_job.name" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Training with Optimized PyTorch" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compilation through Training Compiler changes the memory footprint of the model. Most commonly, this manifests as a reduction in memory utilization and a consequent increase in the largest batch size that can fit on the GPU. Note that if you want to change the batch size, you must adjust the learning rate appropriately.\n", "\n", "**Note:** We recommend you to turn the SageMaker Debugger's profiling and debugging tools off when you use compilation to avoid additional overheads." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# With SageMaker Training Compiler enabled we are able to fit a larger batch into memory.\n", "hyperparameters[\"train_batch_size\"] = 36\n", "# The original LR was set for a batch of 32. Here we are scaling learning rate with batch size.\n", "hyperparameters[\"learning_rate\"] = float(\"5e-5\") / 32 * hyperparameters[\"train_batch_size\"]\n", "\n", "# If checkpointing is enabled with higher epoch numbers, your disk requirements will increase as well\n", "volume_size = 60 + 2 * hyperparameters[\"epochs\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.pytorch import PyTorch, TrainingCompilerConfig\n", "\n", "# configure the training job\n", "optimized_estimator = PyTorch(\n", " entry_point=TRAINING_SCRIPT,\n", " compiler_config=TrainingCompilerConfig(),\n", " source_dir=\"./scripts\",\n", " instance_type=INSTANCE_TYPE,\n", " instance_count=1,\n", " role=role,\n", " py_version=\"py39\",\n", " framework_version=\"1.13.1\",\n", " volume_size=volume_size,\n", " hyperparameters=hyperparameters,\n", " disable_profiler=True,\n", " debugger_hook_config=False,\n", ")\n", "\n", "# start training with our uploaded datasets as input\n", "optimized_estimator.fit({\"train\": training_input_path, \"test\": test_input_path}, wait=False)\n", "\n", "# The name of the training job\n", "optimized_estimator.latest_training_job.name" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Wait for training jobs to complete" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "waiter = native_estimator.sagemaker_session.sagemaker_client.get_waiter(\n", " \"training_job_completed_or_stopped\"\n", ")\n", "waiter.wait(TrainingJobName=native_estimator.latest_training_job.name)\n", "waiter = optimized_estimator.sagemaker_session.sagemaker_client.get_waiter(\n", " \"training_job_completed_or_stopped\"\n", ")\n", "waiter.wait(TrainingJobName=optimized_estimator.latest_training_job.name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load information and logs of the training job *without* SageMaker Training Compiler" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# container image used for native training job\n", "print(f\"container image used for training job: \\n{native_estimator.image_uri}\\n\")\n", "\n", "# s3 uri where the native trained model is located\n", "print(f\"s3 uri where the trained model is located: \\n{native_estimator.model_data}\\n\")\n", "\n", "# latest training job name for this estimator\n", "print(\n", " f\"latest training job name for this estimator: \\n{native_estimator.latest_training_job.name}\\n\"\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%capture native\n", "\n", "# access the logs of the native training job\n", "native_estimator.sagemaker_session.logs_for_job(native_estimator.latest_training_job.name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note:** If the estimator object is no longer available due to a kernel break or refresh, you need to directly use the training job name and manually attach the training job to a new PyTorch estimator. For example:\n", "```python\n", "native_estimator = PyTorch.attach(\"your_huggingface_training_job_name\")\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load information and logs of the training job *with* SageMaker Training Compiler" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# container image used for optimized training job\n", "print(f\"container image used for training job: \\n{optimized_estimator.image_uri}\\n\")\n", "\n", "# s3 uri where the optimized trained model is located\n", "print(f\"s3 uri where the trained model is located: \\n{optimized_estimator.model_data}\\n\")\n", "\n", "# latest training job name for this estimator\n", "print(\n", " f\"latest training job name for this estimator: \\n{optimized_estimator.latest_training_job.name}\\n\"\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%capture optimized\n", "\n", "# access the logs of the optimized training job\n", "optimized_estimator.sagemaker_session.logs_for_job(optimized_estimator.latest_training_job.name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note:** If the estimator object is no longer available due to a kernel break or refresh, you need to directly use the training job name and manually attach the training job to a new PyTorch estimator. For example:\n", "```python\n", "optimized_estimator = PyTorch.attach(\"your_compiled_huggingface_training_job_name\")\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create helper functions for analysis" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from ast import literal_eval\n", "from collections import defaultdict\n", "from matplotlib import pyplot as plt\n", "\n", "\n", "def _summarize(captured):\n", " final = []\n", " for line in captured.stdout.split(\"\\n\"):\n", " cleaned = line.strip()\n", " if \"{\" in cleaned and \"}\" in cleaned:\n", " final.append(cleaned[cleaned.index(\"{\") : cleaned.index(\"}\") + 1])\n", " return final\n", "\n", "\n", "def make_sense(string):\n", " try:\n", " return literal_eval(string)\n", " except:\n", " pass\n", "\n", "\n", "def summarize(summary):\n", " final = {\"train\": [], \"eval\": [], \"summary\": {}}\n", " for line in summary:\n", " interpretation = make_sense(line)\n", " if interpretation:\n", " if \"loss\" in interpretation:\n", " final[\"train\"].append(interpretation)\n", " elif \"eval_loss\" in interpretation:\n", " final[\"eval\"].append(interpretation)\n", " elif \"train_runtime\" in interpretation:\n", " final[\"summary\"].update(interpretation)\n", " return final" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plot and compare throughput of compiled training and native training\n", "\n", "Visualize average throughput as reported by HuggingFace and see potential savings." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# collect the average throughput as reported by HF for the native training job\n", "n = summarize(_summarize(native))\n", "native_throughput = n[\"summary\"][\"train_samples_per_second\"]\n", "\n", "# collect the average throughput as reported by HF for the SageMaker Training Compiler enhanced training job\n", "o = summarize(_summarize(optimized))\n", "optimized_throughput = o[\"summary\"][\"train_samples_per_second\"]\n", "\n", "# Calculate speedup from SageMaker Training Compiler\n", "avg_speedup = f\"{round((optimized_throughput/native_throughput-1)*100)}%\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "plt.title(\"Training Throughput \\n (Higher is better)\")\n", "plt.ylabel(\"Samples/sec\")\n", "\n", "plt.bar(x=[1], height=native_throughput, label=\"Baseline PT\", width=0.35)\n", "plt.bar(x=[1.5], height=optimized_throughput, label=\"Compiler-enhanced PT\", width=0.35)\n", "\n", "plt.xlabel(\" ====> {} Compiler savings <====\".format(avg_speedup))\n", "plt.xticks(ticks=[1, 1.5], labels=[\"Baseline PT\", \"Compiler-enhanced PT\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Convergence of Training Loss\n", "\n", "SageMaker Training Compiler does not affect the model convergence behavior. Here, we see the decrease in training loss is similar with and without SageMaker Training Compiler" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "vanilla_loss = [i[\"loss\"] for i in n[\"train\"]]\n", "vanilla_epochs = [i[\"epoch\"] for i in n[\"train\"]]\n", "optimized_loss = [i[\"loss\"] for i in o[\"train\"]]\n", "optimized_epochs = [i[\"epoch\"] for i in o[\"train\"]]\n", "\n", "plt.title(\"Plot of Training Loss\")\n", "plt.xlabel(\"Epoch\")\n", "plt.ylabel(\"Training Loss\")\n", "plt.plot(vanilla_epochs, vanilla_loss, label=\"Baseline PT\")\n", "plt.plot(optimized_epochs, optimized_loss, label=\"Compiler-enhanced PT\")\n", "plt.legend()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Evaluation Stats\n", "\n", "SageMaker Training Compiler does not affect the quality of the model. Here, we compare the evaluation metrics of the models trained with and without SageMaker Training Compiler to verify the same." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "table = pd.DataFrame([n[\"eval\"][-1], o[\"eval\"][-1]], index=[\"Baseline PT\", \"Compiler-enhanced PT\"])\n", "table.drop(columns=[\"eval_runtime\", \"eval_samples_per_second\", \"epoch\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Training Stats\n", "\n", "Let's compare various training metrics with and without SageMaker Training Compiler. SageMaker Training Compiler provides an increase in training throughput which translates to a decrease in total training time." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.DataFrame([n[\"summary\"], o[\"summary\"]], index=[\"Native\", \"Optimized\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# calculate percentage speedup from SageMaker Training Compiler in terms of total training time reported by HF\n", "\n", "speedup = (\n", " (n[\"summary\"][\"train_runtime\"] - o[\"summary\"][\"train_runtime\"])\n", " * 100\n", " / n[\"summary\"][\"train_runtime\"]\n", ")\n", "print(\n", " f\"SageMaker Training Compiler integrated PyTorch is about {int(speedup)}% faster in terms of total training time as reported by HF.\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Total Billable Time\n", "\n", "Finally, the decrease in total training time results in a decrease in the billable seconds from SageMaker" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def BillableTimeInSeconds(name):\n", " describe_training_job = (\n", " optimized_estimator.sagemaker_session.sagemaker_client.describe_training_job\n", " )\n", " details = describe_training_job(TrainingJobName=name)\n", " return details[\"BillableTimeInSeconds\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Billable = {}\n", "Billable[\"Native\"] = BillableTimeInSeconds(native_estimator.latest_training_job.name)\n", "Billable[\"Optimized\"] = BillableTimeInSeconds(optimized_estimator.latest_training_job.name)\n", "pd.DataFrame(Billable, index=[\"BillableSecs\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "speedup = (Billable[\"Native\"] - Billable[\"Optimized\"]) * 100 / Billable[\"Native\"]\n", "print(f\"SageMaker Training Compiler integrated PyTorch was {int(speedup)}% faster in summary.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Clean up\n", "\n", "Stop all training jobs launched if the jobs are still running." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import boto3\n", "\n", "sm = boto3.client(\"sagemaker\")\n", "\n", "\n", "def stop_training_job(name):\n", " status = sm.describe_training_job(TrainingJobName=name)[\"TrainingJobStatus\"]\n", " if status == \"InProgress\":\n", " sm.stop_training_job(TrainingJobName=name)\n", "\n", "\n", "stop_training_job(native_estimator.latest_training_job.name)\n", "stop_training_job(optimized_estimator.latest_training_job.name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Also, to find instructions on cleaning up resources, see [Clean Up](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-cleanup.html) in the *Amazon SageMaker Developer Guide*." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Notebook CI Test Results\n", "\n", "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", "\n", "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/sagemaker-training-compiler|huggingface|pytorch_single_gpu_single_node|roberta-base|roberta-base.ipynb)\n", "\n", "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/sagemaker-training-compiler|huggingface|pytorch_single_gpu_single_node|roberta-base|roberta-base.ipynb)\n", "\n", "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/sagemaker-training-compiler|huggingface|pytorch_single_gpu_single_node|roberta-base|roberta-base.ipynb)\n", "\n", "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/sagemaker-training-compiler|huggingface|pytorch_single_gpu_single_node|roberta-base|roberta-base.ipynb)\n", "\n", "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/sagemaker-training-compiler|huggingface|pytorch_single_gpu_single_node|roberta-base|roberta-base.ipynb)\n", "\n", "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/sagemaker-training-compiler|huggingface|pytorch_single_gpu_single_node|roberta-base|roberta-base.ipynb)\n", "\n", "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/sagemaker-training-compiler|huggingface|pytorch_single_gpu_single_node|roberta-base|roberta-base.ipynb)\n", "\n", "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/sagemaker-training-compiler|huggingface|pytorch_single_gpu_single_node|roberta-base|roberta-base.ipynb)\n", "\n", "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/sagemaker-training-compiler|huggingface|pytorch_single_gpu_single_node|roberta-base|roberta-base.ipynb)\n", "\n", "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/sagemaker-training-compiler|huggingface|pytorch_single_gpu_single_node|roberta-base|roberta-base.ipynb)\n", "\n", "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/sagemaker-training-compiler|huggingface|pytorch_single_gpu_single_node|roberta-base|roberta-base.ipynb)\n", "\n", "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/sagemaker-training-compiler|huggingface|pytorch_single_gpu_single_node|roberta-base|roberta-base.ipynb)\n", "\n", "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/sagemaker-training-compiler|huggingface|pytorch_single_gpu_single_node|roberta-base|roberta-base.ipynb)\n", "\n", "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/sagemaker-training-compiler|huggingface|pytorch_single_gpu_single_node|roberta-base|roberta-base.ipynb)\n", "\n", "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/sagemaker-training-compiler|huggingface|pytorch_single_gpu_single_node|roberta-base|roberta-base.ipynb)\n" ] } ], "metadata": { "instance_type": "ml.t3.medium", "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.7 (main, Sep 7 2022, 15:22:19) [GCC 9.4.0]" }, "vscode": { "interpreter": { "hash": "11bd1a5e22b55030e7dafa6146ff58da0a8819c60346681265e3e4937faaddb3" } } }, "nbformat": 4, "nbformat_minor": 4 }