{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Compile and Train a Hugging Face Transformer BERT Model with the SST Dataset using SageMaker Training Compiler" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", "\n", "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/sagemaker-training-compiler|huggingface|pytorch_single_gpu_single_node|bert-base-cased|bert-base-cased-single-node-single-gpu.ipynb)\n", "\n", "---" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "1. [Overview](#Overview)\n", "2. [Introduction](#Introduction) \n", "3. [Prepare SageMaker Environment and Permissions](#Prepare-SageMaker-Environment-and-Permissions)\n", " 1. [Install libraries](#Install-libraries) \n", " 2. [SageMaker environment](#SageMaker-environment) \n", " 3. [Permissions](#Permissions)\n", "4. [Loading a dataset](#Loading-a-dataset) \n", " 1. [Tokenization](#Tokenization) \n", " 2. [Uploading data to SageMaker_session_bucket](#Uploading-data-to-sagemaker_session_bucket) \n", "5. [SageMaker Training Job](#Sagemaker-Training-Job) \n", " 1. [Training a PyTorch Trainer without compiling](#Training-a-PyTorch-Trainer-without-compiling) \n", " 2. [Training a PyTorch Trainer with SageMaker Training Compiler](#Training-a-PyTorch-Trainer-with-SageMaker-Training-Compiler) \n", "6. [Analysis and Results](#Analysis-and-Results) " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## SageMaker Training Compiler Overview\n", "\n", "SageMaker Training Compiler is a capability of SageMaker that makes these hard-to-implement optimizations to reduce training time on GPU instances. The compiler optimizes DL models to accelerate training by more efficiently using SageMaker machine learning (ML) GPU instances. SageMaker Training Compiler is available at no additional charge within SageMaker and can help reduce total billable time as it accelerates training. \n", "\n", "SageMaker Training Compiler is integrated into the AWS Deep Learning Containers (DLCs). Using the SageMaker Training Compiler enabled AWS DLCs, you can compile and optimize training jobs on GPU instances with minimal changes to your code. Bring your deep learning models to SageMaker and enable SageMaker Training Compiler to accelerate the speed of your training job on SageMaker ML instances for accelerated computing. \n", "\n", "For more information, see [SageMaker Training Compiler](https://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler.html) in the *Amazon SageMaker Developer Guide*." ] }, { "attachments": { "image.png": { "image/png": "" } }, "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n", "\n", "This notebook is an end-to-end binary text classification example. In this demo, we use the Hugging Face's transformers and datasets libraries with SageMaker Training Compiler to compile and fine-tune a pre-trained transformer for binary text classification. In particular, the pre-trained model will be fine-tuned using the Stanford Sentiment Treebank (SST) dataset. To get started, you need to set up the environment with a few prerequisite steps, for permissions, configurations, and so on. \n", "\n", "![image.png](attachment:image.png)\n", "\n", "**NOTE:** You can run this demo in SageMaker Studio, SageMaker notebook instances, or your local machine with AWS CLI set up. If using SageMaker Studio or SageMaker notebook instances, make sure you choose one of the PyTorch-based kernels, Python 3 (PyTorch x.y Python 3.x CPU Optimized) or conda_pytorch_p36 respectively.\n", "\n", "**NOTE:** This notebook uses two ml.p3.2xlarge instances that have single GPU. If you don't have enough quota, see [Request a service quota increase for SageMaker resources](https://docs.aws.amazon.com/sagemaker/latest/dg/regions-quotas.html#service-limit-increase-request-procedure). " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Prepare SageMaker Environment and Permissions " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Installation\n", "\n", "This example notebook requires the **SageMaker Python SDK v2.108.0** and **transformers v4.21**." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install \"sagemaker>=2.133.0\" botocore boto3 awscli s3fs typing-extensions \"torch==1.12.0\" --upgrade" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install transformers datasets --upgrade" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import botocore\n", "import boto3\n", "import sagemaker\n", "import transformers\n", "\n", "print(f\"sagemaker: {sagemaker.__version__}\")\n", "print(f\"transformers: {transformers.__version__}\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Copy and run the following code if you need to upgrade ipywidgets for datasets library and restart kernel. This is only needed when preprocessing is done in the notebook.\n", "\n", "```python\n", "%%capture\n", "import IPython\n", "!conda install -c conda-forge ipywidgets -y\n", "# has to restart kernel for the updates to be applied\n", "IPython.Application.instance().kernel.do_shutdown(True) \n", "```" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### SageMaker environment " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "**Note:** If you are going to use SageMaker in a local environment. You need access to an IAM Role with the required permissions for SageMaker. To learn more, see [SageMaker Roles](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sagemaker\n", "\n", "sess = sagemaker.Session()\n", "# sagemaker session bucket -> used for uploading data, models and logs\n", "# sagemaker will automatically create this bucket if it not exists\n", "sagemaker_session_bucket = None\n", "if sagemaker_session_bucket is None and sess is not None:\n", " # set to default bucket if a bucket name is not given\n", " sagemaker_session_bucket = sess.default_bucket()\n", "\n", "role = sagemaker.get_execution_role()\n", "sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)\n", "\n", "print(f\"sagemaker role arn: {role}\")\n", "print(f\"sagemaker bucket: {sess.default_bucket()}\")\n", "print(f\"sagemaker session region: {sess.boto_region_name}\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Loading the SST dataset\n", "\n", "When using the [🤗 Datasets library](https://github.com/huggingface/datasets), datasets can be downloaded directly with the following `datasets.load_dataset()` method:\n", "\n", "```python\n", "from datasets import load_dataset\n", "load_dataset('dataset_name')\n", "```\n", "\n", "If you'd like to try other training datasets later, you can simply use this method.\n", "\n", "For this example notebook, we prepared the [SST2 dataset](https://www.tensorflow.org/datasets/catalog/glue#gluesst2) in the public SageMaker sample S3 bucket. The following code cells show how you can directly load the dataset and convert to a HuggingFace DatasetDict." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Tokenization " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from datasets import load_dataset\n", "from transformers import AutoTokenizer\n", "from datasets import Dataset\n", "\n", "# tokenizer used in preprocessing\n", "tokenizer_name = \"bert-base-cased\"\n", "\n", "# dataset used\n", "dataset_name = \"sst\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "# Read our data into pandas DataFrames to prepare it for training\n", "# Initially our dataset will contain one column called 'line'\n", "\n", "test_df = pd.read_csv(\n", " f\"https://sagemaker-example-files-prod-{sess.boto_region_name}.s3.amazonaws.com/datasets/text/SST2/sst2.test\",\n", " sep=\"delimiter\",\n", " header=None,\n", " engine=\"python\",\n", " names=[\"line\"],\n", ")\n", "train_df = pd.read_csv(\n", " f\"https://sagemaker-example-files-prod-{sess.boto_region_name}.s3.amazonaws.com/datasets/text/SST2/sst2.train\",\n", " sep=\"delimiter\",\n", " header=None,\n", " engine=\"python\",\n", " names=[\"line\"],\n", ")\n", "val_df = pd.read_csv(\n", " f\"https://sagemaker-example-files-prod-{sess.boto_region_name}.s3.amazonaws.com/datasets/text/SST2/sst2.val\",\n", " sep=\"delimiter\",\n", " header=None,\n", " engine=\"python\",\n", " names=[\"line\"],\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Split each row into two columns one for the label and one for the text\n", "\n", "test_df[[\"label\", \"text\"]] = test_df[\"line\"].str.split(\" \", 1, expand=True)\n", "train_df[[\"label\", \"text\"]] = train_df[\"line\"].str.split(\" \", 1, expand=True)\n", "val_df[[\"label\", \"text\"]] = val_df[\"line\"].str.split(\" \", 1, expand=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Drop the line column as it is no longer needed since we split this column above\n", "\n", "test_df.drop(\"line\", axis=1, inplace=True)\n", "train_df.drop(\"line\", axis=1, inplace=True)\n", "val_df.drop(\"line\", axis=1, inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Convert the label into numeric instead of object type to prepare binary labels for training\n", "# After conversion the labels will be 8 bit integers\n", "\n", "test_df[\"label\"] = pd.to_numeric(test_df[\"label\"], downcast=\"integer\")\n", "train_df[\"label\"] = pd.to_numeric(train_df[\"label\"], downcast=\"integer\")\n", "val_df[\"label\"] = pd.to_numeric(val_df[\"label\"], downcast=\"integer\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Combine test and val data into one df\n", "\n", "test_df = pd.concat([test_df, val_df], ignore_index=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_dataset = Dataset.from_pandas(train_df)\n", "test_dataset = Dataset.from_pandas(test_df)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Download tokenizer\n", "# The tokenizer is loaded from the AutoTokenizer class and we use the from_pretrained method\n", "# This allows us to instatiate a tokenizer based on a pretrained model\n", "tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)\n", "\n", "\n", "# Tokenizer helper function\n", "# This function specifies the input should be tokenized by padding to the max_length which is 512\n", "# Anything beyond this length will be truncated\n", "# This function will convert the 'text' column to a set of numeric input ids that can be used for\n", "# model training\n", "# For more information on tokenization see the Hugging Face documentation: https://huggingface.co/transformers/preprocessing.html\n", "def tokenize(batch):\n", " return tokenizer(batch[\"text\"], padding=\"max_length\", truncation=True)\n", "\n", "\n", "# Tokenize dataset in batches\n", "# See Hugging Face documentation for more info on the map method: https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map\n", "train_dataset = train_dataset.map(tokenize, batched=True)\n", "test_dataset = test_dataset.map(tokenize, batched=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_dataset = train_dataset.rename_column(\"label\", \"labels\")\n", "train_dataset.set_format(\"torch\", columns=[\"input_ids\", \"attention_mask\", \"labels\"])\n", "test_dataset = test_dataset.rename_column(\"label\", \"labels\")\n", "test_dataset.set_format(\"torch\", columns=[\"input_ids\", \"attention_mask\", \"labels\"])" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Uploading data to `sagemaker_session_bucket`\n", "\n", "After we processed the datasets we are going to use the new FileSystem [integration](https://huggingface.co/docs/datasets/filesystems.html) to upload our dataset to S3." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import botocore\n", "from datasets.filesystems import S3FileSystem\n", "\n", "s3 = S3FileSystem()\n", "\n", "# s3 key prefix for setting up the data channel for the current SageMaker session\n", "s3_prefix = \"samples/datasets/sst\"\n", "\n", "# save train_dataset to s3\n", "training_input_path = f\"s3://{sess.default_bucket()}/{s3_prefix}/train\"\n", "train_dataset.save_to_disk(training_input_path, fs=s3)\n", "\n", "# save test_dataset to s3\n", "test_input_path = f\"s3://{sess.default_bucket()}/{s3_prefix}/test\"\n", "test_dataset.save_to_disk(test_input_path, fs=s3)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## SageMaker Training Job\n", "\n", "To create a SageMaker training job, we use a PyTorch estimator. Using the estimator, you can define which fine-tuning script should SageMaker use through entry_point, which instance_type to use for training, which hyperparameters to pass, and so on.\n", "\n", "When a SageMaker training job starts, SageMaker takes care of starting and managing all the required machine learning instances, picks up the PyTorch Deep Learning Container, uploads your training script, and downloads the data from sagemaker_session_bucket into the container at /opt/ml/input/data.\n", "\n", "In the following section, you learn how to set up two versions of the SageMaker PyTorch estimator, a native one without the compiler and an optimized one with the compiler." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pygmentize ./scripts/train.py" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Training a PyTorch Trainer without compiling" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.pytorch import PyTorch\n", "\n", "hyperparameters = {\"epochs\": 4, \"train_batch_size\": 24, \"model_name\": \"bert-base-cased\"}\n", "\n", "# Scale the learning rate by batch size, as original LR was using batch size of 32\n", "hyperparameters[\"learning_rate\"] = float(\"5e-5\") / 32 * hyperparameters[\"train_batch_size\"]\n", "\n", "# Scale the volume size by number of epochs\n", "volume_size = 60 + 2 * hyperparameters[\"epochs\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# By setting the hyperparameters in the PyTorch Estimator below\n", "# and using the AutoModelForSequenceClassification class in the train.py script\n", "# we can fine-tune the bert-base-cased pretrained Transformer for sequence classification\n", "\n", "native_estimator = PyTorch(\n", " entry_point=\"train.py\",\n", " source_dir=\"./scripts\",\n", " instance_type=\"ml.p3.2xlarge\",\n", " instance_count=1,\n", " role=role,\n", " py_version=\"py39\",\n", " base_job_name=\"native-sst-bert-base-cased-p3-2x-pytorch-190\",\n", " volume_size=volume_size,\n", " framework_version=\"1.13.1\",\n", " hyperparameters=hyperparameters,\n", " disable_profiler=True,\n", " debugger_hook_config=False,\n", ")\n", "\n", "# starting the train job with our uploaded datasets as input\n", "native_estimator.fit({\"train\": training_input_path, \"test\": test_input_path}, wait=False)\n", "\n", "# The name of the training job. You might need to note this down in case your kernel crashes.\n", "native_estimator.latest_training_job.name" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Training a PyTorch Trainer with SageMaker Training Compiler" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Compilation through Training Compiler changes the memory footprint of the model. Most commonly, this manifests as a reduction in memory utilization and a consequent increase in the largest batch size that can fit on the GPU. Note that if you want to change the batch size, you must adjust the learning rate appropriately.\n", "\n", "**Note:** We recommend you to turn the SageMaker Debugger's profiling and debugging tools off when you use compilation to avoid additional overheads.\n", "\n", "We use the tested batch size that's provided at [Tested Models](https://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler-support.html#training-compiler-tested-models) in the *SageMaker Training Compiler Developer Guide*." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.pytorch import PyTorch, TrainingCompilerConfig\n", "\n", "hyperparameters = {\n", " \"epochs\": 4,\n", " \"train_batch_size\": 36,\n", " \"model_name\": \"bert-base-cased\",\n", "}\n", "\n", "# Scale the learning rate by batch size, as original LR was using batch size of 32\n", "hyperparameters[\"learning_rate\"] = float(\"5e-5\") / 32 * hyperparameters[\"train_batch_size\"]\n", "\n", "# Scale the volume size by number of epochs\n", "volume_size = 60 + 2 * hyperparameters[\"epochs\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# By setting the hyperparameters in the PyTorch Estimator below\n", "# and using the AutoModelForSequenceClassification class in the train.py script\n", "# the bert-base-cased pretrained Transformer is fine-tuned for sequence classification\n", "\n", "# Importantly, the TrainingCompilerConfig() is passed below to enable the SageMaker Training Compiler\n", "\n", "sm_training_compiler_estimator = PyTorch(\n", " entry_point=\"train.py\",\n", " source_dir=\"./scripts\",\n", " instance_type=\"ml.p3.2xlarge\",\n", " instance_count=1,\n", " role=role,\n", " py_version=\"py39\",\n", " base_job_name=\"sm-compiled-sst-bert-base-cased-p3-2x-pytorch-190\",\n", " volume_size=volume_size,\n", " framework_version=\"1.13.1\",\n", " compiler_config=TrainingCompilerConfig(),\n", " hyperparameters=hyperparameters,\n", " disable_profiler=True,\n", " debugger_hook_config=False,\n", ")\n", "\n", "# starting the train job with our uploaded datasets as input\n", "sm_training_compiler_estimator.fit(\n", " {\"train\": training_input_path, \"test\": test_input_path}, wait=False\n", ")\n", "\n", "# The name of the training job. You might need to note this down in case your kernel crashes.\n", "sm_training_compiler_estimator.latest_training_job.name" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Wait for the training jobs to complete" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "waiter = native_estimator.sagemaker_session.sagemaker_client.get_waiter(\n", " \"training_job_completed_or_stopped\"\n", ")\n", "waiter.wait(TrainingJobName=native_estimator.latest_training_job.name)\n", "waiter = sm_training_compiler_estimator.sagemaker_session.sagemaker_client.get_waiter(\n", " \"training_job_completed_or_stopped\"\n", ")\n", "waiter.wait(TrainingJobName=sm_training_compiler_estimator.latest_training_job.name)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Analysis and Results" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Load information and logs of the training job *without* SageMaker Training Compiler" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# container image used for native training job\n", "print(f\"container image used for training job: \\n{native_estimator.image_uri}\\n\")\n", "\n", "# s3 uri where the native trained model is located\n", "print(f\"s3 uri where the trained model is located: \\n{native_estimator.model_data}\\n\")\n", "\n", "# latest training job name for this estimator\n", "print(\n", " f\"latest training job name for this estimator: \\n{native_estimator.latest_training_job.name}\\n\"\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "%%capture native\n", "\n", "# access the logs of the native training job\n", "native_estimator.sagemaker_session.logs_for_job(native_estimator.latest_training_job.name)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "**Note:** If the estimator object is no longer available due to a kernel break or refresh, you need to directly use the training job name and manually attach the training job to a new PyTorch estimator. For example:\n", "```python\n", "native_estimator = PyTorch.attach(\"your_native_training_job_name\")\n", "```" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Load information and logs of the training job *with* SageMaker Training Compiler" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# container image used for optimized training job\n", "print(f\"container image used for training job: \\n{sm_training_compiler_estimator.image_uri}\\n\")\n", "\n", "# s3 uri where the optimized trained model is located\n", "print(f\"s3 uri where the trained model is located: \\n{sm_training_compiler_estimator.model_data}\\n\")\n", "\n", "# latest training job name for this estimator\n", "print(\n", " f\"latest training job name for this estimator: \\n{sm_training_compiler_estimator.latest_training_job.name}\\n\"\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%capture optimized\n", "\n", "# access the logs of the optimized training job\n", "sm_training_compiler_estimator.sagemaker_session.logs_for_job(\n", " sm_training_compiler_estimator.latest_training_job.name\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "**Note:** If the estimator object is no longer available due to a kernel break or refresh, you need to directly use the training job name and manually attach the training job to a new PyTorch estimator. You may be able to retrieve the training job name from above where it was printed out, but the name can also be retrieved using the SageMaker Service Console, the SageMaker SDK, or the AWS CLI (https://docs.aws.amazon.com/cli/latest/reference/sagemaker/list-training-jobs.html) \n", "\n", "For example:\n", "```python\n", "sm_training_compiler_estimator = PyTorch.attach(\"your_compiled_huggingface_training_job_name\")\n", "```" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Create helper functions for analysis" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from ast import literal_eval\n", "from collections import defaultdict\n", "from matplotlib import pyplot as plt\n", "\n", "\n", "# Intermediary function for processing each line of stdout captured\n", "# Remove leading and trailing whitespace and append data in curly braces\n", "# to final list\n", "def _summarize(captured):\n", " final = []\n", " for line in captured.stdout.split(\"\\n\"):\n", " cleaned = line.strip()\n", " if \"{\" in cleaned and \"}\" in cleaned:\n", " final.append(cleaned[cleaned.index(\"{\") : cleaned.index(\"}\") + 1])\n", " return final\n", "\n", "\n", "# Check input with literal_eval\n", "# https://docs.python.org/3/library/ast.html\n", "def make_sense(string):\n", " try:\n", " return literal_eval(string)\n", " except:\n", " pass\n", "\n", "\n", "# Parse the stdout and organize by train, evaluation, and summary data\n", "def summarize(summary):\n", " final = {\"train\": [], \"eval\": [], \"summary\": {}}\n", " for line in summary:\n", " interpretation = make_sense(line)\n", " if interpretation:\n", " if \"loss\" in interpretation:\n", " final[\"train\"].append(interpretation)\n", " elif \"eval_loss\" in interpretation:\n", " final[\"eval\"].append(interpretation)\n", " elif \"train_runtime\" in interpretation:\n", " final[\"summary\"].update(interpretation)\n", " return final" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Plot and compare throughput of compiled training and native training" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "native_summary = summarize(_summarize(native))\n", "native_throughput = native_summary[\"summary\"][\"train_samples_per_second\"]\n", "\n", "optimized_summary = summarize(_summarize(optimized))\n", "optimized_throughput = optimized_summary[\"summary\"][\"train_samples_per_second\"]\n", "\n", "avg_speedup = f\"{round((optimized_throughput/native_throughput-1)*100)}%\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "native_summary[\"summary\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "optimized_summary[\"summary\"]" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Training Throughput Plot\n", "\n", "The following script creates a plot that compares the throughput (number_of_samples/second) of the two training jobs with and without SageMaker Training Compiler." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "plt.title(\"Training Throughput \\n (Higher is better)\")\n", "plt.ylabel(\"Samples/sec\")\n", "\n", "plt.bar(x=[1], height=native_throughput, label=\"Baseline PT\", width=0.35)\n", "plt.bar(x=[1.5], height=optimized_throughput, label=\"SM-Training-Compiler-enhanced PT\", width=0.35)\n", "\n", "plt.xlabel(\" ====> {} SM-Training-Compiler savings <====\".format(avg_speedup))\n", "plt.xticks(ticks=[1, 1.5], labels=[\"Baseline PT\", \"SM-Training-Compiler-enhanced PT\"])" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Convergence of Training Loss\n", "\n", "The following script creates a plot that compares the loss function of the two training jobs with and without SageMaker Training Compiler." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "vanilla_loss = [i[\"loss\"] for i in native_summary[\"train\"]]\n", "vanilla_epochs = [i[\"epoch\"] for i in native_summary[\"train\"]]\n", "optimized_loss = [i[\"loss\"] for i in optimized_summary[\"train\"]]\n", "optimized_epochs = [i[\"epoch\"] for i in optimized_summary[\"train\"]]\n", "\n", "plt.title(\"Plot of Training Loss\")\n", "plt.xlabel(\"Epoch\")\n", "plt.ylabel(\"Training Loss\")\n", "plt.plot(vanilla_epochs, vanilla_loss, label=\"Baseline PT\")\n", "plt.plot(optimized_epochs, optimized_loss, label=\"SM-Training-Compiler-enhanced PT\")\n", "plt.legend()\n", "plt.show()" ] }, { "attachments": { "trainingloss.png": { "image/png": "" } }, "cell_type": "markdown", "metadata": {}, "source": [ "## Convergence Example Plot\n", "\n", "![trainingloss.png](attachment:trainingloss.png)\n", "\n", "**Note:** For this example, due to the larger batch size that can be accommodated by the SageMaker Training Compiler, the initial decrease in training loss is greater when the SageMaker Training Compiler is enabled" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Evaluation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create table of summary results including loss, accuracy, f1 score, precision, recall, and steps per second\n", "# for both native and optimized training jobs\n", "\n", "table = pd.DataFrame(\n", " [native_summary[\"eval\"][-1], optimized_summary[\"eval\"][-1]],\n", " index=[\"Baseline PT\", \"SM-Training-Compiler-enhanced PT\"],\n", ")\n", "table.drop(columns=[\"eval_runtime\", \"eval_samples_per_second\", \"epoch\"])" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Clean up\n", "\n", "Stop all training jobs launched if the jobs are still running." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import boto3\n", "\n", "sm = boto3.client(\"sagemaker\")\n", "\n", "\n", "def stop_training_job(name):\n", " status = sm.describe_training_job(TrainingJobName=name)[\"TrainingJobStatus\"]\n", " if status == \"InProgress\":\n", " sm.stop_training_job(TrainingJobName=name)\n", "\n", "\n", "stop_training_job(native_estimator.latest_training_job.name)\n", "stop_training_job(sm_training_compiler_estimator.latest_training_job.name)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Also, to find instructions on cleaning up resources, see [Clean Up](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-cleanup.html) in the *Amazon SageMaker Developer Guide*." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Notebook CI Test Results\n", "\n", "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", "\n", "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/sagemaker-training-compiler|huggingface|pytorch_single_gpu_single_node|bert-base-cased|bert-base-cased-single-node-single-gpu.ipynb)\n", "\n", "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/sagemaker-training-compiler|huggingface|pytorch_single_gpu_single_node|bert-base-cased|bert-base-cased-single-node-single-gpu.ipynb)\n", "\n", "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/sagemaker-training-compiler|huggingface|pytorch_single_gpu_single_node|bert-base-cased|bert-base-cased-single-node-single-gpu.ipynb)\n", "\n", "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/sagemaker-training-compiler|huggingface|pytorch_single_gpu_single_node|bert-base-cased|bert-base-cased-single-node-single-gpu.ipynb)\n", "\n", "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/sagemaker-training-compiler|huggingface|pytorch_single_gpu_single_node|bert-base-cased|bert-base-cased-single-node-single-gpu.ipynb)\n", "\n", "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/sagemaker-training-compiler|huggingface|pytorch_single_gpu_single_node|bert-base-cased|bert-base-cased-single-node-single-gpu.ipynb)\n", "\n", "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/sagemaker-training-compiler|huggingface|pytorch_single_gpu_single_node|bert-base-cased|bert-base-cased-single-node-single-gpu.ipynb)\n", "\n", "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/sagemaker-training-compiler|huggingface|pytorch_single_gpu_single_node|bert-base-cased|bert-base-cased-single-node-single-gpu.ipynb)\n", "\n", "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/sagemaker-training-compiler|huggingface|pytorch_single_gpu_single_node|bert-base-cased|bert-base-cased-single-node-single-gpu.ipynb)\n", "\n", "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/sagemaker-training-compiler|huggingface|pytorch_single_gpu_single_node|bert-base-cased|bert-base-cased-single-node-single-gpu.ipynb)\n", "\n", "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/sagemaker-training-compiler|huggingface|pytorch_single_gpu_single_node|bert-base-cased|bert-base-cased-single-node-single-gpu.ipynb)\n", "\n", "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/sagemaker-training-compiler|huggingface|pytorch_single_gpu_single_node|bert-base-cased|bert-base-cased-single-node-single-gpu.ipynb)\n", "\n", "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/sagemaker-training-compiler|huggingface|pytorch_single_gpu_single_node|bert-base-cased|bert-base-cased-single-node-single-gpu.ipynb)\n", "\n", "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/sagemaker-training-compiler|huggingface|pytorch_single_gpu_single_node|bert-base-cased|bert-base-cased-single-node-single-gpu.ipynb)\n", "\n", "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/sagemaker-training-compiler|huggingface|pytorch_single_gpu_single_node|bert-base-cased|bert-base-cased-single-node-single-gpu.ipynb)\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (Data Science 3.0)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/sagemaker-data-science-310-v1" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" }, "vscode": { "interpreter": { "hash": "11bd1a5e22b55030e7dafa6146ff58da0a8819c60346681265e3e4937faaddb3" } } }, "nbformat": 4, "nbformat_minor": 4 }