{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Train and Deploy GPT-J-6B model using Tensor Parallelism approach within SageMaker Model Parallel Library" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook you will learn how to use the tensor parallelism feature provided by the [SageMaker Model Parallelism Library](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel.html) to train the [EleutherAI's](https://www.eleuther.ai/) [GPT-J](https://arankomatsuzaki.wordpress.com/2021/06/04/gpt-j/) Model using the [GLUE/sst2 Dataset](https://huggingface.co/datasets/glue/viewer/sst2/train).\n", "\n", "EleutherAI released GPT-J 6B, an open-source alternative to [OpenAIs GPT-3](https://openai.com/blog/gpt-3-apps/). [GPT-J 6B](https://huggingface.co/EleutherAI/gpt-j-6B) is the 6 billion parameter successor to EleutherAIs GPT-NEO family, a family of transformer-based language models based on the GPT architecture for text generation.\n", "\n", "EleutherAI's primary goal is to train a model that is equivalent in size to GPT-3 and make it available to the public under an open license.\n", "Over the last few months, GPT-J gained a lot of interest from Researchers, Data Scientists, and even Software Developers, but it remained very challenging to fine tune GPT-J.\n", "\n", "The weights of the 6 billion parameter model represent a ~24GB memory footprint. To load it in float32, one would need at least 2x model size CPU RAM: 1x for initial weights and another 1x to load the checkpoint. Apart from the model parameters, there are the gradients, optimizer states, and activations taking memory, so the actual memory usage might be significantly higher than 48GB. Just as an example, with Adam optimizer and FP32 training, the use from parameters, gradients and optimizer states might be 96GB+, and activation memory footprint would be even more than this, so the total memory usage might be easily larger than 200 GB.\n", "\n", "\n", "\n", "We will walk you through how to easily run FP16 training and fine tune GPT-J using Amazon SageMaker and Hugging Face on NVIDIA GPU instances. The notebook demonstrates the use of Tensor Parallel approach of SageMaker Model Parallel library. \n", "\n", "This notebook depends on the following files and folders:\n", "\n", "1. `train_gptj_smp_tensor_parallel_script.py`: This is an entry-point script that is passed to the PyTorch estimator in the notebook instructions. This script is responsible for end to end training of the GPT-J model with SMP. The script has additional comments at places where the SMP API is used.\n", "\n", "3. `learning_rates.py`: This contains the functions for learning rate schedule.\n", "4. `requirements.txt`: This will install the dependencies, like the right version of huggingface transformers.\n", "5. `memory_tracker.py`: This contains a function to print the memory status.\n", "\n", "\n", "## SageMaker Distributed Training \n", "\n", "SageMaker provides distributed training libraries for data parallelism and model parallelism. The libraries are optimized for the SageMaker training environment, help adapt your distributed training jobs to SageMaker, and improve training speed and throughput.\n", "\n", "### Approaches\n", "\n", "\n", "\n", "\n", "### SageMaker Model Parallel\n", "\n", "Model parallelism is the process of splitting a model up between multiple devices or nodes (such as GPU-equipped instances) and creating an efficient pipeline to train the model across these devices to maximize GPU utilization.\n", "\n", "Increasing deep learning model size (layers and parameters) can result in better accuracy. However, there is a limit to the maximum model size you can fit in a single GPU. When training deep learning models, GPU memory limitations can be a bottleneck in the following ways:\n", "\n", "1. They can limit the size of the model you train. Given that larger models tend to achieve higher accuracy, this directly translates to trained model accuracy.\n", "\n", "2. They can limit the batch size you train with, leading to lower GPU utilization and slower training.\n", "\n", "To overcome the limitations associated with training a model on a single GPU, you can use model parallelism to distribute and train your model on multiple computing devices.\n", "\n", "### Core features of SageMaker Model Parallel \n", "\n", "1. [Automated Model Splitting](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-core-features.html): When you use SageMaker's model parallel library, you can take advantage of automated model splitting, also referred to as automated model partitioning. The library uses a partitioning algorithm that balances memory, minimizes communication between devices, and optimizes performance. You can configure the automated partitioning algorithm to optimize for speed or memory.\n", "\n", "2. [Pipeline Execution Schedule](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-core-features.html): A core feature of SageMaker's distributed model parallel library is pipelined execution, which determines the order in which computations are made and data is processed across devices during model training. Pipelining is a technique to achieve true parallelization in model parallelism, by having the GPUs compute simultaneously on different data samples, and to overcome the performance loss due to sequential computation.\n", "\n", "Pipelining is based on splitting a mini-batch into microbatches, which are fed into the training pipeline one-by-one and follow an execution schedule defined by the library runtime. A microbatch is a smaller subset of a given training mini-batch. The pipeline schedule determines which microbatch is executed by which device for every time slot.\n", "\n", "In addition to its [core features](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-core-features.html), the SageMaker distributed model parallel library offers [memory-saving features](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch.html) for training deep learning models with PyTorch: [tensor parallelism](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-tensor-parallelism.html), [optimizer state sharding](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-optimizer-state-sharding.html), [activation checkpointing](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-activation-checkpointing.html), and [activation offloading](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-activation-offloading.html). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### SageMaker Model Parallel configuration\n", "\n", "Please refer to all the [configuration parameters](https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_general.html) related to SageMaker Distributed Training.\n", "\n", "As we are going to use PyTorch and Hugging Face for training GPT-J, it is important to understand all the SageMaker Distributed configuration parameters specific to PyTorch [here](https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_general.html#pytorch-specific-parameters).\n", "\n", "#### Important\n", "\n", "`process_per_host` must not be greater than the number of GPUs per instance and typically will be equal to the number of GPUs per instance.\n", "\n", "#### SageMaker Tensor Parallel\n", "\n", "Tensor parallelism splits individual layers, or nn.Modules, across devices, to be run in parallel. The following figure shows the simplest example of how the library splits a model with four layers to achieve two-way tensor parallelism (\"tensor_parallel_degree\": 2). The layers of each model replica are bisected and distributed into two GPUs. In this example case, the model parallel configuration also includes \"pipeline_parallel_degree\": 1 and \"ddp\": True (uses PyTorch DistributedDataParallel package in the background), so the degree of data parallelism becomes eight. The library manages communication across the tensor-distributed model replicas.\n", "\n", "\n", "\n", "The usefulness of this feature is in the fact that you can select specific layers or a subset of layers to apply tensor parallelism. To dive deep into tensor parallelism and other memory-saving features for PyTorch, and to learn how to set a combination of pipeline and tensor parallelism, see Extended Features of the SageMaker Model Parallel Library for PyTorch.\n", "\n", "\n", "\n", "#### Additional Resources\n", "If you are a new user of Amazon SageMaker, you may find the following helpful to learn more about SMP and using SageMaker with PyTorch.\n", "\n", "1. To learn more about the SageMaker model parallelism library, see [Model Parallel Distributed Training with SageMaker Distributed](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel.html).\n", "\n", "2. To learn more about using the SageMaker Python SDK with PyTorch, see Using [PyTorch with the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html).\n", "\n", "3. To learn more about launching a training job in Amazon SageMaker with your own training image, see [Use Your Own Training Algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Data Science, Python 3
Kernel on SageMaker Studio Notebook or a conda_pytorch_p38
Kernel on SageMaker Notebook instances.\n",
"g5.48xlarge
, p3dn.24xlarge
, p4d.24xlarge
(or multiple smaller GPU instance types). You can use a single smaller GPU instance type with the GPT-J-XL 1.5B parameter model.\n",
"ResourceLimitExceeded
error message when running the training job, you can request an increase on the default quota by contacting AWS Support. For the quota Limit Type select SageMaker Training Jobs and the instance type.\n",
"