{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Fine-tune FLAN-T5 XXL on Multiple nodes using DeepSpeed on Amazon SageMaker \n", "\n", "FLAN-T5 is an enhanced version of T5 that has been fine-tuned in a mixture of tasks, or simple words, a better T5 model in any aspect. FLAN-T5 outperforms T5 by double-digit improvements for the same number of parameters.\n", "\n", "This repo will show how to fine-tune FLAN-T5 XXL(11B) on multiple nodes using [DeepSpeed ZeRO](https://www.deepspeed.ai/tutorials/zero/) on Amazon SageMaker. And the repo is tested successfully on Data Science image and Python 3 kernel of Sagemaker studio with ml.m5.large kernel gateway instance in us-east-1 region.\n", "\n", "It is structured as follows:\n", "1. process dataset and upload to S3\n", "2. prepare training script and deepspeed launcher\n", "3. Fine-tune FLAN-T5 XXL on Amazon SageMaker\n", "\n", "Before we start, let’s install the required libraries and make sure we have the correct permissions to access S3." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "!pip install \"transformers==4.26.0\" \"datasets[s3]==2.9.0\" sagemaker --upgrade" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "import sagemaker\n", "from sagemaker import get_execution_role\n", "import boto3\n", "\n", "sess = sagemaker.Session()\n", "role = get_execution_role()\n", "\n", "print(f\"sagemaker role arn: {role}\")\n", "print(f\"sagemaker bucket: {sess.default_bucket()}\")\n", "print(f\"sagemaker session region: {sess.boto_region_name}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. process dataset and upload to S3\n", "\n", "We prepare a dataset on the [CNN Dailymail Dataset](https://huggingface.co/datasets/cnn_dailymail). \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# experiment config\n", "model_id = \"google/flan-t5-xxl\" # Hugging Face Model Id\n", "dataset_id = \"cnn_dailymail\" # Hugging Face Dataset Id\n", "dataset_config = \"3.0.0\" # config/verison of the dataset\n", "save_dataset_path = \"data\" # local path to save processed dataset\n", "text_column = \"article\" # column of input text is\n", "summary_column = \"highlights\" # column of the output text \n", "# custom instruct prompt start\n", "prompt_template = f\"Summarize the following news article:\\n{{input}}\\nSummary:\\n\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We process (tokenize) the dataset, upload to s3 and pass it into our managed Training job." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "from datasets import load_dataset\n", "from transformers import AutoTokenizer\n", "import numpy as np \n", "\n", "dataset = load_dataset(dataset_id,name=dataset_config)\n", "tokenizer = AutoTokenizer.from_pretrained(model_id)\n", "\n", "print(f\"Train dataset size: {len(dataset['train'])}\")\n", "print(f\"Test dataset size: {len(dataset['test'])}\")\n", "\n", "# Train dataset size: 287113\n", "# Test dataset size: 11490" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We defined a `prompt_template` in our config, which we will use to construct an instruct prompt for better performance of our model. Our `prompt_template` has a “fixed” start and end, and our document is in the middle. This means we need to ensure that the “fixed” template parts + document are not exceeding the max length of the model. Therefore we calculate the max length of our document, which we will later use for padding and truncation" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "prompt_lenght = len(tokenizer(prompt_template.format(input=\"\"))[\"input_ids\"])\n", "max_sample_length = tokenizer.model_max_length - prompt_lenght\n", "print(f\"Prompt length: {prompt_lenght}\")\n", "print(f\"Max input length: {max_sample_length}\")\n", "\n", "# Prompt length: 12\n", "# Max input length: 500" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We know now that our documents can be “500” tokens long to fit our `template_prompt` still correctly. In addition to our input, we need to understand better our “target” sequence length meaning and how long are the summarization ins our dataset. Therefore we iterate over the dataset and calculate the max input length (at max 500) and the max target length. (takes a few minutes)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "from datasets import concatenate_datasets\n", "import numpy as np\n", "\n", "# The maximum total input sequence length after tokenization. \n", "# Sequences longer than this will be truncated, sequences shorter will be padded.\n", "tokenized_inputs = concatenate_datasets([dataset[\"train\"], dataset[\"test\"]]).map(lambda x: tokenizer(x[text_column], truncation=True), batched=True, remove_columns=[text_column, summary_column])\n", "max_source_length = max([len(x) for x in tokenized_inputs[\"input_ids\"]])\n", "max_source_length = min(max_source_length, max_sample_length)\n", "print(f\"Max source length: {max_source_length}\")\n", "\n", "# The maximum total sequence length for target text after tokenization. \n", "# Sequences longer than this will be truncated, sequences shorter will be padded.\"\n", "tokenized_targets = concatenate_datasets([dataset[\"train\"], dataset[\"test\"]]).map(lambda x: tokenizer(x[summary_column], truncation=True), batched=True, remove_columns=[text_column, summary_column])\n", "target_lenghts = [len(x) for x in tokenized_targets[\"input_ids\"]]\n", "# use 95th percentile as max target length\n", "max_target_length = int(np.percentile(target_lenghts, 95))\n", "print(f\"Max target length: {max_target_length}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now have everything needed to process our dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "def preprocess_function(sample, padding=\"max_length\"):\n", " # created prompted input\n", " inputs = [prompt_template.format(input=item) for item in sample[text_column]]\n", "\n", " # tokenize inputs\n", " model_inputs = tokenizer(inputs, max_length=max_source_length, padding=padding, truncation=True)\n", "\n", " # Tokenize targets with the `text_target` keyword argument\n", " labels = tokenizer(text_target=sample[summary_column], max_length=max_target_length, padding=padding, truncation=True)\n", "\n", " # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore\n", " # padding in the loss.\n", " if padding == \"max_length\":\n", " labels[\"input_ids\"] = [\n", " [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels[\"input_ids\"]\n", " ]\n", "\n", " model_inputs[\"labels\"] = labels[\"input_ids\"]\n", " return model_inputs\n", "\n", "# process dataset\n", "tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=list(dataset[\"train\"].features))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After we processed the datasets we are going to use the new [FileSystem integration](https://huggingface.co/docs/datasets/filesystems) to upload our dataset to S3. We are using the `sess.default_bucket()`, adjust this if you want to store the dataset in a different S3 bucket. We will use the S3 path later in our training script." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# save train_dataset to s3\n", "training_input_path = f's3://{sess.default_bucket()}/processed-404/{dataset_id}/train'\n", "tokenized_dataset[\"train\"].save_to_disk(training_input_path)\n", "\n", "# save test_dataset to s3\n", "test_input_path = f's3://{sess.default_bucket()}/processed-404/{dataset_id}/test'\n", "tokenized_dataset[\"test\"].save_to_disk(test_input_path)\n", "\n", "\n", "print(\"uploaded data to:\")\n", "print(f\"training dataset to: {training_input_path}\")\n", "print(f\"test dataset to: {test_input_path}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. prepare training script and deepspeed launcher\n", "\n", "Here we use torch.distribute.launch to launch deepspeed on multiple nodes. First, we use start.py to configure some enviroments and invoke the shell script torch_launch.sh. Second, the shell script torch_launch.sh will configure all of parameters required for both torch.distribute.launch and training script run_seq2seq_deepspeed.py.\n", "In addition, we create a deepspeed config file named ds_flan_t5_z3_config_bf16.json to configure our training setup. \n", "\n", "We are going to use a p4dn.24xlarge AWS EC2 Instance including 8x NVIDIA A100 40GB. This means we can leverage `bf16`, which reduces the memory footprint of the model by almost ~2x, which allows us to train without offloading efficiently. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Fine-tune FLAN-T5 XXL on Amazon SageMaker\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to create a sagemaker training job we need an `HuggingFace` Estimator. The Estimator handles end-to-end Amazon SageMaker training and deployment tasks. The Estimator manages the infrastructure use. \n", "SagMaker takes care of starting and managing all the required ec2 instances for us, provides the correct huggingface container, uploads the provided scripts and downloads the data from our S3 bucket into the container at /opt/ml/input/data. Then, it starts the training job by running." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "import time\n", "from sagemaker.huggingface import HuggingFace\n", "from sagemaker import get_execution_role\n", "\n", "role = get_execution_role()\n", "# define Training Job Name \n", "job_name = f'huggingface-flan-t5-deepspeed-{time.strftime(\"%Y-%m-%d-%H-%M-%S\", time.localtime())}'\n", "#define the model s3 path which will store your trained model asset\n", "#Note: you should use your real s3 path to configure model_s3_path\n", "model_s3_path='s3://your_bucket/flan-t5-xxl-4102xx668899-liangaws/model/'\n", "\n", "instance_count = 2\n", "#define the enviroment variables for your scripts.\n", "environment = {'NODE_NUMBER':str(instance_count),\n", " 'FI_PROVIDER': 'efa',\n", " 'NCCL_PROTO': 'simple',\n", " 'FI_EFA_USE_DEVICE_RDMA': '1',\n", " 'NCCL_DEBUG': 'INFO',\n", " 'MODEL_S3_PATH': model_s3_path\n", "}\n", "\n", "# create the Estimator\n", "huggingface_estimator = HuggingFace(\n", " entry_point = 'start.py', # user endpoint script\n", " source_dir = 'src', # directory which includes all the files needed for training\n", " instance_type = 'ml.p4d.24xlarge', # instances type used for the training job\n", " instance_count = instance_count, # the number of instances used for training\n", " base_job_name = job_name, # the name of the training job\n", " role = role, # Iam role used in training job to access AWS ressources, e.g. S3\n", " transformers_version = '4.17', # the transformers version used in the training job\n", " pytorch_version = '1.10', # the pytorch_version version used in the training job\n", " py_version = 'py38', # the python version used in the training job\n", " environment = environment,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We created our `HuggingFace` estimator including the `start.py` as `entry_point` . We can now start our training job, with the `.fit()` method passing our S3 path to the training script." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# define a data input dictonary with our uploaded s3 uris\n", "#Here we set test_input_path for both training channel and test channel to quickly verify the whole training procedure.\n", "data = {\n", " 'training': test_input_path,\n", " 'test': test_input_path\n", "}\n", "\n", "# starting the train job with our uploaded datasets as input\n", "huggingface_estimator.fit(data, wait=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "availableInstances": [ { "_defaultOrder": 0, "_isFastLaunch": true, "category": "General purpose", "gpuNum": 0, "memoryGiB": 4, "name": "ml.t3.medium", "vcpuNum": 2 }, { "_defaultOrder": 1, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 8, "name": "ml.t3.large", "vcpuNum": 2 }, { "_defaultOrder": 2, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 16, "name": "ml.t3.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 3, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 32, "name": "ml.t3.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 4, "_isFastLaunch": true, "category": "General purpose", "gpuNum": 0, "memoryGiB": 8, "name": "ml.m5.large", "vcpuNum": 2 }, { "_defaultOrder": 5, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 16, "name": "ml.m5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 6, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 32, "name": "ml.m5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 7, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 64, "name": "ml.m5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 8, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 128, "name": "ml.m5.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 9, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 192, "name": "ml.m5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 10, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 256, "name": "ml.m5.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 11, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 384, "name": "ml.m5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 12, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 8, "name": "ml.m5d.large", "vcpuNum": 2 }, { "_defaultOrder": 13, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 16, "name": "ml.m5d.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 14, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 32, "name": "ml.m5d.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 15, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 64, "name": "ml.m5d.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 16, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 128, "name": "ml.m5d.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 17, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 192, "name": "ml.m5d.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 18, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 256, "name": "ml.m5d.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 19, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 384, "name": "ml.m5d.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 20, "_isFastLaunch": true, "category": "Compute optimized", "gpuNum": 0, "memoryGiB": 4, "name": "ml.c5.large", "vcpuNum": 2 }, { "_defaultOrder": 21, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "memoryGiB": 8, "name": "ml.c5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 22, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "memoryGiB": 16, "name": "ml.c5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 23, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "memoryGiB": 32, "name": "ml.c5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 24, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "memoryGiB": 72, "name": "ml.c5.9xlarge", "vcpuNum": 36 }, { "_defaultOrder": 25, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "memoryGiB": 96, "name": "ml.c5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 26, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "memoryGiB": 144, "name": "ml.c5.18xlarge", "vcpuNum": 72 }, { "_defaultOrder": 27, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "memoryGiB": 192, "name": "ml.c5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 28, "_isFastLaunch": true, "category": "Accelerated computing", "gpuNum": 1, "memoryGiB": 16, "name": "ml.g4dn.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 29, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "memoryGiB": 32, "name": "ml.g4dn.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 30, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "memoryGiB": 64, "name": "ml.g4dn.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 31, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "memoryGiB": 128, "name": "ml.g4dn.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 32, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "memoryGiB": 192, "name": "ml.g4dn.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 33, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "memoryGiB": 256, "name": "ml.g4dn.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 34, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "memoryGiB": 61, "name": "ml.p3.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 35, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "memoryGiB": 244, "name": "ml.p3.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 36, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "memoryGiB": 488, "name": "ml.p3.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 37, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "memoryGiB": 768, "name": "ml.p3dn.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 38, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "memoryGiB": 16, "name": "ml.r5.large", "vcpuNum": 2 }, { "_defaultOrder": 39, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "memoryGiB": 32, "name": "ml.r5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 40, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "memoryGiB": 64, "name": "ml.r5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 41, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "memoryGiB": 128, "name": "ml.r5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 42, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "memoryGiB": 256, "name": "ml.r5.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 43, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "memoryGiB": 384, "name": "ml.r5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 44, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "memoryGiB": 512, "name": "ml.r5.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 45, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "memoryGiB": 768, "name": "ml.r5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 46, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "memoryGiB": 16, "name": "ml.g5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 47, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "memoryGiB": 32, "name": "ml.g5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 48, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "memoryGiB": 64, "name": "ml.g5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 49, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "memoryGiB": 128, "name": "ml.g5.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 50, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "memoryGiB": 256, "name": "ml.g5.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 51, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "memoryGiB": 192, "name": "ml.g5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 52, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "memoryGiB": 384, "name": "ml.g5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 53, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "memoryGiB": 768, "name": "ml.g5.48xlarge", "vcpuNum": 192 } ], "instance_type": "ml.m5.large", "kernelspec": { "display_name": "Python 3 (Data Science)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" }, "vscode": { "interpreter": { "hash": "2d58e898dde0263bc564c6968b04150abacfd33eed9b19aaa8e45c040360e146" } } }, "nbformat": 4, "nbformat_minor": 4 }