{ "cells": [ { "cell_type": "markdown", "id": "b603f275", "metadata": {}, "source": [ "## Fine Tune BERT on Amazon Reviews Dataset\n", "\n", "\n", "This notebook demonstrates how to use SageMaker with AWS Trainium to train a text classification model. We are going to start with a pretrained BERT model from Hugging Face, and fine-tune it with Amazon Reviews dataset. This dataset consists of sentences labeled to be either positive or negative sentiment. The training job will take place on ml.trn1 instance which hosts the AWS Trainium accelerator. " ] }, { "cell_type": "markdown", "id": "567c1aec", "metadata": {}, "source": [ "### Lets begin by installing dependent libraries" ] }, { "cell_type": "code", "execution_count": null, "id": "b500cfc8", "metadata": {}, "outputs": [], "source": [ "!pip install -U sagemaker" ] }, { "cell_type": "code", "execution_count": null, "id": "757cc8ba", "metadata": {}, "outputs": [], "source": [ "!pip install torch==1.12.0 torchvision==0.13.0 torchaudio==0.12.0" ] }, { "cell_type": "code", "execution_count": null, "id": "93adef05", "metadata": {}, "outputs": [], "source": [ "!pip install transformers==4.21.3 datasets==2.5.2" ] }, { "cell_type": "markdown", "id": "b2b1b05b", "metadata": {}, "source": [ "### Data Preparation\n", "\n", "We will use an existing Dataset Amazon reviews part of the HuggingFace Datasets. We will convert the dataset into a CSV format and upload it to S3. For practical use cases we can easily replace this step with actual data in csv format." ] }, { "cell_type": "code", "execution_count": null, "id": "f230b14a", "metadata": {}, "outputs": [], "source": [ "import sagemaker\n", "import transformers\n", "from sagemaker.pytorch import PyTorch\n", "from datasets import load_dataset\n", "from tqdm.auto import tqdm\n", "from transformers import AutoModelForSequenceClassification, AutoTokenizer\n", "from sagemaker import utils\n", "import os\n", "import boto3\n", "import botocore\n", "from datasets.filesystems import S3FileSystem\n", "from pathlib import Path\n", "from sagemaker.pytorch.model import PyTorchModel\n", "from sagemaker.predictor import Predictor\n", "from datetime import datetime\n", "import json\n", "from sagemaker.serializers import JSONSerializer\n", "from sagemaker.deserializers import JSONDeserializer\n", "import torch" ] }, { "cell_type": "code", "execution_count": null, "id": "284a53ea", "metadata": {}, "outputs": [], "source": [ "# import the amazon polarity dataset\n", "\n", "dataset = load_dataset(\"amazon_polarity\")" ] }, { "cell_type": "markdown", "id": "66160add", "metadata": {}, "source": [ "Lets look at the dataset structure" ] }, { "cell_type": "code", "execution_count": null, "id": "ac825bb8", "metadata": {}, "outputs": [], "source": [ "dataset[\"train\"][1]" ] }, { "cell_type": "markdown", "id": "3f3f18e4", "metadata": {}, "source": [ "The dataset consists of 3 fields label, title and content. For this training lets just use 'label' which is the target field and 'content' that is used to learn the features. The 'content' field is free text which contains the actual review for a product.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "7aeadd54", "metadata": {}, "outputs": [], "source": [ "train_ds = dataset[\"train\"]\n", "test_ds = dataset[\"test\"]" ] }, { "cell_type": "code", "execution_count": null, "id": "7ece1e8a", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "train_df = pd.DataFrame(train_ds)\n", "test_df = pd.DataFrame(test_ds)" ] }, { "cell_type": "code", "execution_count": null, "id": "bfe6772f", "metadata": {}, "outputs": [], "source": [ "# let use only label and content field\n", "\n", "train_df = train_df.drop([\"title\"], axis=1)\n", "test_df = test_df.drop([\"title\"], axis=1)" ] }, { "cell_type": "code", "execution_count": null, "id": "9d1d0b6b", "metadata": {}, "outputs": [], "source": [ "num_labels = len(train_df[\"label\"].unique())\n", "\n", "print(\"Total number of labels {}\".format(num_labels))" ] }, { "cell_type": "markdown", "id": "afba68a3", "metadata": {}, "source": [ "Lets save the train and test dataset as CSV files." ] }, { "cell_type": "code", "execution_count": null, "id": "c2008f79", "metadata": {}, "outputs": [], "source": [ "train_df.to_csv(\"train.csv\", index=False)\n", "test_df.to_csv(\"test.csv\", index=False)" ] }, { "cell_type": "markdown", "id": "a96bf232", "metadata": {}, "source": [ "### Upload the data to S3.\n", "\n", "Lets upload the train.csv and test.csv files to S3 for us to be able to access this data during training." ] }, { "cell_type": "code", "execution_count": null, "id": "8e87337d", "metadata": {}, "outputs": [], "source": [ "sess = sagemaker.Session()\n", "# sagemaker session bucket -> used for uploading data, models and logs\n", "# sagemaker will automatically create this bucket if it not exists\n", "\n", "sagemaker_session_bucket = (\n", " None # Provide a bucket if you don't want to use the default bucket\n", ")\n", "if sagemaker_session_bucket is None and sess is not None:\n", " # set to default bucket if a bucket name is not given\n", " sagemaker_session_bucket = sess.default_bucket()\n", "\n", "role = sagemaker.get_execution_role()\n", "\n", "sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)\n", "\n", "print(f\"sagemaker role arn: {role}\")\n", "print(f\"sagemaker bucket: {sess.default_bucket()}\")\n", "print(f\"sagemaker session region: {sess.boto_region_name}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "b8fc763b", "metadata": {}, "outputs": [], "source": [ "train_data_url = sess.upload_data(\n", " path=\"train.csv\",\n", " key_prefix=\"classification/data/amazon\",\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "ab02854c", "metadata": {}, "outputs": [], "source": [ "test_data_url = sess.upload_data(\n", " path=\"test.csv\",\n", " key_prefix=\"classification/data/amazon\",\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "1369288a", "metadata": {}, "outputs": [], "source": [ "print(\"training data path - {}\".format(train_data_url))\n", "print(\"Test data path - {}\".format(test_data_url))" ] }, { "cell_type": "markdown", "id": "eca2ba88", "metadata": {}, "source": [ "### Start the training Job\n", "\n", "Now we are ready to run the training using Trn1 instance. A training script is required for SageMaker PyTorch estimator to run a model training job. Below is the script for fine-tuning a pretrained Hugging Face BERT model with the dataset (Amazon reviews) we just put in the S3." ] }, { "cell_type": "code", "execution_count": null, "id": "7e4b0ad5", "metadata": {}, "outputs": [], "source": [ "!pygmentize ./code/train.py" ] }, { "cell_type": "markdown", "id": "e499c786", "metadata": {}, "source": [ "In the training script, there are several important details worth mentioning:\n", "\n", "1. distributed training (hardware) This is an example of data parallel distributed training. In this training scenario, since there are multiple NeuronCores in this trn1 instance, each NeuronCore receives a copy of the model and a shard of data. Each NeuronCore is managed by a worker that runs a copy of the training script. Gradient from each worker is aggregated and averaged, such that each worker receives exactly same updates to the model weights. Then another iteration of training resumes.\n", "\n", "\n", "2. Distributed training (software) A specialized backend torch.xla.distributed.xla_backend is required for PyTorch to run on XLA device such as Trainium. In the training loop, since each worker generates its own gradient, xm.optimiser_Step(optimizer) makes sure all workers receive same gradient update before next iteration of training.\n", "\n", "\n", "3. The data from S3 will be copied to the training instance and the path will be made available as environment variables under channel names SM_CHANNEL_TRAIN and SM_CHANNEL_VAL\n", "\n", "\n", "4. The trained model config and weights are stored in a path provided by environment variable SM_MODEL_DIR. Amazon SageMaker will subsequently copy the files in SM_MODEL_DIR path to the S3 bucket once the training is complete. We can then use the model to deploy it to any hardware of our choice." ] }, { "cell_type": "code", "execution_count": null, "id": "c16b4648", "metadata": {}, "outputs": [], "source": [ "# start the training job with tranium\n", "base_job_name = \"amazon-review-classification\"" ] }, { "cell_type": "code", "execution_count": null, "id": "ae0ea2b1", "metadata": {}, "outputs": [], "source": [ "hyperparameters = {}\n", "\n", "hyperparameters[\n", " \"model_name_or_path\"\n", "] = \"bert-base-uncased\" # we can change this mode to any other pretrained bert base model\n", "hyperparameters[\"seed\"] = 100\n", "hyperparameters[\"max_length\"] = 128\n", "hyperparameters[\"per_device_train_batch_size\"] = 8\n", "hyperparameters[\"per_device_eval_batch_size\"] = 8\n", "hyperparameters[\"learning_rate\"] = 5e-5\n", "hyperparameters[\"max_train_steps\"] = 2000\n", "hyperparameters[\"num_train_epochs\"] = 1" ] }, { "cell_type": "code", "execution_count": null, "id": "e0032135", "metadata": { "scrolled": true }, "outputs": [], "source": [ "pt_estimator = PyTorch(\n", " entry_point=\"train.py\", # Specify your train script\n", " source_dir=\"code\",\n", " role=sagemaker.get_execution_role(),\n", " instance_count=1,\n", " instance_type=\"ml.trn1.32xlarge\",\n", " framework_version=\"1.11.0\",\n", " py_version=\"py38\",\n", " disable_profiler=True,\n", " base_job_name=base_job_name,\n", " hyperparameters=hyperparameters,\n", " volume_size=512,\n", " distribution={\"torch_distributed\": {\"enabled\": True}},\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "da71a2b2", "metadata": { "scrolled": true }, "outputs": [], "source": [ "pt_estimator.fit({\"train\": train_data_url, \"val\": test_data_url})" ] }, { "cell_type": "markdown", "id": "7409c0f1", "metadata": {}, "source": [ "\n", "\n", "\n", "Now that model is successfully trained and the model weights are stored to S3, We can take this model and deploy it using any hardware such as GPU,CPU or Inferentia.\n", "\n" ] }, { "cell_type": "markdown", "id": "e7b1e8d8", "metadata": {}, "source": [ "### Deploy the trained model\n", "\n", "The trained model can be taken and deployed to any instance such as CPU, GPU or AWS Inferentia. In this example we will take the trained model deploy it to a CPU instance and get some predictions. Inorder to deploy a model we need to do the following steps:\n", "\n", "1. Create a model.tar.gz with all the model files. \n", "2. Create an inference script to load, process and predict.\n", "3. Create a Pytorch Model and deploy it." ] }, { "cell_type": "markdown", "id": "0b8484b7", "metadata": {}, "source": [ "#### 1. Create model.tar.gz\n", "\n", "The output from the above training job is stored as an tar.gz file in S3. So we can directly retrive the url from the estimator and use it." ] }, { "cell_type": "code", "execution_count": null, "id": "f8266bfb", "metadata": { "collapsed": true }, "outputs": [], "source": [ "model_url = (\n", " pt_estimator.model_data\n", ") # Alternatively we can retrieve this from the training job details in console." ] }, { "cell_type": "code", "execution_count": null, "id": "92e1c48f", "metadata": {}, "outputs": [], "source": [ "print(model_url)" ] }, { "cell_type": "markdown", "id": "2011a6f2", "metadata": {}, "source": [ "#### 2. Create an inference script\n", "\n", "We need to write an inference script which tells how to load the model and do inference. The inference script should atlease include a model_fn function that loads the model. Optionally you may also implement input_fn and output_fn to process input and output, and predict_fn to customize how the model server gets predictions form the loaded model.\n", "\n", "The inference.py script we use contains implementation for the functions mentioned above. Lets see how it looks.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "e2a8a45f", "metadata": {}, "outputs": [], "source": [ "!pygmentize code/inference.py" ] }, { "cell_type": "markdown", "id": "541ba6af", "metadata": {}, "source": [ "#### 3. Create a Pytorch Model and deploy\n", "\n", "Once we have the model and the source files, deploying the model is as simple as creating a model object pointing to the model files and the source files and then deploying to the instance type we need." ] }, { "cell_type": "code", "execution_count": null, "id": "cd1f083c", "metadata": {}, "outputs": [], "source": [ "from sagemaker.pytorch.model import PyTorchModel\n", "from sagemaker.serializers import JSONSerializer\n", "from sagemaker.deserializers import JSONDeserializer\n", "\n", "pytorch_model = PyTorchModel(\n", " model_data=model_url,\n", " role=role,\n", " framework_version=\"1.12.0\",\n", " py_version=\"py38\",\n", " source_dir=\"code\",\n", " entry_point=\"inference.py\",\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "817563e6", "metadata": {}, "outputs": [], "source": [ "predictor = pytorch_model.deploy(\n", " instance_type=\"ml.c5.xlarge\", # can be changed to GPU instance as well.\n", " initial_instance_count=1,\n", " serializer=JSONSerializer(),\n", " deserializer=JSONDeserializer(),\n", ")" ] }, { "cell_type": "markdown", "id": "860d3840", "metadata": {}, "source": [ "##### Note : The instance type can be changed as need to either a GPU/CPU based instance" ] }, { "cell_type": "markdown", "id": "ffa64385", "metadata": {}, "source": [ "### Predict using the model" ] }, { "cell_type": "code", "execution_count": null, "id": "76b202c2", "metadata": {}, "outputs": [], "source": [ "# Predict with model endpoint with a positive sample\n", "\n", "payload1 = \"I'm reading a lot of reviews saying that this is the best 'game soundtrack' and I figured that I'd write a review to disagree a bit. This in my opinino is Yasunori Mitsuda's ultimate masterpiece. The music is timeless and I'm been listening to it for years now and its beauty simply refuses to fade.The price tag on this is pretty staggering I must say, but if you are going to buy any cd for this much money, this is the only one that I feel would be worth every penny.\"" ] }, { "cell_type": "code", "execution_count": null, "id": "ab79ac38", "metadata": {}, "outputs": [], "source": [ "import json\n", "\n", "# invoke the endpoint\n", "out = predictor.predict(payload1)\n", "print(\"The prediction from the model --\")\n", "print(out)" ] }, { "cell_type": "code", "execution_count": null, "id": "52a0329e", "metadata": {}, "outputs": [], "source": [ "# Predict using a negative sample\n", "\n", "payload2 = \"I guess you have to be a romance novel lover for this one, and not a very discerning one. All others beware! It is absolute drivel. I figured I was in trouble when a typo is prominently featured on the back cover, but the first page of the book removed all doubt. Wait - maybe I'm missing the point. A quick re-read of the beginning now makes it clear. This has to be an intentional churning of over-heated prose for satiric purposes. Phew, so glad I didn't waste $10.95 after all.\"" ] }, { "cell_type": "code", "execution_count": null, "id": "9248dda2", "metadata": {}, "outputs": [], "source": [ "# invoke the endpoint\n", "out = predictor.predict(payload2)\n", "print(\"The prediction from the model --\")\n", "print(out)" ] }, { "cell_type": "markdown", "id": "dbfe6df2", "metadata": {}, "source": [ "### Clean Up\n", "\n", "Now that we have run some predicts, we should ideally free up the resource by deleting the model." ] }, { "cell_type": "code", "execution_count": null, "id": "17b050b3", "metadata": {}, "outputs": [], "source": [ "predictor.delete_model()\n", "predictor.delete_endpoint()" ] }, { "cell_type": "code", "execution_count": null, "id": "f0bae3a5", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" } }, "nbformat": 4, "nbformat_minor": 5 }