{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Few shot learning using SageMaker and SetFit\n", "\n", "In this lab, you will learn how to fine tune a model to perform intent classification and deploy the trained model on SageMaker. By the end of this lab, you will become familiar with several key concepts in SageMaker including:\n", "- Using SageMaker [Studio Notebooks](https://docs.aws.amazon.com/sagemaker/latest/dg/notebooks.html) to prepare the data\n", "- Configuring and launching a [SageMaker Training](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html) job\n", "- Tracking the training job using [SageMaker Experiments](https://docs.aws.amazon.com/sagemaker/latest/dg/experiments.html)\n", "- Deploying the trained model to an endpoint with [SageMaker Hosting](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-deployment.html)\n", "\n", "First, let us make sure we have the latest SageMaker SDK installed. \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install sagemaker -U" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "After the installation, please restart your kernel. Then let us import all the packages needed later. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sagemaker\n", "from sagemaker.experiments.run import Run\n", "from sagemaker.huggingface import HuggingFace\n", "from sagemaker import image_uris\n", "import boto3\n", "import os\n", "import tarfile\n", "import json\n", "import pandas as pd\n", "from pathlib import Path" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Initial Setup\n", "This section contains defines some key variables used in this notebook, such as:\n", "* `role`: this is an IAM Role that assigns specific permissions to perform actions in AWS, such as training and deploying ML models in SageMaker. \n", "* `sess`: provides convenient methods for manipulating entities and resources that Amazon SageMaker uses. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "role = sagemaker.get_execution_role() # execution role for the endpoint\n", "sess = sagemaker.session.Session() # sagemaker session for interacting with different AWS APIs \n", "bucket = sess.default_bucket() # bucket to house artifacts\n", "model_bucket = sess.default_bucket() # bucket to house artifacts\n", "s3_key_prefix = \"set-fit-intent-classification\" # folder within bucket where code artifact will go\n", "\n", "region = sess._region_name # region name of the current SageMaker Studio environment\n", "account_id = sess.account_id() # account_id of the current SageMaker Studio environment" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Use case and Data Exploration\n", "For this lab we will use the [Banking 77](https://github.com/PolyAI-LDN/task-specific-datasets/tree/master/banking_data) dataset which classifies customer online banking queries into one of 77 intents " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!wget https://raw.githubusercontent.com/PolyAI-LDN/task-specific-datasets/master/banking_data/train.csv -O data/data.csv" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(\"data/data.csv\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(f\"Total number of records: {df.shape[0]}\")\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Let's look at the distribution of the categories\n", "df[\"category\"].value_counts()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "To train a model, we have to convert the text categories into numbers. We can use a Scikit Learn [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html), however for simplicity we'll just create a python dictionary that maps the text labels to numbers and vice versa and save these as json files." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create python dictionary to encode the text categories to integers\n", "cat_encoder = dict(zip(df[\"category\"].unique(), \n", " range(df[\"category\"].unique().shape[0]))\n", " )\n", "# Create a decoder to convert the predictions back to text\n", "cat_decoder = {v:k for k,v in cat_encoder.items()}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Save the decoder and encoder json files for future use\n", "cat_encoders_path = Path(\"cat_encoders\")\n", "cat_encoders_path.mkdir(exist_ok=True)\n", "(cat_encoders_path / \"cat_encoder.json\").open(\"w\").write(json.dumps(cat_encoder))\n", "(cat_encoders_path / \"cat_decoder.json\").open(\"w\").write(json.dumps(cat_decoder))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "To illustrate the few shot learning concept, we will use a subset of the data that contains only 8 examples for each category. This equates to using roughly 6% of the data for training with the remaining 94% of the data used for testing. \n", "\n", "While many Few Shot Learning techniques utilize prompt engineering with Large Language Models, we will use an alternative approach using the [SetFit](https://arxiv.org/abs/2209.11055) algorithm which is a simple yet effective method for few shot learning that produces great results with smaller language model.\n", "The algorithm is implemented in a very user friendly 🤗 [SetFit](https://github.com/huggingface/setfit) library. At a high level, the algorithm involves fine-tuning a [SentenceTransformer](https://www.sbert.net/) model in a contasrastive manner where pairs of examples are passed as input and the model learns to distinguish between pairs from the same category and pairs from different categories. The fine-tuned SentenceTransformer model is then used to generate embeddings that can be used as features of a classic classification model which by default is a Logistic Regression model from SKLearn" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Encode the categories\n", "df[\"label\"] = df[\"category\"].map(cat_encoder)\n", "\n", "# Take a random sample of 8 records from each category. Rest will be used for testing\n", "df_train = df.groupby(\"category\").sample(8)\n", "df_test = df.drop(df_train.index)\n", "print(f\"Train Data Size: {df_train.shape[0]}\\nTest Data Size: {df_test.shape[0]}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Save the train and test data\n", "df_train.to_csv(\"data/train.csv\", index=False)\n", "df_test.to_csv(\"data/test.csv\", index=False)\n", "\n", "# Upload the data and the encoders to S3\n", "s3_train_data_path = sess.upload_data(\"data/train.csv\", bucket=bucket, key_prefix=f\"{s3_key_prefix}/data\")\n", "s3_test_data_path = sess.upload_data(\"data/test.csv\", bucket=bucket, key_prefix=f\"{s3_key_prefix}/data\")\n", "s3_encoder_path = sess.upload_data(\"cat_encoders\", bucket=bucket, key_prefix=f\"{s3_key_prefix}/encoders\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Configure the SageMaker Training Job\n", "Now we are ready to configure the SageMaker training job using the [SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/). \n", "Each supported framework has a corresponding estimator class that can be used to launch a training job. For example, the [TensorFlow](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/using_tf.html) estimator can be used to launch a training job using TensorFlow. In this example, we use the [HuggingFace](https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/index.html) estimator to launch a training job. \n", "\n", "The source directory contains the training script [train.py](src/train.py) and the [requirements.txt](src/requirements.txt) file which lists any additional dependencies needed for the training script. Additionally we specify the execution role, the instance type and the instance count, and the framework versions we want to use. \n", "\n", "When the job is launched on a training instance, internally the script will be launched with the following command:\n", "`python train.py [TRAINNING_ARGS]`\n", "The training arguments are passed to the training script as a dictionary via the `hyperparameters` argument to the Estimator. So per the code below, the full command that will be executed on the training instance is: \n", "\n", "`python train.py --pretrained_model_name_or_path sentence-transformers/paraphrase-mpnet-base-v2 --num_iterations 20`\n", "\n", "Our training script therefore needs to be able to parse command line arguments, argparse is a good way to do this as it is included in the Python standard library but you can use other parsers as well.\n", "\n", "When we launch the training job via the Estimator's `.fit` method, the data we specify will automatically be copied from S3 into a specific directory on the training instance. To understand more about the storage folders used in `train.py`, please refer to [Amazon SageMaker Training Storage Folders for Training Datasets, Checkpoints, Model Artifacts, and Outputs](https://docs.aws.amazon.com/sagemaker/latest/dg/model-train-storage.html)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Base model that will be used for transfer learning\n", "# Can be any sentence transformer model from https://huggingface.co/sentence-transformers\n", "model_id = \"sentence-transformers/paraphrase-mpnet-base-v2\"\n", "\n", "estimator = HuggingFace(source_dir = \"src\", # Path to the directory containing the training script\n", " entry_point=\"train.py\", # Script that will be run when training starts\n", " role=role, # IAM role to be used for training\n", " instance_count=1, \n", " instance_type=\"ml.g4dn.xlarge\", # Instance type to be used for training \n", " pytorch_version=\"1.10\", # PyTorch version to be used for training\n", " transformers_version=\"4.17\", # Transformers version to be used for training\n", " py_version=\"py38\",\n", " disable_profiler=True,\n", " hyperparameters={\"pretrained_model_name_or_path\": model_id,\n", " \"num_iterations\": 20,\n", " \"region\": region\n", " },\n", " keep_alive_period_in_seconds=3600)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "He will we use the [SageMaker Experiments](https://docs.aws.amazon.com/sagemaker/latest/dg/experiments.html) capability to track our training job. Amazon SageMaker Experiments allows you to create, manage, analyze, and compare your machine learning experiments. We create an experiment `set-fit-intent-classification` defined below to help us track our training job. By launching the training job within the experiment Run context, we can pass in the experiment configuration and the training job will be automatically associated with the experiment. This is helpful if we want to log data from multiple places such as from the notebook and training job while associating them all with the same experiment.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "experiment_name = \"set-fit-intent-classification\"\n", "run_name = \"set-fit-intent-classification-run-1\"\n", "with Run(experiment_name=experiment_name, sagemaker_session=sess, run_name=run_name) as run:\n", " estimator.fit(\n", " {\"train\": s3_train_data_path, \"test\": s3_test_data_path, \"encoders\": s3_encoder_path}\n", " )" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Analyze Results\n", "After the training job is done, we can query SageMaker Experiments to see the results. We can see that the training job was associated with the experiment and the training job metrics were logged." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# get the latest run\n", "last_run = sagemaker.experiments.list_runs(experiment_name=experiment_name)[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# extract the logged metrics and confusion matrix\n", "tc = last_run._trial_component\n", "overall_metrics = pd.DataFrame([metrics.__dict__ for metrics in tc.metrics])[[\"metric_name\", \"last\"]]\n", "category_metrics = pd.read_json(tc.output_artifacts[\"Classification Report\"].value).T\n", "confusion_matrix = pd.read_csv(tc.output_artifacts[\"Confusion Matrix\"].value)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"Overall Metrics on Test Data\")\n", "overall_metrics" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"Per Category Metrics\")\n", "category_metrics.sort_values(\"f1-score\", ascending=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"Top Errors by Category\")\n", "confusion_matrix.query(\"actual_category != predicted_category and value > 0\").sort_values(by=\"value\", ascending=False)[:20]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Deploy Model\n", "Our last step is to deploy the model to an endpoint. Similar to the Estimator SDK used to create the Training job, we can use the [HuggingFaceModel](https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/index.html) object to deploy the model to an endpoint. Instead of a training script, we provide an inference script [inference.py](src/inference.py) which will be used to load the model and perform inference. Our inference.py script needs to implement two functions: `model_fn` and `transform_fn`. The `model_fn` function is used to load the model and the `transform_fn` function is used to perform inference for each request. \n", "\n", "The signature of `model_fn` is `model_fn(model_dir)` where `model_dir` is a directory that contains the output of the training script (i.e. anything that was saved to the `opt/ml/model` directory during training). You simply need to write code to take the contents of `model_dir` and return the model object. \n", "\n", "The signature of `transform_fn` is `transform_fn(model, input_data, content_type, accept_type)` where model is the output of `model_fn`, `input_data` is the request payload, `input_content_type` is the request content type, and `output_content_type` is the desired response content type.\n", "\n", "For more details on writing inference scripts, please refer to [SageMaker HuggingFace Inference Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/neo-deployment-hosting-services-prerequisites.html)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.huggingface import HuggingFaceModel" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# configure the model\n", "model = HuggingFaceModel(source_dir = \"src\", # Path to the directory containing the inference script\n", " entry_point=\"inference.py\", # Script that will be used to load model and handle requests\n", " role=role, # IAM role to be assumed by the endpoint\n", " model_data=estimator.model_data, # S3 location of the model artifacts\n", " pytorch_version=\"1.10\", # PyTorch version to be used for training\n", " transformers_version=\"4.17\", # Transformers version to be used for training\n", " py_version=\"py38\"\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# deploy the model\n", "predictor = model.deploy(initial_instance_count=1, instance_type=\"ml.g4dn.xlarge\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# run inference on an example. Note that the input is a list of strings\n", "predictor.predict([\"There seems to be a couple of payments listed in the app I know weren't made by me. Is there a possibility someone has access to my card? Can you find out what's going on?\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# delete the endpoint once done experimenting\n", "predictor.delete_endpoint()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "base", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }