{ "cells": [ { "cell_type": "markdown", "id": "ad718ed6-890f-4a3d-a6c4-9eb29d8f92e7", "metadata": { "tags": [] }, "source": [ "# A/B Testing with Amazon SageMaker\n", "\n", "***\n", "This notebooks is designed to run on `Python 3 (Data Science 2.0)` kernel in Amazon SageMaker Studio\n", "***\n", "\n", "In production ML workflows, data scientists and data engineers frequently try to improve their models in various ways, such as by performing [Perform Automatic Model Tuning](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html), training on additional or more-recent data, and improving feature selection. Performing A/B testing between a new model and an old model with production traffic can be an effective final step in the validation process for a new model. In A/B testing, you test different variants of your models and compare how each variant performs relative to each other. You then choose the best-performing model to replace a previously-existing model new version delivers better performance than the previously-existing version.\n", "\n", "Amazon SageMaker enables you to test multiple models or model versions behind the same endpoint using production variants. Each production variant identifies a machine learning (ML) model and the resources deployed for hosting the model. You can distribute endpoint invocation requests across multiple production variants by providing the traffic distribution for each variant, or you can invoke a specific variant directly for each request.\n", "\n", "In this notebook, we'll:\n", "* Evaluate models by invoking specific variants\n", "* Gradually release a new model by specifying traffic distribution\n", "\n", "Reference notebook example: [A/B Testing with Amazon SageMaker](https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker_endpoints/a_b_testing/a_b_testing.ipynb)" ] }, { "cell_type": "markdown", "id": "8bce61ca-dfb0-455f-afb4-9a28fac22b84", "metadata": {}, "source": [ "## Setup\n", "Let's set up some required imports and basic initial variables:" ] }, { "cell_type": "code", "execution_count": null, "id": "96189bf5-9d67-4da7-88bd-9b33aa183225", "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import datetime\n", "import time\n", "import os, sys\n", "import boto3\n", "import re\n", "import json\n", "import pandas as pd\n", "import numpy as np\n", "import sagemaker\n", "from sagemaker import get_execution_role\n", "\n", "import csv\n", "import matplotlib.pyplot as plt\n", "from sklearn import metrics\n", "p = os.path.abspath('..')\n", "if p not in sys.path:\n", " sys.path.append(p)\n", "import utils\n", "\n", "sm_session = sagemaker.Session()\n", "role = get_execution_role()\n", "region = sm_session.boto_region_name\n", "bucket = sm_session.default_bucket()\n", "sm_client = sm_session.sagemaker_client\n", "sm_runtime = sm_session.sagemaker_runtime_client\n", "prefix = \"sagemaker/huggingface-pytorch-sentiment-analysis\"\n", "time_now = f'{datetime.datetime.now():%Y-%m-%d-%H-%M-%S}'\n", "time_now" ] }, { "cell_type": "code", "execution_count": null, "id": "51753a7c-b26e-40d3-be31-ea8f6b3e4328", "metadata": {}, "outputs": [], "source": [ "%store\n", "%store -r" ] }, { "cell_type": "markdown", "id": "064a1d40-d5cb-40c1-8895-386a9bb5e4ae", "metadata": {}, "source": [ "### Step 1: Deploy the models created in the previous multi-model endpoint notebook\n", "\n" ] }, { "cell_type": "markdown", "id": "8d951339-fbc0-43c5-89b3-61319c4ebf2e", "metadata": {}, "source": [ "Uncomment the below cell to view details of the `production_variant` function defined in SageMaker SDK" ] }, { "cell_type": "code", "execution_count": null, "id": "c5797d0e-6c95-4a19-9a00-917731164e58", "metadata": {}, "outputs": [], "source": [ "# ?? sagemaker.production_variant" ] }, { "cell_type": "code", "execution_count": null, "id": "5f2773a3-754f-4119-bf45-7cb4ba200f13", "metadata": {}, "outputs": [], "source": [ "variant1 = sagemaker.production_variant(\n", " model_name=roberta_mme_model_name,\n", " instance_type=\"ml.c5.2xlarge\",\n", " initial_instance_count=1,\n", " variant_name=\"Variant1\",\n", " initial_weight=1,\n", ")\n", "variant2 = sagemaker.production_variant(\n", " model_name=distilbert_model_name,\n", " instance_type=\"ml.c5.xlarge\",\n", " initial_instance_count=1,\n", " variant_name=\"Variant2\",\n", " initial_weight=1,\n", ")\n", "\n", "(variant1, variant2)" ] }, { "cell_type": "markdown", "id": "f006b375-e844-4d39-b431-205379141d28", "metadata": {}, "source": [ "#### Deploy\n", "Let's go ahead and deploy our two variants to a SageMaker endpoint:\n", "\n", "Uncomment below cells to view the details of the functions" ] }, { "cell_type": "code", "execution_count": null, "id": "21e948cc-570a-4017-8c3e-4ea1e6f07b2c", "metadata": {}, "outputs": [], "source": [ "# ?? sm_session.create_endpoint" ] }, { "cell_type": "code", "execution_count": null, "id": "198835ee-608a-45e4-9734-df09eeb3bd67", "metadata": {}, "outputs": [], "source": [ "# ?? sm_session.endpoint_from_production_variants" ] }, { "cell_type": "code", "execution_count": null, "id": "9537189d-4db5-47ec-b415-7e7242e92fc0", "metadata": {}, "outputs": [], "source": [ "endpoint_name = f\"demo-hf-pytorch-variant-{time_now}\"\n", "print(f\"EndpointName={endpoint_name}\")\n", "\n", "sm_session.endpoint_from_production_variants(\n", " name=endpoint_name, production_variants=[variant1, variant2]\n", ")" ] }, { "cell_type": "markdown", "id": "fdcc1c36-64d3-4bd0-8cfb-f08f6a4f80d4", "metadata": {}, "source": [ "## Step 2: Invoke the deployed models\n", "\n", "You can now send data to this endpoint to get inferences in real time.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "1e96fbfa-3823-41a0-ab02-5f70e339e3ff", "metadata": {}, "outputs": [], "source": [ "test_data = pd.read_csv(\"../sample_payload/test_data.csv\", header=None)\n", "json_data = dict({'inputs':test_data.iloc[:,0].to_list()})\n", "batch_data = pd.read_csv(\"../sample_payload/batch_data.csv\", header=None)" ] }, { "cell_type": "code", "execution_count": null, "id": "ccc8744a-9d1c-4a78-b52e-ddc9bc0f0a2a", "metadata": {}, "outputs": [], "source": [ "%%time\n", "predictions = []\n", "\n", "for i in range(5):\n", " response = sm_runtime.invoke_endpoint(\n", " EndpointName=endpoint_name,\n", " Body=json.dumps(json_data),\n", " ContentType=\"application/json\",\n", " )\n", " predictions.append(response[\"Body\"].read().decode(\"utf-8\"))\n", " time.sleep(0.5)\n", "\n", "print(*predictions, sep='\\n')" ] }, { "cell_type": "markdown", "id": "754e8afe-3ad3-4454-9cde-110d645b7a3e", "metadata": {}, "source": [ "### Invoke a specific variant\n", "\n", "Now, let’s use the new feature that was released today to invoke a specific variant. For this, we simply use the new parameter to define which specific ProductionVariant we want to invoke. Let us use this to invoke Variant1 for all requests." ] }, { "cell_type": "code", "execution_count": null, "id": "4618cbe2-2184-4775-ab85-8e277e0aa542", "metadata": {}, "outputs": [], "source": [ "%%time\n", "response = sm_runtime.invoke_endpoint(\n", " EndpointName=endpoint_name,\n", " Body=json.dumps(json_data),\n", " ContentType=\"application/json\",\n", " TargetVariant=variant1[\"VariantName\"],\n", ")\n", "\n", "print(response[\"Body\"].read())" ] }, { "cell_type": "code", "execution_count": null, "id": "b32ca3e9-0b2d-432b-98a1-88fe2c0c166c", "metadata": {}, "outputs": [], "source": [ "%%time\n", "response = sm_runtime.invoke_endpoint(\n", " EndpointName=endpoint_name,\n", " Body=json.dumps(json_data),\n", " ContentType=\"application/json\",\n", " TargetVariant=variant2[\"VariantName\"],\n", ")\n", "\n", "print(response[\"Body\"].read())" ] }, { "cell_type": "markdown", "id": "1178c6ae-4ed1-4744-9aea-8afba7e94959", "metadata": {}, "source": [ "## Step 3: Evaluate variant performance\n", "\n", "### Evaluating Variant 1\n", "\n", "Using the new targeting feature, let us evaluate the accuracy, precision, recall, F1 score, and ROC/AUC for Variant1:\n", "\n", "Note that the test data was from [Kaggle financial sentiment analysis dataset](https://www.kaggle.com/datasets/sbhatti/financial-sentiment-analysis)" ] }, { "cell_type": "code", "execution_count": null, "id": "5c74752c-ebc7-4896-9f1b-1b07eec4e61f", "metadata": { "tags": [] }, "outputs": [], "source": [ "df_data = pd.read_csv(\"../sample_payload/batch_data.csv\")\n", "source_data = df_data.to_json(orient='records')\n", "json_lst = json.loads(source_data)\n", "json_lst[0]" ] }, { "cell_type": "code", "execution_count": null, "id": "23abc668-2c11-4d41-bd8b-f8a901ee677c", "metadata": {}, "outputs": [], "source": [ "predictions1 = utils.invoke_with_single_sentence(json_lst, endpoint_name, variant1[\"VariantName\"])" ] }, { "cell_type": "code", "execution_count": null, "id": "939ddfd5-6c2a-4705-a348-adcbfe8ed7d1", "metadata": { "tags": [] }, "outputs": [], "source": [ "df = pd.DataFrame(columns=['label','score'], dtype=object)\n", "for prediction in predictions1:\n", " tmp_df = pd.DataFrame(json.loads(prediction)[0])\n", " new_row = tmp_df[tmp_df['score']==max(tmp_df['score'])]\n", " df = df.append(new_row, ignore_index=True)\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "f4f71370-cd81-497f-9b30-b747c58b3bb3", "metadata": {}, "outputs": [], "source": [ "value_map = {'LABEL_0': 0, 'LABEL_1': 1, 'LABEL_2': 2}\n", "df = df.replace({'label': value_map})\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "b2ff1f99-b6a4-4de5-9268-d0254e9cbc9d", "metadata": {}, "outputs": [], "source": [ "# Let's get the labels of our test set; we will use these to evaluate our predictions\n", "df_with_labels = pd.read_csv(\"../sample_payload/batch_data_groundtruth.csv\")\n", "\n", "value_map = {'negative': 0, 'neutral': 1, 'positive': 2}\n", "df_with_labels = df_with_labels.replace({'sentiment': value_map})" ] }, { "cell_type": "code", "execution_count": null, "id": "2cc71175-34ec-4da7-a970-44ba7a6db868", "metadata": {}, "outputs": [], "source": [ "test_labels = df_with_labels.iloc[:, 1]\n", "labels = test_labels.to_numpy()\n", "preds = df.label.to_numpy()\n", "\n", "# Calculate accuracy\n", "accuracy = sum(preds == labels) / len(labels)\n", "print(f\"Accuracy: {accuracy}\")\n" ] }, { "cell_type": "markdown", "id": "4d7da75c-b1bd-4a98-abdb-cb7d93e3534a", "metadata": {}, "source": [ "### Next, we collect data for Variant2" ] }, { "cell_type": "code", "execution_count": null, "id": "49e2b006-1229-42de-9534-45bcfaabb584", "metadata": {}, "outputs": [], "source": [ "predictions2 = utils.invoke_with_single_sentence(json_lst, endpoint_name, variant2[\"VariantName\"])" ] }, { "cell_type": "code", "execution_count": null, "id": "ab24c154-51d0-41dc-9f38-bf97445ae442", "metadata": {}, "outputs": [], "source": [ "df2 = pd.DataFrame(columns=['label','score'], dtype=object)\n", "for prediction in predictions2:\n", " tmp_df = pd.DataFrame(json.loads(prediction))\n", " new_row = tmp_df[tmp_df['score']==max(tmp_df['score'])]\n", " df2 = df2.append(new_row, ignore_index=True)\n", "df2.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "9e0c5a65-c633-4126-9166-3b849f4c316c", "metadata": {}, "outputs": [], "source": [ "value_map = {'NEGATIVE': 0, 'POSITIVE': 1}\n", "df2 = df2.replace({'label': value_map})\n", "df2.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "4b17aa31-729e-47ef-b856-69d3b7313e0f", "metadata": {}, "outputs": [], "source": [ "preds = df2.label.to_numpy()\n", "\n", "# Calculate accuracy\n", "accuracy = sum(preds == labels) / len(labels)\n", "print(f\"Accuracy: {accuracy}\")" ] }, { "cell_type": "markdown", "id": "7847607d-6fdc-45d4-ba0a-54753adb2378", "metadata": {}, "source": [ "## Step 4: Dialing up our chosen variant in production\n", "\n", "Now that we have determined Variant1 to be better as compared to Variant2, we will shift more traffic to it. \n", "\n", "We can continue to use TargetVariant to continue invoking a chosen variant. A simpler approach is to update the weights assigned to each variant using UpdateEndpointWeightsAndCapacities. This changes the traffic distribution to your production variants without requiring updates to your endpoint. \n", "\n", "Recall our variant weights are as follows:" ] }, { "cell_type": "code", "execution_count": null, "id": "b0e9072b-9a94-491a-b8b5-df7120f5ff40", "metadata": {}, "outputs": [], "source": [ "{\n", " variant[\"VariantName\"]: variant[\"CurrentWeight\"]\n", " for variant in sm_client.describe_endpoint(EndpointName=endpoint_name)[\"ProductionVariants\"]\n", "}" ] }, { "cell_type": "markdown", "id": "aa78da1b-ecce-421c-9bc8-4d0706a93e01", "metadata": {}, "source": [ "We'll first write a method to easily invoke our endpoint (a copy of what we had been previously doing):" ] }, { "cell_type": "markdown", "id": "6f1a2a2d-37e6-4ea6-b42d-f2a34814ed03", "metadata": {}, "source": [ "We invoke our endpoint for a bit, to show the even split in invocations:" ] }, { "cell_type": "code", "execution_count": null, "id": "2d32d0bb-1f3a-48ed-9d74-db6cfa29330b", "metadata": {}, "outputs": [], "source": [ "invocation_start_time = datetime.datetime.now()\n", "utils.invoke_endpoint_for_two_minutes(endpoint_name)\n", "time.sleep(20) # give metrics time to catch up\n", "params = {\n", " \"endpoint_name\": endpoint_name, \n", " \"variant1\": variant1, \n", " \"variant2\": variant2, \n", " \"start_time\":invocation_start_time\n", "}\n", "utils.plot_endpoint_metrics(**params)" ] }, { "cell_type": "markdown", "id": "7b20419a-8832-4739-80ea-fae0b04afce4", "metadata": {}, "source": [ "Now let us shift 75% of the traffic to Variant1 by assigning new weights to each variant using UpdateEndpointWeightsAndCapacities. Amazon SageMaker will now send 75% of the inference requests to Variant1 and remaining 25% of requests to Variant2. " ] }, { "cell_type": "code", "execution_count": null, "id": "8d3415aa-1566-4e1b-adb5-1ef15ff5eb1f", "metadata": {}, "outputs": [], "source": [ "sm_client.update_endpoint_weights_and_capacities(\n", " EndpointName=endpoint_name,\n", " DesiredWeightsAndCapacities=[\n", " {\"DesiredWeight\": 75, \"VariantName\": variant1[\"VariantName\"]},\n", " {\"DesiredWeight\": 25, \"VariantName\": variant2[\"VariantName\"]},\n", " ],\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "e9d859e3-e2a9-406b-aa3e-05461b5e0dba", "metadata": {}, "outputs": [], "source": [ "print(\"Waiting for update to complete\")\n", "utils.endpoint_update_wait(endpoint_name)\n", "\n", "{\n", " variant[\"VariantName\"]: variant[\"CurrentWeight\"]\n", " for variant in sm_client.describe_endpoint(EndpointName=endpoint_name)[\"ProductionVariants\"]\n", "}" ] }, { "cell_type": "markdown", "id": "c965321b-43f6-42dc-8108-e534cbdc249b", "metadata": {}, "source": [ "Now let's check how that has impacted invocation metrics:" ] }, { "cell_type": "code", "execution_count": null, "id": "fd2207b5-e0c4-4b32-a2e9-6af422ed4502", "metadata": {}, "outputs": [], "source": [ "utils.invoke_endpoint_for_two_minutes(endpoint_name)\n", "time.sleep(20) # give metrics time to catch up\n", "utils.plot_endpoint_metrics(**params)" ] }, { "cell_type": "markdown", "id": "f9cc839b-ca65-44d4-8d0d-d8d810e65af4", "metadata": {}, "source": [ "We can continue to monitor our metrics and when we're satisfied with a variant's performance, we can route 100% of the traffic over the variant. We used UpdateEndpointWeightsAndCapacities to update the traffic assignments for the variants. The weight for Variant1 is set to 0 and the weight for Variant2 is set to 1. Therefore, Amazon SageMaker will send 100% of all inference requests to Variant2." ] }, { "cell_type": "code", "execution_count": null, "id": "74dfd003-8954-4542-8fef-74206d60872a", "metadata": {}, "outputs": [], "source": [ "sm_client.update_endpoint_weights_and_capacities(\n", " EndpointName=endpoint_name,\n", " DesiredWeightsAndCapacities=[\n", " {\"DesiredWeight\": 1, \"VariantName\": variant1[\"VariantName\"]},\n", " {\"DesiredWeight\": 0, \"VariantName\": variant2[\"VariantName\"]},\n", " ],\n", ")\n", "print(\"Waiting for update to complete\")\n", "utils.endpoint_update_wait(endpoint_name)\n", "\n", "{\n", " variant[\"VariantName\"]: variant[\"CurrentWeight\"]\n", " for variant in sm_client.describe_endpoint(EndpointName=endpoint_name)[\"ProductionVariants\"]\n", "}" ] }, { "cell_type": "code", "execution_count": null, "id": "9a8b8ae6-818d-4377-af7c-08a0a1a48acb", "metadata": {}, "outputs": [], "source": [ "utils.invoke_endpoint_for_two_minutes(endpoint_name)\n", "time.sleep(20) # give metrics time to catch up\n", "utils.plot_endpoint_metrics(**params)" ] }, { "cell_type": "markdown", "id": "846bcd5e-771c-48fe-8ba0-c10e7e5f954d", "metadata": {}, "source": [ "The Amazon CloudWatch metrics for the total invocations for each variant below shows us that all inference requests are being processed by Variant1 and there are no inference requests processed by Variant2.\n", "\n", "You can now safely update your endpoint and delete Variant2 from your endpoint. You can also continue testing new models in production by adding new variants to your endpoint and following steps 2 - 4. " ] }, { "cell_type": "markdown", "id": "f4338605-82d7-4274-a8a8-5e815c42f6e1", "metadata": {}, "source": [ "## Delete the endpoint\n", "\n", "If you do not plan to use this endpoint further, you should delete the endpoint to avoid incurring additional charges." ] }, { "cell_type": "code", "execution_count": null, "id": "925712f1-6cdc-414b-932d-30d59ba4c14f", "metadata": {}, "outputs": [], "source": [ "sm_session.delete_endpoint(endpoint_name)" ] }, { "cell_type": "code", "execution_count": null, "id": "1e6386a0-485e-44ae-af78-690fa2b9bd9e", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "Python 3 (Data Science 2.0)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/sagemaker-data-science-38" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" }, "toc-showcode": false, "toc-showtags": false }, "nbformat": 4, "nbformat_minor": 5 }