{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Retail Demo Store Experimentation Workshop - A/B Testing Exercise\n", "\n", "In this exercise we will define, launch, and evaluate the results of an A/B experiment using the experimentation framework implemented in the Retail Demo Store project. If you have not already stepped through the **[3.1-Overview](./3.1-Overview.ipynb)** workshop notebook, please do so now as it provides the foundation built upon in this exercise.\n", "\n", "Recommended Time: 30 minutes\n", "\n", "## Prerequisites\n", "\n", "Since this module uses the Retail Demo Store's Recommendation service to run experiments across variations that depend on the personalization features of the Retail Demo Store, it is assumed that you have either completed the [Personalization](../1-Personalization/Lab-1-Introduction-and-data-preparation.ipynb) workshop or those resources have been pre-provisioned in your AWS environment. If you are unsure and attending an AWS managed event such as a workshop, check with your event lead." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 1: A/B Experiment\n", "\n", "For the first exercise we will demonstrate how to use the A/B testing technique to implement an experiment over two implementations, or variations, of product recommendations. The first variation will represent our current implementation using the [**Default Product Resolver**](https://github.com/aws-samples/retail-demo-store/blob/master/src/recommendations/src/recommendations-service/experimentation/resolvers.py) and the second variation will use the [**Personalize Resolver**](https://github.com/aws-samples/retail-demo-store/blob/master/src/recommendations/src/recommendations-service/experimentation/resolvers.py). The scenario we are simulating is adding product recommendations powered by Amazon Personalize to home page and measuring the impact/uplift in click-throughs for products as a result of deploying a personalization strategy.\n", "\n", "### What is A/B Testing?\n", "\n", "A/B testing, also known as bucket or split testing, is used to compare the performance of two variations (A and B) of a single variable/experience by exposing separate groups of users to each variation and measuring user responses. An A/B experiment is run for a period of time, typically dictated by the number of users necessary to reach a statistically significant result, followed by statistical analysis of the results to determine if a conclustion can be reached as to the best performing variation.\n", "\n", "### Our Experiment Hypothesis\n", "\n", "**Sample scenario:**\n", "\n", "Website analytics have shown that user sessions frequently end on the home page for our e-commerce site, the Retail Demo Store. Furthermore, when users do make a purchase, most purchases are for a single product. Currently on our home page we are using a basic approach of recommending featured products. We hypothesize that adding personalized recommendations to the home page will result in increasing the click-through rate of products by 25%. The current click-through rate is 15%." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ABExperiment Class\n", "\n", "Before stepping through creating and executing our A/B test, let's look at the relevant source code for the [**ABExperiment**](https://github.com/aws-samples/retail-demo-store/blob/master/src/recommendations/src/recommendations-service/experimentation/experiment_ab.py) class that implements A/B experiments in the Retail Demo Store project.\n", "\n", "As noted in the **3.1-Overview** notebook, all experiment types are subclasses of the abstract **Experiment** class. See **[3.1-Overview](./3.1-Overview.ipynb)** for more details on the experimentation framework.\n", "\n", "The `ABExperiment.get_items()` method is where item recommendations are retrieved for the experiment. The `ABExperiment.calculate_variation_index()` method is where users are assigned to a variation/group using a consistent hashing algorithm. This ensures that each user is assigned to the same variation across multiple requests for recommended items for the duration of the experiment. Once the variation is determined, the variation's **Resolver** is used to retrieve recommendations. Details on the experiment are added to item list to support conversion/outcome tracking and UI annotation. \n", "\n", "```python\n", "# from src/recommendations/src/recommendations-service/experimentation/experiment_ab.py\n", "\n", "class ABExperiment(Experiment):\n", " ...\n", "\n", " def get_items(self, user_id, current_item_id = None, item_list = None, num_results = 10, tracker = None):\n", " ...\n", " # Determine which variation to use for the user.\n", " variation_idx = self.calculate_variation_index(user_id)\n", "\n", " # Increment exposure counter for variation for this experiment.\n", " self._increment_exposure_count(variation_idx)\n", "\n", " # Get item recommendations from the variation's resolver.\n", " variation = self.variations[variation_idx]\n", "\n", " resolve_params = {\n", " 'user_id': user_id,\n", " 'product_id': current_item_id,\n", " 'num_results': num_results\n", " }\n", " items = variation.resolver.get_items(**resolve_params)\n", "\n", " # Inject experiment details into recommended item list.\n", " rank = 1\n", " for item in items:\n", " correlation_id = self._create_correlation_id(user_id, variation_idx, rank)\n", "\n", " item_experiment = {\n", " 'id': self.id,\n", " 'feature': self.feature,\n", " 'name': self.name,\n", " 'type': self.type,\n", " 'variationIndex': variation_idx,\n", " 'resultRank': rank,\n", " 'correlationId': correlation_id\n", " }\n", "\n", " item.update({ \n", " 'experiment': item_experiment\n", " })\n", "\n", " rank += 1\n", "\n", " ...\n", "\n", " return items\n", "\n", " def calculate_variation_index(self, user_id):\n", " \"\"\" Given a user_id and this experiment's configuration, return the variation\n", "\n", " The same variation will be returned for given user for this experiment no \n", " matter how many times this method is called.\n", " \"\"\"\n", " if len(self.variations) == 0:\n", " return -1\n", "\n", " hash_str = f'experiments.{self.feature}.{self.name}.{user_id}'.encode('ascii')\n", " hash_int = int(hashlib.sha1(hash_str).hexdigest()[:15], 16)\n", " index = hash_int % len(self.variations)\n", "\n", " return index\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Setup - Import Dependencies\n", "\n", "Throughout this workshop we will need access to some common libraries and clients for connecting to AWS services. Let's set those up now." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import boto3\n", "import json\n", "import uuid\n", "import numpy as np\n", "import requests\n", "import pandas as pd\n", "import random\n", "import scipy.stats as scs\n", "import time\n", "import decimal\n", "import matplotlib.pyplot as plt\n", "\n", "from boto3.dynamodb.conditions import Key\n", "from random import randint\n", "\n", "# import custom scripts used for plotting\n", "from src.plot import *\n", "from src.stats import *\n", "\n", "%matplotlib inline\n", "plt.style.use('ggplot')\n", "\n", "# We will be using a DynamoDB table to store configuration info for our experiments.\n", "dynamodb = boto3.resource('dynamodb')\n", "\n", "# Service discovery will allow us to dynamically discover Retail Demo Store resources\n", "servicediscovery = boto3.client('servicediscovery')\n", "# Retail Demo Store config parameters are stored in SSM\n", "ssm = boto3.client('ssm')\n", "\n", "# Utility class to convert types for printing as JSON.\n", "class CompatEncoder(json.JSONEncoder):\n", " def default(self, obj):\n", " if isinstance(obj, decimal.Decimal):\n", " if obj % 1 > 0:\n", " return float(obj)\n", " else:\n", " return int(obj)\n", " else:\n", " return super(CompatEncoder, self).default(obj)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sample Size Calculation\n", "\n", "The first step is to determine the sample size necessary to reach a statistically significant result given a target of 25% gain in click-through rate from the home page. There are several sample size calculators available online including calculators from [Optimizely](https://www.optimizely.com/sample-size-calculator/?conversion=15&effect=20&significance=95), [AB Tasty](https://www.abtasty.com/sample-size-calculator/), and [Evan Miller](https://www.evanmiller.org/ab-testing/sample-size.html#!15;80;5;25;1). For this exercise, we will use the following function to calculate the minimal sample size for each variation." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def min_sample_size(bcr, mde, power=0.8, sig_level=0.05):\n", " \"\"\"Returns the minimum sample size to set up a split test\n", "\n", " Arguments:\n", " bcr (float): probability of success for control, sometimes\n", " referred to as baseline conversion rate\n", "\n", " mde (float): minimum change in measurement between control\n", " group and test group if alternative hypothesis is true, sometimes\n", " referred to as minimum detectable effect\n", "\n", " power (float): probability of rejecting the null hypothesis when the\n", " null hypothesis is false, typically 0.8\n", "\n", " sig_level (float): significance level often denoted as alpha,\n", " typically 0.05\n", "\n", " Returns:\n", " min_N: minimum sample size (float)\n", "\n", " References:\n", " Stanford lecture on sample sizes\n", " http://statweb.stanford.edu/~susan/courses/s141/hopower.pdf\n", " \"\"\"\n", " # standard normal distribution to determine z-values\n", " standard_norm = scs.norm(0, 1)\n", "\n", " # find Z_beta from desired power\n", " Z_beta = standard_norm.ppf(power)\n", "\n", " # find Z_alpha\n", " Z_alpha = standard_norm.ppf(1-sig_level/2)\n", "\n", " # average of probabilities from both groups\n", " pooled_prob = (bcr + bcr+mde) / 2\n", "\n", " min_N = (2 * pooled_prob * (1 - pooled_prob) * (Z_beta + Z_alpha)**2\n", " / mde**2)\n", "\n", " return min_N" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# This is the conversion rate using the current implementation\n", "baseline_conversion_rate = 0.15\n", "# This is the lift expected by adding personalization\n", "absolute_percent_lift = baseline_conversion_rate * .25\n", "\n", "# Calculate the sample size needed to reach a statistically significant result\n", "sample_size = int(min_sample_size(baseline_conversion_rate, absolute_percent_lift))\n", "\n", "print('Sample size for each variation: ' + str(sample_size))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Experiment Strategy Datastore\n", "\n", "With our sample size defined, let's create an experiment strategy for our A/B experiment. Walk through each of the following steps to configure your environment.\n", "\n", "A DynamoDB table was created by the Retail Demo Store CloudFormation template that we will use to store the configuration information for our experiments. The table name can be found in a system parameter." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response = ssm.get_parameter(Name='retaildemostore-experiment-strategy-table-name')\n", "\n", "table_name = response['Parameter']['Value'] # Do Not Change\n", "print('Experiments DDB table: ' + table_name)\n", "table = dynamodb.Table(table_name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we need to lookup the Amazon Personalize campaign ARN for product recommendations. This is the campaign that was created in the [Personalization workshop](../1-Personalization/personalize.ipynb) (or was pre-built for you depending on your workshop event)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response = ssm.get_parameter(Name = '/retaildemostore/personalize/recommended-for-you-arn')\n", "\n", "campaign_arn = response['Parameter']['Value'] # Do Not Change\n", "print('Personalize product recommendations ARN: ' + campaign_arn)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create A/B Experiment\n", "\n", "The Retail Demo Store supports running multiple experiments concurrently. For this workshop we will create a single A/B test/experiment that uniformly splits users between a control group that receives recommendations from the default behavior and a variance group that receives recommendations from Amazon Personalize. The Recommendations service already has logic that supports A/B tests once an active experiment is detected our Experiment Strategy DynamoDB table.\n", "\n", "Experiment configurations are stored in a DynamoDB table where each item in the table represents an experiment and has the following fields.\n", "\n", "- **id** - Uniquely identified this experience (UUID).\n", "- **feature** - Identifies the Retail Demo Store feature where the experiment should be applied. The name for the home page product recommendations feature is `home_product_recs`.\n", "- **name** - The name of the experiment. Keep the name short but descriptive. It will be used in the UI for demo purposes and when logging events for experiment result tracking.\n", "- **status** - The status of the experiment (`ACTIVE`, `EXPIRED`, or `PENDING`).\n", "- **type** - The type of test (`ab` for an A/B test, `interleaving` for interleaved recommendations, or `mab` for multi-armed bandit test)\n", "- **variations** - List of configurations representing variations for the experiment. For example, for A/B tests of the `home_product_recs` feature, the `variations` can be two Amazon Personalize campaign ARNs (variation type `personalize-recommendations`) or a single Personalize campaign ARN and the default product behavior." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feature = 'home_product_recs'\n", "experiment_name = 'home_personalize_ab'\n", "\n", "# First, make sure there are no other active experiments so we can isolate\n", "# this experiment for the exercise (to keep things clean/simple).\n", "response = table.scan(\n", " ProjectionExpression='#k', \n", " ExpressionAttributeNames={'#k' : 'id'},\n", " FilterExpression=Key('status').eq('ACTIVE')\n", ")\n", "for item in response['Items']:\n", " response = table.update_item(\n", " Key=item,\n", " UpdateExpression='SET #s = :inactive',\n", " ExpressionAttributeNames={\n", " '#s' : 'status'\n", " },\n", " ExpressionAttributeValues={\n", " ':inactive' : 'INACTIVE'\n", " }\n", " )\n", "\n", "# Query the experiment strategy table to see if our experiment already exists\n", "response = table.query(\n", " IndexName='feature-name-index',\n", " KeyConditionExpression=Key('feature').eq(feature) & Key('name').eq(experiment_name),\n", " FilterExpression=Key('status').eq('ACTIVE')\n", ")\n", "\n", "if response.get('Items') and len(response.get('Items')) > 0:\n", " print('Experiment already exists')\n", " home_page_experiment = response['Items'][0]\n", "else:\n", " print('Creating experiment')\n", " \n", " # Default product resolver\n", " variation_0 = {\n", " 'type': 'product'\n", " }\n", " \n", " # Amazon Personalize resolver\n", " variation_1 = {\n", " 'type': 'personalize-recommendations',\n", " 'inference_arn': campaign_arn\n", " }\n", "\n", " home_page_experiment = { \n", " 'id': uuid.uuid4().hex,\n", " 'feature': feature,\n", " 'name': experiment_name,\n", " 'status': 'ACTIVE',\n", " 'type': 'ab',\n", " 'variations': [ variation_0, variation_1 ]\n", " }\n", " \n", " response = table.put_item(\n", " Item=home_page_experiment\n", " )\n", "\n", " print(json.dumps(response, indent=4))\n", " \n", "print('Experiment item:')\n", "print(json.dumps(home_page_experiment, indent=4, cls=CompatEncoder))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load Users\n", "\n", "For our experiment simulation, we will load all Retail Demo Store users and run the experiment until the sample size for both variations has been met.\n", "\n", "First, let's discover the IP address for the Retail Demo Store's [Users](https://github.com/aws-samples/retail-demo-store/tree/master/src/users) service." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response = servicediscovery.discover_instances(\n", " NamespaceName='retaildemostore.local',\n", " ServiceName='users',\n", " MaxResults=1,\n", " HealthStatus='HEALTHY'\n", ")\n", "\n", "users_service_instance = response['Instances'][0]['Attributes']['AWS_INSTANCE_IPV4']\n", "print('Users Service Instance IP: {}'.format(users_service_instance))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, let's fetch all users, randomize their order, and load them into a local data frame." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load all users so we have enough to satisfy our sample size requirements.\n", "response = requests.get('http://{}/users/all?count=10000'.format(users_service_instance))\n", "users = response.json()\n", "random.shuffle(users)\n", "users_df = pd.DataFrame(users)\n", "pd.set_option('display.max_rows', 5)\n", "\n", "users_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Discover Recommendations Service\n", "\n", "Next, let's discover the IP address for the Retail Demo Store's [Recommendations](https://github.com/aws-samples/retail-demo-store/tree/master/src/recommendations) service. This is the service where the Experimentation framework is implemented and the `/recommendations` endpoint is what we call to simulate our A/B experiment." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response = servicediscovery.discover_instances(\n", " NamespaceName='retaildemostore.local',\n", " ServiceName='recommendations',\n", " MaxResults=1,\n", " HealthStatus='HEALTHY'\n", ")\n", "\n", "recommendations_service_instance = response['Instances'][0]['Attributes']['AWS_INSTANCE_IPV4']\n", "print('Recommendation Service Instance IP: {}'.format(recommendations_service_instance))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Simulate Experiment\n", "\n", "Next we will define a function to simulate our A/B experiment by making calls to the [Recommendations](https://github.com/aws-samples/retail-demo-store/tree/master/src/recommendations) service across the users we just loaded. Then we will run our simulation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Simulation Function\n", "\n", "The following `simulate_experiment` function is supplied with the sample size for each group (A and B) and the probability of conversion for each group that we want to use for our simulation. It runs the simulation long enough to satisfy the sample size requirements and calls the Recommendations service for each user in the experiment." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def simulate_experiment(N_A, N_B, p_A, p_B):\n", " \"\"\"Returns a pandas dataframe with simulated CTR data\n", "\n", " Parameters:\n", " N_A (int): sample size for control group\n", " N_B (int): sample size for test group\n", " Note: final sample size may not match N_A & N_B provided because the\n", " group at each row is chosen at random by the ABExperiment class.\n", " p_A (float): conversion rate; conversion rate of control group\n", " p_B (float): conversion rate; conversion rate of test group\n", "\n", " Returns:\n", " df (df)\n", " \"\"\"\n", "\n", " # will hold exposure/outcome data\n", " data = []\n", "\n", " # total number of users to sample for both variations\n", " N = N_A + N_B\n", " \n", " if N > len(users):\n", " raise ValueError('Sample size is greater than number of users')\n", "\n", " print('Generating data for {} users... this may take a few minutes'.format(N))\n", "\n", " # initiate bernoulli distributions to randomly sample from based on simulated probabilities\n", " A_bern = scs.bernoulli(p_A)\n", " B_bern = scs.bernoulli(p_B)\n", " \n", " for idx in range(N):\n", " if idx > 0 and idx % 500 == 0:\n", " print('Generated data for {} users so far'.format(idx))\n", " \n", " # initite empty row\n", " row = {}\n", "\n", " # Get next user from shuffled list\n", " user = users[idx]\n", "\n", " # Call Recommendations web service to get recommendations for the user\n", " response = requests.get('http://{}/recommendations?userID={}&feature={}'.format(recommendations_service_instance, user['id'], feature))\n", "\n", " recommendations = response.json()\n", " recommendation = recommendations[randint(0, len(recommendations)-1)]\n", " \n", " variation = recommendation['experiment']['variationIndex']\n", " row['variation'] = variation\n", " \n", " # Determine if variation converts based on probabilities provided\n", " if variation == 0:\n", " row['converted'] = A_bern.rvs()\n", " else:\n", " row['converted'] = B_bern.rvs()\n", " \n", " if row['converted'] == 1:\n", " # Update experiment with outcome/conversion\n", " correlation_id = recommendation['experiment']['correlationId']\n", " requests.post('http://{}/experiment/outcome'.format(recommendations_service_instance), data={'correlationId':correlation_id})\n", " \n", " data.append(row)\n", " \n", " # convert data into dataframe\n", " df = pd.DataFrame(data)\n", " \n", " print('Done')\n", "\n", " return df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Run Simulation\n", "\n", "Next we run the simulation by defining our simulation parameters for sample sizes and probabilities and then call `simulate_experiment`. This will take several minutes depending on the sample sizes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "\n", "# Set size of both groups to calculated sample size\n", "N_A = N_B = sample_size\n", "\n", "# Use probabilities from our hypothesis\n", "# bcr: baseline conversion rate\n", "p_A = 0.15\n", "# d_hat: difference in a metric between the two groups, sometimes referred to as minimal detectable effect or lift depending on the context\n", "p_B = 0.1875\n", "\n", "# Run simulation\n", "ab_data = simulate_experiment(N_A, N_B, p_A, p_B)\n", "\n", "ab_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Inspect Experiment Summary Statistics\n", "\n", "Since the **Experiment** class updates statistics for the experiment in the experiment strategy DynamoDB table when a user is exposed to an experiment (\"exposure\") and when a user converts (\"outcome\"), we should see updated counts on our experiment. Let's reload our experiment and inspect the exposure and conversion counts for our simulation." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Query DDB table for experiment item.\n", "response = table.get_item(Key={'id': home_page_experiment['id']})\n", "\n", "print(json.dumps(response['Item'], indent=4, cls=CompatEncoder))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should now see counters for `conversions` and `exposures` for each variation. These represent how many times a user has been exposed to a variation and how many times a user has converted for a variation (i.e. clicked on a recommended item/product)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Analyze Simulation Results\n", "\n", "Next, let's take a closer look at the results of our simulation. We'll start by calculating some summary statistics." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ab_summary = ab_data.pivot_table(values='converted', index='variation', aggfunc=np.sum)\n", "# add additional columns to the pivot table\n", "ab_summary['total'] = ab_data.pivot_table(values='converted', index='variation', aggfunc=lambda x: len(x))\n", "ab_summary['rate'] = ab_data.pivot_table(values='converted', index='variation')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ab_summary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The output above tells us how many users converted for each variation, the actual sample size for each variation in the simulation, and the conversion rate for each variation.\n", "\n", "Next let's isolate the data and conversion counts for each variation." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "A_group = ab_data[ab_data['variation'] == 0]\n", "B_group = ab_data[ab_data['variation'] == 1]\n", "A_converted, B_converted = A_group['converted'].sum(), B_group['converted'].sum()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "A_converted, B_converted" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Isolate the actual sample size for each variation." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "A_total, B_total = len(A_group), len(B_group)\n", "A_total, B_total" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Calculate the actual conversion rates and uplift for our simulation." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "p_A, p_B = A_converted / A_total, B_converted / B_total\n", "p_A, p_B" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "p_B - p_A" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Determining Statistical Significance\n", "\n", "In statistical hypothesis testing there are two types of errors that can occur. These are referred to as type 1 and type 2 errors.\n", "\n", "Type 1 errors occur when the null hypothesis is true but is rejected. In other words, a \"false positive\" conclusion. Put in A/B testing terms, a type 1 error is when we conclude a statistically significant result when there isn't one.\n", "\n", "Type 2 errors occur when we conclude that there is not a winner between two variations when in fact there is an actual winner. In other words, the null hypothesis is false yet we fail to reject it. Therefore, type 2 errors are a \"false negative\" conclusion.\n", "\n", "If the probability of making a type 1 error is determined by \"α\" (alpha), the probability of a type 2 error is \"β\" (beta). Beta depends on the power of the test (i.e the probability of not committing a type 2 error, which is equal to 1-β).\n", "\n", "Let's inspect the results of our simulation more closely to verify that it is statistically significant." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Calculate p-value\n", "\n", "Formally, the p-value is the probability of seeing a particular result (or greater) from zero, assuming that the null hypothesis is TRUE. In other words, the p-value is the expected fluctuation in a given sample, similar to the variance. As an example, imagine we ran an A/A test where displayed the same variation to two groups of users. After such an experiment we would expect the conversion rates of both groups to be very similar but not dramatically different.\n", "\n", "What we are hoping to see is a p-value that is less than our significance level. The significance level we used when calculating our sample size was 5%, which means we are seeking results with 95% accuracy. 5% is considered the industry standard." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "p_value = scs.binom(A_total, p_A).pmf(p_B * B_total)\n", "print('p-value = {0:0.9f}'.format(p_value))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Is the p-value less than the signficance level of 5%? This tells us the probability of a type 1 error.\n", "\n", "Let's plot the data from both groups as binomial distributions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=(12,6))\n", "xA = np.linspace(A_converted-49, A_converted+50, 100)\n", "yA = scs.binom(A_total, p_A).pmf(xA)\n", "ax.scatter(xA, yA, s=10)\n", "xB = np.linspace(B_converted-49, B_converted+50, 100)\n", "yB = scs.binom(B_total, p_B).pmf(xB)\n", "ax.scatter(xB, yB, s=10)\n", "plt.xlabel('converted')\n", "plt.ylabel('probability')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Based the probabilities from our hypothesis, we should see that the test group in blue (B) converted more users than the control group in red (A). However, the plot above is not a plot of the null and alternate hypothesis. The null hypothesis is a plot of the difference between the probability of the two groups.\n", "\n", "> Given the randomness of our user selection, group hashing, and probabilities, your simulation results should be different for each simulation run and therefore may or may not be statistically significant.\n", "\n", "In order to calculate the difference between the two groups, we need to standardize the data. Because the number of samples can be different between the two groups, we should compare the probability of successes, p.\n", "\n", "According to the central limit theorem, by calculating many sample means we can approximate the true mean of the population from which the data for the control group was taken. The distribution of the sample means will be normally distributed around the true mean with a standard deviation equal to the standard error of the mean." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "SE_A = np.sqrt(p_A * (1-p_A)) / np.sqrt(A_total)\n", "SE_B = np.sqrt(p_B * (1-p_B)) / np.sqrt(B_total)\n", "SE_A, SE_B" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=(12,6))\n", "xA = np.linspace(0, .3, A_total)\n", "yA = scs.norm(p_A, SE_A).pdf(xA)\n", "ax.plot(xA, yA)\n", "ax.axvline(x=p_A, c='red', alpha=0.5, linestyle='--')\n", "\n", "xB = np.linspace(0, .3, B_total)\n", "yB = scs.norm(p_B, SE_B).pdf(xB)\n", "ax.plot(xB, yB)\n", "ax.axvline(x=p_B, c='blue', alpha=0.5, linestyle='--')\n", "\n", "plt.xlabel('Converted Proportion')\n", "plt.ylabel('PDF')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dashed lines represent the mean conversion rate for each group. The distance between the red dashed line and the blue dashed line is equal to d_hat, or the minimum detectable effect." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "p_A_actual = ab_summary.loc[0, 'rate']\n", "p_B_actual = ab_summary.loc[1, 'rate']\n", "bcr = p_A_actual\n", "d_hat = p_B_actual - p_A_actual\n", "A_total, B_total, bcr, d_hat" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, let's calculate the power, alpha, and beta from our simulation." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "abplot(A_total, B_total, bcr, d_hat, show_power=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The power value we used when determining out sample size for our experiment was 80%. This is considered the industry standard. Is the power value calculated in the plot above greater than 80%?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "abplot(A_total, B_total, bcr, d_hat, show_beta=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "abplot(A_total, B_total, bcr, d_hat, show_alpha=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Are the alpha and beta values plotted in the graphs above less than our significance level of 5%? If so, we have a statistically significant result." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Next Steps\n", "\n", "You have completed the exercise for implementing an A/B test using the experimentation framework in the Retail Demo Store. Close this notebook and open the notebook for the next exercise, **[3.3-Interleaving-Experiment](./3.3-Interleaving-Experiment.ipynb)**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### References and Further Reading\n", "\n", "- [A/B testing](https://en.wikipedia.org/wiki/A/B_testing), Wikipedia\n", "- [A/B testing](https://www.optimizely.com/optimization-glossary/ab-testing/), Optimizely\n", "- [Evan's Awesome A/B Tools](https://www.evanmiller.org/ab-testing/), Evan Miller" ] } ], "metadata": { "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.13" } }, "nbformat": 4, "nbformat_minor": 4 }