# Retail Demo Store Experimentation Workshop - A/B Testing Exercise

In this exercise we will define, launch, and evaluate the results of an A/B experiment using the experimentation framework implemented in the Retail Demo Store project. If you have not already stepped through the **[3.1-Overview](./3.1-Overview.ipynb)** workshop notebook, please do so now as it provides the foundation built upon in this exercise.

Recommended Time: 30 minutes

## Prerequisites

Since this module uses the Retail Demo Store's Recommendation service to run experiments across variations that depend on the personalization features of the Retail Demo Store, it is assumed that you have either completed the [Personalization](../1-Personalization/Lab-1-Introduction-and-data-preparation.ipynb) workshop or those resources have been pre-provisioned in your AWS environment. If you are unsure and attending an AWS managed event such as a workshop, check with your event lead.

## Exercise 1: A/B Experiment

For the first exercise we will demonstrate how to use the A/B testing technique to implement an experiment over two implementations, or variations, of product recommendations. The first variation will represent our current implementation using the [**Default Product Resolver**](https://github.com/aws-samples/retail-demo-store/blob/master/src/recommendations/src/recommendations-service/experimentation/resolvers.py) and the second variation will use the [**Personalize Resolver**](https://github.com/aws-samples/retail-demo-store/blob/master/src/recommendations/src/recommendations-service/experimentation/resolvers.py). The scenario we are simulating is adding product recommendations powered by Amazon Personalize to home page and measuring the impact/uplift in click-throughs for products as a result of deploying a personalization strategy.

### What is A/B Testing?

A/B testing, also known as bucket or split testing, is used to compare the performance of two variations (A and B) of a single variable/experience by exposing separate groups of users to each variation and measuring user responses. An A/B experiment is run for a period of time, typically dictated by the number of users necessary to reach a statistically significant result, followed by statistical analysis of the results to determine if a conclustion can be reached as to the best performing variation.

### Our Experiment Hypothesis

**Sample scenario:**

Website analytics have shown that user sessions frequently end on the home page for our e-commerce site, the Retail Demo Store. Furthermore, when users do make a purchase, most purchases are for a single product. Currently on our home page we are using a basic approach of recommending featured products. We hypothesize that adding personalized recommendations to the home page will result in increasing the click-through rate of products by 25%. The current click-through rate is 15%.

### ABExperiment Class

Before stepping through creating and executing our A/B test, let's look at the relevant source code for the [**ABExperiment**](https://github.com/aws-samples/retail-demo-store/blob/master/src/recommendations/src/recommendations-service/experimentation/experiment_ab.py) class that implements A/B experiments in the Retail Demo Store project.

As noted in the **3.1-Overview** notebook, all experiment types are subclasses of the abstract **Experiment** class. See **[3.1-Overview](./3.1-Overview.ipynb)** for more details on the experimentation framework.

The `ABExperiment.get_items()` method is where item recommendations are retrieved for the experiment. The `ABExperiment.calculate_variation_index()` method is where users are assigned to a variation/group using a consistent hashing algorithm. This ensures that each user is assigned to the same variation across multiple requests for recommended items for the duration of the experiment. Once the variation is determined, the variation's **Resolver** is used to retrieve recommendations. Details on the experiment are added to item list to support conversion/outcome tracking and UI annotation. 

```python
# from src/recommendations/src/recommendations-service/experimentation/experiment_ab.py

class ABExperiment(Experiment):
 ...

 def get_items(self, user_id, current_item_id = None, item_list = None, num_results = 10, tracker = None):
 ...
 # Determine which variation to use for the user.
 variation_idx = self.calculate_variation_index(user_id)

 # Increment exposure counter for variation for this experiment.
 self._increment_exposure_count(variation_idx)

 # Get item recommendations from the variation's resolver.
 variation = self.variations[variation_idx]

 resolve_params = {
 'user_id': user_id,
 'product_id': current_item_id,
 'num_results': num_results
 }
 items = variation.resolver.get_items(**resolve_params)

 # Inject experiment details into recommended item list.
 rank = 1
 for item in items:
 correlation_id = self._create_correlation_id(user_id, variation_idx, rank)

 item_experiment = {
 'id': self.id,
 'feature': self.feature,
 'name': self.name,
 'type': self.type,
 'variationIndex': variation_idx,
 'resultRank': rank,
 'correlationId': correlation_id
 }

 item.update({ 
 'experiment': item_experiment
 })

 rank += 1

 ...

 return items

 def calculate_variation_index(self, user_id):
 """ Given a user_id and this experiment's configuration, return the variation

 The same variation will be returned for given user for this experiment no 
 matter how many times this method is called.
 """
 if len(self.variations) == 0:
 return -1

 hash_str = f'experiments.{self.feature}.{self.name}.{user_id}'.encode('ascii')
 hash_int = int(hashlib.sha1(hash_str).hexdigest()[:15], 16)
 index = hash_int % len(self.variations)

 return index
```

### Setup - Import Dependencies

Throughout this workshop we will need access to some common libraries and clients for connecting to AWS services. Let's set those up now.

In [None]:
import boto3
import json
import uuid
import numpy as np
import requests
import pandas as pd
import random
import scipy.stats as scs
import time
import decimal
import matplotlib.pyplot as plt

from boto3.dynamodb.conditions import Key
from random import randint

# import custom scripts used for plotting
from src.plot import *
from src.stats import *

%matplotlib inline
plt.style.use('ggplot')

# We will be using a DynamoDB table to store configuration info for our experiments.
dynamodb = boto3.resource('dynamodb')

# Service discovery will allow us to dynamically discover Retail Demo Store resources
servicediscovery = boto3.client('servicediscovery')
# Retail Demo Store config parameters are stored in SSM
ssm = boto3.client('ssm')

# Utility class to convert types for printing as JSON.
class CompatEncoder(json.JSONEncoder):
 def default(self, obj):
 if isinstance(obj, decimal.Decimal):
 if obj % 1 > 0:
 return float(obj)
 else:
 return int(obj)
 else:
 return super(CompatEncoder, self).default(obj)

### Sample Size Calculation

The first step is to determine the sample size necessary to reach a statistically significant result given a target of 25% gain in click-through rate from the home page. There are several sample size calculators available online including calculators from [Optimizely](https://www.optimizely.com/sample-size-calculator/?conversion=15&effect=20&significance=95), [AB Tasty](https://www.abtasty.com/sample-size-calculator/), and [Evan Miller](https://www.evanmiller.org/ab-testing/sample-size.html#!15;80;5;25;1). For this exercise, we will use the following function to calculate the minimal sample size for each variation.

In [None]:
def min_sample_size(bcr, mde, power=0.8, sig_level=0.05):
 """Returns the minimum sample size to set up a split test

 Arguments:
 bcr (float): probability of success for control, sometimes
 referred to as baseline conversion rate

 mde (float): minimum change in measurement between control
 group and test group if alternative hypothesis is true, sometimes
 referred to as minimum detectable effect

 power (float): probability of rejecting the null hypothesis when the
 null hypothesis is false, typically 0.8

 sig_level (float): significance level often denoted as alpha,
 typically 0.05

 Returns:
 min_N: minimum sample size (float)

 References:
 Stanford lecture on sample sizes
 http://statweb.stanford.edu/~susan/courses/s141/hopower.pdf
 """
 # standard normal distribution to determine z-values
 standard_norm = scs.norm(0, 1)

 # find Z_beta from desired power
 Z_beta = standard_norm.ppf(power)

 # find Z_alpha
 Z_alpha = standard_norm.ppf(1-sig_level/2)

 # average of probabilities from both groups
 pooled_prob = (bcr + bcr+mde) / 2

 min_N = (2 * pooled_prob * (1 - pooled_prob) * (Z_beta + Z_alpha)**2
 / mde**2)

 return min_N

In [None]:
# This is the conversion rate using the current implementation
baseline_conversion_rate = 0.15
# This is the lift expected by adding personalization
absolute_percent_lift = baseline_conversion_rate * .25

# Calculate the sample size needed to reach a statistically significant result
sample_size = int(min_sample_size(baseline_conversion_rate, absolute_percent_lift))

print('Sample size for each variation: ' + str(sample_size))

### Experiment Strategy Datastore

With our sample size defined, let's create an experiment strategy for our A/B experiment. Walk through each of the following steps to configure your environment.

A DynamoDB table was created by the Retail Demo Store CloudFormation template that we will use to store the configuration information for our experiments. The table name can be found in a system parameter.

In [None]:
response = ssm.get_parameter(Name='retaildemostore-experiment-strategy-table-name')

table_name = response['Parameter']['Value'] # Do Not Change
print('Experiments DDB table: ' + table_name)
table = dynamodb.Table(table_name)

Next we need to lookup the Amazon Personalize campaign ARN for product recommendations. This is the campaign that was created in the [Personalization workshop](../1-Personalization/personalize.ipynb) (or was pre-built for you depending on your workshop event).

In [None]:
response = ssm.get_parameter(Name = '/retaildemostore/personalize/recommended-for-you-arn')

campaign_arn = response['Parameter']['Value'] # Do Not Change
print('Personalize product recommendations ARN: ' + campaign_arn)

### Create A/B Experiment

The Retail Demo Store supports running multiple experiments concurrently. For this workshop we will create a single A/B test/experiment that uniformly splits users between a control group that receives recommendations from the default behavior and a variance group that receives recommendations from Amazon Personalize. The Recommendations service already has logic that supports A/B tests once an active experiment is detected our Experiment Strategy DynamoDB table.

Experiment configurations are stored in a DynamoDB table where each item in the table represents an experiment and has the following fields.

- **id** - Uniquely identified this experience (UUID).
- **feature** - Identifies the Retail Demo Store feature where the experiment should be applied. The name for the home page product recommendations feature is `home_product_recs`.
- **name** - The name of the experiment. Keep the name short but descriptive. It will be used in the UI for demo purposes and when logging events for experiment result tracking.
- **status** - The status of the experiment (`ACTIVE`, `EXPIRED`, or `PENDING`).
- **type** - The type of test (`ab` for an A/B test, `interleaving` for interleaved recommendations, or `mab` for multi-armed bandit test)
- **variations** - List of configurations representing variations for the experiment. For example, for A/B tests of the `home_product_recs` feature, the `variations` can be two Amazon Personalize campaign ARNs (variation type `personalize-recommendations`) or a single Personalize campaign ARN and the default product behavior.

In [None]:
feature = 'home_product_recs'
experiment_name = 'home_personalize_ab'

# First, make sure there are no other active experiments so we can isolate
# this experiment for the exercise (to keep things clean/simple).
response = table.scan(
 ProjectionExpression='#k', 
 ExpressionAttributeNames={'#k' : 'id'},
 FilterExpression=Key('status').eq('ACTIVE')
)
for item in response['Items']:
 response = table.update_item(
 Key=item,
 UpdateExpression='SET #s = :inactive',
 ExpressionAttributeNames={
 '#s' : 'status'
 },
 ExpressionAttributeValues={
 ':inactive' : 'INACTIVE'
 }
 )

# Query the experiment strategy table to see if our experiment already exists
response = table.query(
 IndexName='feature-name-index',
 KeyConditionExpression=Key('feature').eq(feature) & Key('name').eq(experiment_name),
 FilterExpression=Key('status').eq('ACTIVE')
)

if response.get('Items') and len(response.get('Items')) > 0:
 print('Experiment already exists')
 home_page_experiment = response['Items'][0]
else:
 print('Creating experiment')
 
 # Default product resolver
 variation_0 = {
 'type': 'product'
 }
 
 # Amazon Personalize resolver
 variation_1 = {
 'type': 'personalize-recommendations',
 'inference_arn': campaign_arn
 }

 home_page_experiment = { 
 'id': uuid.uuid4().hex,
 'feature': feature,
 'name': experiment_name,
 'status': 'ACTIVE',
 'type': 'ab',
 'variations': [ variation_0, variation_1 ]
 }
 
 response = table.put_item(
 Item=home_page_experiment
 )

 print(json.dumps(response, indent=4))
 
print('Experiment item:')
print(json.dumps(home_page_experiment, indent=4, cls=CompatEncoder))

## Load Users

For our experiment simulation, we will load all Retail Demo Store users and run the experiment until the sample size for both variations has been met.

First, let's discover the IP address for the Retail Demo Store's [Users](https://github.com/aws-samples/retail-demo-store/tree/master/src/users) service.

In [None]:
response = servicediscovery.discover_instances(
 NamespaceName='retaildemostore.local',
 ServiceName='users',
 MaxResults=1,
 HealthStatus='HEALTHY'
)

users_service_instance = response['Instances'][0]['Attributes']['AWS_INSTANCE_IPV4']
print('Users Service Instance IP: {}'.format(users_service_instance))

Next, let's fetch all users, randomize their order, and load them into a local data frame.

In [None]:
# Load all users so we have enough to satisfy our sample size requirements.
response = requests.get('http://{}/users/all?count=10000'.format(users_service_instance))
users = response.json()
random.shuffle(users)
users_df = pd.DataFrame(users)
pd.set_option('display.max_rows', 5)

users_df

## Discover Recommendations Service

Next, let's discover the IP address for the Retail Demo Store's [Recommendations](https://github.com/aws-samples/retail-demo-store/tree/master/src/recommendations) service. This is the service where the Experimentation framework is implemented and the `/recommendations` endpoint is what we call to simulate our A/B experiment.

In [None]:
response = servicediscovery.discover_instances(
 NamespaceName='retaildemostore.local',
 ServiceName='recommendations',
 MaxResults=1,
 HealthStatus='HEALTHY'
)

recommendations_service_instance = response['Instances'][0]['Attributes']['AWS_INSTANCE_IPV4']
print('Recommendation Service Instance IP: {}'.format(recommendations_service_instance))

## Simulate Experiment

Next we will define a function to simulate our A/B experiment by making calls to the [Recommendations](https://github.com/aws-samples/retail-demo-store/tree/master/src/recommendations) service across the users we just loaded. Then we will run our simulation.

### Simulation Function

The following `simulate_experiment` function is supplied with the sample size for each group (A and B) and the probability of conversion for each group that we want to use for our simulation. It runs the simulation long enough to satisfy the sample size requirements and calls the Recommendations service for each user in the experiment.

In [None]:
def simulate_experiment(N_A, N_B, p_A, p_B):
 """Returns a pandas dataframe with simulated CTR data

 Parameters:
 N_A (int): sample size for control group
 N_B (int): sample size for test group
 Note: final sample size may not match N_A & N_B provided because the
 group at each row is chosen at random by the ABExperiment class.
 p_A (float): conversion rate; conversion rate of control group
 p_B (float): conversion rate; conversion rate of test group

 Returns:
 df (df)
 """

 # will hold exposure/outcome data
 data = []

 # total number of users to sample for both variations
 N = N_A + N_B
 
 if N > len(users):
 raise ValueError('Sample size is greater than number of users')

 print('Generating data for {} users... this may take a few minutes'.format(N))

 # initiate bernoulli distributions to randomly sample from based on simulated probabilities
 A_bern = scs.bernoulli(p_A)
 B_bern = scs.bernoulli(p_B)
 
 for idx in range(N):
 if idx > 0 and idx % 500 == 0:
 print('Generated data for {} users so far'.format(idx))
 
 # initite empty row
 row = {}

 # Get next user from shuffled list
 user = users[idx]

 # Call Recommendations web service to get recommendations for the user
 response = requests.get('http://{}/recommendations?userID={}&feature={}'.format(recommendations_service_instance, user['id'], feature))

 recommendations = response.json()
 recommendation = recommendations[randint(0, len(recommendations)-1)]
 
 variation = recommendation['experiment']['variationIndex']
 row['variation'] = variation
 
 # Determine if variation converts based on probabilities provided
 if variation == 0:
 row['converted'] = A_bern.rvs()
 else:
 row['converted'] = B_bern.rvs()
 
 if row['converted'] == 1:
 # Update experiment with outcome/conversion
 correlation_id = recommendation['experiment']['correlationId']
 requests.post('http://{}/experiment/outcome'.format(recommendations_service_instance), data={'correlationId':correlation_id})
 
 data.append(row)
 
 # convert data into dataframe
 df = pd.DataFrame(data)
 
 print('Done')

 return df

### Run Simulation

Next we run the simulation by defining our simulation parameters for sample sizes and probabilities and then call `simulate_experiment`. This will take several minutes depending on the sample sizes.

In [None]:
%%time

# Set size of both groups to calculated sample size
N_A = N_B = sample_size

# Use probabilities from our hypothesis
# bcr: baseline conversion rate
p_A = 0.15
# d_hat: difference in a metric between the two groups, sometimes referred to as minimal detectable effect or lift depending on the context
p_B = 0.1875

# Run simulation
ab_data = simulate_experiment(N_A, N_B, p_A, p_B)

ab_data

### Inspect Experiment Summary Statistics

Since the **Experiment** class updates statistics for the experiment in the experiment strategy DynamoDB table when a user is exposed to an experiment ("exposure") and when a user converts ("outcome"), we should see updated counts on our experiment. Let's reload our experiment and inspect the exposure and conversion counts for our simulation.

In [None]:
# Query DDB table for experiment item.
response = table.get_item(Key={'id': home_page_experiment['id']})

print(json.dumps(response['Item'], indent=4, cls=CompatEncoder))

You should now see counters for `conversions` and `exposures` for each variation. These represent how many times a user has been exposed to a variation and how many times a user has converted for a variation (i.e. clicked on a recommended item/product).

### Analyze Simulation Results

Next, let's take a closer look at the results of our simulation. We'll start by calculating some summary statistics.

In [None]:
ab_summary = ab_data.pivot_table(values='converted', index='variation', aggfunc=np.sum)
# add additional columns to the pivot table
ab_summary['total'] = ab_data.pivot_table(values='converted', index='variation', aggfunc=lambda x: len(x))
ab_summary['rate'] = ab_data.pivot_table(values='converted', index='variation')

In [None]:
ab_summary

The output above tells us how many users converted for each variation, the actual sample size for each variation in the simulation, and the conversion rate for each variation.

Next let's isolate the data and conversion counts for each variation.

In [None]:
A_group = ab_data[ab_data['variation'] == 0]
B_group = ab_data[ab_data['variation'] == 1]
A_converted, B_converted = A_group['converted'].sum(), B_group['converted'].sum()

In [None]:
A_converted, B_converted

Isolate the actual sample size for each variation.

In [None]:
A_total, B_total = len(A_group), len(B_group)
A_total, B_total

Calculate the actual conversion rates and uplift for our simulation.

In [None]:
p_A, p_B = A_converted / A_total, B_converted / B_total
p_A, p_B

In [None]:
p_B - p_A

### Determining Statistical Significance

In statistical hypothesis testing there are two types of errors that can occur. These are referred to as type 1 and type 2 errors.

Type 1 errors occur when the null hypothesis is true but is rejected. In other words, a "false positive" conclusion. Put in A/B testing terms, a type 1 error is when we conclude a statistically significant result when there isn't one.

Type 2 errors occur when we conclude that there is not a winner between two variations when in fact there is an actual winner. In other words, the null hypothesis is false yet we fail to reject it. Therefore, type 2 errors are a "false negative" conclusion.

If the probability of making a type 1 error is determined by "α" (alpha), the probability of a type 2 error is "β" (beta). Beta depends on the power of the test (i.e the probability of not committing a type 2 error, which is equal to 1-β).

Let's inspect the results of our simulation more closely to verify that it is statistically significant.

#### Calculate p-value

Formally, the p-value is the probability of seeing a particular result (or greater) from zero, assuming that the null hypothesis is TRUE. In other words, the p-value is the expected fluctuation in a given sample, similar to the variance. As an example, imagine we ran an A/A test where displayed the same variation to two groups of users. After such an experiment we would expect the conversion rates of both groups to be very similar but not dramatically different.

What we are hoping to see is a p-value that is less than our significance level. The significance level we used when calculating our sample size was 5%, which means we are seeking results with 95% accuracy. 5% is considered the industry standard.

In [None]:
p_value = scs.binom(A_total, p_A).pmf(p_B * B_total)
print('p-value = {0:0.9f}'.format(p_value))

Is the p-value less than the signficance level of 5%? This tells us the probability of a type 1 error.

Let's plot the data from both groups as binomial distributions.

In [None]:
fig, ax = plt.subplots(figsize=(12,6))
xA = np.linspace(A_converted-49, A_converted+50, 100)
yA = scs.binom(A_total, p_A).pmf(xA)
ax.scatter(xA, yA, s=10)
xB = np.linspace(B_converted-49, B_converted+50, 100)
yB = scs.binom(B_total, p_B).pmf(xB)
ax.scatter(xB, yB, s=10)
plt.xlabel('converted')
plt.ylabel('probability')

Based the probabilities from our hypothesis, we should see that the test group in blue (B) converted more users than the control group in red (A). However, the plot above is not a plot of the null and alternate hypothesis. The null hypothesis is a plot of the difference between the probability of the two groups.

> Given the randomness of our user selection, group hashing, and probabilities, your simulation results should be different for each simulation run and therefore may or may not be statistically significant.

In order to calculate the difference between the two groups, we need to standardize the data. Because the number of samples can be different between the two groups, we should compare the probability of successes, p.

According to the central limit theorem, by calculating many sample means we can approximate the true mean of the population from which the data for the control group was taken. The distribution of the sample means will be normally distributed around the true mean with a standard deviation equal to the standard error of the mean.

In [None]:
SE_A = np.sqrt(p_A * (1-p_A)) / np.sqrt(A_total)
SE_B = np.sqrt(p_B * (1-p_B)) / np.sqrt(B_total)
SE_A, SE_B

In [None]:
fig, ax = plt.subplots(figsize=(12,6))
xA = np.linspace(0, .3, A_total)
yA = scs.norm(p_A, SE_A).pdf(xA)
ax.plot(xA, yA)
ax.axvline(x=p_A, c='red', alpha=0.5, linestyle='--')

xB = np.linspace(0, .3, B_total)
yB = scs.norm(p_B, SE_B).pdf(xB)
ax.plot(xB, yB)
ax.axvline(x=p_B, c='blue', alpha=0.5, linestyle='--')

plt.xlabel('Converted Proportion')
plt.ylabel('PDF')

The dashed lines represent the mean conversion rate for each group. The distance between the red dashed line and the blue dashed line is equal to d_hat, or the minimum detectable effect.

In [None]:
p_A_actual = ab_summary.loc[0, 'rate']
p_B_actual = ab_summary.loc[1, 'rate']
bcr = p_A_actual
d_hat = p_B_actual - p_A_actual
A_total, B_total, bcr, d_hat

Finally, let's calculate the power, alpha, and beta from our simulation.

In [None]:
abplot(A_total, B_total, bcr, d_hat, show_power=True)

The power value we used when determining out sample size for our experiment was 80%. This is considered the industry standard. Is the power value calculated in the plot above greater than 80%?

In [None]:
abplot(A_total, B_total, bcr, d_hat, show_beta=True)

In [None]:
abplot(A_total, B_total, bcr, d_hat, show_alpha=True)

Are the alpha and beta values plotted in the graphs above less than our significance level of 5%? If so, we have a statistically significant result.

## Next Steps

You have completed the exercise for implementing an A/B test using the experimentation framework in the Retail Demo Store. Close this notebook and open the notebook for the next exercise, **[3.3-Interleaving-Experiment](./3.3-Interleaving-Experiment.ipynb)**.

### References and Further Reading

- [A/B testing](https://en.wikipedia.org/wiki/A/B_testing), Wikipedia
- [A/B testing](https://www.optimizely.com/optimization-glossary/ab-testing/), Optimizely
- [Evan's Awesome A/B Tools](https://www.evanmiller.org/ab-testing/), Evan Miller