# Retail Demo Store Experimentation Workshop - Interleaving Recommendation Exercise

In this exercise we will define, launch, and evaluate the results of an experiment using recommendation interleaving using the experimentation framework implemented in the Retail Demo Store project. If you have not already stepped through the **[3.1-Overview](./3.1-Overview.ipynb)** workshop notebook, please do so now as it provides the foundation built upon in this exercise. It is also recommended, but not required, to complete the **[3.2-AB-Experiment](./3.2-AB-Experiment.ipynb)** workshop notebook.

Recommended Time: 30 minutes

## Prerequisites

Since this module uses the Retail Demo Store's Recommendation microservice to run experiments across variations that depend on the personalization features of the Retail Demo Store, it is assumed that you have either completed the [Personalization](../1-Personalization/Lab-1-Introduction-and-data-preparation.ipynb) workshop or those resources have been pre-provisioned in your AWS environment. If you are unsure and attending an AWS managed event such as a workshop, check with your event lead.

## Exercise 2: Interleaving Recommendations Experiment

For the first exercise, **[3.2-AB-Experiment](./3.2-AB-Experiment.ipynb)**, we demonstrated how to create and run an A/B experiment using two different variations for making product recommendations. We calculated the sample sizes of users needed to reach a statistically significant result comparing the two variations. Then we ran the experiment using a simulation until the sample sizes were reached for both variations. In real-life, depending on the baseline and minimum detectable effect rate combined with your site's user traffic, the amount of time necessary to complete an experiment can take several days to a few weeks. This can be expensive from both an opportunity cost perspective as well as negatively impacting the pace at which experiments and changes can be rolled out to your site.

In this exercise we will look at an alternative approach to evaluating product recommendation variations that requires a smaller sample size and shorter experiment durations. This technique is often used as a preliminary step before formal A/B testing to reduce a larger number of variations to just the top performers. Traditional A/B testing is then done against the best performing variations, significantly reducing the overall time necessary for experimentation.

We will use the same two variations as the last exercise. The first variation will represent our current implementation using the [**Default Product Resolver**](https://github.com/aws-samples/retail-demo-store/blob/master/src/recommendations/src/recommendations-service/experimentation/resolvers.py) and the second variation will use the [**Personalize Recommendation Resolver**](https://github.com/aws-samples/retail-demo-store/blob/master/src/recommendations/src/recommendations-service/experimentation/resolvers.py). The scenario we are simulating is adding product recommendations powered by Amazon Personalize to the home page and measuring the impact/uplift in click-throughs for products as a result of deploying a personalization strategy. We will use the same hypothesis from our A/B test where the conversion rate of our existing approach is 15% and we expect a 25% lift in this rate by adding personalized recommendations.

### What is Interleaving Recommendation Testing?

The approach of interleaving recommendations is to take the recommendations from two or more variations and interleave, or blend, them into a single set of recommendations for *every user in the experiment*. Because each user in the sample is exposed to recommendations from all variations, we gain some key benefits. First, the sample size can be smaller since we don't need separate groups of users for each variation. This also results in a shorter experiment duration. Additionally, this approach is less susceptible to variances in user type and behavior that could throw off the results of an experiment. For example, it's not uncommon to have power users who shop/watch/listen/read much more than a typical user. With multiple sample groups, the behavior of these users can throw off results for their group, particularly with smaller sample sizes.

Care must be taken in how recommendations are interleaved, though, to account for position bias in the recommendations and to track variation attribution. There are two common methods to interleaving recommendations. First is a balanced approach where recommendations are taken from each variation in an alternating style where the starting variation is selected randomly. The other approach follows the team-draft analogy where team captains select their "best player" (recommendation) from the variations in random selection order. Both methods can result in different interleaving outputs.

Interleaving recommendations as an approach to experimenation got its start with information retrieval systems and search engines (Yahoo! & Bing) where different approaches to ranking results could be measured concurrently. More recently, [Netflix has adopted the interleaving technique](https://medium.com/netflix-techblog/interleaving-in-online-experiments-at-netflix-a04ee392ec55) to rapidly evaluate different approaches to making movie recommendations to its users. The image below depicts the recommendations from two different recommenders/variations (Ranker A and Ranker B) and examples of how they are interleaved.

![Interleaving at Netflix](./images/netflix-interleaving.png)


### InterleavingExperiment Class

Before stepping through creating and executing our interleaving test, let's look at the relevant source code for the [**InterleavingExperiment**](https://github.com/aws-samples/retail-demo-store/blob/master/src/recommendations/src/recommendations-service/experimentation/experiment_interleaving.py) class that implements this experiment type in the Retail Demo Store project.

As noted in the **[3.1-Overview](./3.1-Overview.ipynb)** notebook, all experiment types are subclasses of the abstract **Experiment** class. See **[3.1-Overview](./3.1-Overview.ipynb)** for more details on the experimentation framework.

The `InterleavingExperiment.get_items()` method is where item recommendations are retrieved for the experiment. This method will retrieve recommendations from the resolvers for all variations and then use the configured interleaving method (balanced or team-draft) to interleave the recommendations to produce the final result. Exposure tracking is also implemented to facilitate measuring the outcome of an experiment. The implementations for the balanced and team-draft interleaving methods are not included below but are available in the source code for the Recommendations service.

```python
# from src/recommendations/src/recommendations-service/experimentation/experiment_interleaving.py

class InterleavingExperiment(Experiment):
 """ Implements interleaving technique described in research paper by 
 Chapelle et al http://olivier.chapelle.cc/pub/interleaving.pdf
 """
 METHOD_BALANCED = 'balanced'
 METHOD_TEAM_DRAFT = 'team-draft'

 def __init__(self, table, **data):
 super(InterleavingExperiment, self).__init__(table, **data)
 self.method = data.get('method', InterleavingExperiment.METHOD_BALANCED)

 def get_items(self, user_id, current_item_id = None, item_list = None, num_results = 10, tracker = None):
 ...
 
 # Initialize array structure to hold item recommendations for each variation
 variations_data = [[] for x in range(len(self.variations))]

 # Get recomended items for each variation
 for i in range(len(self.variations)):
 resolve_params = {
 'user_id': user_id,
 'product_id': current_item_id,
 'product_list': item_list,
 'num_results': num_results * 3 # account for overlaps
 }
 variation = self.variations[i]
 items = variation.resolver.get_items(**resolve_params)
 variations_data[i] = items

 # Interleave items to produce result
 interleaved = []
 if self.method == InterleavingExperiment.METHOD_TEAM_DRAFT:
 interleaved = self._interleave_team_draft(user_id, variations_data, num_results)
 else:
 interleaved = self._interleave_balanced(user_id, variations_data, num_results)

 # Increment exposure for each variation (can be optimized)
 for i in range(len(self.variations)):
 self._increment_exposure_count(i)

 ...

 return interleaved
```

### Setup - Import Dependencies

Througout this workshop we will need access to some common libraries and clients for connecting to AWS services. Let's set those up now.

In [None]:
import boto3
import json
import uuid
import numpy as np
import requests
import pandas as pd
import random
import scipy.stats as scs
import time
import decimal
import matplotlib.pyplot as plt

from boto3.dynamodb.conditions import Key
from random import randint

# import custom scripts for plotting results
from src.plot import *
from src.stats import *

%matplotlib inline
plt.style.use('ggplot')

# We will be using a DynamoDB table to store configuration info for our experiments.
dynamodb = boto3.resource('dynamodb')

# Service discovery will allow us to dynamically discover Retail Demo Store resources
servicediscovery = boto3.client('servicediscovery')
# Retail Demo Store config parameters are stored in SSM
ssm = boto3.client('ssm')

# Utility class to convert types for printing as JSON.
class CompatEncoder(json.JSONEncoder):
 def default(self, obj):
 if isinstance(obj, decimal.Decimal):
 if obj % 1 > 0:
 return float(obj)
 else:
 return int(obj)
 else:
 return super(CompatEncoder, self).default(obj)

### Experiment Strategy Datastore

Let's create an experiment using the interleaving technique.

A DynamoDB table was created by the Retail Demo Store CloudFormation template that we will use to store the configuration information for our experiments. The table name can be found in a system parameter.

In [None]:
response = ssm.get_parameter(Name='retaildemostore-experiment-strategy-table-name')

table_name = response['Parameter']['Value'] # Do Not Change
print('Experiments DDB table: ' + table_name)
table = dynamodb.Table(table_name)

Next we need to lookup the Amazon Personalize campaign ARN for product recommendations. This is the campaign that was created in the Personalization workshop.

In [None]:
response = ssm.get_parameter(Name = '/retaildemostore/personalize/recommended-for-you-arn')

inference_arn = response['Parameter']['Value'] # Do Not Change
print('Personalize product recommendations ARN: ' + inference_arn)

### Create Interleaving Experiment

The Retail Demo Store supports running multiple experiments concurrently. For this workshop we will create a single interleaving test/experiment that will expose users of a single group to recommendations from the default behavior and recommendations from Amazon Personalize. The [Recommendations](https://github.com/aws-samples/retail-demo-store/tree/master/src/recommendations) microservice already has logic that supports interleaving experiments when an active experiment is detected.

Experiment configurations are stored in a DynamoDB table where each item in the table represents an experiment and has the following fields.

- **id** - Uniquely identified this experience (UUID).
- **feature** - Identifies the Retail Demo Store feature where the experiment should be applied. The name for the home page product recommendations feature is `home_product_recs`.
- **name** - The name of the experiment. Keep the name short but descriptive. It will be used in the UI for demo purposes and when logging events for experiment result tracking.
- **status** - The status of the experiment (`ACTIVE`, `EXPIRED`, or `PENDING`).
- **type** - The type of test (`ab` for an A/B test, `interleaving` for interleaved recommendations, or `mab` for multi-armed bandit test)
- **method** - The interleaving method (`balanced` or `team-draft`)
- **variations** - List of configurations representing variations for the experiment. For example, for interleaving tests of the `home_product_recs` feature, the `variations` can be two Amazon Personalize campaign ARNs (variation type `personalize-recommendations`) or a single Personalize campaign ARN and the default product behavior.

In [None]:
feature = 'home_product_recs'
experiment_name = 'home_personalize_interleaving'

# First, make sure there are no other active experiments so we can isolate
# this experiment for the exercise.
response = table.scan(
 ProjectionExpression='#k', 
 ExpressionAttributeNames={'#k' : 'id'},
 FilterExpression=Key('status').eq('ACTIVE')
)
for item in response['Items']:
 response = table.update_item(
 Key=item,
 UpdateExpression='SET #s = :inactive',
 ExpressionAttributeNames={
 '#s' : 'status'
 },
 ExpressionAttributeValues={
 ':inactive' : 'INACTIVE'
 }
 )

# Query the experiment strategy table to see if our experiment already exists
response = table.query(
 IndexName='feature-name-index',
 KeyConditionExpression=Key('feature').eq(feature) & Key('name').eq(experiment_name),
 FilterExpression=Key('status').eq('ACTIVE')
)

if response.get('Items') and len(response.get('Items')) > 0:
 print('Experiment already exists')
 home_page_experiment = response['Items'][0]
else:
 print('Creating experiment')
 
 # Default product resolver
 variation_0 = {
 'type': 'product'
 }
 
 # Amazon Personalize resolver
 variation_1 = {
 'type': 'personalize-recommendations',
 'inference_arn': inference_arn
 }

 home_page_experiment = { 
 'id': uuid.uuid4().hex,
 'feature': feature,
 'name': experiment_name,
 'status': 'ACTIVE',
 'type': 'interleaving',
 'method': 'team-draft',
 'analytics': {},
 'variations': [ variation_0, variation_1 ]
 }
 
 response = table.put_item(
 Item=home_page_experiment
 )

 print(json.dumps(response, indent=4))
 
print(json.dumps(home_page_experiment, indent=4, cls=CompatEncoder))

## Load Users

For our experiment simulation, we will load all Retail Demo Store users and run the experiment until the sample size has been met.

First, let's discover the IP address for the Retail Demo Store's [Users](https://github.com/aws-samples/retail-demo-store/tree/master/src/users) service.

In [None]:
response = servicediscovery.discover_instances(
 NamespaceName='retaildemostore.local',
 ServiceName='users',
 MaxResults=1,
 HealthStatus='HEALTHY'
)

users_service_instance = response['Instances'][0]['Attributes']['AWS_INSTANCE_IPV4']
print('Users Service Instance IP: {}'.format(users_service_instance))

Next, let's load all users into a local data frame.

In [None]:
# Load all users so we have enough to satisfy our sample size requirements.
response = requests.get('http://{}/users/all?count=10000'.format(users_service_instance))
users = response.json()
users_df = pd.DataFrame(users)
pd.set_option('display.max_rows', 5)

users_df

## Discover Recommendations Service

Next, let's discover the IP address for the Retail Demo Store's [Recommendations](https://github.com/aws-samples/retail-demo-store/tree/master/src/recommendations) service.

In [None]:
response = servicediscovery.discover_instances(
 NamespaceName='retaildemostore.local',
 ServiceName='recommendations',
 MaxResults=1,
 HealthStatus='HEALTHY'
)

recommendations_service_instance = response['Instances'][0]['Attributes']['AWS_INSTANCE_IPV4']
print('Recommendation Service Instance IP: {}'.format(recommendations_service_instance))

## Simulate Experiment

Next we will simulate our interleaving recommendation experiment by making calls to the [Recommendations](https://github.com/aws-samples/retail-demo-store/tree/master/src/recommendations) service across the users we just loaded.

### Simulation Function

The following `simulate_experiment` function is supplied with the number of trials we want to run and the probability of conversion for each variation for our simulation. It runs the simulation long enough to satisfy the number of trials and calls the Recommendations service for each trial in the experiment.

In [None]:
def simulate_experiment(n_trials, probs):
 """Simulates experiment based on pre-determined probabilities

 Example:

 Parameters:
 n_trials (int): number of trials to run for experiment
 probs (array float): array of floats containing probability/conversion 
 rate for each variation

 Returns:
 df (df) - data frame of simulation data/results
 """

 # will hold exposure/outcome data
 data = []

 print('Simulating experiment for {} users... this may take a few minutes'.format(n_trials))

 for idx in range(n_trials):
 if idx > 0 and idx % 500 == 0:
 print('Simulated experiment for {} users so far'.format(idx))
 
 row = {}

 # Get random user
 user = users[randint(0, len(users)-1)]

 # Call Recommendations web service to get recommendations for the user
 response = requests.get('http://{}/recommendations?userID={}&feature={}'.format(recommendations_service_instance, user['id'], feature))

 recommendations = response.json()
 recommendation = recommendations[randint(0, len(recommendations)-1)]
 
 variation = recommendation['experiment']['variationIndex']
 row['variation'] = variation
 
 # Conversion based on probability of variation
 row['converted'] = np.random.binomial(1, p=probs[variation])

 if row['converted'] == 1:
 # Update experiment with outcome/conversion
 correlation_id = recommendation['experiment']['correlationId']
 requests.post('http://{}/experiment/outcome'.format(recommendations_service_instance), data={'correlationId':correlation_id})
 
 data.append(row)
 
 # convert data into pandas dataframe
 df = pd.DataFrame(data)
 
 print('Done')

 return df

### Run Simulation

Next we run the simulation by defining our simulation parameters for the number of trials and probabilities and then call `simulate_experiment`. This will take a few minutes to run.

In [None]:
%%time

# Number of trials to run
N = 2000

# bcr: baseline conversion rate
p_A = 0.15
# d_hat: difference in a metric between the two groups, sometimes referred to as minimal detectable effect or lift depending on the context
p_B = 0.1875

ab_data = simulate_experiment(N, [p_A, p_B])

In [None]:
ab_data

### Inspect Experiment Summary Statistics

Since the **Experiment** class updates statistics on the experiment in the experiment strategy table when a user is exposed to an experiment ("exposure") and when a user converts ("outcome"), we should see updated counts on our experiment. Let's reload our experiment and inspect the exposure and conversion counts for our simulation.

In [None]:
response = table.get_item(Key={'id': home_page_experiment['id']})

print(json.dumps(response['Item'], indent=4, cls=CompatEncoder))

Note the `conversions` and `exposures` counts for each variation above. These counts were incremented by the experiment class each time a trial was run (exposure) and a user converted in the `simulate_experiment` function above.

### Analyze Simulation Results

To wrap up, let's analyze some of the results from our simulated interleaving experiment by inspecting the actual conversion rate and verifying our target confidence interval and power.

First, let's take a closer look at the results of our simulation. We'll start by calculating some summary statistics.

In [None]:
ab_summary = ab_data.pivot_table(values='converted', index='variation', aggfunc=np.sum)
# add additional columns to the pivot table
ab_summary['total'] = ab_data.pivot_table(values='converted', index='variation', aggfunc=lambda x: len(x))
ab_summary['rate'] = ab_data.pivot_table(values='converted', index='variation')

In [None]:
ab_summary

Next let's isolate data for each variation.

In [None]:
A_group = ab_data[ab_data['variation'] == 0]
B_group = ab_data[ab_data['variation'] == 1]
A_converted, B_converted = A_group['converted'].sum(), B_group['converted'].sum()

In [None]:
A_converted, B_converted

Determine the actual sample size for each variation.

In [None]:
A_total, B_total = len(A_group), len(B_group)
A_total, B_total

Calculate the actual conversion rates and uplift from our simulation.

In [None]:
p_A, p_B = A_converted / A_total, B_converted / B_total
p_A, p_B

In [None]:
p_B - p_A

### Determining Statistical Significance

For simplicity we will use the same approach as our A/B test to determine statistical significance. 

Let's plot the data from both groups as binomial distributions.

In [None]:
fig, ax = plt.subplots(figsize=(12,6))
xA = np.linspace(A_converted-49, A_converted+50, 100)
yA = scs.binom(A_total, p_A).pmf(xA)
ax.scatter(xA, yA, s=10)
xB = np.linspace(B_converted-49, B_converted+50, 100)
yB = scs.binom(B_total, p_B).pmf(xB)
ax.scatter(xB, yB, s=10)
plt.xlabel('converted')
plt.ylabel('probability')

Based the probabilities from our hypothesis, we should see that the test group in blue (B) converted more users than the control group in red (A). However, the plot above is not a plot of the null and alternate hypothesis. The null hypothesis is a plot of the difference between the probability of the two groups.

> Given the randomness of our user selection, group hashing, and probabilities, your simulation results should be different for each simulation run and therefore may or may not be statistically significant.

In order to calculate the difference between the two groups, we need to standardize the data. Because the number of samples can be different between the two groups, we should compare the probability of successes, p.

According to the central limit theorem, by calculating many sample means we can approximate the true mean of the population from which the data for the control group was taken. The distribution of the sample means will be normally distributed around the true mean with a standard deviation equal to the standard error of the mean.

In [None]:
SE_A = np.sqrt(p_A * (1-p_A)) / np.sqrt(A_total)
SE_B = np.sqrt(p_B * (1-p_B)) / np.sqrt(B_total)
SE_A, SE_B

In [None]:
fig, ax = plt.subplots(figsize=(12,6))
xA = np.linspace(0, .3, A_total)
yA = scs.norm(p_A, SE_A).pdf(xA)
ax.plot(xA, yA)
ax.axvline(x=p_A, c='red', alpha=0.5, linestyle='--')

xB = np.linspace(0, .3, B_total)
yB = scs.norm(p_B, SE_B).pdf(xB)
ax.plot(xB, yB)
ax.axvline(x=p_B, c='blue', alpha=0.5, linestyle='--')

plt.xlabel('Converted Proportion')
plt.ylabel('PDF')

## Next Steps

You have completed the exercise for implementing an A/B test using the experimentation framework in the Retail Demo Store. Close this notebook and open the notebook for the next exercise, **[3.4-Multi-Armed-Bandit-Experiment](./3.4-Multi-Armed-Bandit-Experiment.ipynb)**.

### References and Further Reading

- [Large Scale Validation and Analysis of Interleaved Search Evaluation](http://olivier.chapelle.cc/pub/interleaving.pdf), Chapelle et al
- [Innovating Faster on Personalization Algorithms at Netflix Using Interleaving](https://medium.com/netflix-techblog/interleaving-in-online-experiments-at-netflix-a04ee392ec55), Netflix Technology Blog