# Building Your First E-Commerce Recommender

This notebook will walk you through the steps to build a Domain dataset group and a recommender that returns product recommendations based on data generated for our fictitious retail store data set. The goal is to recommend products that are relevant based on a particular user.

This synthetic data comes from the [Retail Demo Store project](https://github.com/aws-samples/retail-demo-store). Follow the link to learn more about the data and potential uses.

# How to Use the Notebook

The code is broken up into cells like the one below. There's a triangular Run button at the top of this page that you can click to execute each cell and move onto the next, or you can press `Shift` + `Enter` while in the cell to execute it and move onto the next one.

As a cell is executing you'll notice a line to the side showcase an `*` while the cell is running or it will update to a number to indicate the last cell that completed executing after it has finished exectuting all the code within a cell.

Simply follow the instructions below and execute the cells to get started with Amazon Personalize using case optimized recommenders.

## Imports
Python ships with a broad collection of libraries and we need to import those as well as the ones installed to help us like [boto3](https://aws.amazon.com/sdk-for-python/) (AWS SDK for python) and [Pandas](https://pandas.pydata.org/)/[Numpy](https://numpy.org/) which are core data science tools.

In [None]:
# Imports
import boto3
import json
import numpy as np
import pandas as pd
import time
import datetime

Next you will want to validate that your environment can communicate successfully with Amazon Personalize, the lines below do just that.

In [None]:
# Configure the SDK to Personalize:
personalize = boto3.client('personalize')
personalize_runtime = boto3.client('personalize-runtime')

## Specify an S3 Bucket and Data Output Location

Amazon Personalize will need an S3 bucket to act as the source of your data. The code bellow will create a bucket with a unique `bucket_name`.

The Amazon S3 bucket needs to be in the same region as the Amazon Personalize resources. 

In [None]:
# Sets the same region as current Amazon SageMaker Notebook
with open('/opt/ml/metadata/resource-metadata.json') as notebook_info:
 data = json.load(notebook_info)
 resource_arn = data['ResourceArn']
 region = resource_arn.split(':')[3]
print('region:', region)

# Or you can specify the region where your bucket and model will be domiciled
# region = "us-east-1" 

s3 = boto3.client('s3')
account_id = boto3.client('sts').get_caller_identity().get('Account')
bucket_name = account_id + "-" + region + "-" + "personalizemanagedretailers"
print('bucket_name:', bucket_name)

try: 
 if region == "us-east-1":
 s3.create_bucket(Bucket=bucket_name)
 else:
 s3.create_bucket(
 Bucket = bucket_name,
 CreateBucketConfiguration={'LocationConstraint': region}
 )
except s3.exceptions.BucketAlreadyOwnedByYou:
 print("Bucket already exists. Using bucket", bucket_name)

## Download, Prepare, and Upload Training Data

First we need to download the data (training data). In this tutorial we'll use the Purchase history from a retail store dataset. The dataset contains the user id,items id,the interaction between customers and items and the time this interaction took place(Timestamp) 

### Download and Explore the Dataset

In [None]:
!aws s3 cp s3://retail-demo-store-us-east-1/csvs/items.csv .
!aws s3 cp s3://retail-demo-store-us-east-1/csvs/interactions.csv .

The dataset has been successfully downloaded as Electronics_Store_purchase_history.csv

Lets learn more about the dataset by viewing its charateristics

In [None]:
df = pd.read_csv('./interactions.csv')
df

In [None]:
df.EVENT_TYPE.value_counts()

In [None]:
def convert_event_type(event_type_in_some_format):
 if(event_type_in_some_format == "ProductViewed"):
 return "View"
 if(event_type_in_some_format == "OrderCompleted"):
 return "Purchase"
 else:
 return event_type_in_some_format

df['EVENT_TYPE'] = df['EVENT_TYPE'].apply(convert_event_type)

In [None]:
df.EVENT_TYPE.value_counts()

The ECOMMERCE recommenders require you to provide specific EVENT_TYPE values in order to understand the context of an interaction, therefore we are going to modify our interactions EVENTYPE column

In [None]:
df.info()

From the 2 cells above, we've learned that our data has has 9 columns, 675004 rows and the headers are: ITEM_ID, USER_ID, EVENT_TYPE, TIMESTAMP and DISCOUNT.

To be compatible with an Amazon Personalize interactions schema, this dataset requires column headings compatible with Amazon Personalize default column names (read about column names [here](https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html) )

## Prepare the Interactions Data


### Drop Columns

Some columns in this dataset would not add value to our model and as such need to be dropped from this dataset. Columns such as *discount*.

In [None]:
test=df.drop(columns=['DISCOUNT'])
df=test
df.sample(10)

In the cells below, we will write our cleaned data to a file named "final_training_data.csv

In [None]:
df.to_csv("cleaned_training_data.csv")

### Upload to S3
Now that our training data is ready for Amazon Personalize,the next step is to upload it to the s3 bucket created earlier

In [None]:
interactions_file_path = 'cleaned_training_data.csv'
boto3.Session().resource('s3').Bucket(bucket_name).Object(interactions_file_path).upload_file(interactions_file_path)
interactions_s3DataPath = "s3://"+bucket_name+"/"+interactions_file_path


## Configure an S3 bucket and an IAM role

So far, we have downloaded, manipulated, and saved the data onto the Amazon EBS instance attached to instance running this Jupyter notebook. However, Amazon Personalize will need an S3 bucket to act as the source of your data, as well as IAM roles for accessing that bucket. Let's set all of that up.


## Set the S3 bucket policy
Amazon Personalize needs to be able to read the contents of your S3 bucket. So add a bucket policy which allows that.

Note: Make sure the role you are using to run the code in this notebook has the necessary permissions to modify the S3 bucket policy.

In [None]:
s3 = boto3.client("s3")
policy = {
 "Version": "2012-10-17",
 "Id": "PersonalizeS3BucketAccessPolicy",
 "Statement": [
 {
 "Sid": "PersonalizeS3BucketAccessPolicy",
 "Effect": "Allow",
 "Principal": {
 "Service": "personalize.amazonaws.com"
 },
 "Action": [
 "s3:GetObject",
 "s3:ListBucket"
 ],
 "Resource": [
 "arn:aws:s3:::{}".format(bucket_name),
 "arn:aws:s3:::{}/*".format(bucket_name)
 ]
 }
 ]
}

s3.put_bucket_policy(Bucket=bucket_name, Policy=json.dumps(policy))

## Create and Wait for Dataset Group
The largest grouping in Personalize is a Dataset Group, this will isolate your data, event trackers, solutions, Recommenders, and campaigns. Grouping things together that share a common collection of data. Feel free to alter the name below if you'd like.

### Create Dataset Group

In [None]:
response = personalize.create_dataset_group(
 name='personalize_ecomemerce_ds_group',
 domain='ECOMMERCE'
)

dataset_group_arn = response['datasetGroupArn']
print(json.dumps(response, indent=2))

Wait for Dataset Group to Have ACTIVE Status
Before we can use the Dataset Group in any items below it must be active, execute the cell below and wait for it to show active.

In [None]:
%%time

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
 describe_dataset_group_response = personalize.describe_dataset_group(
 datasetGroupArn = dataset_group_arn
 )
 status = describe_dataset_group_response["datasetGroup"]["status"]
 print("DatasetGroup: {}".format(status))
 
 if status == "ACTIVE" or status == "CREATE FAILED":
 break
 
 time.sleep(60)

## Create Interactions Schema
A core component of how Personalize understands your data comes from the Schema that is defined below. This configuration tells the service how to digest the data provided via your CSV file. Note the columns and types align to what was in the file you created above.

In [None]:
interactions_schema = schema = {
 "type": "record",
 "name": "Interactions",
 "namespace": "com.amazonaws.personalize.schema",
 "fields": [
 {
 "name": "USER_ID",
 "type": "string"
 },
 {
 "name": "ITEM_ID",
 "type": "string"
 },
 {
 "name": "TIMESTAMP",
 "type": "long"
 },
 {
 "name": "EVENT_TYPE",
 "type": "string"
 
 }
 ],
 "version": "1.0"
}

create_schema_response = personalize.create_schema(
 name = "personalize-ecommerce-interatn_group",
 domain = "ECOMMERCE",
 schema = json.dumps(interactions_schema)
)

interaction_schema_arn = create_schema_response['schemaArn']
print(json.dumps(create_schema_response, indent=2))

## Create Datasets
After the group, the next thing to create is the actual datasets.

### Create Interactions Dataset

In [None]:
dataset_type = "INTERACTIONS"

create_dataset_response = personalize.create_dataset(
 name = "personalize_ecommerce_demo_interactions",
 datasetType = dataset_type,
 datasetGroupArn = dataset_group_arn,
 schemaArn = interaction_schema_arn
)

interactions_dataset_arn = create_dataset_response['datasetArn']
print(json.dumps(create_dataset_response, indent=2))

## Create Personalize Role
Also Amazon Personalize needs the ability to assume Roles in AWS in order to have the permissions to execute certain tasks, the lines below grant that.

Note: Make sure the role you are using to run the code in this notebook has the necessary permissions to create a role.

In [None]:
iam = boto3.client("iam")

role_name = "PersonalizeRoleEcommerceDemoRecommender"
assume_role_policy_document = {
 "Version": "2012-10-17",
 "Statement": [
 {
 "Effect": "Allow",
 "Principal": {
 "Service": "personalize.amazonaws.com"
 },
 "Action": "sts:AssumeRole"
 }
 ]
}

create_role_response = iam.create_role(
 RoleName = role_name,
 AssumeRolePolicyDocument = json.dumps(assume_role_policy_document)
)

# AmazonPersonalizeFullAccess provides access to any S3 bucket with a name that includes "personalize" or "Personalize" 
# if you would like to use a bucket with a different name, please consider creating and attaching a new policy
# that provides read access to your bucket or attaching the AmazonS3ReadOnlyAccess policy to the role
policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonPersonalizeFullAccess"
iam.attach_role_policy(
 RoleName = role_name,
 PolicyArn = policy_arn
)

# Now add S3 support
iam.attach_role_policy(
 PolicyArn='arn:aws:iam::aws:policy/AmazonS3FullAccess',
 RoleName=role_name
)
time.sleep(60) # wait for a minute to allow IAM role policy attachment to propagate

role_arn = create_role_response["Role"]["Arn"]
print(role_arn)


## Import the data
Earlier you created the DatasetGroup and Dataset to house your information, now you will execute an import job that will load the data from S3 into Amazon Personalize for usage building your model.
### Create Interactions Dataset Import Job

In [None]:
create_interactions_dataset_import_job_response = personalize.create_dataset_import_job(
 jobName = "personalize_ecommerce_demo_interactions_import",
 datasetArn = interactions_dataset_arn,
 dataSource = {
 "dataLocation": "s3://{}/{}".format(bucket_name, interactions_file_path)
 },
 roleArn = role_arn
)

dataset_interactions_import_job_arn = create_interactions_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(create_interactions_dataset_import_job_response, indent=2))

Wait for Dataset Import Job to Have ACTIVE Status
It can take a while before the import job completes, please wait until you see that it is active below.

In [None]:
%%time

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
 describe_dataset_import_job_response = personalize.describe_dataset_import_job(
 datasetImportJobArn = dataset_interactions_import_job_arn
 )
 status = describe_dataset_import_job_response["datasetImportJob"]['status']
 print("DatasetImportJob: {}".format(status))
 
 if status == "ACTIVE" or status == "CREATE FAILED":
 break
 
 time.sleep(60)

## Choose a recommender use cases

Each domain has different use cases. When you create a recommender you create it for a specific use case, and each use case has different requirements for getting recommendations.


In [None]:
available_recipes = personalize.list_recipes(domain='ECOMMERCE') # See a list of recommenders for the domain. 
display (available_recipes['recipes'])

We are going to create a recommender of the type "Customers who viewed X also viewed". This recommender gives recommendations for items that customers also viewed based on an item that you specify. With this use case, Amazon Personalize automatically filters items the user purchased based on the userId that you specify and `Purchase` events.

In [None]:
create_recommender_response = personalize.create_recommender(
 name = 'viewed_x_also_viewed_demo',
 recipeArn = 'arn:aws:personalize:::recipe/aws-ecomm-customers-who-viewed-x-also-viewed',
 datasetGroupArn = dataset_group_arn
)
viewed_x_also_viewed_arn = create_recommender_response["recommenderArn"]
print (json.dumps(create_recommender_response))

We are going to create a second recommender of the type "Recommended For You". This type of recommender offers personalized recommendations for items based on a user that you specify. With this use case, Amazon Personalize automatically filters items the user purchased based on the userId that you specify and `Purchase` events.

[More use cases per domain](https://docs.aws.amazon.com/personalize/latest/dg/domain-use-cases.html)

In [None]:
create_recommender_response = personalize.create_recommender(
 name = 'recommended_for_you_demo',
 recipeArn = 'arn:aws:personalize:::recipe/aws-ecomm-recommended-for-you',
 datasetGroupArn = dataset_group_arn
)
recommended_for_you_arn = create_recommender_response["recommenderArn"]
print (json.dumps(create_recommender_response))

We wait until the recomenders have finished creating and have status `ACTIVE`. We check periodically on the status of the recommender

In [None]:
%%time

max_time = time.time() + 10*60*60 # 10 hours

while time.time() < max_time:

 version_response = personalize.describe_recommender(
 recommenderArn = viewed_x_also_viewed_arn
 )
 status = version_response["recommender"]["status"]

 if status == "ACTIVE":
 print("Build succeeded for {}".format(viewed_x_also_viewed_arn))
 
 elif status == "CREATE FAILED":
 print("Build failed for {}".format(viewed_x_also_viewed_arn))
 

 if status == "ACTIVE" or status == "CREATE FAILED":
 break
 else:
 print('The "Customers who viewed X also viewed" Recommender build is still in progress')
 
 time.sleep(60)
 
while time.time() < max_time:

 version_response = personalize.describe_recommender(
 recommenderArn = recommended_for_you_arn
 )
 status = version_response["recommender"]["status"]

 if status == "ACTIVE":
 print("Build succeeded for {}".format(recommended_for_you_arn))
 
 elif status == "CREATE FAILED":
 print("Build failed for {}".format(recommended_for_you_arn))
 break

 if status == "ACTIVE" or status == "CREATE FAILED":
 break
 else:
 print('The "Recommended for you" Recommender build is still in progress')
 
 time.sleep(60)

## Getting recommendations with a recommender
Now that the recommenders have been trained, lets have a look at the recommendations we can get for our users!

In [None]:
# reading the original data in order to have a dataframe that has both item_ids 
# and the corresponding titles to make out recommendations easier to read.
items_df = pd.read_csv('./items.csv')
items_df.sample(10)

In [None]:
def get_item_by_id(item_id, item_df):
 """
 This takes in an item_id from a recommendation in string format,
 converts it to an int, and then does a lookup in a default or specified
 dataframe and returns the item description.
 
 A really broad try/except clause was added in case anything goes wrong.
 
 Feel free to add more debugging or filtering here to improve results if
 you hit an error.
 """
 try:
 return items_df.loc[items_df["ITEM_ID"]==str(item_id)]['PRODUCT_DESCRIPTION'].values[0]
 except:
 print (item_id)
 return "Error obtaining item description"

Let us get some recommendations using the "Customers who viewed X also viewed" Recommender:

In [None]:
# use a random valid id for a quick sanity check, modify the line of code bellow to a valid id in your dataset
get_item_by_id("c72257d4-430b-4eb7-9de3-28396e593381", items_df)

In [None]:
# First pick a user
test_user_id = "777"

# Select a random item
test_item_id = "8fbe091c-f73c-4727-8fe7-d27eabd17bea" # a random item: 8fbe091c-f73c-4727-8fe7-d27eabd17bea

# Get recommendations for the user for this item
get_recommendations_response = personalize_runtime.get_recommendations(
 recommenderArn = viewed_x_also_viewed_arn,
 itemId = test_item_id,
 userId = test_user_id,
 numResults = 10
)

# Build a new dataframe for the recommendations
item_list = get_recommendations_response['itemList']
recommendation_list = []

for item in item_list:
 item = get_item_by_id(item['itemId'], items_df)
 recommendation_list.append(item)

user_recommendations_df = pd.DataFrame(recommendation_list, columns = [get_item_by_id(test_item_id, items_df)])

pd.options.display.max_rows =10
display(user_recommendations_df)

Get recommendations from the recommender returning "Recommended for you":

In [None]:
# First pick a user
test_user_id = "777" 

# Get recommendations for the user
get_recommendations_response = personalize_runtime.get_recommendations(
 recommenderArn = recommended_for_you_arn,
 userId = test_user_id,
 numResults = 20
)

# Build a new dataframe for the recommendations
item_list = get_recommendations_response['itemList']
recommendation_list = []
for item in item_list:
 item = get_item_by_id(item['itemId'], items_df)
 recommendation_list.append(item)


user_recommendations_df = pd.DataFrame(recommendation_list, columns = [test_user_id])

pd.options.display.max_rows =20
display(user_recommendations_df)

## Review
Using the codes above you have successfully trained a deep learning model to generate item recommendations based on prior user behavior. You have created two recommenders for two foundational use cases. 
Going forward, you can adapt this code to create other recommenders.

## Notes for the Next Notebook:
There are a few values you will need for the next notebook, execute the cell below to store them so they can be used in the `Clean_Up_Resources.ipynb` notebook.

This will overwite any data stored for those variables and set them to the values specified in this notebook. 

In [None]:
# store for cleanup
%store dataset_group_arn
%store role_name
%store region

If you have run the `Building_Your_First_Recommender_Video_On_Demand.ipynb` notebook, please make sure you re-run the previous step in the `Building_Your_First_Recommender_Video_On_Demand.ipynb` notebook and re-run the `Clean_Up_Resources.ipynb` to remove the resources created in that notebook after you run the `Clean_Up_Resources.ipynb` with the resources created here.