# Building Your First Video On Demand Recommender

This notebook will walk you through the steps to build a Domain dataset group and a
recommender that returns movie recommendations based on data collected from the movielens data set. The goal is to recommend movies that are relevant based on a particular user.

The data comes from the [MovieLens project](https://grouplens.org/datasets/movielens/). Follow the link to learn more about the data and potential uses.

# How to Use the Notebook

The code is broken up into cells like the one below. There's a triangular Run button at the top of this page that you can click to execute each cell and move onto the next, or you can press `Shift` + `Enter` while in the cell to execute it and move onto the next one.

As a cell is executing you'll notice a line to the side showcase an `*` while the cell is running or it will update to a number to indicate the last cell that completed executing after it has finished exectuting all the code within a cell.

Simply follow the instructions below and execute the cells to get started with Amazon Personalize using case optimized recommenders.

## Imports
Python ships with a broad collection of libraries and we need to import those as well as the ones installed to help us like [boto3](https://aws.amazon.com/sdk-for-python/) (AWS SDK for python) and [Pandas](https://pandas.pydata.org/)/[Numpy](https://numpy.org/) which are core data science tools.

In [None]:
# Imports
import boto3
import json
import numpy as np
import pandas as pd
import time
import datetime

Next you will want to validate that your environment can communicate successfully with Amazon Personalize, the lines below do just that.

In [None]:
# Configure the SDK to Personalize:
personalize = boto3.client('personalize')
personalize_runtime = boto3.client('personalize-runtime')

## Configure the data
Data is imported into Amazon Personalize through Amazon S3, below we will specify a bucket that you have created within AWS for the purposes of this exercise.
Below you will update the `bucket` variable to instead be set to the value that you created earlier in the CloudFormation steps, this should be in a text file from your earlier work. the `filename` does not need to be changed.
### Specify a Bucket and Data Output Location
Update the `bucket` name to a unique name.

In [None]:
filename = "movie-lens-100k.csv"

## Download, Prepare, and Upload Training Data
At present you do not have the MovieLens data loaded locally yet for examination, execute the lines below to download the latest copy and to examine it quickly.

### Download and Explore the Dataset

In [None]:
!wget -N https://files.grouplens.org/datasets/movielens/ml-latest-small.zip
!unzip -o ml-latest-small.zip

In [None]:
!ls ml-latest-small

In [None]:
!pygmentize ml-latest-small/README.txt

In [None]:
interactions_data = pd.read_csv('./ml-latest-small/ratings.csv')
pd.set_option('display.max_rows', 5)
interactions_data

In [None]:
interactions_data.info()

## Prepare the Data

### Interactions Data
As you can see the data contains a UserID, ItemID, Rating, and Timestamp.

We are now going to remove the items with low rankings, and remove the Rating column before we build our model.

We are also adding the column EVENT_TYPE to all interactions.

In [None]:
interactions_data = interactions_data[interactions_data['rating'] > 3] # Keep only movies rated higher than 3 out of 5.
interactions_data = interactions_data[['userId', 'movieId', 'timestamp']]
interactions_data.rename(columns = {'userId':'USER_ID', 'movieId':'ITEM_ID', 
 'timestamp':'TIMESTAMP'}, inplace = True)
interactions_data['EVENT_TYPE']='watch' #Adding an EVENT_TYPE column that has the event type "watched" for all movies
interactions_data.head()

### Item Metadata

Open the item data file and take a look at the first rows.

In [None]:
items_data = pd.read_csv('./ml-latest-small/movies.csv')
items_data.head(5)

In [None]:
items_data.info()

In [None]:
items_data['year'] = items_data['title'].str.extract('.*\((.*)\).*',expand = False)
items_data.head(5)

Selecting a modern date as the creation timestamp for this example because the actual creation timestamp is unknown. In your use-case, please provide the appropriate creation timestamp.

In [None]:
ts= datetime.datetime(2022, 1, 1, 0, 0).strftime('%s')
print(ts)

In [None]:
items_data["CREATION_TIMESTAMP"] = ts
items_data

In [None]:
# removing the title
items_data.drop(columns="title", inplace = True)

# renaming the columns to match schema
items_data.rename(columns = { 'movieId':'ITEM_ID', 'genres':'GENRES',
 'year':'YEAR'}, inplace = True)
items_data

# User Metadata

The dataset doe not have any user metadata so we will create an fake metadata field.

In [None]:
# get user ids from the interaction dataset

user_ids = interactions_data['USER_ID'].unique()
user_data = pd.DataFrame()
user_data["USER_ID"]=user_ids
user_data

## Adding Metadata
The current dataset does not contain additiona user information. For this example, we'll randomly assign a gender to the users with equal probablity of male and female.

In [None]:
possible_genders = ['female', 'male']
random = np.random.choice(possible_genders, len(user_data.index), p=[0.5, 0.5])
user_data["GENDER"] = random
user_data

## Configure an S3 bucket and an IAM role

So far, we have downloaded, manipulated, and saved the data onto the Amazon EBS instance attached to instance running this Jupyter notebook. However, Amazon Personalize will need an S3 bucket to act as the source of your data, as well as IAM roles for accessing that bucket. Let's set all of that up.

The Amazon S3 bucket needs to be in the same region as the Amazon Personalize resources we have been creating so far. Simply define the region as a string below.

In [None]:
# Sets the same region as current Amazon SageMaker Notebook
with open('/opt/ml/metadata/resource-metadata.json') as notebook_info:
 data = json.load(notebook_info)
 resource_arn = data['ResourceArn']
 region = resource_arn.split(':')[3]
print('region:', region)

# Or you can specify the region where your bucket and model will be domiciled this should be the same region as the Amazon Personalize resources
# region = "us-east-1"


In [None]:
s3 = boto3.client('s3')
account_id = boto3.client('sts').get_caller_identity().get('Account')
bucket_name = account_id + "-" + region + "-" + "personalizemanagedvod"
print('bucket_name:', bucket_name)

try:
 if region == "us-east-1":
 s3.create_bucket(Bucket=bucket_name)
 else:
 s3.create_bucket(
 Bucket=bucket_name,
 CreateBucketConfiguration={'LocationConstraint': region}
 )
except s3.exceptions.BucketAlreadyOwnedByYou:
 print("Bucket already exists. Using bucket", bucket_name)

### Upload data to S3
Now that your Amazon S3 bucket has been created, upload the CSV file of our user-item-interaction data.

In [None]:
interactions_filename = "interactions.csv"
interactions_data.to_csv(interactions_filename, index=False)
boto3.Session().resource('s3').Bucket(bucket_name).Object(interactions_filename).upload_file(interactions_filename)

items_filename = "items.csv"
items_data.to_csv(items_filename, index=False)
boto3.Session().resource('s3').Bucket(bucket_name).Object(items_filename).upload_file(items_filename)

user_filename = "users.csv"
user_data.to_csv(user_filename, index=False)
boto3.Session().resource('s3').Bucket(bucket_name).Object(user_filename).upload_file(user_filename)

## Set the S3 bucket policy
Amazon Personalize needs to be able to read the contents of your S3 bucket. So add a bucket policy which allows that.

Note: Make sure the role you are using to run the code in this notebook has the necessary permissions to modify the S3 bucket policy.

In [None]:
s3 = boto3.client("s3")
policy = {
 "Version": "2012-10-17",
 "Id": "PersonalizeS3BucketAccessPolicy",
 "Statement": [
 {
 "Sid": "PersonalizeS3BucketAccessPolicy",
 "Effect": "Allow",
 "Principal": {
 "Service": "personalize.amazonaws.com"
 },
 "Action": [
 "s3:GetObject",
 "s3:ListBucket"
 ],
 "Resource": [
 "arn:aws:s3:::{}".format(bucket_name),
 "arn:aws:s3:::{}/*".format(bucket_name)
 ]
 }
 ]
}

s3.put_bucket_policy(Bucket=bucket_name, Policy=json.dumps(policy))

## Create and Wait for Dataset Group
The largest grouping in Personalize is a Dataset Group, this will isolate your data, event trackers, solutions, recommenders and campaigns. Grouping things together that share a common collection of data. Feel free to alter the name below if you'd like.

### Create Dataset Group

In [None]:
response = personalize.create_dataset_group(
 name='personalize-video-on-demand-ds-group',
 domain='VIDEO_ON_DEMAND'
)

dataset_group_arn = response['datasetGroupArn']
print(json.dumps(response, indent=2))

Wait for Dataset Group to Have ACTIVE Status
Before we can use the Dataset Group in any items below it must be active, execute the cell below and wait for it to show active.

In [None]:
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
 describe_dataset_group_response = personalize.describe_dataset_group(
 datasetGroupArn = dataset_group_arn
 )
 status = describe_dataset_group_response["datasetGroup"]["status"]
 print("DatasetGroup: {}".format(status))
 
 if status == "ACTIVE" or status == "CREATE FAILED":
 break
 
 time.sleep(60)

## Create Interactions Schema
A core component of how Personalize understands your data comes from the Schema that is defined below. This configuration tells the service how to digest the data provided via your CSV file. Note the columns and types align to what was in the file you created above.

In [None]:
schema = {
 "type": "record",
 "name": "Interactions",
 "namespace": "com.amazonaws.personalize.schema",
 "fields": [
 {
 "name": "USER_ID",
 "type": "string"
 },
 {
 "name": "ITEM_ID",
 "type": "string"
 },
 {
 "name": "EVENT_TYPE",
 "type": "string"
 },
 {
 "name": "TIMESTAMP",
 "type": "long"
 }
 ],
 "version": "1.0"
}

create_interactions_schema_response = personalize.create_schema(
 name='personalize-demo-interactions-schema',
 schema=json.dumps(schema),
 domain='VIDEO_ON_DEMAND'
)

interactions_schema_arn = create_interactions_schema_response['schemaArn']
print(json.dumps(create_interactions_schema_response, indent=2))

# Create Items (movies) schema

In [None]:
schema = {
 "type": "record",
 "name": "Items",
 "namespace": "com.amazonaws.personalize.schema",
 "fields": [
 {
 "name": "ITEM_ID",
 "type": "string"
 },
 {
 "name": "GENRES",
 "type": [
 "string"
 ],
 "categorical": True
 },
 {
 "name": "YEAR",
 "type": [
 "string"
 ],
 "categorical": True
 }, 
 {
 "name": "CREATION_TIMESTAMP",
 "type": "long"
 }
 ],
 "version": "1.0"
}
create_items_schema_response = personalize.create_schema(
 name='personalize-demo-items-schema',
 schema=json.dumps(schema),
 domain='VIDEO_ON_DEMAND'
)

items_schema_arn = create_items_schema_response['schemaArn']
print(json.dumps(create_items_schema_response, indent=2))

# Create Users schema

In [None]:
schema = {
 "type": "record",
 "name": "Users",
 "namespace": "com.amazonaws.personalize.schema",
 "fields": [
 {
 "name": "USER_ID",
 "type": "string"
 },
 {
 "name": "GENDER",
 "type": "string",
 "categorical": True
 }
 ],
 "version": "1.0"
}
create_users_schema_response = personalize.create_schema(
 name='personalize-demo-users-schema',
 schema=json.dumps(schema),
 domain='VIDEO_ON_DEMAND'
)

users_schema_arn = create_users_schema_response['schemaArn']
print(json.dumps(create_users_schema_response, indent=2))

## Create Datasets
After the group, the next thing to create is the actual datasets.

### Create Interactions Dataset

In [None]:
dataset_type = "INTERACTIONS"

create_dataset_response = personalize.create_dataset(
 name = "personalize-demo-interactions",
 datasetType = dataset_type,
 datasetGroupArn = dataset_group_arn,
 schemaArn = interactions_schema_arn
)

interactions_dataset_arn = create_dataset_response['datasetArn']
print(json.dumps(create_dataset_response, indent=2))

### Create Items Dataset

In [None]:
dataset_type = "ITEMS"
create_dataset_response = personalize.create_dataset(
 name = "personalize-demo-items",
 datasetType = dataset_type,
 datasetGroupArn = dataset_group_arn,
 schemaArn = items_schema_arn
)

items_dataset_arn = create_dataset_response['datasetArn']
print(json.dumps(create_dataset_response, indent=2))

### Create Users Dataset

In [None]:
dataset_type = "USERS"
create_dataset_response = personalize.create_dataset(
 name = "personalize-demo-users",
 datasetType = dataset_type,
 datasetGroupArn = dataset_group_arn,
 schemaArn = users_schema_arn
)

users_dataset_arn = create_dataset_response['datasetArn']
print(json.dumps(create_dataset_response, indent=2))

## Create Personalize Role
Also Amazon Personalize needs the ability to assume Roles in AWS in order to have the permissions to execute certain tasks, the lines below grant that.

Note: Make sure the role you are using to run the code in this notebook has the necessary permissions to create a role.

In [None]:
iam = boto3.client("iam")

role_name = "PersonalizeRoleVODDemoRecommender"
assume_role_policy_document = {
 "Version": "2012-10-17",
 "Statement": [
 {
 "Effect": "Allow",
 "Principal": {
 "Service": "personalize.amazonaws.com"
 },
 "Action": "sts:AssumeRole"
 }
 ]
}

create_role_response = iam.create_role(
 RoleName = role_name,
 AssumeRolePolicyDocument = json.dumps(assume_role_policy_document)
)

# AmazonPersonalizeFullAccess provides access to any S3 bucket with a name that includes "personalize" or "Personalize" 
# if you would like to use a bucket with a different name, please consider creating and attaching a new policy
# that provides read access to your bucket or attaching the AmazonS3ReadOnlyAccess policy to the role
policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonPersonalizeFullAccess"
iam.attach_role_policy(
 RoleName = role_name,
 PolicyArn = policy_arn
)

# Now add S3 support
iam.attach_role_policy(
 PolicyArn='arn:aws:iam::aws:policy/AmazonS3FullAccess',
 RoleName=role_name
)
time.sleep(60) # wait for a minute to allow IAM role policy attachment to propagate

role_arn = create_role_response["Role"]["Arn"]
print(role_arn)


## Import the data
Earlier you created the DatasetGroup and Dataset to house your information, now you will execute an import job that will load the data from S3 into Amazon Personalize for usage building your model.
### Create Interactions Dataset Import Job

In [None]:
create_interactions_dataset_import_job_response = personalize.create_dataset_import_job(
 jobName = "personalize-demo-import-interactions",
 datasetArn = interactions_dataset_arn,
 dataSource = {
 "dataLocation": "s3://{}/{}".format(bucket_name, interactions_filename)
 },
 roleArn = role_arn
)

dataset_interactions_import_job_arn = create_interactions_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(create_interactions_dataset_import_job_response, indent=2))

### Create Items Dataset Import Job

In [None]:
create_items_dataset_import_job_response = personalize.create_dataset_import_job(
 jobName = "personalize-demo-import-items",
 datasetArn = items_dataset_arn,
 dataSource = {
 "dataLocation": "s3://{}/{}".format(bucket_name, items_filename)
 },
 roleArn = role_arn
)

dataset_items_import_job_arn = create_items_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(create_items_dataset_import_job_response, indent=2))

### Create Users Dataset Import Job

In [None]:
create_users_dataset_import_job_response = personalize.create_dataset_import_job(
 jobName = "personalize-demo-import-users",
 datasetArn = users_dataset_arn,
 dataSource = {
 "dataLocation": "s3://{}/{}".format(bucket_name, user_filename)
 },
 roleArn = role_arn
)

dataset_users_import_job_arn = create_users_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(create_users_dataset_import_job_response, indent=2))

Wait for Dataset Import Job to Have ACTIVE Status
It can take a while before the import job completes, please wait until you see that it is active below.

In [None]:
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
 describe_dataset_import_job_response = personalize.describe_dataset_import_job(
 datasetImportJobArn = dataset_interactions_import_job_arn
 )
 status = describe_dataset_import_job_response["datasetImportJob"]['status']
 print("Interactions DatasetImportJob: {}".format(status))
 
 if status == "ACTIVE" or status == "CREATE FAILED":
 break
 
 time.sleep(60)
 
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
 describe_dataset_import_job_response = personalize.describe_dataset_import_job(
 datasetImportJobArn = dataset_items_import_job_arn
 )
 status = describe_dataset_import_job_response["datasetImportJob"]['status']
 print("Items DatasetImportJob: {}".format(status))
 
 if status == "ACTIVE" or status == "CREATE FAILED":
 break
 
 time.sleep(60)
 
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
 describe_dataset_import_job_response = personalize.describe_dataset_import_job(
 datasetImportJobArn = dataset_users_import_job_arn
 )
 status = describe_dataset_import_job_response["datasetImportJob"]['status']
 print("Users DatasetImportJob: {}".format(status))
 
 if status == "ACTIVE" or status == "CREATE FAILED":
 break
 
 time.sleep(60)

## Choose a recommender use cases

Each domain has different use cases. When you create a recommender you create it for a specific use case, and each use case has different requirements for getting recommendations.


In [None]:
available_recipes = personalize.list_recipes(domain='VIDEO_ON_DEMAND') # See a list of recommenders for the domain. 
if (len(available_recipes["recipes"])==0):
 # This is a workaround to get the recipes in case 'available_recipes["recipes"]'does not retrieve them
 available_recipes = personalize.list_recipes(domain='VIDEO_ON_DEMAND', nextToken=available_recipes["nextToken"])
display(available_recipes["recipes"])
 

We are going to create a recommender of the type "More like X". This type of recommender offers recommendations for videos that are similar to a video a user watched. With this use case, Amazon Personalize automatically filters videos the user watched based on the userId specified in the `get_recommendations` call. For better performance, record Click events in addition to the required Watch events.

In [None]:
create_recommender_response = personalize.create_recommender(
 name = 'more_like_x_demo',
 recipeArn = 'arn:aws:personalize:::recipe/aws-vod-more-like-x',
 datasetGroupArn = dataset_group_arn
)
recommender_more_like_x_arn = create_recommender_response["recommenderArn"]
print (json.dumps(create_recommender_response))

We are going to create a second recommender of the type "Top picks for you". This type of recommender offers personalized streaming content recommendations for a user that you specify. With this use case, Amazon Personalize automatically filters videos the user watched based on the userId that you specify and `Watch` events.

[More use cases per domain](https://docs.aws.amazon.com/personalize/latest/dg/domain-use-cases.html)

In [None]:
create_recommender_response = personalize.create_recommender(
 name = 'top_picks_for_you_demo',
 recipeArn = 'arn:aws:personalize:::recipe/aws-vod-top-picks',
 datasetGroupArn = dataset_group_arn
)
recommender_top_picks_arn = create_recommender_response["recommenderArn"]
print (json.dumps(create_recommender_response))

We wait until the recomenders have finished creating and have status `ACTIVE`. We check periodically on the status of the recommender

In [None]:
%%time

max_time = time.time() + 10*60*60 # 10 hours
while time.time() < max_time:

 version_response = personalize.describe_recommender(
 recommenderArn = recommender_more_like_x_arn
 )
 status = version_response["recommender"]["status"]

 if status == "ACTIVE":
 print("Build succeeded for {}".format(recommender_more_like_x_arn))
 
 elif status == "CREATE FAILED":
 print("Build failed for {}".format(recommender_more_like_x_arn))

 if status == "ACTIVE":
 break
 else:
 print("The More Like X Recommender build is still in progress")
 
 time.sleep(60)
 
while time.time() < max_time:

 version_response = personalize.describe_recommender(
 recommenderArn = recommender_top_picks_arn
 )
 status = version_response["recommender"]["status"]

 if status == "ACTIVE":
 print("Build succeeded for {}".format(recommender_top_picks_arn))
 
 elif status == "CREATE FAILED":
 print("Build failed for {}".format(recommender_top_picks_arn))

 if status == "ACTIVE":
 break
 else:
 print("The Top Pics for You Recommender build is still in progress")
 
 time.sleep(60)

# Getting recommendations with a recommender
Now that the recommenders have been trained, lets have a look at the recommendations we can get for our users!

In [None]:
# reading the original data in order to have a dataframe that has both movie_ids 
# and the corresponding titles to make out recommendations easier to read.
items_df = pd.read_csv('./ml-latest-small/movies.csv')
items_df.sample(10)

In [None]:
def get_movie_by_id(movie_id, movie_df):
 """
 This takes in an movie_id from a recommendation in string format,
 converts it to an int, and then does a lookup in a specified
 dataframe.
 
 A really broad try/except clause was added in case anything goes wrong.
 
 Feel free to add more debugging or filtering here to improve results if
 you hit an error.
 """
 try:
 return movie_df.loc[movie_df["movieId"]==int(movie_id)]['title'].values[0]
 except:
 print (movie_id)
 return "Error obtaining title"

### Let us get some 'More Like X' recommendations:

In [None]:
# First pick a user
test_user_id = "1"

# Select a random item
test_item_id = "81847" #Iron Man 59315, Tangled: 81847

# Get recommendations for the user for this item
get_recommendations_response = personalize_runtime.get_recommendations(
 recommenderArn = recommender_more_like_x_arn,
 userId = test_user_id,
 itemId = test_item_id,
 numResults = 20
)

# Build a new dataframe for the recommendations
item_list = get_recommendations_response['itemList']
recommendation_list = []
for item in item_list:
 movie = get_movie_by_id(item['itemId'], items_df)
 recommendation_list.append(movie)

user_recommendations_df = pd.DataFrame(recommendation_list, columns = [get_movie_by_id(test_item_id, items_df)])

pd.options.display.max_rows = 20
display(user_recommendations_df)

### Get recommendations from the recommender returning "Top picks for you":

Adding the user's metadata to our sample user, you can use this type of metadata to get insights on your users.

In [None]:
users_data_df = pd.read_csv('./users.csv')

def get_gender_by_id(user_id, user_df):
 """
 This takes in a user_id and then does a lookup in a specified
 dataframe.
 
 A really broad try/except clause was added in case anything goes wrong.
 
 Feel free to add more debugging or filtering here to improve results if
 you hit an error.
 """
 return user_df.loc[user_df["USER_ID"]==int(user_id)]['GENDER'].values[0]
 try:
 return user_df.loc[user_df["USER_ID"]==int(user_id)]['GENDER'].values[0]
 except:
 print (user_id)
 return "Error obtaining title"

In [None]:
# First pick a user
test_user_id = "111" # samples users: 55, 75, 76, 111

# Get recommendations for the user
get_recommendations_response = personalize_runtime.get_recommendations(
 recommenderArn = recommender_top_picks_arn,
 userId = test_user_id,
 numResults = 20
)

# Build a new dataframe for the recommendations
item_list = get_recommendations_response['itemList']
recommendation_list = []
for item in item_list:
 movie = get_movie_by_id(item['itemId'], items_df)
 recommendation_list.append(movie)

column_name = test_user_id+" ("+get_gender_by_id(test_user_id, users_data_df)+")"

user_recommendations_df = pd.DataFrame(recommendation_list, columns = [column_name])

pd.options.display.max_rows =20
display(user_recommendations_df)

## Review
Using the codes above you have successfully trained a deep learning model to generate movie recommendations based on prior user behavior. You have created two recommenders for two foundational use cases. 
Going forward, you can adapt this code to create other recommenders.

## Notes for the Next Notebook:
There are a few values you will need for the next notebook, execute the cell below to store them so they can be used in the `Clean_Up_Resources.ipynb` notebook.

This will overwite any data stored for those variables and set them to the values specified in this notebook. 

In [None]:
# store for cleanup
%store dataset_group_arn
%store role_name
%store region

If you have run the `Building_Your_First_Recommender_Ecommerce.ipynb` notebook, please make sure you re-run the previous step in the `Building_Your_First_Recommender_Ecommerce.ipynb` notebook and re-run the `Clean_Up_Resources.ipynb` to remove the resources created in that notebook after you run the `Clean_Up_Resources.ipynb` with the resources created here.