# Validating and Importing User-Item-Interaction Data 

In this notebook, you will choose a dataset and prepare it for use with Amazon Personalize.

1. [How to Use the Notebook](#usenotebook)
1. [Introduction](#intro)
1. [Define your Use Case](#usecase)
1. [Choose a Dataset or Data Source](#source)
1. [Prepare the Interactions Data](#prepare_interactions)
1. [Prepare the Item Metadata](#prepare_items)
1. [Prepare the User Metadata](#prepare_users)
1. [Configure an S3 bucket and an IAM role](#bucket_role)
1. [Create Dataset Group](#group_dataset)
1. [Create the Interactions Schema](#interact_schema)
1. [Create the Items (Movies) Schema](#items_schema)
1. [Create the Users Schema](#users_schema)
1. [Import the Interactions Data](#import_interactions)
1. [Import the Item Metadata](#import_items)
1. [Import the User Metadata](#import_users)

## How to Use the Notebook 

The code is broken up into cells like the one below. There's a triangular Run button at the top of this page that you can click to execute each cell and move onto the next, or you can press `Shift` + `Enter` while in the cell to execute it and move onto the next one.

As a cell is executing you'll notice a line to the side showcase an `*` while the cell is running or it will update to a number to indicate the last cell that completed executing after it has finished exectuting all the code within a cell.

Simply follow the instructions below and execute the cells to get started with Amazon Personalize using case optimized recommenders.


## Introduction 
[Back to top](#top)

In Amazon Personalize, you start by creating a dataset group, which is a container for Amazon Personalize components. Your dataset group can be one of the following:

A Domain dataset group, where you create preconfigured resources for different business domains and use cases, such as getting recommendations for similar videos (VIDEO_ON_DEMAND domain) or best selling items (ECOMMERCE domain). You choose your business domain, import your data, and create recommenders. You use recommenders in your application to get recommendations.

Use a [Domain dataset group](https://docs.aws.amazon.com/personalize/latest/dg/domain-dataset-groups.html) if you have a video on demand or e-commerce application and want Amazon Personalize to find the best configurations for your use cases. If you start with a Domain dataset group, you can also add custom resources such as solutions with solution versions trained with recipes for custom use cases.

A [Custom dataset group](https://docs.aws.amazon.com/personalize/latest/dg/custom-dataset-groups.html), where you create configurable resources for custom use cases and batch recommendation workflows. You choose a recipe, train a solution version (model), and deploy the solution version with a campaign. You use a campaign in your application to get recommendations.

Use a Custom dataset group if you don't have a video on demand or e-commerce application or want to configure and manage only custom resources, or want to get recommendations in a batch workflow. If you start with a Custom dataset group, you can't associate it with a domain later. Instead, create a new Domain dataset group.

You can create and manage Domain dataset groups and Custom dataset groups with the AWS console, the AWS Command Line Interface (AWS CLI), or programmatically with the AWS SDKs.


## Define your Use Case 
[Back to top](#top)

There are a few guidelines for scoping a problem suitable for Personalize. We recommend the values below as a starting point, although the [official limits](https://docs.aws.amazon.com/personalize/latest/dg/limits.html) lie a little lower.

* Authenticated users
* At least 50 unique users
* At least 100 unique items
* At least 2 dozen interactions for each user 

Most of the time this is easily attainable, and if you are low in one category, you can often make up for it by having a larger number in another category.

The user-item-iteraction data is key for getting started with the service. This means we need to look for use cases that generate that kind of data, a few common examples are:

1. Video-on-demand applications
1. E-commerce platforms

Defining your use-case will inform what data and what type of data you need.

In this example we are going to be creating:

1. Amazon Personalize VIDEO_ON_DEMAND Domain recommender for the ["More Like X"](https://docs.aws.amazon.com/personalize/latest/dg/VIDEO_ON_DEMAND-use-cases.html#more-like-y-use-case) use case.
1. Amazon Personalize VIDEO_ON_DEMAND Domain recommender for the ["Top pics for you"](https://docs.aws.amazon.com/personalize/latest/dg/VIDEO_ON_DEMAND-use-cases.html#top-picks-use-case) use case.
1. Amazon Personalize Custom Campaign for a personalized ranked list of movies, for instance shelf/rail/carousel based on some information (director, location, superhero franchise, etc...) 

All of these will be created within the same dataset group and with the same input data.

The diagram bellow shows an overview of what we will be building in this wokshop.



In this notebook we will be working on the Data Layer, Creatig a Dataset Group and importing the Datasets. 


## Choose a Dataset or Data Source 
[Back to top](#top)

Regardless of the use case, the algorithms all share a base of learning on user-item-interaction data which is defined by 3 core attributes:

1. **UserID** - The user who interacted
1. **ItemID** - The item the user interacted with
1. **Timestamp** - The time at which the interaction occurred

To begin, we are going to use the latest MovieLens dataset, this dataset has over 25 million interactions and a rich collection of metadata for items. There is also a smaller version of this dataset, which can be used to shorten training times, while still incorporating the same capabilities as the full dataset.

Generally speaking your data will not arrive in a perfect form for Personalize, and will take some modification to be structured correctly. This notebook guides you through all of that. 

Set USE_FULL_MOVIELENS to True to use the full dataset.

In [None]:
USE_FULL_MOVIELENS = False

First, you will download the dataset from the Movielens website and unzip it in a new folder using the code below.

In [None]:
data_dir = "poc_data"
!mkdir $data_dir

if not USE_FULL_MOVIELENS:
 !cd $data_dir && wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip
 !cd $data_dir && unzip ml-latest-small.zip
 dataset_dir = data_dir + "/ml-latest-small/"
else:
 !cd $data_dir && wget http://files.grouplens.org/datasets/movielens/ml-25m.zip
 !cd $data_dir && unzip ml-25m.zip
 dataset_dir = data_dir + "/ml-25m/"

Take a look at the data files you have downloaded.

In [None]:
!ls $dataset_dir

At present not much is known except that we have a few CSVs and a readme. Next we will output the readme to learn more!

In [None]:
!pygmentize $dataset_dir/README.txt

From the README, we see there is a file `ratings.csv` that should work as a proxy for our interactions data, after all rating a film definitely is a form of interacting with it. The dataset also has some genre information as some movie genome data. In this POC we will focus on the interactions and the genre data.


## Prepare the Interactions data 
[Back to top](#top)

The next thing to be done is to load the data and confirm the data is in a good state.

Python ships with a broad collection of libraries and we need to import those as well as the ones installed to help us like [boto3](https://aws.amazon.com/sdk-for-python/) (AWS SDK for python) and [Pandas](https://pandas.pydata.org/)/[Numpy](https://numpy.org/) which are core data science tools.

In [None]:
import time
from time import sleep
import json
from datetime import datetime
import boto3
import pandas as pd
import numpy as np

Next, open the `ratings.csv` file and take a look at the first rows.

In [None]:
original_data = pd.read_csv(dataset_dir + '/ratings.csv')
original_data.head(5)

In [None]:
original_data.shape

In [None]:
original_data.describe()

This shows that we have a good range of values for `userId` and `movieId`. Next, it is always a good idea to confirm the data format.

In [None]:
original_data.info()

In [None]:
original_data.isnull().any()

In [None]:
arb_time_stamp = original_data.iloc[50]['timestamp']
print(arb_time_stamp)
print(datetime.utcfromtimestamp(arb_time_stamp).strftime('%Y-%m-%d %H:%M:%S'))

From this, you can see that there are a total of (25,000,095 for full 100836 for small) entries in the dataset, with 4 columns, and each cell stored as int64 format, with the exception of the rating whihch is a float64.

The int64 format is clearly suitable for `userId` and `movieId`. However, we need to dive deeper to understand the timestamps in the data. To use Amazon Personalize, you need to save timestamps in [Unix Epoch](https://en.wikipedia.org/wiki/Unix_time) format.

Currently, the timestamp values are not human-readable. So let's grab an arbitrary timestamp value and figure out how to interpret it.

Do a quick sanity check on the transformed dataset by picking an arbitrary timestamp and transforming it to a human-readable format.

This date makes sense as a timestamp, so we can continue formatting the rest of the data. Remember, the data we need is user-item-interaction data, which is `userId`, `movieId`, and `timestamp` in this case. Our dataset has an additional column, `rating`, which can be dropped from the dataset after we have leveraged it to focus on positive interactions.

### Convert the Interactions Data

Since this is a dataset of an explicit feedback movie ratings, it includes movies rated from 1 to 5. We want to include only moves that were "liked" by the users, and simulate a dataset of data that would be gathered by a VOD platform. In order to do that, we will filter out all interactions under 2 out of 5, and create two event types: "Click" and and "Watch". We will then assign all movies rated 2 and above as "Click" and movies rated 4 and above as both "Click" and "Watch". 

Note that for a real data set you would actually model based on implicit feedback such as clicks, watches and/or explicit feedback such as ratings, likes etc.

In [None]:
watched_df = original_data.copy()
watched_df = watched_df[watched_df['rating'] > 3]
watched_df = watched_df[['userId', 'movieId', 'timestamp']]
watched_df['EVENT_TYPE']='Watch'
watched_df.head()

In [None]:
clicked_df = original_data.copy()
clicked_df = clicked_df[clicked_df['rating'] > 1]
clicked_df = clicked_df[['userId', 'movieId', 'timestamp']]
clicked_df['EVENT_TYPE']='Click'
clicked_df.head()

In [None]:
interactions_df = clicked_df.copy()
interactions_df = interactions_df.append(watched_df)
interactions_df.sort_values("timestamp", axis = 0, ascending = True, 
 inplace = True, na_position ='last') 

In [None]:
interactions_df.info()

Lets look at what the new dataset looks like.

In [None]:
interactions_df.describe()

After manipulating the data, always confirm the data format has not changed.

In [None]:
interactions_df.dtypes

 Amazon Personalize has default column names for users, items, and timestamp. These default column names are `USER_ID`, `ITEM_ID`, `TIMESTAMP` and `EVENT_VALUE` for the [VIDEO_ON_DEMAND domain dataset](https://docs.aws.amazon.com/personalize/latest/dg/VIDEO-ON-DEMAND-datasets-and-schemas.html). The final modification to the dataset is to replace the existing column headers with the default headers.

In [None]:
interactions_df.rename(columns = {'userId':'USER_ID', 'movieId':'ITEM_ID', 
 'timestamp':'TIMESTAMP'}, inplace = True) 

That's it! At this point the data is ready to go, and we just need to save it as a CSV file.

In [None]:
interactions_filename = "interactions.csv"
interactions_df.to_csv((data_dir+"/"+interactions_filename), index=False, float_format='%.0f')

## Prepare the Item Metadata 
[Back to top](#top)

This will allow you to work with filters as well as supporting the [Top Pics for you Domain Recommender](https://docs.aws.amazon.com/personalize/latest/dg/VIDEO_ON_DEMAND-use-cases.html#top-picks-use-case), and complying with the [VIDEO_ON_DEMAND domain dataset and schema requirements](https://docs.aws.amazon.com/personalize/latest/dg/VIDEO-ON-DEMAND-datasets-and-schemas.html#VIDEO-ON-DEMAND-dataset-requirements)..

Next we load the data and confirm the data is in a good state.

Next, open the `movies.csv` file and take a look at the first rows. This file has information about the movie.

In [None]:
original_data = pd.read_csv(dataset_dir + '/movies.csv')
original_data.head(5)

In [None]:
original_data.describe()

This does not really tell us much about the dataset, so we will explore a bit more and look at the raw information. We can see that genres often appear in groups. That is fine for us as Personalize supports this structure.

In [None]:
original_data.info()

From this, you can see that there are a total of (62,000+ for full 9742 for small) entries in the dataset, with 3 columns.

Lets look for potential data issues. First we will check for null values.

In [None]:
original_data.isnull().sum()

Looks good, we currently have no null values.

This is a pretty small dataset of just the movieId, title and the list of genres that are applicable to each entry. However there is additional data available in the Movielens dataset. For instance the title includes the year of the movies release. Let's make that another column of metadata.

In [None]:
original_data['year'] = original_data['title'].str.extract('.*\((.*)\).*',expand = False)
original_data.head(5)

Lets check again for null values, now that we have added a new field.

In [None]:
original_data.isnull().sum()

It looks like we have introduced some null values, this is likely due to something in the orginal data. If we had time, we could investigate the titles that resulted in the null values. However, for this workshop we will drop the null value titles.

In [None]:
original_data = original_data.dropna(axis=0)

Lets validate that we resololved the data issue

In [None]:
original_data.isnull().sum()

From an item metadata perspective, we only want to include information that is relevant to training a model and/or filtering results, so we will drop the title column, and keep the genre information.

In [None]:
itemmetadata_df = original_data.copy()
itemmetadata_df = itemmetadata_df[['movieId', 'genres', 'year']]
itemmetadata_df.head()

We will add a new dataframe to help us generate a creation timestamp. If you don’t provide the CREATION_TIMESTAMP for an item, the model infers this information from the interaction dataset and uses the timestamp of the item’s earliest interaction as its corresponding release date. If an item doesn’t have an interaction, its release date is set as the timestamp of the latest interaction in the training set and it is considered a new item. 

For the current example we are selecting a today's date as the creation timestamp because the actual creation timestamp is unknown. In your use-case, please provide the appropriate creation timestamp for the item. This can be when the item was added to your platform.

In [None]:
ts = datetime(2022, 1, 1, 0, 0).strftime('%s')
print(ts)

itemmetadata_df['CREATION_TIMESTAMP'] = ts

After manipulating the data, always confirm that the data format has not changed.

In [None]:
itemmetadata_df.dtypes

Amazon Personalize has a default column for `ITEM_ID` that will map to our `movieId`. We will flesh out more information by specifying `GENRE` as well.

In [None]:
itemmetadata_df.rename(columns = {'genres':'GENRES', 'movieId':'ITEM_ID', 'year':'YEAR'}, inplace = True) 

In [None]:
itemmetadata_df

That's it! At this point the item data is ready to go, and we just need to save it as a CSV file.

In [None]:
items_filename = "item-meta.csv"
itemmetadata_df.to_csv((data_dir+"/"+items_filename), index=False, float_format='%.0f')

## Prepare the User Metadata 
[Back to top](#top)

This will supporting the [Top Pics for you Domain Recommender](https://docs.aws.amazon.com/personalize/latest/dg/VIDEO_ON_DEMAND-use-cases.html#top-picks-use-case), and complying with the [VIDEO_ON_DEMAND domain dataset and schema requirements](https://docs.aws.amazon.com/personalize/latest/dg/VIDEO-ON-DEMAND-datasets-and-schemas.html#VIDEO-ON-DEMAND-dataset-requirements).

The dataset does not have any user metadata so we will create a fake metadata field.

In [None]:
# get all unique user ids from the interaction dataset

user_ids = interactions_df['USER_ID'].unique()
user_data = pd.DataFrame()
user_data["USER_ID"]= user_ids
user_data

Adding Metadata
The current dataset does not contain additiona user information. For this example, we'll randomly assign a gender to the users with equal probablity of male and female.

In [None]:
possible_genders = ['female', 'male']
random = np.random.choice(possible_genders, len(user_data.index), p=[0.5, 0.5])
user_data["GENDER"] = random
user_data

In [None]:
# Saving the data as a CSV file
users_filename = "users.csv"
user_data.to_csv((data_dir+"/"+users_filename), index=False, float_format='%.0f')


## Configure an S3 bucket and an IAM role 
[Back to top](#top)

So far, we have downloaded, manipulated, and saved the data onto the Amazon EBS instance attached to instance running this Jupyter notebook. 

By default, the Personalize service does not have permission to acccess the data we uploaded into the S3 bucket in our account. In order to grant access to the Personalize service to read our CSVs, we need to set a Bucket Policy and create an IAM role that the Amazon Personalize service will assume. Let's set all of that up.

Use the metadata stored on the instance underlying this Amazon SageMaker notebook, to determine the region it is operating in. If you are using a Jupyter notebook outside of Amazon SageMaker, simply define the region as a string below. The Amazon S3 bucket needs to be in the same region as the Amazon Personalize resources we have been creating so far.

In [None]:
with open('/opt/ml/metadata/resource-metadata.json') as notebook_info:
 data = json.load(notebook_info)
 resource_arn = data['ResourceArn']
 region = resource_arn.split(':')[3]
print('region:', region)

Amazon S3 bucket names are globally unique. To create a unique bucket name, the code below will append the string `personalizepocvod` to your AWS account number. Then it creates a bucket with this name in the region discovered in the previous cell.

In [None]:
s3 = boto3.client('s3')
account_id = boto3.client('sts').get_caller_identity().get('Account')
bucket_name = account_id + "-" + region + "-" + "personalizepocvod"
print('bucket_name:', bucket_name)
try: 
 if region == "us-east-1":
 s3.create_bucket(Bucket=bucket_name)
 else:
 s3.create_bucket(
 Bucket=bucket_name,
 CreateBucketConfiguration={'LocationConstraint': region}
 )
except s3.exceptions.BucketAlreadyOwnedByYou:
 print("Bucket already exists. Using bucket", bucket_name)

### Set the S3 bucket policy
Amazon Personalize needs to be able to read the contents of your S3 bucket. So add a bucket policy which allows that.

In [None]:
policy = {
 "Version": "2012-10-17",
 "Id": "PersonalizeS3BucketAccessPolicy",
 "Statement": [
 {
 "Sid": "PersonalizeS3BucketAccessPolicy",
 "Effect": "Allow",
 "Principal": {
 "Service": "personalize.amazonaws.com"
 },
 "Action": [
 "s3:*Object",
 "s3:ListBucket"
 ],
 "Resource": [
 "arn:aws:s3:::{}".format(bucket_name),
 "arn:aws:s3:::{}/*".format(bucket_name)
 ]
 }
 ]
}

s3.put_bucket_policy(Bucket=bucket_name, Policy=json.dumps(policy))

### Create an IAM role

Amazon Personalize needs the ability to assume roles in AWS in order to have the permissions to execute certain tasks. Let's create an IAM role and attach the required policies to it. The code below attaches very permissive policies; please use more restrictive policies for any production application.

In [None]:
iam = boto3.client("iam")

role_name = account_id+"-PersonalizeS3-Immersion-Day"
assume_role_policy_document = {
 "Version": "2012-10-17",
 "Statement": [
 {
 "Effect": "Allow",
 "Principal": {
 "Service": "personalize.amazonaws.com"
 },
 "Action": "sts:AssumeRole"
 }
 ]
}

try:
 create_role_response = iam.create_role(
 RoleName = role_name,
 AssumeRolePolicyDocument = json.dumps(assume_role_policy_document)
 );
 
except iam.exceptions.EntityAlreadyExistsException as e:
 print('Warning: role already exists:', e)
 create_role_response = iam.get_role(
 RoleName = role_name
 );

role_arn = create_role_response["Role"]["Arn"]
 
print('IAM Role: {}'.format(role_arn))
 
attach_response = iam.attach_role_policy(
 RoleName = role_name,
 PolicyArn = "arn:aws:iam::aws:policy/AmazonS3FullAccess"
);

role_arn = create_role_response["Role"]["Arn"]

# Pause to allow role to be fully consistent
time.sleep(30)
print('Done.')

### Upload data to S3

Now that your Amazon S3 bucket has been created, upload the CSV file of our user-item-interaction data. 

In [None]:
interactions_file_path = data_dir + "/" + interactions_filename
boto3.Session().resource('s3').Bucket(bucket_name).Object(interactions_filename).upload_file(interactions_file_path)

items_file_path = data_dir + "/" + items_filename
boto3.Session().resource('s3').Bucket(bucket_name).Object(items_filename).upload_file(items_file_path)

users_file_path = data_dir + "/" + users_filename
boto3.Session().resource('s3').Bucket(bucket_name).Object(users_filename).upload_file(users_file_path)

## Create the Dataset Group 
[Back to top](#top)

The highest level of isolation and abstraction with Amazon Personalize is a *dataset group*. Information stored within one of these dataset groups has no impact on any other dataset group or models created from one - they are completely isolated. This allows you to run many experiments and is part of how we keep your models private and fully trained only on your data. 

Before importing the data prepared earlier, there needs to be a dataset group and a dataset added to it that handles the interactions.

Dataset groups can house the following types of information:

* User-item-interactions
* Event streams (real-time interactions)
* User metadata
* Item metadata

We need to create the dataset group that will contain our three datasets.

### Create the Dataset Group

The following cell will create a new dataset group with the name `personalize-poc-movielens`.

In [None]:
# Configure the SDK to Personalize:
personalize = boto3.client('personalize')
personalize_runtime = boto3.client('personalize-runtime')
print("We can communicate with Personalize!")

In [None]:
create_dataset_group_response = personalize.create_dataset_group(
 name = "personalize-poc-movielens",
 domain='VIDEO_ON_DEMAND'
)

dataset_group_arn = create_dataset_group_response['datasetGroupArn']
print(json.dumps(create_dataset_group_response, indent=2))

print(f'DatasetGroupArn = {dataset_group_arn}')

#### Wait for Dataset Group to Have ACTIVE Status 

Before we can use the Dataset Group in any items below it must be active. This can take a minute or two. Execute the cell below and wait for it to show the ACTIVE status. It checks the status of the dataset group every 60 seconds, up to a maximum of 3 hours.

In [None]:
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
 describe_dataset_group_response = personalize.describe_dataset_group(
 datasetGroupArn = dataset_group_arn
 )
 status = describe_dataset_group_response["datasetGroup"]["status"]
 print("DatasetGroup: {}".format(status))
 
 if status == "ACTIVE" or status == "CREATE FAILED":
 break
 
 time.sleep(60)

Now that you have a dataset group, you can create a dataset for the interaction data.

## Create the Interactions Schema 
[Back to top](#top)

Now that we've loaded and prepared our three datasets we'll need to configure the Amazon Personalize service to understand our data so that it can be used to train models for generating recommendations.Amazon Personalize requires a schema for each dataset so it can map the columns in our CSVs to fields for model training. Each schema is declared in JSON using the [Apache Avro](https://avro.apache.org/) format. 

First, define a schema to tell Amazon Personalize what type of dataset you are uploading. There are several reserved and mandatory keywords required in the schema, based on the type of dataset. More detailed information can be found in the [documentation](https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html).

Here, you will create a schema for interactions data, which requires the `USER_ID`, `ITEM_ID`, and `TIMESTAMP` fields. These must be defined in the same order in the schema as they appear in the dataset.

The interactions dataset has three required columns: `ITEM_ID`, `USER_ID`, and `TIMESTAMP`. The `TIMESTAMP` represents when the user interated with an item and must be expressed in Unix timestamp format (seconds). For this dataset we also have an `EVENT_TYPE` column.

In [None]:
interactions_schema = {
 "type": "record",
 "name": "Interactions",
 "namespace": "com.amazonaws.personalize.schema",
 "fields": [
 {
 "name": "USER_ID",
 "type": "string"
 },
 {
 "name": "ITEM_ID",
 "type": "string"
 },
 {
 "name": "EVENT_TYPE", # "Watch", "Click", etc.
 "type": "string"
 },
 {
 "name": "TIMESTAMP",
 "type": "long"
 }
 ],
 "version": "1.0"
}

try:
 create_schema_response = personalize.create_schema(
 name = "personalize-poc-movielens-interactions-schema",
 schema = json.dumps(interactions_schema),
 domain='VIDEO_ON_DEMAND'
 )
 print(json.dumps(create_schema_response, indent=2))
 interactions_schema_arn = create_schema_response['schemaArn']
except personalize.exceptions.ResourceAlreadyExistsException:
 print('You aready created this schema.')
 schemas = personalize.list_schemas(maxResults=100)['schemas']
 for schema_response in schemas:
 if schema_response['name'] == "personalize-poc-movielens-interactions-schema":
 interactions_schema_arn = schema_response['schemaArn']
 print(f"Using existing schema: {interactions_schema_arn}")

### Create the Interactions Dataset

With a schema created, you can create a dataset within the dataset group. Note that this does not load the data yet, but creates a schema of what the data looks like. 

In [None]:
dataset_type = "INTERACTIONS"
create_dataset_response = personalize.create_dataset(
 name = "personalize-poc-movielens-interactions",
 datasetType = dataset_type,
 datasetGroupArn = dataset_group_arn,
 schemaArn = interactions_schema_arn
)

interactions_dataset_arn = create_dataset_response['datasetArn']
print(json.dumps(create_dataset_response, indent=2))

## Create the Items (Movies) Schema
[Back to top](#top)

First, define a schema to tell Amazon Personalize what type of dataset you are uploading. There are several reserved and mandatory keywords required in the schema, based on the type of dataset. More detailed information can be found in the [documentation](https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html).

Here, you will create a schema for item metadata data, and we define the `ITEM_ID`, `GENRES`, `YEAR`, and `CREATION_TIMESTAMP` fields. These must be defined in the same order in the schema as they appear in the dataset.

In [None]:
items_schema = {
 "type": "record",
 "name": "Items",
 "namespace": "com.amazonaws.personalize.schema",
 "fields": [
 {
 "name": "ITEM_ID",
 "type": "string"
 },
 {
 "name": "GENRES",
 "type": "string",
 "categorical": True
 },{
 "name": "YEAR",
 "type": "int",
 },
 {
 "name": "CREATION_TIMESTAMP",
 "type": "long",
 }
 ],
 "version": "1.0"
}
 
try:
 create_schema_response = personalize.create_schema(
 name = "personalize-poc-movielens-items-schema",
 schema = json.dumps(items_schema),
 domain='VIDEO_ON_DEMAND'
 )
 items_schema_arn = create_schema_response['schemaArn']
 print(json.dumps(create_schema_response, indent=2))
except personalize.exceptions.ResourceAlreadyExistsException:
 print('You aready created this schema.')
 schemas = personalize.list_schemas(maxResults=100)['schemas']
 for schema_response in schemas:
 if schema_response['name'] == "personalize-poc-movielens-items-schema":
 items_schema_arn = schema_response['schemaArn']
 print(f"Using existing schema: {items_schema_arn}")


### Create the Items Dataset

With a schema created, you can create a dataset within the dataset group. Note that this does not load the data yet, but creates a schema of what the data looks like. 

In [None]:
dataset_type = "ITEMS"
create_dataset_response = personalize.create_dataset(
 name = "personalize-poc-movielens-items",
 datasetType = dataset_type,
 datasetGroupArn = dataset_group_arn,
 schemaArn = items_schema_arn
)

items_dataset_arn = create_dataset_response['datasetArn']
print(json.dumps(create_dataset_response, indent=2))

## Create the Users Schema
[Back to top](#top)

First, define a schema to tell Amazon Personalize what type of dataset you are uploading. There are several reserved and mandatory keywords required in the schema, based on the type of dataset. More detailed information can be found in the [documentation](https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html).

Here, you will create a schema for user data, which requires the `USER_ID`, and an additonal metadata field, in this case `GENDER`. These must be defined in the same order in the schema as they appear in the dataset.

In [None]:
users_schema = {
 "type": "record",
 "name": "Users",
 "namespace": "com.amazonaws.personalize.schema",
 "fields": [
 {
 "name": "USER_ID",
 "type": "string"
 },
 {
 "name": "GENDER",
 "type": "string",
 "categorical": True
 }
 ],
 "version": "1.0"
}
 
try:
 create_schema_response = personalize.create_schema(
 name = "personalize-poc-movielens-users-schema",
 schema = json.dumps(users_schema),
 domain='VIDEO_ON_DEMAND'
 )
 print(json.dumps(create_schema_response, indent=2))
 users_schema_arn = create_schema_response['schemaArn']
except personalize.exceptions.ResourceAlreadyExistsException:
 print('You aready created this schema.')
 schemas = personalize.list_schemas(maxResults=100)['schemas']
 for schema_response in schemas:
 if schema_response['name'] == "personalize-poc-movielens-users-schema":
 users_schema_arn = schema_response['schemaArn']
 print(f"Using existing schema: {users_schema_arn}")

### Create the Users dataset

With a schema created, you can create a dataset within the dataset group. Note that this does not load the data yet, but creates a schema of what the data looks like. 

In [None]:
dataset_type = "USERS"
create_dataset_response = personalize.create_dataset(
 name = "personalize-poc-movielens-users",
 datasetType = dataset_type,
 datasetGroupArn = dataset_group_arn,
 schemaArn = users_schema_arn
)

users_dataset_arn = create_dataset_response['datasetArn']
print(json.dumps(create_dataset_response, indent=2))

Let's wait untill all the datasets have been created.

In [None]:
%%time

max_time = time.time() + 6*60*60 # 6 hours
while time.time() < max_time:
 describe_dataset_response = personalize.describe_dataset(
 datasetArn = interactions_dataset_arn
 )
 status = describe_dataset_response["dataset"]['status']
 print("Interactions Dataset: {}".format(status))
 
 if status == "ACTIVE" or status == "CREATE FAILED":
 break
 
 time.sleep(60)
 
while time.time() < max_time:
 describe_dataset_response = personalize.describe_dataset(
 datasetArn = items_dataset_arn
 )
 status = describe_dataset_response["dataset"]['status']
 print("Items Dataset: {}".format(status))
 
 if status == "ACTIVE" or status == "CREATE FAILED":
 break
 
 time.sleep(60)
 
while time.time() < max_time:
 describe_dataset_response = personalize.describe_dataset(
 datasetArn = users_dataset_arn
 )
 status = describe_dataset_response["dataset"]['status']
 print("Users Dataset: {}".format(status))
 
 if status == "ACTIVE" or status == "CREATE FAILED":
 break
 
 time.sleep(60)

## Import the interactions data 
[Back to top](#top)

Earlier you created the dataset group and dataset to house your information, so now you will execute an import job that will load the interactions data from the S3 bucket into the Amazon Personalize dataset. 

In [None]:
create_dataset_import_job_response = personalize.create_dataset_import_job(
 jobName = "personalize-poc-interactions-import",
 datasetArn = interactions_dataset_arn,
 dataSource = {
 "dataLocation": "s3://{}/{}".format(bucket_name, interactions_filename)
 },
 roleArn = role_arn
)

interactions_dataset_import_job_arn = create_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(create_dataset_import_job_response, indent=2))

## Import the Item Metadata 
[Back to top](#top)

Earlier you created the dataset group and dataset to house your information, now you will execute an import job that will load the item data from the S3 bucket into the Amazon Personalize dataset. 

In [None]:
create_dataset_import_job_response = personalize.create_dataset_import_job(
 jobName = "personalize-poc-items-import",
 datasetArn = items_dataset_arn,
 dataSource = {
 "dataLocation": "s3://{}/{}".format(bucket_name, items_filename)
 },
 roleArn = role_arn
)

items_dataset_import_job_arn = create_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(create_dataset_import_job_response, indent=2))

## Import the User Metadata 
[Back to top](#top)

Earlier you created the dataset group and dataset to house your information, now you will execute an import job that will load the user data from the S3 bucket into the Amazon Personalize dataset. 

In [None]:
create_dataset_import_job_response = personalize.create_dataset_import_job(
 jobName = "personalize-poc-users-import",
 datasetArn = users_dataset_arn,
 dataSource = {
 "dataLocation": "s3://{}/{}".format(bucket_name, users_filename)
 },
 roleArn = role_arn
)

users_dataset_import_job_arn = create_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(create_dataset_import_job_response, indent=2))

Before we can use the dataset, the import job must be active. Execute the cell below and wait for it to show the ACTIVE status. It checks the status of the import job every minute, up to a maximum of 6 hours.

Importing the data can take some time, depending on the size of the dataset. In this workshop, the data import job should take around 15 minutes. While you're waiting you can learn more about Datasets and Schemas in [the documentation](https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html). We need to wait for the data imports to complete.

In [None]:
%%time

max_time = time.time() + 6*60*60 # 6 hours
while time.time() < max_time:
 describe_dataset_import_job_response = personalize.describe_dataset_import_job(
 datasetImportJobArn = interactions_dataset_import_job_arn
 )
 status = describe_dataset_import_job_response["datasetImportJob"]['status']
 print("Interactions DatasetImportJob: {}".format(status))
 
 if status == "ACTIVE" or status == "CREATE FAILED":
 break
 
 time.sleep(60)
 
while time.time() < max_time:
 describe_dataset_import_job_response = personalize.describe_dataset_import_job(
 datasetImportJobArn = items_dataset_import_job_arn
 )
 status = describe_dataset_import_job_response["datasetImportJob"]['status']
 print("Items DatasetImportJob: {}".format(status))
 
 if status == "ACTIVE" or status == "CREATE FAILED":
 break
 
 time.sleep(60)
 
while time.time() < max_time:
 describe_dataset_import_job_response = personalize.describe_dataset_import_job(
 datasetImportJobArn = users_dataset_import_job_arn
 )
 status = describe_dataset_import_job_response["datasetImportJob"]['status']
 print("Users DatasetImportJob: {}".format(status))
 
 if status == "ACTIVE" or status == "CREATE FAILED":
 break
 
 time.sleep(60)

With all imports now complete you can start training recommenders and solutions. Run the cell below before moving on to store a few values for usage in the next notebooks. After completing that cell open notebook `02_Training_Layer.ipynb` to continue.

In [None]:
%store USE_FULL_MOVIELENS
%store dataset_dir
%store dataset_group_arn
%store bucket_name
%store role_arn
%store role_name
%store data_dir
%store region
%store interactions_dataset_arn
%store items_dataset_arn
%store users_dataset_arn
%store interactions_schema_arn
%store items_schema_arn
%store users_schema_arn