# Data Preparation Notebook

In this notebook, you will execute code to 

1. download [MovieLens](https://grouplens.org/datasets/movielens/) dataset into `ml-latest-small` directory
2. split the data into training and testing sets
3. perform negative sampling
4. calculate statistics needed to train the NCF model
5. upload data onto S3 bucket

## 1. Download dataset

In [None]:
%%bash
# delete the data directory if exists
rm -r ml-latest-small

# download movielens small dataset
curl -O http://files.grouplens.org/datasets/movielens/ml-latest-small.zip

# unzip into data directory
unzip ml-latest-small.zip
rm ml-latest-small.zip

**About the Data**

In [None]:
!cat ml-latest-small/README.txt

For this model, we will be using `ratings.csv` mainly, which contains 4 columns,
- userId
- movieId
- rating
- timestamp

## 2. Read data and perform train and test split

In [None]:
# requirements
import os
import boto3
import sagemaker
import numpy as np
import pandas as pd

In [None]:
# read rating data
fpath = './ml-latest-small/ratings.csv'
df = pd.read_csv(fpath)

In [None]:
# let's see what the data look like
df.head(2)

In [None]:
# understand what's the maximum number of hold out portion should be
df.groupby('userId').movieId.nunique().min()

Note: Since the "least active" user has 20 ratings, for our testing set, let's hold out 10 items for every user so that the max test set portion is 50%.

In [None]:
def train_test_split(df, holdout_num):
 """ perform training/testing split
 
 @param df: dataframe
 @param holdhout_num: number of items to be held out
 
 @return df_train: training data
 @return df_test testing data
 
 """
 # first sort the data by time
 df = df.sort_values(['userId', 'timestamp'], ascending=[True, False])
 
 # perform deep copy on the dataframe to avoid modification on the original dataframe
 df_train = df.copy(deep=True)
 df_test = df.copy(deep=True)
 
 # get test set
 df_test = df_test.groupby(['userId']).head(holdout_num).reset_index()
 
 # get train set
 df_train = df_train.merge(
 df_test[['userId', 'movieId']].assign(remove=1),
 how='left'
 ).query('remove != 1').drop('remove', 1).reset_index(drop=True)
 
 # sanity check to make sure we're not duplicating/losing data
 assert len(df) == len(df_train) + len(df_test)
 
 return df_train, df_test

In [None]:
df_train, df_test = train_test_split(df, 10)

## 3. Perform negative sampling

Assuming if a user rating an item is a positive label, there is no negative sample in the dataset, which is not possible for model training. Therefore, we random sample `n` items from the unseen movie list for every user to provide the negative samples.

In [None]:
def negative_sampling(user_ids, movie_ids, items, n_neg):
 """This function creates n_neg negative labels for every positive label
 
 @param user_ids: list of user ids
 @param movie_ids: list of movie ids
 @param items: unique list of movie ids
 @param n_neg: number of negative labels to sample
 
 @return df_neg: negative sample dataframe
 
 """
 
 neg = []
 ui_pairs = zip(user_ids, movie_ids)
 records = set(ui_pairs)
 
 # for every positive label case
 for (u, i) in records:
 # generate n_neg negative labels
 for _ in range(n_neg):
 # if the randomly sampled movie exists for that user
 j = np.random.choice(items)
 while(u, j) in records:
 # resample
 j = np.random.choice(items)
 neg.append([u, j, 0])
 # conver to pandas dataframe for concatenation later
 df_neg = pd.DataFrame(neg, columns=['userId', 'movieId', 'rating'])
 
 return df_neg

In [None]:
# create negative samples for training set
neg_train = negative_sampling(
 user_ids=df_train.userId.values, 
 movie_ids=df_train.movieId.values,
 items=df.movieId.unique(),
 n_neg=5
)

In [None]:
print(f'created {neg_train.shape[0]:,} negative samples')

In [None]:
df_train = df_train[['userId', 'movieId']].assign(rating=1)
df_test = df_test[['userId', 'movieId']].assign(rating=1)

df_train = pd.concat([df_train, neg_train], ignore_index=True)

## 4. Calulate statistics for our understanding and model training

In [None]:
def get_unique_count(df):
 """calculate unique user and movie counts"""
 return df.userId.nunique(), df.movieId.nunique()

In [None]:
# unique number of user and movie in the whole dataset
get_unique_count(df)

In [None]:
print('training set shape', get_unique_count(df_train))
print('testing set shape', get_unique_count(df_test))

Next, we calculate some statistics for training purpose.

In [None]:
# number of unique user and number of unique item/movie
n_user, n_item = get_unique_count(df_train)

print("number of unique users", n_user)
print("number of unique items", n_item)

In [None]:
# save the variable for the model training notebook
# -----
# read about `store` magic here: 
# https://ipython.readthedocs.io/en/stable/config/extensions/storemagic.html

%store n_user
%store n_item

## 5. Preprocess data and upload them onto S3

In [None]:
# get current session region
session = boto3.session.Session()
region = session.region_name
print(f'currently in {region}')

In [None]:
# use the default sagemaker s3 bucket to store processed data
# here we figure out what that default bucket name is 
sagemaker_session = sagemaker.Session()
bucket_name = sagemaker_session.default_bucket()
print(bucket_name) # bucket name format: "sagemaker-{region}-{aws_account_id}"

**upload data to the bucket**

In [None]:
# save data locally first
dest = 'ml-latest-small/s3'
train_path = os.path.join(dest, 'train.npy')
test_path = os.path.join(dest, 'test.npy')

!mkdir {dest}
np.save(train_path, df_train.values)
np.save(test_path, df_test.values)

# upload to S3 bucket (see the bucket name above)
sagemaker_session.upload_data(train_path, key_prefix='data')
sagemaker_session.upload_data(test_path, key_prefix='data')