{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Preparation Notebook\n", "\n", "In this notebook, you will execute code to \n", "\n", "1. download [MovieLens](https://grouplens.org/datasets/movielens/) dataset into `ml-latest-small` directory\n", "2. split the data into training and testing sets\n", "3. perform negative sampling\n", "4. calculate statistics needed to train the NCF model\n", "5. upload data onto S3 bucket" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Download dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "# delete the data directory if exists\n", "rm -r ml-latest-small\n", "\n", "# download movielens small dataset\n", "curl -O http://files.grouplens.org/datasets/movielens/ml-latest-small.zip\n", "\n", "# unzip into data directory\n", "unzip ml-latest-small.zip\n", "rm ml-latest-small.zip" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**About the Data**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "!cat ml-latest-small/README.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this model, we will be using `ratings.csv` mainly, which contains 4 columns,\n", "- userId\n", "- movieId\n", "- rating\n", "- timestamp" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Read data and perform train and test split" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# requirements\n", "import os\n", "import boto3\n", "import sagemaker\n", "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# read rating data\n", "fpath = './ml-latest-small/ratings.csv'\n", "df = pd.read_csv(fpath)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# let's see what the data look like\n", "df.head(2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# understand what's the maximum number of hold out portion should be\n", "df.groupby('userId').movieId.nunique().min()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note: Since the \"least active\" user has 20 ratings, for our testing set, let's hold out 10 items for every user so that the max test set portion is 50%." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def train_test_split(df, holdout_num):\n", " \"\"\" perform training/testing split\n", " \n", " @param df: dataframe\n", " @param holdhout_num: number of items to be held out\n", " \n", " @return df_train: training data\n", " @return df_test testing data\n", " \n", " \"\"\"\n", " # first sort the data by time\n", " df = df.sort_values(['userId', 'timestamp'], ascending=[True, False])\n", " \n", " # perform deep copy on the dataframe to avoid modification on the original dataframe\n", " df_train = df.copy(deep=True)\n", " df_test = df.copy(deep=True)\n", " \n", " # get test set\n", " df_test = df_test.groupby(['userId']).head(holdout_num).reset_index()\n", " \n", " # get train set\n", " df_train = df_train.merge(\n", " df_test[['userId', 'movieId']].assign(remove=1),\n", " how='left'\n", " ).query('remove != 1').drop('remove', 1).reset_index(drop=True)\n", " \n", " # sanity check to make sure we're not duplicating/losing data\n", " assert len(df) == len(df_train) + len(df_test)\n", " \n", " return df_train, df_test" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_train, df_test = train_test_split(df, 10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Perform negative sampling" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Assuming if a user rating an item is a positive label, there is no negative sample in the dataset, which is not possible for model training. Therefore, we random sample `n` items from the unseen movie list for every user to provide the negative samples." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def negative_sampling(user_ids, movie_ids, items, n_neg):\n", " \"\"\"This function creates n_neg negative labels for every positive label\n", " \n", " @param user_ids: list of user ids\n", " @param movie_ids: list of movie ids\n", " @param items: unique list of movie ids\n", " @param n_neg: number of negative labels to sample\n", " \n", " @return df_neg: negative sample dataframe\n", " \n", " \"\"\"\n", " \n", " neg = []\n", " ui_pairs = zip(user_ids, movie_ids)\n", " records = set(ui_pairs)\n", " \n", " # for every positive label case\n", " for (u, i) in records:\n", " # generate n_neg negative labels\n", " for _ in range(n_neg):\n", " # if the randomly sampled movie exists for that user\n", " j = np.random.choice(items)\n", " while(u, j) in records:\n", " # resample\n", " j = np.random.choice(items)\n", " neg.append([u, j, 0])\n", " # conver to pandas dataframe for concatenation later\n", " df_neg = pd.DataFrame(neg, columns=['userId', 'movieId', 'rating'])\n", " \n", " return df_neg" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# create negative samples for training set\n", "neg_train = negative_sampling(\n", " user_ids=df_train.userId.values, \n", " movie_ids=df_train.movieId.values,\n", " items=df.movieId.unique(),\n", " n_neg=5\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(f'created {neg_train.shape[0]:,} negative samples')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_train = df_train[['userId', 'movieId']].assign(rating=1)\n", "df_test = df_test[['userId', 'movieId']].assign(rating=1)\n", "\n", "df_train = pd.concat([df_train, neg_train], ignore_index=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Calulate statistics for our understanding and model training" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_unique_count(df):\n", " \"\"\"calculate unique user and movie counts\"\"\"\n", " return df.userId.nunique(), df.movieId.nunique()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# unique number of user and movie in the whole dataset\n", "get_unique_count(df)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('training set shape', get_unique_count(df_train))\n", "print('testing set shape', get_unique_count(df_test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we calculate some statistics for training purpose." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# number of unique user and number of unique item/movie\n", "n_user, n_item = get_unique_count(df_train)\n", "\n", "print(\"number of unique users\", n_user)\n", "print(\"number of unique items\", n_item)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# save the variable for the model training notebook\n", "# -----\n", "# read about `store` magic here: \n", "# https://ipython.readthedocs.io/en/stable/config/extensions/storemagic.html\n", "\n", "%store n_user\n", "%store n_item" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Preprocess data and upload them onto S3" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# get current session region\n", "session = boto3.session.Session()\n", "region = session.region_name\n", "print(f'currently in {region}')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# use the default sagemaker s3 bucket to store processed data\n", "# here we figure out what that default bucket name is \n", "sagemaker_session = sagemaker.Session()\n", "bucket_name = sagemaker_session.default_bucket()\n", "print(bucket_name) # bucket name format: \"sagemaker-{region}-{aws_account_id}\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**upload data to the bucket**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# save data locally first\n", "dest = 'ml-latest-small/s3'\n", "train_path = os.path.join(dest, 'train.npy')\n", "test_path = os.path.join(dest, 'test.npy')\n", "\n", "!mkdir {dest}\n", "np.save(train_path, df_train.values)\n", "np.save(test_path, df_test.values)\n", "\n", "# upload to S3 bucket (see the bucket name above)\n", "sagemaker_session.upload_data(train_path, key_prefix='data')\n", "sagemaker_session.upload_data(test_path, key_prefix='data')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "Python 3 (Data Science)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 4 }