{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data Preparation Notebook\n",
    "\n",
    "In this notebook, you will execute code to \n",
    "\n",
    "1. download [MovieLens](https://grouplens.org/datasets/movielens/) dataset into `ml-latest-small` directory\n",
    "2. split the data into training and testing sets\n",
    "3. perform negative sampling\n",
    "4. calculate statistics needed to train the NCF model\n",
    "5. upload data onto S3 bucket"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Download dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "# delete the data directory if exists\n",
    "rm -r ml-latest-small\n",
    "\n",
    "# download movielens small dataset\n",
    "curl -O http://files.grouplens.org/datasets/movielens/ml-latest-small.zip\n",
    "\n",
    "# unzip into data directory\n",
    "unzip ml-latest-small.zip\n",
    "rm ml-latest-small.zip"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**About the Data**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!cat ml-latest-small/README.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For this model, we will be using `ratings.csv` mainly, which contains 4 columns,\n",
    "- userId\n",
    "- movieId\n",
    "- rating\n",
    "- timestamp"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Read data and perform train and test split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# requirements\n",
    "import os\n",
    "import boto3\n",
    "import sagemaker\n",
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# read rating data\n",
    "fpath = './ml-latest-small/ratings.csv'\n",
    "df = pd.read_csv(fpath)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# let's see what the data look like\n",
    "df.head(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# understand what's the maximum number of hold out portion should be\n",
    "df.groupby('userId').movieId.nunique().min()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note: Since the \"least active\" user has 20 ratings, for our testing set, let's hold out 10 items for every user so that the max test set portion is 50%."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def train_test_split(df, holdout_num):\n",
    "    \"\"\" perform training/testing split\n",
    "    \n",
    "    @param df: dataframe\n",
    "    @param holdhout_num: number of items to be held out\n",
    "    \n",
    "    @return df_train: training data\n",
    "    @return df_test testing data\n",
    "    \n",
    "    \"\"\"\n",
    "    # first sort the data by time\n",
    "    df = df.sort_values(['userId', 'timestamp'], ascending=[True, False])\n",
    "    \n",
    "    # perform deep copy on the dataframe to avoid modification on the original dataframe\n",
    "    df_train = df.copy(deep=True)\n",
    "    df_test = df.copy(deep=True)\n",
    "    \n",
    "    # get test set\n",
    "    df_test = df_test.groupby(['userId']).head(holdout_num).reset_index()\n",
    "    \n",
    "    # get train set\n",
    "    df_train = df_train.merge(\n",
    "        df_test[['userId', 'movieId']].assign(remove=1),\n",
    "        how='left'\n",
    "    ).query('remove != 1').drop('remove', 1).reset_index(drop=True)\n",
    "    \n",
    "    # sanity check to make sure we're not duplicating/losing data\n",
    "    assert len(df) == len(df_train) + len(df_test)\n",
    "    \n",
    "    return df_train, df_test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_train, df_test = train_test_split(df, 10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Perform negative sampling"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Assuming if a user rating an item is a positive label, there is no negative sample in the dataset, which is not possible for model training. Therefore, we random sample `n` items from the unseen movie list for every user to provide the negative samples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def negative_sampling(user_ids, movie_ids, items, n_neg):\n",
    "    \"\"\"This function creates n_neg negative labels for every positive label\n",
    "    \n",
    "    @param user_ids: list of user ids\n",
    "    @param movie_ids: list of movie ids\n",
    "    @param items: unique list of movie ids\n",
    "    @param n_neg: number of negative labels to sample\n",
    "    \n",
    "    @return df_neg: negative sample dataframe\n",
    "    \n",
    "    \"\"\"\n",
    "    \n",
    "    neg = []\n",
    "    ui_pairs = zip(user_ids, movie_ids)\n",
    "    records = set(ui_pairs)\n",
    "    \n",
    "    # for every positive label case\n",
    "    for (u, i) in records:\n",
    "        # generate n_neg negative labels\n",
    "        for _ in range(n_neg):\n",
    "            # if the randomly sampled movie exists for that user\n",
    "            j = np.random.choice(items)\n",
    "            while(u, j) in records:\n",
    "                # resample\n",
    "                j = np.random.choice(items)\n",
    "            neg.append([u, j, 0])\n",
    "    # conver to pandas dataframe for concatenation later\n",
    "    df_neg = pd.DataFrame(neg, columns=['userId', 'movieId', 'rating'])\n",
    "    \n",
    "    return df_neg"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# create negative samples for training set\n",
    "neg_train = negative_sampling(\n",
    "    user_ids=df_train.userId.values, \n",
    "    movie_ids=df_train.movieId.values,\n",
    "    items=df.movieId.unique(),\n",
    "    n_neg=5\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f'created {neg_train.shape[0]:,} negative samples')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_train = df_train[['userId', 'movieId']].assign(rating=1)\n",
    "df_test = df_test[['userId', 'movieId']].assign(rating=1)\n",
    "\n",
    "df_train = pd.concat([df_train, neg_train], ignore_index=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Calulate statistics for our understanding and model training"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_unique_count(df):\n",
    "    \"\"\"calculate unique user and movie counts\"\"\"\n",
    "    return df.userId.nunique(), df.movieId.nunique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# unique number of user and movie in the whole dataset\n",
    "get_unique_count(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('training set shape', get_unique_count(df_train))\n",
    "print('testing set shape', get_unique_count(df_test))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we calculate some statistics for training purpose."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# number of unique user and number of unique item/movie\n",
    "n_user, n_item = get_unique_count(df_train)\n",
    "\n",
    "print(\"number of unique users\", n_user)\n",
    "print(\"number of unique items\", n_item)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# save the variable for the model training notebook\n",
    "# -----\n",
    "# read about `store` magic here: \n",
    "# https://ipython.readthedocs.io/en/stable/config/extensions/storemagic.html\n",
    "\n",
    "%store n_user\n",
    "%store n_item"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Preprocess data and upload them onto S3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# get current session region\n",
    "session = boto3.session.Session()\n",
    "region = session.region_name\n",
    "print(f'currently in {region}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# use the default sagemaker s3 bucket to store processed data\n",
    "# here we figure out what that default bucket name is \n",
    "sagemaker_session = sagemaker.Session()\n",
    "bucket_name = sagemaker_session.default_bucket()\n",
    "print(bucket_name)  # bucket name format: \"sagemaker-{region}-{aws_account_id}\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**upload data to the bucket**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# save data locally first\n",
    "dest = 'ml-latest-small/s3'\n",
    "train_path = os.path.join(dest, 'train.npy')\n",
    "test_path = os.path.join(dest, 'test.npy')\n",
    "\n",
    "!mkdir {dest}\n",
    "np.save(train_path, df_train.values)\n",
    "np.save(test_path, df_test.values)\n",
    "\n",
    "# upload to S3 bucket (see the bucket name above)\n",
    "sagemaker_session.upload_data(train_path, key_prefix='data')\n",
    "sagemaker_session.upload_data(test_path, key_prefix='data')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "instance_type": "ml.t3.medium",
  "kernelspec": {
   "display_name": "Python 3 (Data Science)",
   "language": "python",
   "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}