{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Movie recommendation on Amazon SageMaker with Factorization Machines"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Recommendation is one of the most popular applications in machine learning (ML). This lab is a modified version of [Build a movie recommender with factorization machines on Amazon SageMaker](https://aws.amazon.com/blogs/machine-learning/build-a-movie-recommender-with-factorization-machines-on-amazon-sagemaker/) AWS ML blog. It will show you how to build a movie recommendation model based on factorization machines — one of the built-in algorithms of Amazon SageMaker — and the popular [MovieLens](https://grouplens.org/datasets/movielens/) dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Lab Time\n",
    "This lab will take around 10 to 15 minutes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## A word about factorization machines"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Factorization Machines (FM) are a supervised machine learning technique introduced in 2010 ([research paper](https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf), PDF). FM get their name from their ability to reduce problem dimensionality thanks to matrix factorization.\n",
    "\n",
    "Factorization machines can be used for classification or regression and are much more computationally efficient on large sparse data sets than traditional algorithms like linear regression. This property is why FM are widely used for recommendation. User count and item count are typically very large although the actual number of recommendations is very small (users don’t rate all available items!).\n",
    "\n",
    "Here’s a simple example: Where a sparse rating matrix (dimension 4×4) is factored into a dense user matrix (dimension 4×2) and a dense item matrix (2×4). As you can see, the number of factors (2) is smaller than the number of columns of the rating matrix (4). In addition, this multiplication also lets us fill all blank values in the rating matrix, which we can then use to recommend new items to any user."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"../../../images/Factorization2.png\" alt=\"Factorization\" style=\"width: 800px;\"/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The MovieLens dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This dataset is a great starting point for recommendation. It comes in multiples sizes. In this blog post we’ll use ml100k: 100,000 ratings from 943 users on 1682 movies. As you can see, the ml100k rating matrix is quite sparse (93.6% to be precise) because  it only holds 100,000 ratings out of a possible 1,586,126 (943*1682).\n",
    "\n",
    "Here are the first 10 lines in the data set: user 754 gave movie 595 a 2-star rating, and so on.\n",
    "\n",
    "<pre>\n",
    "# user id, movie id, rating, timestamp\n",
    "754         595         2    879452073\n",
    "932         157         4    891250667\n",
    "751         100         4    889132252\n",
    "101         820         3    877136954\n",
    "606         1277        3    878148493\n",
    "581         475         4    879641850\n",
    "13          50          5    882140001\n",
    "457         59          5    882397575\n",
    "111         321         3    891680076\n",
    "123         657         4    879872066\n",
    "</pre>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Recommendation Engine Implementation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import timeit\n",
    "start_time = timeit.default_timer()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sagemaker\n",
    "import sagemaker.amazon.common as smac\n",
    "from sagemaker import get_execution_role\n",
    "from sagemaker.predictor import json_deserializer\n",
    "\n",
    "import boto3, csv, io, json\n",
    "import numpy as np\n",
    "from scipy.sparse import lil_matrix\n",
    "\n",
    "bucket = '<put your S3 bucket name>'  # replace with an existing bucket if needed\n",
    "prefix = 'sagemaker/fm-movielens'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Download ml-100k dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget http://files.grouplens.org/datasets/movielens/ml-100k.zip -P /tmp/\n",
    "!unzip -o /tmp/ml-100k.zip -d /tmp/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%cd /tmp/ml-100k\n",
    "!shuf ua.base -o ua.base.shuffled\n",
    "!head -10 ua.base.shuffled"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!head -10 ua.test"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Build training set and test set"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nbUsers=943\n",
    "nbMovies=1682\n",
    "nbFeatures=nbUsers+nbMovies\n",
    "\n",
    "nbRatingsTrain=90570\n",
    "nbRatingsTest=9430"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# For each user, build a list of rated movies.\n",
    "# We'd need this to add random negative samples.\n",
    "moviesByUser = {}\n",
    "for userId in range(nbUsers):\n",
    "    moviesByUser[str(userId)]=[]\n",
    " \n",
    "with open('/tmp/ml-100k/ua.base.shuffled','r') as f:\n",
    "    samples=csv.reader(f,delimiter='\\t')\n",
    "    for userId,movieId,rating,timestamp in samples:\n",
    "        moviesByUser[str(int(userId)-1)].append(int(movieId)-1) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def loadDataset(filename, lines, columns):\n",
    "    # Features are one-hot encoded in a sparse matrix\n",
    "    X = lil_matrix((lines, columns)).astype('float32')\n",
    "    # Labels are stored in a vector\n",
    "    Y = []\n",
    "    line=0\n",
    "    with open(filename,'r') as f:\n",
    "        samples=csv.reader(f,delimiter='\\t')\n",
    "        for userId,movieId,rating,timestamp in samples:\n",
    "            X[line,int(userId)-1] = 1\n",
    "            X[line,int(nbUsers)+int(movieId)-1] = 1\n",
    "            if int(rating) >= 4:\n",
    "                Y.append(1)\n",
    "            else:\n",
    "                Y.append(0)\n",
    "            line=line+1\n",
    "            \n",
    "    Y=np.array(Y).astype('float32')\n",
    "    return X,Y"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, Y_train = loadDataset('/tmp/ml-100k/ua.base.shuffled', nbRatingsTrain, nbFeatures)\n",
    "X_test, Y_test = loadDataset('/tmp/ml-100k/ua.test',nbRatingsTest,nbFeatures)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(X_train[1000])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(X_train.shape)\n",
    "print(Y_train.shape)\n",
    "assert X_train.shape == (nbRatingsTrain, nbFeatures)\n",
    "assert Y_train.shape == (nbRatingsTrain, )\n",
    "zero_labels = np.count_nonzero(Y_train)\n",
    "print(\"Training labels: %d zeros, %d ones\" % (zero_labels, nbRatingsTrain-zero_labels))\n",
    "\n",
    "print(X_test.shape)\n",
    "print(Y_test.shape)\n",
    "assert X_test.shape  == (nbRatingsTest, nbFeatures)\n",
    "assert Y_test.shape  == (nbRatingsTest, )\n",
    "zero_labels = np.count_nonzero(Y_test)\n",
    "print(\"Test labels: %d zeros, %d ones\" % (zero_labels, nbRatingsTest-zero_labels))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Convert to protobuf and save to S3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_key      = 'train.protobuf'\n",
    "train_prefix   = '{}/{}'.format(prefix, 'train3')\n",
    "\n",
    "test_key       = 'test.protobuf'\n",
    "test_prefix    = '{}/{}'.format(prefix, 'test3')\n",
    "\n",
    "output_prefix  = 's3://{}/{}/output'.format(bucket, prefix)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def writeDatasetToProtobuf(X, Y, bucket, prefix, key):\n",
    "    buf = io.BytesIO()\n",
    "    smac.write_spmatrix_to_sparse_tensor(buf, X, Y)\n",
    "    buf.seek(0)\n",
    "    obj = '{}/{}'.format(prefix, key)\n",
    "    boto3.resource('s3').Bucket(bucket).Object(obj).upload_fileobj(buf)\n",
    "    return 's3://{}/{}'.format(bucket,obj)\n",
    "    \n",
    "train_data = writeDatasetToProtobuf(X_train, Y_train, bucket, train_prefix, train_key)    \n",
    "test_data  = writeDatasetToProtobuf(X_test, Y_test, bucket, test_prefix, test_key)    \n",
    "  \n",
    "print(train_data)\n",
    "print(test_data)\n",
    "print('Output: {}'.format(output_prefix))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Run training job"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.amazon.amazon_estimator import get_image_uri\n",
    "\n",
    "container = get_image_uri(boto3.Session().region_name, 'factorization-machines')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fm = sagemaker.estimator.Estimator(container,\n",
    "                                   get_execution_role(), \n",
    "                                   train_instance_count=1, \n",
    "                                   train_instance_type='ml.c4.xlarge',\n",
    "                                   output_path=output_prefix,\n",
    "                                   sagemaker_session=sagemaker.Session())\n",
    "\n",
    "fm.set_hyperparameters(feature_dim=nbFeatures,\n",
    "                      predictor_type='binary_classifier',\n",
    "                      mini_batch_size=1000,\n",
    "                      num_factors=64,\n",
    "                      epochs=100)\n",
    "\n",
    "fm.fit({'train': train_data, 'test': test_data})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Deploy model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "fm_predictor = fm.deploy(instance_type='ml.c4.xlarge', initial_instance_count=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fm_serializer(data):\n",
    "    js = {'instances': []}\n",
    "    for row in data:\n",
    "        js['instances'].append({'features': row.tolist()})\n",
    "    #print js\n",
    "    return json.dumps(js)\n",
    "\n",
    "fm_predictor.content_type = 'application/json'\n",
    "fm_predictor.serializer = fm_serializer\n",
    "fm_predictor.deserializer = json_deserializer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Run predictions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's test the performance. We will perform a batch prediction on below 10 test sets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(X_test[1000:1010])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the above cell output, each test set contains two non-zero values (ex, (0,100) and (0,1164). The first one (100) indicates the user ID and the second one subtrated by the total user count (1164 - nbUsers) indicates the movie ID."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "result = fm_predictor.predict(X_test[1000:1010].toarray())\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's test a whole batch of data and evaluate our predictive accuracy. We will firstly see the prediction result on the training set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "predictions = []\n",
    "for array in np.array_split(X_train[0:20000].toarray(), 100):\n",
    "    result = fm_predictor.predict(array)\n",
    "    predictions += [r['predicted_label'] for r in result['predictions']]\n",
    "\n",
    "predictions = np.array(predictions)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "pd.crosstab(Y_train[0:20000], predictions, rownames=['actuals'], colnames=['predictions'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's now check the prediction result on the test set which was not used in the training process."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "predictions = []\n",
    "for array in np.array_split(X_test.toarray(), 100):\n",
    "    result = fm_predictor.predict(array)\n",
    "    predictions += [r['predicted_label'] for r in result['predictions']]\n",
    "\n",
    "predictions = np.array(predictions)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(X_test.toarray().shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(Y_test.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(predictions.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.crosstab(Y_test, predictions, rownames=['actuals'], colnames=['predictions'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# code you want to evaluate\n",
    "elapsed = timeit.default_timer() - start_time\n",
    "print(elapsed/60)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Online-demo"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Hoang and Alastair developed an online demo based on [Simon's blog](https://medium.com/@julsimon/building-a-movie-recommender-with-factorization-machines-on-amazon-sagemaker-cedbfc8c93d8) which is available at the below link. Their demo includes training cases using bigger data set (20 million) too."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* http://sagemaker-nab-demo.s3-website-us-west-2.amazonaws.com/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Original blog: [Build a movie recommender with factorization machines on Amazon SageMaker](https://aws.amazon.com/blogs/machine-learning/build-a-movie-recommender-with-factorization-machines-on-amazon-sagemaker/)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {},
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}