{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Movie recommendation on Amazon SageMaker with Factorization Machines"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Recommendation is one of the most popular applications in machine learning (ML). This lab is a modified version of [Build a movie recommender with factorization machines on Amazon SageMaker](https://aws.amazon.com/blogs/machine-learning/build-a-movie-recommender-with-factorization-machines-on-amazon-sagemaker/) AWS ML blog. It will show you how to build a movie recommendation model based on factorization machines — one of the built-in algorithms of Amazon SageMaker — and the popular [MovieLens](https://grouplens.org/datasets/movielens/) dataset."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Lab Time\n",
"This lab will take around 10 to 15 minutes."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## A word about factorization machines"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Factorization Machines (FM) are a supervised machine learning technique introduced in 2010 ([research paper](https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf), PDF). FM get their name from their ability to reduce problem dimensionality thanks to matrix factorization.\n",
"\n",
"Factorization machines can be used for classification or regression and are much more computationally efficient on large sparse data sets than traditional algorithms like linear regression. This property is why FM are widely used for recommendation. User count and item count are typically very large although the actual number of recommendations is very small (users don’t rate all available items!).\n",
"\n",
"Here’s a simple example: Where a sparse rating matrix (dimension 4×4) is factored into a dense user matrix (dimension 4×2) and a dense item matrix (2×4). As you can see, the number of factors (2) is smaller than the number of columns of the rating matrix (4). In addition, this multiplication also lets us fill all blank values in the rating matrix, which we can then use to recommend new items to any user."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The MovieLens dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This dataset is a great starting point for recommendation. It comes in multiples sizes. In this blog post we’ll use ml100k: 100,000 ratings from 943 users on 1682 movies. As you can see, the ml100k rating matrix is quite sparse (93.6% to be precise) because it only holds 100,000 ratings out of a possible 1,586,126 (943*1682).\n",
"\n",
"Here are the first 10 lines in the data set: user 754 gave movie 595 a 2-star rating, and so on.\n",
"\n",
"
\n", "# user id, movie id, rating, timestamp\n", "754 595 2 879452073\n", "932 157 4 891250667\n", "751 100 4 889132252\n", "101 820 3 877136954\n", "606 1277 3 878148493\n", "581 475 4 879641850\n", "13 50 5 882140001\n", "457 59 5 882397575\n", "111 321 3 891680076\n", "123 657 4 879872066\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Recommendation Engine Implementation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import timeit\n", "start_time = timeit.default_timer()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sagemaker\n", "import sagemaker.amazon.common as smac\n", "from sagemaker import get_execution_role\n", "from sagemaker.predictor import json_deserializer\n", "\n", "import boto3, csv, io, json\n", "import numpy as np\n", "from scipy.sparse import lil_matrix\n", "\n", "bucket = '