# Movie recommendation on Amazon SageMaker with Factorization Machines

Recommendation is one of the most popular applications in machine learning (ML). This lab is a modified version of [Build a movie recommender with factorization machines on Amazon SageMaker](https://aws.amazon.com/blogs/machine-learning/build-a-movie-recommender-with-factorization-machines-on-amazon-sagemaker/) AWS ML blog. It will show you how to build a movie recommendation model based on factorization machines — one of the built-in algorithms of Amazon SageMaker — and the popular [MovieLens](https://grouplens.org/datasets/movielens/) dataset.

### Lab Time
This lab will take around 10 to 15 minutes.

## A word about factorization machines

Factorization Machines (FM) are a supervised machine learning technique introduced in 2010 ([research paper](https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf), PDF). FM get their name from their ability to reduce problem dimensionality thanks to matrix factorization.

Factorization machines can be used for classification or regression and are much more computationally efficient on large sparse data sets than traditional algorithms like linear regression. This property is why FM are widely used for recommendation. User count and item count are typically very large although the actual number of recommendations is very small (users don’t rate all available items!).

Here’s a simple example: Where a sparse rating matrix (dimension 4×4) is factored into a dense user matrix (dimension 4×2) and a dense item matrix (2×4). As you can see, the number of factors (2) is smaller than the number of columns of the rating matrix (4). In addition, this multiplication also lets us fill all blank values in the rating matrix, which we can then use to recommend new items to any user.

<img src="../../../images/Factorization2.png" alt="Factorization" style="width: 800px;"/>

### The MovieLens dataset

This dataset is a great starting point for recommendation. It comes in multiples sizes. In this blog post we’ll use ml100k: 100,000 ratings from 943 users on 1682 movies. As you can see, the ml100k rating matrix is quite sparse (93.6% to be precise) because  it only holds 100,000 ratings out of a possible 1,586,126 (943*1682).

Here are the first 10 lines in the data set: user 754 gave movie 595 a 2-star rating, and so on.

<pre>
# user id, movie id, rating, timestamp
754         595         2    879452073
932         157         4    891250667
751         100         4    889132252
101         820         3    877136954
606         1277        3    878148493
581         475         4    879641850
13          50          5    882140001
457         59          5    882397575
111         321         3    891680076
123         657         4    879872066
</pre>

## Recommendation Engine Implementation

In [None]:
import timeit
start_time = timeit.default_timer()

In [None]:
import sagemaker
import sagemaker.amazon.common as smac
from sagemaker import get_execution_role
from sagemaker.predictor import json_deserializer

import boto3, csv, io, json
import numpy as np
from scipy.sparse import lil_matrix

bucket = '<put your S3 bucket name>'  # replace with an existing bucket if needed
prefix = 'sagemaker/fm-movielens'

### Download ml-100k dataset

In [None]:
!wget http://files.grouplens.org/datasets/movielens/ml-100k.zip -P /tmp/
!unzip -o /tmp/ml-100k.zip -d /tmp/

In [None]:
%cd /tmp/ml-100k
!shuf ua.base -o ua.base.shuffled
!head -10 ua.base.shuffled

In [None]:
!head -10 ua.test

### Build training set and test set

In [None]:
nbUsers=943
nbMovies=1682
nbFeatures=nbUsers+nbMovies

nbRatingsTrain=90570
nbRatingsTest=9430

In [None]:
# For each user, build a list of rated movies.
# We'd need this to add random negative samples.
moviesByUser = {}
for userId in range(nbUsers):
    moviesByUser[str(userId)]=[]
 
with open('/tmp/ml-100k/ua.base.shuffled','r') as f:
    samples=csv.reader(f,delimiter='\t')
    for userId,movieId,rating,timestamp in samples:
        moviesByUser[str(int(userId)-1)].append(int(movieId)-1) 

In [None]:
def loadDataset(filename, lines, columns):
    # Features are one-hot encoded in a sparse matrix
    X = lil_matrix((lines, columns)).astype('float32')
    # Labels are stored in a vector
    Y = []
    line=0
    with open(filename,'r') as f:
        samples=csv.reader(f,delimiter='\t')
        for userId,movieId,rating,timestamp in samples:
            X[line,int(userId)-1] = 1
            X[line,int(nbUsers)+int(movieId)-1] = 1
            if int(rating) >= 4:
                Y.append(1)
            else:
                Y.append(0)
            line=line+1
            
    Y=np.array(Y).astype('float32')
    return X,Y

In [None]:
X_train, Y_train = loadDataset('/tmp/ml-100k/ua.base.shuffled', nbRatingsTrain, nbFeatures)
X_test, Y_test = loadDataset('/tmp/ml-100k/ua.test',nbRatingsTest,nbFeatures)

In [None]:
print(X_train[1000])

In [None]:
print(X_train.shape)
print(Y_train.shape)
assert X_train.shape == (nbRatingsTrain, nbFeatures)
assert Y_train.shape == (nbRatingsTrain, )
zero_labels = np.count_nonzero(Y_train)
print("Training labels: %d zeros, %d ones" % (zero_labels, nbRatingsTrain-zero_labels))

print(X_test.shape)
print(Y_test.shape)
assert X_test.shape  == (nbRatingsTest, nbFeatures)
assert Y_test.shape  == (nbRatingsTest, )
zero_labels = np.count_nonzero(Y_test)
print("Test labels: %d zeros, %d ones" % (zero_labels, nbRatingsTest-zero_labels))

### Convert to protobuf and save to S3

In [None]:
train_key      = 'train.protobuf'
train_prefix   = '{}/{}'.format(prefix, 'train3')

test_key       = 'test.protobuf'
test_prefix    = '{}/{}'.format(prefix, 'test3')

output_prefix  = 's3://{}/{}/output'.format(bucket, prefix)

In [None]:
def writeDatasetToProtobuf(X, Y, bucket, prefix, key):
    buf = io.BytesIO()
    smac.write_spmatrix_to_sparse_tensor(buf, X, Y)
    buf.seek(0)
    obj = '{}/{}'.format(prefix, key)
    boto3.resource('s3').Bucket(bucket).Object(obj).upload_fileobj(buf)
    return 's3://{}/{}'.format(bucket,obj)
    
train_data = writeDatasetToProtobuf(X_train, Y_train, bucket, train_prefix, train_key)    
test_data  = writeDatasetToProtobuf(X_test, Y_test, bucket, test_prefix, test_key)    
  
print(train_data)
print(test_data)
print('Output: {}'.format(output_prefix))

### Run training job

In [None]:
from sagemaker.amazon.amazon_estimator import get_image_uri

container = get_image_uri(boto3.Session().region_name, 'factorization-machines')

In [None]:
fm = sagemaker.estimator.Estimator(container,
                                   get_execution_role(), 
                                   train_instance_count=1, 
                                   train_instance_type='ml.c4.xlarge',
                                   output_path=output_prefix,
                                   sagemaker_session=sagemaker.Session())

fm.set_hyperparameters(feature_dim=nbFeatures,
                      predictor_type='binary_classifier',
                      mini_batch_size=1000,
                      num_factors=64,
                      epochs=100)

fm.fit({'train': train_data, 'test': test_data})

### Deploy model

In [None]:
fm_predictor = fm.deploy(instance_type='ml.c4.xlarge', initial_instance_count=1)

In [None]:
def fm_serializer(data):
    js = {'instances': []}
    for row in data:
        js['instances'].append({'features': row.tolist()})
    #print js
    return json.dumps(js)

fm_predictor.content_type = 'application/json'
fm_predictor.serializer = fm_serializer
fm_predictor.deserializer = json_deserializer

## Run predictions

Let's test the performance. We will perform a batch prediction on below 10 test sets.

In [None]:
print(X_test[1000:1010])

In the above cell output, each test set contains two non-zero values (ex, (0,100) and (0,1164). The first one (100) indicates the user ID and the second one subtrated by the total user count (1164 - nbUsers) indicates the movie ID.

In [None]:
result = fm_predictor.predict(X_test[1000:1010].toarray())
print(result)

Let's test a whole batch of data and evaluate our predictive accuracy. We will firstly see the prediction result on the training set.

In [None]:
import numpy as np

predictions = []
for array in np.array_split(X_train[0:20000].toarray(), 100):
    result = fm_predictor.predict(array)
    predictions += [r['predicted_label'] for r in result['predictions']]

predictions = np.array(predictions)

In [None]:
import pandas as pd

pd.crosstab(Y_train[0:20000], predictions, rownames=['actuals'], colnames=['predictions'])

Let's now check the prediction result on the test set which was not used in the training process.

In [None]:
import numpy as np

predictions = []
for array in np.array_split(X_test.toarray(), 100):
    result = fm_predictor.predict(array)
    predictions += [r['predicted_label'] for r in result['predictions']]

predictions = np.array(predictions)

In [None]:
print(X_test.toarray().shape)

In [None]:
print(Y_test.shape)

In [None]:
print(predictions.shape)

In [None]:
pd.crosstab(Y_test, predictions, rownames=['actuals'], colnames=['predictions'])

In [None]:
# code you want to evaluate
elapsed = timeit.default_timer() - start_time
print(elapsed/60)

## Online-demo

Hoang and Alastair developed an online demo based on [Simon's blog](https://medium.com/@julsimon/building-a-movie-recommender-with-factorization-machines-on-amazon-sagemaker-cedbfc8c93d8) which is available at the below link. Their demo includes training cases using bigger data set (20 million) too.

* http://sagemaker-nab-demo.s3-website-us-west-2.amazonaws.com/

## References

* Original blog: [Build a movie recommender with factorization machines on Amazon SageMaker](https://aws.amazon.com/blogs/machine-learning/build-a-movie-recommender-with-factorization-machines-on-amazon-sagemaker/)